To extract the key topics in the corpora of Twitter data, we utilize an unsupervised clustering technique called Latent Dirichlet Allocation (LDA). Latent Dirichlet Allocation (LDA) has been used extensively to detect topics on Twitter [20]. LDA is a probabalistic topic model, a class of Bayesian latent variable models. At a high level, it represents documents as a mixture of topics which are in turn represented by a set of words [21]. Given a corpus of documents, an LDA model learns the topic representation of each document and the words associated with each topic. Once a model is trained, given a document, the model will produce a topic likelihood distribution which identifies the relevant topic(s) in a document. For our analysis, we use a popular Natural Language Processing (NLP) Python package, Gensim [22] with the Machine Learning for Language Toolkit (MALLET) implementation [23].

We note there are a significant number of different methods for extracting topics from large text corpora, each with their own advantages and shortcomings. In the case of LDA, the algorithm assumes that each document contains multiple topics. However, in the case of tweets, which are likely to contain a limited number of topics due to character restrictions, this may not be a reasonable assumption. Previous research has proposed a modified LDA algorithm, namely Twitter-LDA, which assumes that each tweet contains exactly one topic [20]. In a comparison between LDA and Twitter-LDA on unfiltered Twitter data, where the number of topics was set at 110, the Twitter-LDA outperformed standard LDA [20]. However, the rationale for using standard LDA in this study is two-fold: (1) the goal of the LDA analysis is to derive topic distributions for distinct geographic regions; as this is an aggregate measure, by assuming that each tweet contained exactly one topic, we would limit the components of the topic distribution, potentially reducing the accuracy of the overall distribution, and (2) because the data is filtered to only pertain to climate change, the topics identified are likely to have more overlap than in a general tweet corpus and thus it was not reasonable to assume that each tweet would only contain a single topic. Here we utilize LDA as a topic model, but we emphasize that the approach presented in this study is not dependent on that specific method. In summary, the choice of the algorithm should be guided by the research question and available data.

Latent Dirichlet Allocation (LDA) requires preprocessing each of the incoming documents. We begin this process by removing punctuation, URLs, and mentions (i.e. users referencing other users) from tweets. Though this information can be important for certain applications with Twitter data, it offers little value in the development of a topic model. After removing these features, we convert the remaining text to lowercase. Next, we remove any stopwords from the document. Stopwords (e.g., “the”, “and”, “are”, etc.) are the most common words in the English language and have little bearing on the overall meaning of a document [24]. Additionally, we remove stopwords specific to Twitter discussion. Examples include ‘&’ (the Twitter rendering of the ampersand sign) and ‘RT’ (Twitter code indicating that a tweet is a retweet, or copy, of another user’s tweet). We also remove the words “climate” and “carbon” as well as the phrase “global warming” as those were the keywords contained in the query used to construct the dataset and would otherwise be over-represented in every topic. Finally, we reduce each word to its lemma, or root. This ensures that the algorithm does not incorrectly identify different tenses or forms of a word as separate words. To perform these tasks, we use the the NLTK package in Python [25].

Following this preliminary preprocessing, we take further steps are to increase the efficiency at which the model is developed. Notably, we remove words that occur very frequently and those that occur infrequently from the set of words (i.e., dictionary) which is used to construct the topic model. Reducing the size of the dictionary, significantly improves the run-time of the algorithm, allowing for rapid testing and iteration. To reduce the size of the dictionary, we first only consider words greater than two characters. We then eliminate words that appeared in greater than 50% of tweets (e.g., “the”) as they would provide little information about the relevant topic. Next, we drop words that appeared in fewer than 100 (less than 0.03% of the dataset) of the tweets.

Finally, we leverage a randomized subset of the tweets used in preliminary model training for identification of the optimal number of topics. The size of the subset depends on the phase of model development and is specified in the following section.

Topic modeling requires user input for determining the optimal number of topics. In a well-performing model, the topics are distinct and intuitive. There are a variety of metrics that can assess different aspects of model quality but there are no formalized methods to make a holistic assessment. One popular metric is coherence, which rewards similarity within a topic and contrast between topics [26]. There are several ways to compute coherence and one of the most widely used metrics is Cv [26]. This metric combines several measurements, specifically, indirect cosine measure, boolean sliding window, and normalized pointwise mutual information, and has been shown to accurately indicate the degree to which a topic can be easily interpreted [26].

To determine the ideal number of topics, we iteratively construct models over a large range of topics with a step of two and computed coherence with the Cv metric. Based on these results, we select the range of topic numbers where coherence was highest and repeat the process with a smaller range and step size of 1.

Many topic models constructed with Twitter data have over one hundred topics in the final model [20]. However, as the tweets used in this analysis already have a prioi filtering related to climate change, we use lower ranges of topics as the training corpus is unlikely to contain as many topics as an unfiltered tweet stream. In the first round of model development, we test models with 10, 12, 14, …, 50 topics. Based on these results (depicted in S2 Fig in S1 File), we then test models with 14, 15, 16, …, 22 topics.

Based on the coherence score analysis, we identify seventeen as the appropriate number of topics in the model. This set largely encompasses the breadth of Twitter discussion on climate change while minimizing overlap between topics. The final topic model makes use of the entire 350,000 tweet corpus to ensure that the training data accurately represents the extent and variety of discourse on Twitter. A concise summary of the final topics is presented in Table 1.

It should be noted that the topic names and descriptions (Table 1) are subjective and represent our best effort at concisely describing the complexity of each topic category. A key challenge in LDA analysis is discerning the distinct characteristics of each topic set. The state-of-the art involves referring to the top n-grams and most representative tweets for each category (see S3 Table and S4 Fig in S1 File)—an inherently subjective task. While this analysis is based on quantitative representations of these topics, it is critical to acknowledge the subjectivity inherent in unsupervised topic models and its implications for the study’s inferences.

Note: The content above has been extracted from a research article, so it may not display correctly.



Q&A
Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.



We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.