Latent Dirichlet Allocation (LDA) requires preprocessing each of the incoming documents. We begin this process by removing punctuation, URLs, and mentions (i.e. users referencing other users) from tweets. Though this information can be important for certain applications with Twitter data, it offers little value in the development of a topic model. After removing these features, we convert the remaining text to lowercase. Next, we remove any stopwords from the document. Stopwords (e.g., “the”, “and”, “are”, etc.) are the most common words in the English language and have little bearing on the overall meaning of a document [24]. Additionally, we remove stopwords specific to Twitter discussion. Examples include ‘&’ (the Twitter rendering of the ampersand sign) and ‘RT’ (Twitter code indicating that a tweet is a retweet, or copy, of another user’s tweet). We also remove the words “climate” and “carbon” as well as the phrase “global warming” as those were the keywords contained in the query used to construct the dataset and would otherwise be over-represented in every topic. Finally, we reduce each word to its lemma, or root. This ensures that the algorithm does not incorrectly identify different tenses or forms of a word as separate words. To perform these tasks, we use the the NLTK package in Python [25].

Following this preliminary preprocessing, we take further steps are to increase the efficiency at which the model is developed. Notably, we remove words that occur very frequently and those that occur infrequently from the set of words (i.e., dictionary) which is used to construct the topic model. Reducing the size of the dictionary, significantly improves the run-time of the algorithm, allowing for rapid testing and iteration. To reduce the size of the dictionary, we first only consider words greater than two characters. We then eliminate words that appeared in greater than 50% of tweets (e.g., “the”) as they would provide little information about the relevant topic. Next, we drop words that appeared in fewer than 100 (less than 0.03% of the dataset) of the tweets.

Finally, we leverage a randomized subset of the tweets used in preliminary model training for identification of the optimal number of topics. The size of the subset depends on the phase of model development and is specified in the following section.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.