Corpus and document similarity
This protocol is extracted from research article:
The public and legislative impact of hyperconcentrated topic news
Sci Adv, Aug 28, 2019; DOI: 10.1126/sciadv.aat8296

We estimate the similarity of a corpus of documents as the median of its pairwise document similarities, using all (n2) combinations from the corpus. To estimate similarity between two documents, we adopted doc2vec (12), a well-known tool that generates a vector representation (called a “paragraph vector”) of a document. Specifically, we used a standard doc2vec model (30), trained on each domain corpus, to compute a vector for each document in our corpus. We defined the pairwise similarity of two documents as the cosine similarity of their respective document vectors (31).

Whereas we do not in general deny that high median similarities can occur in annual corpora with low news volume (see fig. S1), we found that legislative activity tends to correlate with periods in which news volume and median similarity are simultaneously high. We therefore employ a threshold whereby the similarity of an annual corpus is considered to be zero if it contains less than c% of the articles from the respective domain corpus. We use a threshold of c = 5% in this paper.

We note that since cosine similarity has a range of [−1, 1], and our models are learnt on datasets that discuss a common topic, the variation in similarities we obtain is relatively small compared to a metric with a larger range.

Despite this conservative choice, we demonstrate consistent G causality with legislation (table S2). We note that stronger significance may be obtained if we were to use similarity measures with larger ranges. However, the results obtained with our conservative approach inspire confidence in the validity of our hypothesis.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.