Discriminative keywords
The public and legislative impact of hyperconcentrated topic news

Procedure

We are interested in identifying and summarizing those aspects of a domain’s current framing that distinguish it from the domain’s framing at a previous time period. To this end, we adopted the idea of an entropic formulation of discriminative keywords, as proposed by Sheshadri et al. (26).

Below, a corpus T is a set of news articles. Specifically, given two disjoint sets of news articles T1 and T2, we identified a set of k n-grams that yield the largest Cross Entropy (29) in the combined corpus T = T1T2. Let A be an article in corpus T. Let xi represent any of the possible m n-grams in T. Let S(xi, T) = {AT | xiA} be the set of articles in corpus T in which the n-gram xi appears. We used a |T| × m term frequency matrix representing the corpus to calculate H, the information entropy of T. We use MATLAB’s fitctree and predictor importance functions with a split criterion parameter of “deviance” to estimate the utility of each n-gram.$IG(T,xi)=H(T)−S(xi,T)|T|H(S(xi))$(1)

Following Entman’s (9) formulation, this approach weights n-grams that are specific to a particular corpus more highly than n-grams that are common to both corpora. A quick intuition for the approach is obtained by considering that the unigram “Snowden” may have a high utility in distinguishing Surveillance articles published after 1 January 2014 from those before them, but the unigram “surveillance” is common to articles from both periods and therefore may not. Because keywords from a particular news corpus distinguish it from others, they may be said to represent the “concentration” of news in that corpus.

