Tree-Based Scan Statistics for Cohort Data

JM Judith C. Maro
MN Michael D. Nguyen
ID Inna Dashevsky
MB Meghan A. Baker
MK Martin Kulldorff
request Request a Protocol
ask Ask a question
Favorite

The tree-based scan statistic detects elevated frequencies of outcomes in electronic health data that have been grouped into hierarchical tree structures. In our case, the tree structure is derived from the Agency for Healthcare Research And Quality’s Multi-Level Clinical Classifications Software (MLCCS) (http://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp). The MLCCS groups outcomes into clinically meaningful categories and arranges them into four grouping levels. The broadest grouping identifies eighteen body systems and the narrowest grouping may contain multiple ICD-9-CM codes, forming a “branch.” Each individual ICD-9-CM code is a “leaf.” Any particular location on the tree – be it at the leaf or branch level – is referred to as a node. Table Table11 shows an example branch.

We curated the full MLCCS tree by excluding ICD-9-CM outcome codes that 1) are unlikely to be caused by medical product exposures such as well care visits and pregnancy; 2) are unlikely to manifest within a few weeks after exposure, such as cancer; and 3) are common and of a less serious or unspecific nature, such as fever or diarrhea. Following the curation of the original thirteen thousand unique ICD-9-CM codes, we evaluated 6,162 ICD-9-CM codes which all represent individual leaves on the tree. Overall, there are 6,861 nodes on the tree. The curated tree is available upon request.

The null hypothesis being tested is that, for all nodes on the tree, an outcome is expected to occur in proportion to the underlying expected count that defined that node, as generated from a Poisson distribution. The alternative hypothesis is that one or more particular nodes on the tree have outcomes occurring with higher probability than the specified expected counts on those nodes.

A log-likelihood ratio was calculated for every node on the tree. The maximum among these log-likelihood ratios from the real data set is the test statistic for the entire analytic dataset. This maximum is compared with the maximum log-likelihood ratios that were calculated in the same way from simulated datasets generated under the null hypothesis. If the test statistic from the real dataset is among the 5 percent highest of all the maxima, the null hypothesis is rejected. The fact that it is the maxima over the whole tree is what adjusts for the multiple testing. This hypothesis testing method allows one to detect whether any node on the tree had clusters of excess outcomes that were statistically significant while adjusting for multiple testing inherent to evaluating more than six thousand nodes [31]. Specific details of this procedure are included in the eAppendix.

Tree-based scan statistics can be used unconditionally or one can condition on the total number of observed outcomes in the dataset. Mathematical expressions for both versions can be found in the eAppendix. Conditioning is a mechanism to control for situations when there is an across-the-board increase in health care utilization during a particular time period that is unrelated to the exposure of interest. This situation might occur commonly in vaccine safety surveillance when the cohort has follow-up tests or visits in the days immediately following their well-care visit when a vaccine was administered. The conditional tree-based scan statistic attenuates this health care utilization unrelated to the exposure by standardizing all diagnoses by the frequency with which they appear in the dataset.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A