Topic Extraction and Statistical Analysis

Min Zhang; Yuxuan Sun; Xiaosong Zhao; Lingmei Wang; Jingjing Xiong

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Topic Extraction and Statistical Analysis

MZ Min Zhang

YS Yuxuan Sun

XZ Xiaosong Zhao

LW Lingmei Wang

JX Jingjing Xiong

This method is extracted from research article: Inquiry, Jun 2023

The Impact of Narrative Reviews on Patient E-doctor Choice in Online Health Communities

DOI: 10.1177/00469580231183695

Ask a question

Favorite

Topic modeling is a widely used text-mining method that aims to uncover latent semantic structures in document sets by identifying potential topics.²⁵ It can statistically capture those topics with the use of different algorithms,²² such as Principal Components Analysis (PCA) and Latent Dirichlet Allocation (LDA). LDA is a popular algorithm in natural language processing (NLP) and was employed in this study. Firstly, a Python Kit was used to parse the reviews, followed by the exclusion of meaningless words (eg, “I” and “we”) and high-frequency words (eg, “doctor” and “patient”) from the texts. Using LDA, we selected the optimal number of topics in the corpus based on its perplexity evaluation criteria, which measures the quality of the model. We experimented with different topic numbers, ranging from 2 to 10, running the LDA model 10 times in each iteration through Anaconda 3. After evaluating perplexity statistics, we found that the minimum perplexity occurred for 3 topics.

Data management was performed using Microsoft Excel 2017, while SPSS software version 22.0 was used for all statistical analyses. Given the exploratory nature of our study, stepwise regression was deemed suitable as it can screen for significant independent variables affecting the dependent variable and simplify the regression equation. Specifically, independent variables were only introduced if their partial regression sum-of-squares were significant. Any independent variables deemed to have little influence on the dependent variable were eliminated to identify the optimal regression subset. Our models were constructed hierarchically, with control variables included in model 1, followed by independent variables in model 2, and interaction terms in model 3. All reported P-values were 2-sided, and a p-value of less than .05 was considered statistically significant. The regression equation was expressed as follows in equation (1), where β₀ represents the constant term, β₁ through β₂₀ represent the regression coefficients, and ε represents the error term. The term Topic_i_t-1×Spe_dummyi_t-1 (i = 1, 2, 3) reflects the moderating effect of the specialty.

This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page(https://us.sagepub.com/en-us/nam/open-access-at-sage).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol