2.3.2. Topic Evolution Analysis Based on the LDA Model

YG Yuanyuan Ge
XX Xingying Xu
MC Meng Cao
BL Baijun Liu
YW Ying Wang
PL Ping Liao
JW Jiajing Wang
YC Yifei Chen
HY Hongmei Yuan
GC Guiliang Chen
request Request a Protocol
ask Ask a question
Favorite

Patent text mining can discover potential data patterns and internal relations from a large amount of unstructured textual information and is an important method for technical topic evolution analysis. In this study, we use the Latent Dirichlet Allocation (LDA) model to identify the technical topics contained in patents, as a part of the technical topic evolution analysis. Technological theme evolution refers to the dynamic change of technological theme in a continuous, sliding time window, describes the vein of technological innovation development over time, and analyzes the law of technological innovation from a dynamic perspective. The LDA model was proposed by Blei [21], a kind of unsupervised machine learning model that extracts and analyzes the topics from short text at the probabilistic statistical level. It is a three-layer Bayesian model constructed by “document–topic–word”, which can be used to mine large-scale text data. At present, the research methods of technological topic evolution analysis mainly fall into three categories:

In summary, the methods based on bibliometric and patent citation analysis have their limitations. As a result, they are not comprehensive and accurate enough to describe the process of topic change in the technical field, and the accuracy and scientificity of the research results need to be improved, while the application of the LDA topic model to study the evolution of technical topics can accurately identify the technical topics. The LDA topic model has been widely used in many fields of topic recognition, and can effectively analyze large-scale unstructured document sets [28,29,30,31].

Figure 2 shows the LDA model construction process. The LDA model first extracts a topic in each document, then randomly selects a word from the vocabulary corresponding to the extracted topic and repeats the above process until every word in the whole document is traversed, finally generating the document–topic distribution and topic–vocabulary distribution.

Probabilistic graph model of the Latent Dirichlet Allocation (LDA) algorithm.

However, the topics under each time window generated by the LDA topic model exist independently. To explore the evolutionary relationship between topics, the continuity between topics can be measured by calculating the similarity between the topics of neighboring time windows. The similarity between temporal topics is measured by calculating the cosine similarity between the topics of neighboring time windows, and then the correlation relationship between the topics of neighboring time windows is established. The calculation is shown in Equation (1):

where Tit1 is the ith topic under time window t − 1, Tjt is the jth topic under time window t, p(wn0Tit1) and pwn0Tjt are the topic–vocabulary distributions under time windows t − 1 and t, wn0 is the n0th vocabulary in the topic, and N0 is the total number of vocabularies contained in the topic. The closer the value of S(Tit1, Tjt) is to 1, the more similar the topics are. A threshold, ε, is set to remove weak and invalid associations: when S(Tit1, Tjt) ε, the current topic is a continuation of the previous topic; if STit1, Tjt < ε, the current topic is not related to the previous topic. In this study, the threshold, ε, was set to 0.2.

The topic evolution model construction was realized based on Python (version 3.9.0), and the text preprocessing was carried out using the Jieba toolkit.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A