We used cluster analysis to identify subgroups of participants that shared similarities in their message histories. Cluster analysis was performed using the Communication History Analysis Interface (CHAI), a visual interface that we developed which offers users the capability to visualize participants’ message histories, perform cluster analysis, and explore the results of cluster analysis.
To identify subgroups of participants with similar message histories, we employed the k-means clustering method [45] to cluster parent and teen pairs by the topics that they discussed with their coaches. K-means cluster analysis takes a set of n-dimensional points and clusters them into a set of K clusters [45]. Each parent and teen pair’s communications with the coaches were represented using a 15-dimensional vector, 1 dimension for each topic identified in the topic modeling procedure. To give an example, suppose a parent and teen pair authored 6 messages in total, 2 each for topics 3, 5, and 7. Their contribution would be represented by {0, 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0}. Thus, the vector representing each parent and teen pair would illustrate common topics within that pair’s communications, and the results of the cluster analysis would yield parent and teen pairs that discussed similar topics. As 1 topic, Time, was highly prevalent and yet had no specific meaning other than the references to time, it was excluded from the clustering.
We employed 2 methods together, visual examination and the inverse scree plot [46], to select the number of clusters. We plotted the variance for solutions with the number of clusters k varying from 1 to 20, and selected 4 as the optimal solution for 2 reasons. At this point the increasing the number of clusters led to less substantial decreases in variance, but there was not a “clear bend.” We visually examined solutions of differing numbers through the CHAI interface, deciding on 4 to err on the side of coarser clusters that illustrated differences in participants’ textual communications, but did not differentiate too granularly within the sample. The k-means clustering method can be susceptible to the starting seeds [47]. To avoid bias, we repeated the clustering with different starting seeds and observed that the defining characteristics of the clustering solutions remained the same in the repetitions.
We examined the results of the cluster analysis using the CHAI application that we developed. The clustering feature of this application features 2 primary views, an Overview of the clustering results (Figure 2) and a Cluster Detail view that can be used to examine the messages for each cluster. CHAI provides summaries of cluster engagement characteristics that show the prevalence of all topics in each cluster, so that users can compare the clusters in terms of topic and authorship. The CHAI application performs cluster analysis and displays participants’ message histories by cluster. For any given participant identification (ID) number, each message history is rendered as a horizontal sequence, with the earliest message to the left and the last message on the right. The right pane enables users to view outcomes and demographic characteristics for each cluster. The CHAI application was developed using Python, the machine learning library scikit-learn, and Web development frameworks and Javascript visualization libraries including AngularJS and D3.
Communication history analysis interface: (a) cluster controls, (b) cluster engagement characteristics (theme proportions and parent/teen participation), (c) message sequences, and (d) cluster demographics.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.