A multi-class text classification model for sentiment analysis was implemented using the language-specific pre-trained BERT model for Swedish (”KB-BERT”, specifically bert-base-swedish-cased (v1)) which was developed by KBLab at the National Library of Sweden (KB)42 and fine-tuned to suit the domain.
The last hidden layer of the KB-BERT model was extracted, and a single-hidden layer feed-forward neural network was implemented as the sentiment classifier. This model was implemented for both the summary and the transcript format.
Since this approach follows a semi-supervised methodology and only 1.5% of the data is annotated, additional approaches had to be applied to ensure an accurate outcome. For this reason, a voting ensemble classifier was created with the aim of better performance compared to a single classifier. As the name suggests, a voting ensemble chooses the label based on the outcome of the soft majority of classifier predictions, i.e. each model predicts a probability for each class. These are accumulated and the highest value in one of the classes is selected. Therefore, the complete sentiment model consists of three pre-trained KB-BERT models that were fine-tuned according to the dataset provided earlier.
First, the KB-BERT model was trained using the , and the hyperparameters, batch size and learning rate were fine-tuned based on the . These two datasets were then used to train each model within the ensemble classifier again. To further improve the reliability, it was decided to train the ensemble classifier again on another retrospectively annotated dataset. This time, the dataset was created from the unlabelled dataset, where the ensemble was first used to predict the full unlabelled set. For the predicted sentiment of sentences on which the three classifiers did not agree, the sentences were annotated manually and fed into the models. 301 and 401 sentences were annotated for the Summary and Transcript category was annotated, respectively. Thereby, in total, 1407 annotated sentences were used for training and evaluation, which is around 2.9% of the full dataset.
The different metrics used for evaluation were precision, recall, and accuracy. Also, the F1-score, for the summaries, a weighted average of the F1-score was used, while for the transcripts, a macro F1-score was applied for evaluation.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.