We evaluated the performance of the final models using four approaches. First, a frame-wise evaluation, which compares the ball status in the ground truth to the prediction in each frame. As performance metrics, we computed the accuracy and F1-Score for each model. Furthermore, we compared the prediction to a random guessing approach per match by taking the percentage of in-game frames, we refer to this as the knowledge gain. It is calculated for each match using the prediction accuracy minus the percentage of in-game frames. However, this frame-wise performance does not necessarily translate into correctly identified match interruptions since all consecutive frames must have the same value for a stoppage. For example, a single ball-in detection in a long streak of ball-out labels creates two stoppages instead of a single one.
Consequently, and secondly, model performance was evaluated stoppage-wise. The basic idea is to extract stoppages from the original data and predictions by searching for a matching pair between them. Stoppages are extracted by identifying the frames at which the match was interrupted and resumed, where the first and last frames with the ball-out label define the start and end of a stoppage, respectively. A suitable metric to compare predicated stoppages with actual ones is the intersection over union (IoU), which is common in object detection benchmarks31,32. The IoU is computed for a pair of real and predicted stoppages as the overlap time between them divided by the overall time covered by both stoppages, i.e., the overlap time plus the sum of the non-overlapping durations. For our paper, two stoppages are matched if their IoU is at least 50%, which guarantees that each real stoppage is matched only with one predicted stoppage and vice versa.
A good model should apply to analysis tasks. Hence, the results of performance metrics using the ball status prediction must be comparable to those using real data. The IoU metric assigns a predicted stoppage to the corresponding ground truth interruption, even when the overlap is imperfect. Subsequently, a shift in the correct starting and ending points for each stoppage exists, which affects its application in video analysis tasks.
Third, we checked whether the predicted stoppages’ starting and ending points did not differ much from the ground truth. We calculated the shifts between real and predicted points for the start and end. A fundamental task in video analysis is to analyze standard situations, thus, we assumed that our predictions should be in a range within ± 2 s. In that case, a practitioner could easily find and add time marks to analyze the execution of a standard situation.
Fourth, we evaluated the quality of the performance indicator Total Distance Covered (TDC) in the effective playing time (TDCE). TDC is one of the most common performance indicators for estimating workload33–35. TDCE represents the running activity when the ball is in-play and can be interpreted as the match intensity. Since the error in predicting the ball status introduces an error to TDCE, we checked whether this error is acceptable for performance analysis. Therefore, we calculated TDCE for each player three times per match using the ball status based on (1) ground truth, (2) AD prediction and 3) a naïve approach and compared them. For the naïve approach, we calculated an approximation for TDCE by taking the real TDC by the player for the whole match and the mean percentage of the effective playing time in all matches. In our test matches, the mean percentage of the effective playing time was 59.6 ± 6.0% (Min. = 47.4%, Max. = 69.7%). To reduce noise, only field players who played a full match were included in the analysis.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.