Extracting attention values

Julien Heitmann; Alban Glangetas; Jonathan Doenz; Juliane Dervaux; Deeksha M. Shama; Daniel Hinjos Garcia; Mohamed Rida Benissa; Aymeric Cantais; Alexandre Perez; Daniel Müller; Tatjana Chavdarova; Isabelle Ruchonnet-Metrailler; Johan N. Siebert; Laurence Lacroix; Martin Jaggi; Alain Gervaix; Mary-Anne Hartley

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Extracting attention values

JH Julien Heitmann

AG Alban Glangetas

JD Jonathan Doenz

JD Juliane Dervaux

DS Deeksha M. Shama

DG Daniel Hinjos Garcia

MB Mohamed Rida Benissa

AC Aymeric Cantais

AP Alexandre Perez

DM Daniel Müller

TC Tatjana Chavdarova

IR Isabelle Ruchonnet-Metrailler

JS Johan N. Siebert

LL Laurence Lacroix

MJ Martin Jaggi

AG Alain Gervaix

MH Mary-Anne Hartley

This method is extracted from research article: NPJ Digit Med, Jun 2023

DeepBreath—automated detection of respiratory pathology from lung auscultation in 572 pediatric outpatients across 5 countries

DOI: 10.1038/s41746-023-00838-3

Ask a question

Favorite

DeepBreath is interpretable by design. At the level of the CNN model, we can plot the segment-level predictions ${p (x_{i})}_{i = 1}^{T}$ , and the attention values ${g (x_{i})}_{i = 1}^{T}$ . These values can identify the parts of the recording that are most deterministic for the prediction over the time dimension. Comparing these values to segment-level annotations made by medical doctors (identifying inspiration and expiration), we can visualize how the model interprets disease over the breath cycle and thus allow clinicians to interrogate the model’s alignment to physiology.

Every CNN audio classifier passes through a recording and computes segment-level outputs, before aggregating those intermediate outputs to return a single clip-level prediction. The duration captured by a single segment is determined by the size of the receptive field of the CNN architecture. The receptive field of the final convolutional layer has a width of 78, which corresponds to a duration of 1296 ms. For every segment, the CNN model computes an attention value g(x_i) and a prediction p(x_i). The attention value g(x_i) determines how much the segment prediction p(x_i) is attended in the overall clip-level output p(x). Plotting ${g (x_{i})}_{i = 1}^{T}$ allows us to identify parts of the recording (hence the respiration) that have a high contribution to the clip-level prediction. In order to interpret these singled-out parts, we made use of annotations of breath sounds, that were provided for the recordings from Geneva. With those annotations we can evaluate whether there is a similarity between the way medical experts label breath sounds, and the way respiration is perceived by a model (that was trained for diagnosis prediction without any knowledge of respiration phases or sounds).

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol