CHI NL Read: Handling Failures in Deep Learning Computer Vision Models

Hello everyone! We are very pleased to have Agathe Balayn showcase her upcoming CHI 2023 paper with us, titled “Faulty or Ready? Handling Failures in Deep-Learning Computer Vision Models until Deployment: A Study of Practices, Challenges, and Needs.“.

[Reading time: 5 min.]

This image is composed of two parts. On the left, it shows the research question 1, and summaries of research insights coming from the literature (that is, insights about failures, bugs, objectives, and satisfaction point. On the right side, it shows research question 2, and related research insights coming from the literature (that is around approaches, artifacts, workflow, and explainability types).
Summary of the research questions, and of the related insights from literature used as initial guides for the exploration of the research questions, and as working assumptions to assess. Each working assumption (bold text in the light blue boxes) involves one major concept of the debugging literature (in italic) and its different instances (plain text in the white boxes), and is formulated solely based on the assumptions the literature seems to implicitly make about practices.

What’s your name?

Hi, my name is Agathe Balayn.


Agathe Balayn, Natasa Rikalo, Jie Yang, and Alessandro Bozzon. 2023. Faulty or Ready? Handling Failures in Deep-Learning Computer Vision Models until Deployment: A Study of Practices, Challenges, and Needs. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23), April 23–28, 2023, Hamburg, Germany. ACM, New York, NY, USA, 20 pages.

TL;DR (if you had to describe this work in one sentence, what would you tell someone?):

Taking a step back from the technical development of algorithmic, explainability, methods for the future debugging of a machine learning model, we investigate current debugging practices: how do machine learning developers currently debug their models, what are the challenges they really face, and what are their true needs?

What problem is this research addressing, and why does it matter?

While many research efforts are put into developing explainability methods for machine learning models, it remains challenging for practitioners to develop models that are trustworthy (testified by accidents and social harms caused by such models). We investigate practices and needs of model developers, and misalignments between research and practice, to propose future research directions.

How did you approach this problem?

After investigating what current research literature tells us about machine learning debugging practices (almost nothing, especially for computer vision), and about potential solutions for debugging (among which are well-known explainability methods), we conducted a qualitative study. We performed semi-structured interviews with 18 developers of various backgrounds and machine learning experience. We presented them with a task where they have to investigate a model, declare it ready for deployment or debug it. We prompted them to speak aloud to understand how they would tackle the debugging task. We then analyzed their debugging stages, methods they might use, challenges they face, and limitations.

This image is composed of two parts. At the top, it shows the design brief. The design brief says the following. ``Context: A company wants to develop a system to support blind people in understanding the spaces in which they live. An intern has already developed a deep learning model for scene classification (bathroom, bedroom, dining room, kitchen, living room). For this, he created a dataset by scraping images from the Web using Google search engine, and applying some typical data augmentation methods (e.g., flipping and cropping images, brightness transformation). He then fine-tuned a ResNet model pre-trained on ImageNet on this data. Your task: Unfortunately, his internship has ended. The company asks you to take over his model and investigate whether the model can be deployed, or whether it needs improvement. In this case, what issues should be improved on, and how? To start up your analysis, it is providing you already with the test accuracy, the confusion matrix of the model, and examples of test data (below).''  At the bottom, it provides example images that were shown with the design brief. These images represent kitchens, living rooms, and bedrooms, and both a ground truth and prediction are associated to them (not all of them received a correct prediction).
Top: our design brief, inspired by the multitude of computer vision works on scene recognition, as support for visually-impaired individuals to create mental maps of their environment. Bottom: example images of four dataset classes shown to the participants, next to their ground truth and the class inferred by the model (prediction). These examples indicate feature errors in the model. For instance, among all the kitchen images, only the one which received an incorrect prediction contains stools. This hints at the potential use by the model of this concept with a higher weight than for more relevant kitchen features such as the oven.

What were your key findings?

While practices broadly follow the traditional software debugging workflow, they differentiate by the ambiguous way the model requirements are defined, by the type of hypothesis formulation and instrumentation activities performed, by the artifacts employed to facilitate the workflow, and by the fluidity of the relevant concepts. 

Debugging workflows are typically performed manually and collaboratively, without resorting to methods developed specifically for machine learning models. There, practitioners tend to have a narrow understanding of the bugs that any machine learning model might suffer from, skewed by their prior experience.

These results bear implications for debugging tool design and machine learning education.

This figure presents the main stages of the software debugging workflow at the top, with the corresponding stakeholders involved in each stage. Then, below, it summarizes the corresponding insights from our research, in terms of failures, debugging goals, artifacts, bugs, hypothesis testing methods, and bug correction methods.
Summary of the debugging practices identified through the interviews. In orange, we show the stakeholders that can intervene in each step of the debugging process.

What is the main message you’d like people to take away?

The machine learning community is primarily focusing on the development of algorithmic tools that could potentially be used for model debugging. Yet, it is also important to acknowledge the socio-technical nature of machine learning, introduce design methods to research, and study practices, practitioners’ challenges, and their use of proposed tools.

What led / inspired you to carry out this research?

The few earlier works that have investigated practices in machine learning had triggered my interest in the topic. The proposed Critical Technical Practice by Phil Agre and trickling down/bubbling up ideas in HCI were also motivations for my work, as they made me reflect about the current research/practice gap in machine learning. 

What kind of skills / background would you say one needs to perform this type of research?

I think one needs to be interested in interdisciplinary research. 

One needs a good understanding of the technical machine learning literature, especially in terms of current proposed solutions for debugging, to compare current research directions to challenges faced by practitioners. 

They also need knowledge and practice of qualitative, empirical, research methodologies in order to set up the semi-structured interviews and analyze them. 

(Oftentimes, in computer science, we only learn about the first, but I believe that someone motivated by this research can also learn the second). We also need a lot of motivation to recruit the participants!

Any further reading you recommend?

  • Agathe Balayn, Natasa Rikalo, Christoph Lofi, Jie Yang, and Alessandro Bozzon. 2022. How can Explainability Methods be Used to Support Bug Identification in Computer Vision Models? In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 184, 1–16.
  • Q. Vera Liao, Daniel Gruen, and Sarah Miller. 2020. Questioning the AI: Informing Design Practices for Explainable AI User Experiences. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–15.

Your biography

I am a PhD candidate in Computer Science at Delft University of Technology. My research focuses on characterizing theories and practices for developing and evaluating machine learning models with regard to safety issues and societal harms, and on proposing supporting methods, user-interfaces, and workflows. 

Although I was primarily trained in computer science, all along my PhD, I have learned about and used qualitative methods to conduct my research, as I believe socio-technical work is extremely needed for machine learning nowadays.

I’m finishing my PhD next month, and currently open to new research opportunities, so I would be very happy to discuss more with anyone interested in machine learning + HCI topics!

Follow me on Twitter 😊.

Agathe’s website:

CHI NL Read is a regular feature on the CHI NL blog, where board members and blog editors Lisa and Abdo invite a member of CHI NL to showcase a recent research paper they published to the wider SIGCHI community and world 🌍. One of the ideas behind CHI NL Read is to make research a bit more accessible to those outside of academic HCI.

Get updates about HCI activities in the Netherlands

CHI Nederland (CHI NL) is celebrating its 25th year anniversary this year, and we have much in store to acknowledge this occasion. Stay tuned!