What is it about?

Scene-graph-based Visual Question Answering (VQA) is an approach within the field of deep learning that focuses on answering questions about images. In this type of VQA, a user can ask a question related to an image, and a deep learning model attempts to provide an answer. However, instead of directly using information from the image, this approach utilizes a graph that contains information about the image. This scene graph contains nodes that represent objects available in the image, along with their specific attributes and relations to other objects. In this work, we propose a method to enhance the interpretability of this approach. This means that we aim to increase the interpretability of the model by making the decision-making process of the underlying deep learning model more understandable and transparent. By visualizing the internal information of the model, users can gain insights into how it arrives at its predictions. Furthermore, this visualization allows users to identify issues or problems that may lead to incorrect predictions and supports them in making necessary corrections.

Featured Image

Why is it important?

We developed a visual analysis approach for graph-based Visual Question Answering (VQA) models and implemented it in an interactive tool. Users can browse a collection of scenes and apply filters to select a preferred scene. Once a scene is selected, the image and a visual representation of the scene graph are displayed. Users can add, remove, and edit nodes, edges, and attributes. We integrated the GraphVQA-GAT model (as an example of a model for scene-graph-based VQA) into the tool to perform VQA tasks. Our tool outputs the model predictions with a confidence score, and the graph and edge scores are visualized for each node and edge of the scene graph, respectively. The results of our evaluation show that the graph gate weights (internal information of the deep learning model) are important intrinsic values of GraphVQA-GAT. Visualizing the graph gate weight per node lets users see which nodes are focused by GraphVQA-GAT. Using our tool, users can investigate and engineer node tokens and relations to understand and steer model predictions.


This article tries to make a contribution to the research field of explainable AI. While there are numerous models and architectures in the context of machine learning, this particular approach focuses on scene-graph-based Visual Question Answering. The research started as a Master's thesis conducted by one of my students, with supervision from myself and other authors, and resulted in this great piece of work.

Tanja Munz-Körner
Universitat Stuttgart

Read the Original

This page is a summary of: Visual Analysis of Scene-Graph-Based Visual Question Answering, September 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3615522.3615547.
You can read the full text:



The following have contributed to this page