What is it about?
Given an aerial image and a question, the model obtains an answer that is relevant regarding the image contents. We show that the Transformer architecture which is used for general domain Visual Question Answering with high results also works well for the domain of Remote Sensing. Image and question features are concatenated and processed by the Transformer attention layers.
Featured Image
Photo by NASA on Unsplash
Why is it important?
We show that Transformer based systems are better for the task of Remote Sensing Visual Question Answering than current baselines composed of Convolutional Neural Networks and Recurrent Neural Networks.
Perspectives
Remote Sensing Visual Question Answering is an interesting task for users to interacted with Earth Observation data. Users can ask about specfic information about images and obtain it. I hope this article contributes to an higher interest of other researchers to develop systems for this task.
João Daniel Silva
Instituto Superior Técnico
Read the Original
This page is a summary of: Remote sensing visual question answering with a self-attention multi-modal encoder, November 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3557918.3565874.
You can read the full text:
Resources
Contributors
The following have contributed to this page