About The Workshop

The joint understanding of language and vision poses a fundamental challenge in artificial intelligence. This problem is particularly relevant because combining images and texts is a very natural way of learning for humans. Therefore, progress on tasks like visual question answering, image-captioning, object referral, etc. would undoubtedly provide a stepping stone towards new products and services. For example, a natural language interface between factory operators and control systems could streamline production processes, resulting in safer and more efficient working environments. In a different vein, being able to express commands in natural language to an autonomous car could eliminate the unsettling feeling of giving up all control. The possible applications are countless. Understandably, this calls for efficient computational models that can address these tasks in a realistic environment.

In this workshop, we aim to identify, and address the challenges for deploying vision-language models in practical applications (see list of topics). Prospective authors are invited to submit a 10-14 page paper, which they can present at the poster session during the workshop (see call for papers). Additionally, this workshop will host a challenge where participants need to solve a visual grounding task. So far, progress on the visual grounding task was mostly measured on popular datasets such as COCO-Ref, ReferIt, etc. However, these benchmarks do not serve as the ideal test beds for building models that need to operate in the wild. Therefore, in this workshop, we will focus on performing the visual grounding task in a more realistic task setting. More specifically, we consider a setting where a passenger can pass free-form natural language commands to a self-driving car. This scenario is particularly challenging, as the language is much less constrained compared to existing benchmarks, and object references are often implicit. The challenge is based on the recent Talk2Car dataset. If you you wish to receive updates about this workshop or challenge, click here.

Call For Papers

Authors are invited to submit a 10-14 page paper to the workshop (ECCV format, page limit is without references). All submissions will be peer-reviewed (single-blind). Notice that papers longer than 4 pages (including references) can be considered as a double submission, if they share contents with a paper accepted at ECCV (or any other conference). Accepted work will be presented as a poster or contributed talk during the workshop, and published in the workshop proceedings after the main conference. Authors are encouraged, but not obligated, to participate in the challenge. For more information about the dates click here.

List of topics

  • Visual Dialog
  • Multi-modal feature learning
  • Object Referral/Visual Grounding
  • Visual Question Answering
  • Embodied Question Answering
  • Zero-shot/Few-shot in multi-modal learning
  • Applications in joint text/image understanding
  • ...

Workshop Challenge

The challenge focuses on tackling a visual grounding task in a self-driving car scenario. Given a natural language command, the goal is to predict the referred object in the scene. More information about this challenge can be found here.



Workshop Schedule


Important Dates

Important Dates (UTC-12 midnight.) Event
March 20 2020 Release of the challenge
March 27 2020 Opening of leaderboard and submissions
May 29 2020 Call for papers opened
July 10 2020 Paper submission deadline
July 18 2020 Freezing of challenge leaderboard
August 1 2020 End of challenge
August 23 2020 Workshop @ ECCV2020 in Glasgow
Septermber 14 2020 Camera ready version

Want to receive updates? Leave your email here!

* indicates required