Quick description of what is demonstrated in the video:
- The robot starts with no knowledge of the labels for the objects in its point of view (on the table) and the human interactor demonstrates that by asking the robot to point to those objects before giving it any training example, to which the robot responds by saying "but i don't know what that is".
- Training examples consist of an utterance paired with a list of objects present in the scene
- The human interactor changes the configuration of the objects on the table to provide difference visual contexts (consisting of the target object and 1+ distracting objects), in conjunction with a new utterance (consisting of the label for the target object (+ other non referential words such as "look", "at", "the").
- The model adaptively and incrementally learns
a) the referential intentions of the speaker and filters out the distracting objects
b) learns the referential words and distinguishes them from non-referential words
c) learns the correct mappings between referential words and their object referents in the scene.