This is the result of running the Densecap captioning system implemented at Stanford Vision Lab on the video of the Atlas humanoid robot designed by Boston Dynamics. (all links at bottom)
Densecap captions salient regions of images using a recurrent neural network and a fully convolutional localization network (FCLN). The FCLN processes an image, proposing regions of interest and conditioning a recurrent neural network which generates the associated captions. The whole system is trained end-to-end on the Visual Genome dataset (~4M captions on ~100k images). It was designed and implemented by Justin Johnson, Andrej Karpathy, and Li Fei-Fei at Stanford Computer Vision Lab.
The source video was filmed by Boston Dynamics and shows their most recent generation of Atlas, a bipedal humanoid robot which can navigate rough outdoor terrain and is capable of grasping and manipulating its environment.
Captions are generated by densecap on individual video frames. The video is made by a python script which merges matching captions along sequences of consecutive frames with a set of (mostly greedy) heuristics. Presumably, it would be possible to caption sequences of regions directly rather than a naive merging algorithm, but I'm not sure how :)
This video demonstrates both the impressive capabilities of neural captioning systems, as well as the humorous (and maybe unsettling) limitations of such systems when their training data lack the vocabulary to fully describe the scene. Notably, no "robots" or "machines" appear in this video according to densecap, and the robot is variously labeled as a person, man, motorcycle, and fire hydrant.
Atlas, The Next Generation (original video): youtube.com/watch?v=rVlhMGQgDkY
Code for merging captions and generating videos: github.com/genekogan/densecap-video
Densecap applied on Deepdream: vimeo.com/173062236