This is the result of running the Densecap captioning system implemented at Stanford Vision Lab on a Deepdream video.
Densecap captions salient regions of images using a recurrent neural network and a fully convolutional localization network (FCLN). The FCLN processes an image, proposing regions of interest and conditioning a recurrent neural network which generates the associated captions. The whole system is trained end-to-end on the Visual Genome dataset (~4M captions on ~100k images). It was designed and implemented by Justin Johnson, Andrej Karpathy, and Li Fei-Fei at Stanford Computer Vision Lab.
Deepdream is a technique which "hallucinates" objects inside of an image by detecting and amplifying the content activations of a convolutional neural network. It was introduced by Alexander Mordvintsev, Chris Olah, and Mike Tyka (exactly 1 year ago as of this video's upload). I used Mike Tyka's technique of generating "infinite zoom" videos by repeatedly Deepdreaming a frame, cropping it, then feeding it back again to Deepdream, starting with an image of white noise. The resulting video contains bursts of hallucinated objects detected by Google's Inceptionism network, including many of the famous "puppyslugs."
This video hints to a future in which multiple neural networks compete or attempt to fool each other (not to be confused with generative adversarial networks, which are a bit different). The artifacts of such systems may be humorous as they are here, but could also lead to some serious misunderstandings in more consequential applications. It's plausible that such systems will be deployed for tasks like describing CCTV footage or crime scene photos, and we can think of many more.
Captions are generated by densecap on individual video frames. The video is generated by a python script which merges matching captions along sequences of consecutive frames with a set of (mostly greedy) heuristics. Presumably, it would be possible to caption sequences of regions directly rather than a naive merging algorithm, but I'm not sure how :)
Original blog post by Google: research.googleblog.com/2015/07/deepdream-code-example-for-visualizing.html
Deepdream code: github.com/google/deepdream
Code for merging captions and generating videos: github.com/genekogan/densecap-video
Densecap applied on Boston Dynamics Atlas Robot: vimeo.com/173025372