Andrej Karpathy's "NeuralTalk" code github.com/karpathy/neuraltalk2 slightly modified to run from a webcam feed. I recorded this live while walking near the bridge at Damstraat and Oudezijds Voorburgwal in Amsterdam, while visiting for IDFA DocLab.
NeuralTalk is trained on the MS COCO dataset, which guides the kind of captions that are generated mscoco.org/dataset/#captions-challenge2015 MS COCO contains 100k image-caption pairs covering a wide variety of situations. But in a brief walk, you will really only run into a few of those situations.
All processing is done on my 2013 MacBook Pro with the NVIDIA 750M and only 2GB of GPU memory. I'm walking around with my laptop open pointing it at things, hence the shaky footage and people staring at themselves. The openFrameworks code for streaming the webcam and reading from disk is available at gist.github.com/kylemcdonald/b02edbc33942a85856c8
Edit: I added a caption file that mirrors the burned in captions. While the captions run at about four captions per second on my laptop, I generated the caption file with one caption per second to make it more reasonable.