Stephen Gould discusses image understanding and pixel labeling, the topic of "Scene Understanding by Labeling Pixels" from the November 2014 Communications of the ACM (cacm.acm.org/magazines/2014/11/179821).
00:00-00:35 What do you see? How about now? And now? Because that's what your computer sees. Can we teach it to understand that these arbitrary pixels are a tree, a car, a chair? Then can we teach it to put it all together and describe an entire scene?
00:35-00:46 Join us as we talk with Stephen Gould about how computer vision goes from individual pixels to full scenes, and then back again, in Scene Understanding by Labeling Pixels.
00:46-00:57 [Intro graphics/music]
00:57-01:12 We move through dozens of settings every day. From street... to building... to stairs... to hallway... to office.
01:12-01:25 In the process, we automatically recognize hundreds of objects; interpret their relationships to each other, and to us; and summarize the entire scene so that we can make intelligent decisions about it.
01:25-01:31 DR. GOULD: For a human, this is very, very easy to do. But for a computer, it's a daunting task.
01:31-01:43 The process starts by sharpening a computer vision algorithm's intelligence by training it on standard sets of images, laboriously labeled with names of objects and other features.
01:43-01:53 DR. GOULD: So when I was a graduate student, I spent a lot of my time labeling images and those are now available to the computer vision community to use.
01:53-01:58 The scene as a whole provides clues to what objects are likely to be in it.
01:58-02:13 DR. GOULD: So if we see a road scene, then within that road scene we'd expect to find cars and pedestrians. Whereas if we look at a scene and it's a farm or a rural scene, then we'd expect to find animals, you know, cows and sheep and so on.
02:13-02:22 Features such as color help determine what's what. But which features are important... well, that varies quite a lot.
02:22-02:41 DR. GOULD: In computer vision, we tend to break the world up into two different types of things. There are what we call "things" and what we call "stuff". So shape information is very very important for identifying things. Humans have a very distinctive shape, cars have a very distinctive shape. Whereas texture and color is more important for background regions.
02:41-02:57 Understanding the entire scene requires what Dr. Gould calls "top-down" processing -- recognizing parts of the scene. In his article, he also considers the importance of "bottom-up" processing, starting with individual pixels.
02:57-03:04 DR. GOULD: When we model the problem, we take every pixel and we treat it as a random variable.
03:04-03:15 To keep the variable set small and to improve performance, a process lowers the image's resolution by carefully grouping similar pixels into superpixels.
03:15-03:38 DR. GOULD: You want your superpixel algorithm or your oversegmentation algorithm to group together pixels that look similar, but that also belong together semantically. It's highly likely that adjacent pixels in an image are labeled with the same category. So sky pixels tend to appear close to each other, grass pixels tend to appear close to each other, and so on.
03:38-03:43 Then it's a matter of bringing together the top-down and bottom-up information.
03:43-04:02 DR. GOULD: So then at the same time you have this bottom up process which is labeling pixels as, you know, gray, and so probably road pixels. And if you find some road pixels that are below a bounding box that you think contains a pedestrian, then that gives you much more evidence for saying that that bounding box truly does contain a pedestrian.
04:02-04:33 So what good is all this? Dr. Gould points to the possibility of robots that could recognize, arrange, and handle household objects. Scene understanding could also let us truly search online images, rather than relying on the text that surrounds them. And with assisted and autonomous driving becoming a reality, scene understanding could amend other systems to improve safety.
04:33-04:42 Find out more in this month's Communications of the ACM, in the contributed article, "Scene Understanding by Labeling Pixels".
04:42-04:51 [Outro and credits]