by André Filgueiras de Araujo.
If you are of the curious type, like me, you wonder a lot. Being able to search a database of billions of text documents in milliseconds is a relief – thanks to Google, Bing and the like. Sometimes, however, your question is too vague – “what’s this plant in front of me?”, or easier to convey with images and not words – “I wonder what people think of this Pinot Noir… if only there was a way I could easily search to learn more about it…”
You’ll be happy to know that there exists a way to answer such questions. Just as Google can answer our written questions, we can download an appropriate smartphone application, take a picture and have a computer answer our visual questions. The process of finding information using images or videos is referred as Visual Search. Yes, it really works, and for a large collection of objects and scenery too. A not-so-recent New York Times article identified applications that recognize landmarks, text, book covers, CD/DVD covers, artwork, logos, barcodes, wine labels and even plants, meals and skin freckles.
Under the hood. It is very interesting to understand how visual recognition systems work. In an image, we can usually find what are called keypoints (or interest points). Think of these as some special points in the image – usually, we use corners and blobs, since they are clearly distinguishable. Keypoints can be found in images of an object, even if these images are different due to some extreme modifications. These modifications come from taking a picture with the camera in the horizontal versus in the vertical position, or very close versus very far from the object, or with just ambient lighting versus having a lamp close to the object. All these conditions might make the resulting image look very different (as in the example from the figure below) – but, in most cases, the keypoints are found in the same places. Isn’t this remarkable?
The next step is to describe these keypoints. This is accomplished by extracting pixels around them. Those pixels are used to calculate gradients (which can be thought of as differences of pixel values) in the horizontal and vertical directions, which are used to describe that image portion. In setting up a visual recognition system, this process needs to be repeated for all images in your database. When the application is running, and the user takes a picture, the process of keypoint recognition and the extraction of pixels will be repeated for his picture (called the “query”) in order to compare it with the most similar image in your database.
Challenges. It would be too good to be true if we could recognize all different objects and scenes using the same method. Although the method mentioned above works for many types of visual elements, it fails in many cases: for example, cars, furniture and apparel. The method is excellent for objects that do not change much due to viewpoint variation (Van Gogh’s The Starry Night, or the Eiffel Tower, for example, look very similar in most pictures one takes of them), and are “textured” (in other words, they contain dense variations in pixels). That leaves out many of the things that would be nice to automatically recognize in our day-to-day experience, and motivates research on many different kinds of image and video properties that would be able to overcome current recognition limitations.
Searching by category While it is possible to recognize many types of specific images, a very difficult related problem is to find out which category they belong to. Are there common properties that we can use to find out if a given object is a chair, or a car, or a table? Rather than constructing a database with all existing chairs, cars and tables known to human existence, the objective in this case is to try to use properties of these objects to allow a system to know that a chair it has never encountered is… a chair. This is one of the most significant unsolved research problems in this area. Think of how many different types of chairs you see every day. They are all chairs, but they are so different! A wide diversity of chairs is illustrated in the figure below (and you can have fun navigating more than 21,000 image categories here): they can be of all colors, of many different materials, have legs that differ significantly…
In Computer Vision, techniques rely on common appearances to determine that two things are similar. However, the notion of a chair contains only a very relaxed definition of shape and appearance – not to mention the fact that, from different viewpoints, the chair will look very different. Thus, current algorithms still cannot do a decent job in assigning a category to these types of objects.
20 years from now… While typing to access information is great, it is not the most natural way when you’re interested in finding out something about what you’re looking at. Using images to describe what’s currently in front of you is one step closer to a more natural recognition experience.
In 10-20 years’ time it is quite possible that we’ll all have access to devices that can be attached to our heads, and automatically understand and augment the physical world around us – truly linking the physical world to the information world. You’re right, I’m thinking of Google Glass – and if you have not yet watched Google’s promotional video, you only have to click here to (kind of) have an idea of what it feels like. In 20 years, Visual Search technology will be much more robust, and it will then be very easy to just say, “Okay, Glass, what’s this?” – and have it answer you.
We can only speculate what will come after that – and it can only be exciting.
André Filgueiras de Araujo is a 2010 fellow of the Fulbright International Science & Technology Award, from Brazil, and a PhD Candidate in Electrical Engineering at Stanford University.