Cameras that understand: portrait mode and Google Lens

I've talked quite a lot about the impact of machine learning and computer vision in general on everything from e-commerce recommendation to social to all kinds of cool industrial applications, but it's also interesting just to look at the effect that machine learning is having on actual cameras. 

For both Apple and Google, most of the advances in smartphone cameras now happen in software. The marketing term for this is ‘computational photography’, which really just means that as well as trying to make a better lens and sensor, which are subject to the rules of physics and the size of the phone, we use software (now, mostly, machine learning or ‘AI’) to try to get a better picture out of the raw data coming from the hardware. Hence, Apple launched ‘portrait mode’ on a phone with a dual-lens system but uses software to assemble that data into a single refocused image, and it now offers a version of this on a single-lens phone (as did Google when it copied this feature). In the same way, Google’s new Pixel phone has a ‘night sight’ capability that is all about software, not radically different hardware. The technical quality of the picture you see gets better because of new software as much as because of new hardware. 

Most of how this is done will be invisible to the user. HDR went from a garish novelty to a setting in the camera that sometimes worked to, now, something automatic that you never need to know about. I expect the separate ‘portrait mode’ or ‘night sight’ options will disappear, just like the ‘HDR’ button did. 

This will probably also go several levels further in, as the camera goes better at working out what you’re actually taking a picture of. When you take a photo on a ski slope it will come out perfectly exposed and colour-balanced because the camera knows this is snow and adjusts correctly. Today, portrait mode is doing face detection as well as depth mapping to work out what to focus on; in the future, it will know which of the faces in the frame is your child and set the focus on them. 

So, we are clearly well on the way to the point at which any photograph a normal consumer takes will be technically perfect. However, there’s a second step here - not just “what is this picture and how should we focus it?” but “why did you take the picture?”

One of the desire paths of the smartphone camera is that since we have it with us all the time and we can take unlimited pictures for free, and have them instantly, we don’t just take more pictures of our children and dogs but also pictures of things that we’d never have taken pictures of before. We take pictures of posters and books and things we might want to buy - we take pictures of recipes, catalogues, conference schedules, train timetables (Americans, ask a foreigner) and fliers. The smartphone image sensor has become a notebook. (Something similar has happened with smartphone screenshots, another desire path that no-one thought would become a normal consumer behavior.) 

Machine learning means that the computer will be able to unlock a lot of this. If there's a date in this picture, what might that mean? Does this look like a recipe? Is there a book in this photo and can we match it to an Amazon listing? Can we match the handbag to Net a Porter? And so you can imagine a suggestion from your phone: “do you want to add the date in this photo to your diary?” in much the same way that today email programs extract flights or meetings or contact details from emails. 

This is an interesting product design challenge. Some of this can be passive, as with automatically detecting flights in email - you wait until you know you have something. Machine learning means we now have this with face recognition and object classification: every image on your phone is indexed by default, and you can ask for ‘all pictures of my son at the beach’ or ‘every picture of a dog’. But you can do many more analyses than this, and we take lots of photos, and there will be something you could analyse in all of them. You can perhaps index or translate all of the text in all the photos you take (presuming that isn’t resource-prohibitive), but should you do a product search on every object in every picture on the phone? At some point, you probably need some sort of ‘tell me about this’ mode, where you explicitly ask the computer to do ‘magic’.

Asking a computer to ‘tell me about this picture’ poses other problems, though. We do not have HAL 9000, nor any path to it, and we cannot recognise any arbitrary object, but we can make a guess, of varying quality, in quite a lot of categories. So how should the user know what would work, and how does the system know what kind of guess to make? Should this all happen in one app with a general promise, or many apps with specific promises? Should you have a poster mode, a ‘solve this equation’ mode, a date mode, a books mode and a product search mode? Or should you just have mode for ‘wave the phone’s camera at things and something good will probably happen’?

This last is the approach Google is taking with ‘Lens’, which is integrated into the Android camera app next to ‘Portrait’ - point it at things and magic happens. Mostly.

These three screenshots actually show quite a lot of moving parts:

  1. In the first, text is recognized (and I can copy it), and then the book itself is recognized (by the text or the image?) and Lens delivers a product match. Success.

  2. In the second, the app isn’t managing to recognise the object, so the photo is being passed on to Google Image search and a match is found on a bunch of web pages, but Google doesn’t know what this actually is. This works, from the consumer’s perspective, but there’s no knowledge graph.

  3. Third, what should be a highly recognizable product (an Alvar Aalto vase) is taken from an angle that probably doesn’t match an image on a website, but Google’s object detection thinks it’s a free-standing bath. If I manually give the image to Google Image Search, it suggests ‘club chair’. (Technically, the phone might be able to work out how big this object is and do something with that, but that’s probably for next year.)

These illustrate questions of both discoverability and expectation. What can it do, what should I not expect it to do, and how should you react when you don’t have a good result? This is in fact another manifestation of the challenge seen in voice assistants - they can do enough different things that you don’t want to give the user a list of all of them, but not nearly enough different things that you can expect it to handle anything you throw at it. So how do you build the communication and discovery of what your ‘AI’ system can do?

In the second example here we are failing down to Google image search, and voice assistants sometimes fall back to reading the top result from a Google web search. Here, this tactic worked. In the third example Google is confident (going direct to product search rather than image search), but wrong - how do I react to that? Would no suggestion at all have been better? I would have lost respect for the product if it hadn’t found the book, but I understand that matching the vase was a lot harder and give it a pass, and I can see why the vase might look like a bath to a ‘dumb’ computer. Conversely, I suspect that one of the problems with Siri was that Apple’s marketing gave the impression that you really could ask this thing anything: consumer expectations did not match the product’s capabilities.

In a sense, these questions are also brand questions. We know that Shazaam only does recorded music. Amazon’s app got a better match for the Krushchev book, linking not a modern reprint as Google did but a second-hand copy of the exact edition with the same cover. But, it failed totally on the lamp and the vase, even though they’re both for sale on Amazon. Do I have different expectations of Amazon? How intelligent do I expect the AI to be?

The alternative approach, as for Shazam, is to go vertical. Suppose there was an app that you could wave at a recipe in a book, that would generate a shopping list or maybe give you nutrition information. You could make that really reliable and you would have no ‘AI discoverability’ problem at all, but the app itself would have a discovery problem (even if it was from Google) - how would people find out about it? Either way, this approach isn’t workable for Google (or Amazon) - if they can recognise 50 categories now and 200 in two years, they can’t have 200 apps or 200 modes in the camera app any more than they can have 200 modes on the search page. You need either to have a general purpose front end or make the whole thing passive or invisible (face recognition, HDR, putting flight details into your calendar).

Language translation is another of these possible modes - and Google Translate does have its own app, for now. Google Translate is a visual Babelfish, and of course the Babelfish was a wearable. The long-term context for many of these questions is not an sensor in your pocket but a sensor that you wear. In, say, five years time, you might be able to buy, as a consumer product, a pair of ‘glasses’ that combine both a transparent, colour 3D display and a cluster of image sensors. Those image sensors map the space around you, so you can make the wall a display or play Minecraft on the table, but they also recognise things around you. At this point we won’t be taking photos-as-notes at all. You won’t take a picture of the conference schedule - you’ll just look at it, and then later that day say ‘hey Google, what’s the next session?’ Or, ‘I met someone at an event last week and their name badge said they worked for a Hollywood studio - who were they?’ So what suggestions will we get, and what will be remembered for you? How do you know what the glasses can do (and what someone else’s glasses might be doing)? And how do the brands associated with this map against intelligence and discovery on one hand and privacy and trust on the other?