Google Research Blog
The latest news from Research at Google
TensorFlow - Google’s latest machine learning system, open sourced for everyone
Monday, November 09, 2015
Posted by Jeff Dean, Senior Google Fellow, and Rajat Monga, Technical Lead
Deep Learning has had a huge impact on computer science, making it possible to explore new frontiers of research and to develop amazingly useful products that millions of people use every day. Our internal deep learning infrastructure
, developed in 2011, has allowed Googlers to build ever larger
and scale training to thousands of cores in our datacenters. We’ve used it to demonstrate that
concepts like “cat”
can be learned from unlabeled YouTube images, to improve speech recognition in
the Google app
by 25%, and to build image search
in Google Photos
. DistBelief also trained the Inception model that won Imagenet’s
Large Scale Visual Recognition Challenge in 2014
, and drove our experiments in
automated image captioning
as well as
While DistBelief was very successful, it had some limitations. It was narrowly targeted to neural networks, it was difficult to configure, and it was tightly coupled to Google’s internal infrastructure -- making it nearly impossible to share research code externally.
Today we’re proud to announce the open source release of
-- our second-generation machine learning system, specifically designed to correct these shortcomings. TensorFlow is general, flexible, portable, easy-to-use, and completely open source. We added all this while improving upon DistBelief’s speed, scalability, and production readiness -- in fact, on some benchmarks, TensorFlow is twice as fast as DistBelief (see the
for details of TensorFlow’s programming model and implementation).
TensorFlow has extensive built-in support for deep learning, but is far more general than that -- any computation that you can express as a computational flow graph, you can compute with TensorFlow (see some
). Any gradient-based machine learning algorithm will benefit from TensorFlow’s
and suite of first-rate optimizers. And it’s easy to express your new ideas in TensorFlow via the flexible Python interface.
Inspecting a model with TensorBoard, the visualization tool
TensorFlow is great for research, but it’s ready for use in real products too. TensorFlow was built from the ground up to be fast, portable, and ready for production service. You can move your idea seamlessly from training on your desktop GPU to running on your mobile phone. And you can get started quickly with powerful machine learning tech by using our state-of-the-art
example model architectures
. For example, we plan to release our complete, top shelf ImageNet computer vision model on TensorFlow soon.
But the most important thing about TensorFlow is that it’s yours. We’ve open-sourced TensorFlow as a standalone library and associated tools, tutorials, and examples with the Apache 2.0 license so you’re free to use TensorFlow at your institution (no matter where you work).
Our deep learning researchers all use TensorFlow in their experiments. Our engineers use it to infuse Google Search with
signals derived from deep neural networks
, and to power the
magic features of tomorrow
. We’ll continue to use TensorFlow to serve machine learning in products, and our research team is committed to sharing TensorFlow implementations of our published ideas. We hope you’ll join us at
Computer, respond to this email.
Tuesday, November 03, 2015
Posted by Greg Corrado
, Senior Research Scientist
Machine Intelligence for You
What I love about working at Google is the opportunity to harness cutting-edge machine intelligence for users’ benefit. Two recent Research Blog posts talked about how we’ve used machine learning in the form of
deep neural networks
. Today we can share something even wilder -- Smart Reply, a deep neural network that writes email.
I get a lot of email, and I often peek at it on the go with my phone. But replying to email on mobile is a real pain, even for short replies. What if there were a system that could automatically determine if an email was answerable with a short reply, and compose a few suitable responses that I could edit or send with just a tap?
Some months ago, Bálint Miklós from the Gmail team asked me if such a thing might be possible. I said it sounded too much like passing the
to get our hopes up... but having collaborated before on machine learning improvements to spam detection and email categorization, we thought we’d give it a try.
There’s a long history of research on both understanding and generating natural language for applications like machine translation. Last year, Google researchers Oriol Vinyals, Ilya Sutskever, and Quoc Le proposed fusing these two tasks in what they called
. This end-to-end approach has many possible applications, but one of the most unexpected that we’ve experimented with is conversational synthesis.
showed that we could use sequence-to-sequence learning to power a chatbot that was remarkably
fun to play with
, despite having included no explicit knowledge of language in the program.
Obviously, there’s a huge gap between a cute research chatbot and a system that I want helping me draft email. It was still an open question if we could build something that was actually useful to our users. But one engineer on our team, Anjuli Kannan, was willing to take on the challenge. Working closely with both Machine Intelligence researchers and Gmail engineers, she elaborated and experimented with the sequence-to-sequence research ideas. The result is the industrial strength neural network that runs at the core of the Smart Reply feature we’re launching this week.
How it works
A naive attempt to build a response generation system might depend on hand-crafted rules for common reply scenarios. But in practice, any engineer’s ability to invent “rules” would be quickly outstripped by the tremendous diversity with which real people communicate. A machine-learned system, by contrast, implicitly captures diverse situations, writing styles, and tones. These systems generalize better, and handle completely new inputs more gracefully than brittle, rule-based systems ever could.
Diagram by Chris Olah
Like other sequence-to-sequence models, the Smart Reply System is built on a pair of
recurrent neural networks
, one used to encode the incoming email and one to predict possible responses. The encoding network consumes the words of the incoming email one at a time, and produces a vector (a list of numbers). This vector, which Geoff Hinton calls a “
,” captures the gist of what is being said without getting hung up on diction -- for example, the vector for "Are you free tomorrow?" should be similar to the vector for "Does tomorrow work for you?" The second network starts from this thought vector and synthesizes a grammatically correct reply one word at a time, like it’s typing it out. Amazingly, the detailed operation of each network is entirely learned, just by training the model to predict likely responses.
One challenge of working with emails is that the inputs and outputs of the model can be hundreds of words long. This is where the particular choice of recurrent neural network type really matters. We used a variant of a "long short-term-memory" network (or
for short), which is particularly good at preserving long-term dependencies, and can home in on the part of the incoming email that is most useful in predicting a response, without being distracted by less relevant sentences before and after.
Of course, there's another very important factor in working with email, which is privacy. In developing Smart Reply we adhered to the same rigorous user privacy standards we’ve always held -- in other words, no humans reading your email. This means researchers have to get machine learning to work on a data set that they themselves cannot read, which is a little like trying to solve a puzzle while blindfolded -- but a challenge makes it more interesting!
Getting it right
Our first prototype of the system had a few unexpected quirks. We wanted to generate a few candidate replies, but when we asked our neural network for the three most likely responses, it’d cough up triplets like “How about tomorrow?” “Wanna get together tomorrow?” “I suggest we meet tomorrow.” That’s not really much of a choice for users. The solution was provided by Sujith Ravi, whose team developed a great machine learning system for mapping natural language responses to semantic intents. This was instrumental in several phases of the project, and was critical to solving the "response diversity problem": by knowing how semantically similar two responses are, we can suggest responses that are different not only in wording, but in their underlying meaning.
Another bizarre feature of our early prototype was its propensity to respond with “I love you” to seemingly anything. As adorable as this sounds, it wasn’t really what we were hoping for. Some analysis revealed that the system was doing exactly what we’d trained it to do, generate likely responses -- and it turns out that responses like “Thanks", "Sounds good", and “I love you” are super common -- so the system would lean on them as a safe bet if it was unsure. Normalizing the likelihood of a candidate reply by some measure of that response's prior probability forced the model to predict responses that were not just highly likely, but also had high affinity to the original message. This made for a less lovey, but far more useful, email assistant.
Give it a try
We’re actually pretty amazed at how well this works. We’ll be rolling this feature out on
Inbox for Android and iOS
later this week, and we hope you’ll try it for yourself! Tap on a Smart Reply suggestion to start editing it. If it’s perfect as is, just tap send. Two-tap email on the go -- just like Bálint envisioned.
This blog post may or may not have actually been written by a neural network.
How to measure translation quality in your user interfaces
Friday, October 30, 2015
Posted by Javier Bargas-Avila, User Experience Research at Google
Worldwide, there are about
200 languages that are spoken by at least 3 million people
. In this global context, software developers are required to translate their user interfaces into many languages. While graphical user interfaces have evolved substantially when compared to text-based user interfaces, they still rely heavily on textual information. The perceived language quality of translated user interfaces (UIs) can have a significant impact on the overall quality and usability of a product. But how can software developers and product managers learn more about the quality of a translation when they don’t speak the language themselves?
Key information in interaction elements and content are mostly conveyed through text. This aspect can be illustrated by removing text elements from a UI, as shown in the the figure below.
Three versions of the YouTube UI: (a) the original, (b) YouTube without text elements, and (c) YouTube without graphic elements. It gets apparent how the textless version is stripped of the most useful information: it is almost impossible to choose a video to watch and navigating the site is impossible.
Measuring user rated language quality: Development and validation of the user interface Language Quality Survey (LQS)
", recently published in the
International Journal of Human-Computer Studies
, we describe the development and validation of a survey that enables users to provide feedback about the language quality of the user interface.
UIs are generally developed in one source language and translated afterwards string by string. The process of translation is prone to errors and might introduce problems that are not present in the source. These problems are most often due to difficulties in the translation process. For example, the word “auto” can be translated to French as
(car), which obviously has a different meaning. Translators might chose the wrong term if context is missing during the process. Another problem arises from words that behave as a verb when placed in a button or as a noun if part of a label. For example, “access” can stand for “you have access” (as a label) or “you can request access” (as a button).
Further pitfalls are gender, prepositions without context or other characteristics of the source text that might influence translation. These problems sometimes even get aggravated by the fact that translations are made by different linguists at different points in time. Such mistranslations might not only negatively affect trustworthiness and brand perception, but also the acceptance of the product and its perceived usefulness.
This work was motivated by the fact that in 2012, the YouTube internationalization team had anecdotal evidence which suggested that some language versions of YouTube might benefit from improvement efforts. While expert evaluations led to significant improvements of text quality, these evaluations were expensive and time-consuming. Therefore, it was decided to develop a survey that enables users to provide feedback about the language quality of the user interface to allow a scalable way of gathering quantitative data about language quality.
The Language Quality Survey (LQS) contains 10 questions about language quality. The first five questions form the factor “Readability”, which describes how natural and smooth to read the used text is. For instance, one question targets ease of understanding (“How easy or difficult to understand is the text used in the [product name] interface?”). Questions 6 to 9 summarize the frequency of (in)consistencies in the text, called “Linguistic Correctness”. The
full survey can be found in the publication
Case study: applying the LQS in the field
As the LQS was developed to discover problematic translations of the YouTube interface and allow focused quality improvement efforts, it was made available in over 60 languages and data were gathered for all these versions of the YouTube interface. To understand the quality of each UI version, we compared the results for the translated versions to the source language (here: US-English). We inspected first the global item, in combination with Linguistic Correctness and Readability. Second, we inspected each item separately, to understand which notion of Linguistic Correctness or Readability showed worse (or better) values. Here are some results:
The data revealed that about one third of the languages showed subpar language quality levels, when compared to the source language.
To understand the source of these problems and fix them, we analyzed the qualitative feedback users had provided (every time someone selected the lower two end scale points, pointing at a problem in the language, a text box was surfaced, asking them to provide examples or links to illustrate the issues).
The analysis of these comments provided linguists with valuable feedback of various kinds. For instance, users pointed to confusing terminology, untranslated words that were missed during translation, typographical or grammatical problems, words that were translated but are commonly used in English, or screenshots in help pages that were in English but needed to be localized. Some users also pointed to readability aspects such as sections with old fashioned or too formal tone as well as too informal translations, complex technical or legal wordings, unnatural translations or rather lengthy sections of text. In some languages users also pointed to text that was too small or criticized the readability of the font that was used.
In parallel, in-depth expert reviews (so-called “language find-its”) were organized. In these sessions, a group of experts for each language met and screened all of YouTube to discover aspects of the language that could be improved and decided on concrete actions to fix them. By using the LQS data to select target languages, it was possible to reduce the number of language find-its to about one third of the original estimation (if all languages had been screened).
LQS has since been successfully adapted and used for various Google products such as Docs, Analytics, or AdWords. We have found the LQS to be a reliable, valid and useful tool to approach language quality evaluation and improvement. The LQS can be regarded as a small piece in the puzzle of understanding and improving localization quality. Google is making this survey broadly available, so that everyone can start improving their products for everyone around the world.
Improving YouTube video thumbnails with deep neural nets
Thursday, October 08, 2015
Posted by Weilong Yang and Min-hsuan Tsai, Video Content Analysis team and the YouTube Creator team
are often the first things viewers see when they look for something interesting to watch. A strong, vibrant, and relevant thumbnail draws attention, giving viewers a quick preview of the content of the video, and helps them to find content more easily. Better thumbnails lead to more clicks and views for video creators.
Inspired by the recent remarkable advances of
deep neural networks
(DNNs) in computer vision, such as
classification, our team has recently launched an improved automatic YouTube "thumbnailer" in order to help creators showcase their video content. Here is how it works.
The Thumbnailer Pipeline
While a video is being uploaded to YouTube, we first sample frames from the video at one frame per second. Each sampled frame is evaluated by a
and assigned a single
. The frames with the highest scores are selected, enhanced and rendered as thumbnails with different sizes and aspect ratios. Among all the components, the quality model is the most critical and turned out to be the most challenging to develop. In the latest version of the thumbnailer algorithm, we used a DNN for the quality model. So, what is the
measuring, and how is the score calculated?
The main processing pipeline of the thumbnailer.
(Training) The Quality Model
Unlike the task of identifying if a video contains your favorite animal, judging the visual quality of a video frame can be very subjective - people often have very different opinions and preferences when selecting frames as video thumbnails. One of the main challenges we faced was how to collect a large set of well-annotated training examples to feed into our neural network. Fortunately, on YouTube, in addition to having algorithmically generated thumbnails, many YouTube videos also come with carefully designed custom thumbnails uploaded by creators. Those thumbnails are typically well framed, in-focus, and center on a specific subject (e.g. the main character in the video). We consider these custom thumbnails from popular videos as positive (high-quality) examples, and randomly selected video frames as negative (low-quality) examples. Some examples of the training images are shown below.
Example training images.
The visual quality model essentially solves a problem we call "binary classification": given a frame, is it of high quality or not? We trained a DNN on this set using a similar architecture to the Inception network in
that achieved the top performance in the ImageNet 2014 competition.
Compared to the previous automatically generated thumbnails, the DNN-powered model is able to select frames with much better quality. In a human evaluation, the thumbnails produced by our new models are preferred to those from the previous thumbnailer in more than 65% of side-by-side ratings. Here are some examples of how the new quality model performs on YouTube videos:
Example frames with low and high quality score from the DNN quality model, from video “
Grand Canyon Rock Squirrel
Thumbnails generated by old vs. new thumbnailer algorithm.
We recently launched this new thumbnailer across YouTube, which means creators can start to choose from higher quality thumbnails generated by our new thumbnailer. Next time you see an awesome YouTube thumbnail, don’t hesitate to give it a
Google voice search: faster and more accurate
Thursday, September 24, 2015
Posted by Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays and Johan Schalkwyk – Google Speech Team
Back in 2012,
that Google voice search had taken a new turn by adopting
Deep Neural Networks
(DNNs) as the core technology used to model the sounds of a language. These replaced the 30-year old standard in the industry: the Gaussian Mixture Model (GMM). DNNs were better able to assess which sound a user is producing at every instant in time, and with this they delivered greatly increased speech recognition accuracy.
Today, we’re happy to announce we built even better neural network acoustic models using
Connectionist Temporal Classification
sequence discriminative training techniques
. These models are a special extension of
recurrent neural networks
(RNNs) that are more accurate, especially in noisy environments, and they are blazingly fast!
In a traditional speech recognizer, the waveform spoken by a user is split into small consecutive slices or “frames” of 10 milliseconds of audio. Each frame is analyzed for its frequency content, and the resulting feature vector is passed through an acoustic model such as a DNN that outputs a probability distribution over all the phonemes (sounds) in the model. A Hidden Markov Model (HMM) helps to impose some temporal structure on this sequence of probability distributions. This is then combined with other knowledge sources such as a Pronunciation Model that links sequences of sounds to valid words in the target language and a Language Model that expresses how likely given word sequences are in that language. The recognizer then reconciles all this information to determine the sentence the user is speaking. If the user speaks the word “museum” for example - /m j u z i @ m/ in
- it may be hard to tell where the /j/ sound ends and where the /u/ starts, but in truth the recognizer doesn’t care where exactly that transition happens: All it cares about is that these sounds were spoken.
Our improved acoustic models rely on Recurrent Neural Networks (RNN). RNNs have feedback loops in their topology, allowing them to model temporal dependencies: when the user speaks /u/ in the previous example, their articulatory apparatus is coming from a /j/ sound and from an /m/ sound before. Try saying it out loud - “museum” - it flows very naturally in one breath, and RNNs can capture that. The type of RNN used here is a
Long Short-Term Memory
(LSTM) RNN which, through memory cells and a sophisticated gating mechanism, memorizes information better than other RNNs. Adopting such models
already improved the quality
of our recognizer significantly.
The next step was to train the models to recognize phonemes in an utterance without requiring them to make a prediction for each time instant. With Connectionist Temporal Classification, the models are trained to output a sequence of “spikes” that reveals the sequence of sounds in the waveform. They can do this in any way as long as the sequence is correct.
The tricky part though was how to make this happen in real-time. After many iterations, we managed to train streaming, unidirectional, models that consume the incoming audio in larger chunks than conventional models, but do actual computations less often. With this, we drastically reduced computations and made the recognizer much faster. We also added artificial noise and reverberation to the training data, making the recognizer more robust to ambient noise. You can watch a model learning a sentence
We now had a faster and more accurate acoustic model and were excited to launch it on real voice traffic. However, we had to solve another problem - the model was delaying its phoneme predictions by about 300 milliseconds: it had just learned it could make better predictions by listening further ahead in the speech signal! This was smart, but it would mean extra latency for our users, which was not acceptable. We solved this problem by training the model to output phoneme predictions much closer to the ground-truth timing of the speech.
The CTC recognizer outputs spikes as it identifies various phonetic units (in various colors) in the input speech signal. The x-axis shows the acoustic input timing for phonemes and y-axis shows the posterior probabilities as predicted by the neural network. The dotted line shows where the model chooses not to output a phoneme.
We are happy to announce that
our new acoustic models
are now used for voice searches and commands in the
(on Android and iOS), and for dictation on Android devices. In addition to requiring much lower computational resources, the new models are more accurate, robust to noise, and faster to respond to voice search queries - so give it a try, and happy (voice) searching!
Google Voice Search
A Beginner’s Guide to Deep Neural Networks
Tuesday, September 22, 2015
Posted by Natalie Hammel and Lorraine Yurshansky, creators of Nat & Lo’s 20% Project
Last year, we (a couple of people who knew nothing about how voice search works) set out to make a video
about the research that’s gone into teaching computers to recognize speech and understand language
Making the video was eye-opening and brain-opening. It introduced us to concepts we’d never heard of – like machine learning and artificial neural networks – and ever since, we’ve been kind of fascinated by them. Machine learning, in particular, is a very active area of Computer Science research, with far-ranging applications beyond voice search – like
image recognition and description
Google Voice transcription
So... still curious to know more (and having just started
) we found Google researchers
and ambushed them with our machine learning questions.
This video is our attempt to distill what we learned from talking with them, but if anything in it piques your curiosity, or you have other questions, you’re in luck! On
Friday, September 25, at 1 PM PDT / 4 PM EST
Greg and Chris will be doing an
Ask Me Anything on Reddit
(see the calendar
) to answer your deep learning questions.
Everyone who’s curious is welcome to join, ask questions, and hopefully gain a better understanding of the world of machine learning and deep neural networks. (And we’ll be hanging out with them, too...in case you have any questions about video making or dogs.) We hope to see you this Friday!
Information sharing for more efficient network utilization and management
Thursday, September 17, 2015
Andreas Terzis, Software Engineer
As Internet traffic has grown and changed, Google and other content and application providers have worked cooperatively with Internet service providers (ISPs) so that services can be delivered quickly, efficiently and cost-effectively. For example, rather than content having to traverse a long distance and many different networks to reach an Internet access provider’s network, a content provider might store (cache) the data close by and interconnect (‘peer’) directly with the access provider. Google has invested billions of dollars in the network and infrastructure necessary to bring our services as close to your Internet access provider’s front door as possible, for free – which both reduces ISPs’ costs and improves the user experience.
Content and application providers can also tune their services for congested and/or lower bandwidth environments. For instance, YouTube detects how smoothly a video is playing and adjusts the quality to account for temporary fluctuations in bandwidth or congestion. In the
Google Video Quality Report
, we transparently reveal the speeds YouTube is experiencing on different networks.
As more of Internet traffic becomes encrypted, some network operators have expressed concern about the effect encryption might have on their ability to manage their networks. We don’t think there has to be a trade-off here – there are ways to do effective network management of encrypted traffic today, and, through further cooperation between content and application providers and ISPs, we believe this could be made easier while still respecting encryption.
To spur discussion and collaboration on this front, we recently submitted a
to a workshop organized by the
Internet Architecture Board
outlining some ideas. We advocate for a model where ISPs selectively share network state to content and applications providers, enabling them to adapt to available network resources.
For example, we recently proposed to the
Internet Engineering Task Force
the concept of
(TG), whereby mobile network operators could share information about the throughput of a radio downlink. Preliminary field tests in a production LTE network showed that TG reduces YouTube join latency, defined as the amount of time until the video starts playing, by 8% on average, rebuffering time by 20% on average, and rebuffer count by 2% on average. In addition to improving quality of experience for users, this mechanism improves the utilization of providers’ networks. Encryption of traffic would have no impact on the efficacy of this approach; it works equally well with encrypted and unencrypted traffic.
Throughput Guidance is one possible solution and many questions remain unanswered. It’s still relatively early days in our exploration of this and the other measures in our short paper, and we’re looking forward to getting feedback and collaborating with network operators and others.
Adaptive Data Analysis
Automatic Speech Recognition
Electronic Commerce and Algorithms
Google Science Fair
Google Voice Search
High Dynamic Range Imaging
Internet of Things
Natural Language Processing
Natural Language Understanding
Optical Character Recognition
Public Data Explorer
Security and Privacy
Site Reliability Engineering
Give us feedback in our
Official Google Blog
Public Policy Blog
Lat Long Blog
Ads Developer Blog
Android Developers Blog