Like many breakthroughs in scientific discovery, the one that spurred an artificial intelligence revolution came from a moment of serendipity.
In early 2017, two Google research scientists, Ashish Vaswani and Jakob Uszkoreit, were in a hallway of the search giant’s Mountain View campus, discussing a new idea for how to improve machine translation, the AI technology behind Google Translate.
The AI researchers had been working with another colleague, Illia Polosukhin, on a concept they called “self-attention” that could radically speed up and augment how computers understand language.
Polosukhin, a science fiction fan from Kharkiv in Ukraine, believed self-attention was a bit like the alien language in the film Arrival, which had just recently been released. The extraterrestrials’ fictional language did not contain linear sequences of words. Instead, they generated entire sentences using a single symbol that represented an idea or a concept, which human linguists had to decode as a whole.
The cutting-edge AI translation methods at the time involved scanning each word in a sentence and translating it in turn, in a sequential process. The idea of self-attention was to read an entire sentence at once, analysing all its parts and not just individual words. You could then garner better context, and generate a translation in parallel.
The three Google scientists surmised this would be much faster and more accurate than existing methods. They started playing around with some early prototypes on English-German translations, and found it worked.
During their chat in the hallway, Uszkoreit and Vaswani were overheard by Noam Shazeer, a Google veteran who had joined the company back in 2000 when Google had roughly 200 employees.
Shazeer, who had helped build the “Did You Mean?” spellcheck function for Google Search, among several other AI innovations, was frustrated by existing language-generating methods, and looking for fresh ideas.
So when he heard his colleagues talking about this idea of “self-attention”, he decided to jump in and help. “I said, I’m with you . . . let’s do it, this is going to make life much, much better for all AI researchers,” Shazeer says.
The chance conversation formalised a months-long collaboration in 2017 that eventually produced an architecture for processing language, known simply as the “transformer”. The eight research scientists who eventually played a part in its creation described it in a short paper with a snappy title: “Attention Is All You Need.”
One of the authors, Llion Jones, who grew up in a tiny Welsh village, says the title was a nod to the Beatles song “All You Need Is Love”. The paper was first published in June 2017, and it kick-started an entirely new era of artificial intelligence: the rise of generative AI.
Today, the transformer underpins most cutting-edge applications of AI in development. Not only is it embedded in Google Search and Translate, for which it was originally invented, but it also powers all large language models, including those behind ChatGPT and Bard. It drives autocomplete on our mobile keyboards, and speech recognition by smart speakers.
Its real power, however, comes from the fact that it works in areas far beyond language. It can generate anything with repeating motifs or patterns, from images with tools such as Dall-E, Midjourney and Stable Diffusion, to computer code with generators like GitHub CoPilot, or even DNA.
Vaswani, who grew up in Oman in an Indian family, has a particular interest in music and wondered if the transformer could be used to generate it. He was amazed to discover it could generate classical piano music as well as the state-of-the-art AI models of the time.
“The transformer is a way to capture interaction very quickly all at once between different parts of any input, and once it does that, it can . . . learn features from it,” he says. “It’s a general method that captures interactions between pieces in a sentence, or the notes in music, or pixels in an image, or parts of a protein. It can be purposed for any task.”
The genesis of the transformer and the story of its creators helps to account for how we got to this moment in artificial intelligence: an inflection point, comparable to our transition on to the web or to smartphones, that has seeded a new generation of entrepreneurs building AI-powered consumer products for the masses.
But it also highlights how Google’s evolution into a large bureaucratic incumbent has stifled its ability to let entrepreneurialism flourish, and to launch new consumer products quickly. All eight authors, seven of whom spoke to the Financial Times, have now left the company.
It is a stark illustration of the “innovator’s dilemma”, a term coined by Harvard Business School professor Clayton Christensen that explores why industry leaders get overtaken by small, emerging players. Despite gathering the world’s leading talent in deep learning and AI and creating a fertile research environment for them, Google was unable to retain the scientists it helped to train.
In a statement, Google said it was “proud of our industry-defining, breakthrough work on transformers and [was] energised by the AI ecosystem it’s created.” It acknowledged the “bittersweet” reality that, in such a dynamic environment, talented staff might choose to move on.
The intellectual capital created has resulted in an explosion of innovation, experts say. “What came out of ‘Attention is All You Need’ is the basis for effectively every generative AI company using a large language model. I mean it’s in everything. That’s the most insane thing about it,” says Jill Chase, a partner at CapitalG, Alphabet’s growth fund, where she focuses on AI investments. “All these products exist because of the transformer.”
Birth of an innovation
Like all scientific advances, the transformer was built on decades of work that came before it, from the labs of Google itself, as well as its subsidiary DeepMind, the Facebook owner Meta and university researchers in Canada and the US, among others.
But over the course of 2017, the pieces clicked together through the serendipitous assembly of a group of scientists spread across Google’s research divisions.
The final team included Vaswani, Shazeer, Uszkoreit, Polosukhin and Jones, as well as Aidan Gomez, an intern then studying at the University of Toronto, and Niki Parmar, a recent masters graduate on Uszkoreit’s team, from Pune in western India. The eighth author was Lukasz Kaiser, who was also a part-time academic at France’s National Centre for Scientific Research.
Each was drawn towards what was widely seen as an emerging field of AI research: natural language processing. The group’s educational, professional and geographic diversity — coming from backgrounds as varied as Ukraine, India, Germany, Poland, Britain, Canada and the US — made them unique. “To have that diverse set of people was absolutely essential for this work to happen,” says Uszkoreit, who grew up between the US and Germany.
Uszkoreit was initially adamant he would never work in language understanding, because his father was a professor of computational linguistics. But when he came to Google as an intern, he found, much to his annoyance, that the most interesting problems in AI at the time were in language translation. Grudgingly, he followed in his father’s footsteps and ended up focusing on machine translation too.
As they all remember it, they were originally working as three separate groups on various aspects of self-attention, but then decided to combine forces. While some of the group worked on writing the initial code, cleaning data and testing it, others were responsible for creating an architecture around the models, integrating it into Google’s infrastructure to make it work efficiently, and ultimately make it easy to deploy.
“The idea for the transformer formed organically as we worked and collaborated in the office,” says Jones. Google’s colourful open-plan working environment, complete with campus bicycles, proved fruitful. “I recall Jakob [Uszkoreit] cycling up to my desk and scribbling a picture of a model on a whiteboard behind me and gathering the thoughts of whoever was in earshot.”
The binding forces between the group were their fascination with language, and their motivation for using AI to better understand it. As Shazeer, the veteran engineer, says: “Text is really our most concentrated form of abstract thought. I always felt that if you wanted to build something really intelligent, you should do it on text.”
The model published in the paper was a simple, pared-back version of the original idea of self-attention. Shazeer found it worked even better this way, when stripped of any bells and whistles they had tried to add on. The model code provided the starting point, but extensive fine-tuning was required to make it run on graphics processing units, the hardware best suited to deep learning technology like the transformer.
“In deep learning, nothing is ever just about the equations. It is how you . . . put them on the hardware, it’s a giant bag of black magic tricks that only very few people have truly mastered,” Uszkoreit says.
Once these were applied, primarily by Shazeer whom one of his co-authors calls “a wizard”, the transformer began to improve every task it was thrown at, in leaps and bounds.
Its benefit was that it allowed computations to be made in parallel, and packed them into far fewer mathematical operations than other methods, making them faster and more efficient. “It is just very simple and overall, the model is very compact,” says Polosukhin.
A peer-reviewed version of the paper was published in December 2017, just in time for NeurIPS, one of the most prestigious machine learning conferences held in southern California that year. Many of the transformers authors remember being mobbed by researchers at the event when displaying a poster of their work. Soon, scientists from organisations outside Google began to use transformers in applications from translation to AI-generated answers, image labelling and recognition. At present, it has been cited more than 82,000 times in research papers.
“There was a Cambrian explosion in both research and practical applications of the transformer,” Vaswani says, referring to the moment 530mn years ago when animal life rapidly flourished. “We saw it advancing neural machine translation, [language model] BERT appeared, which made its way to Search — that was a very important moment for practical AI, when the transformer entered Google Search.”
After the paper was published, Parmar found the transformer could generate long Wikipedia-like pages of text, which previous models had struggled with. “And we already knew [then] that you could never have done anything like that before,” she says.
Parmar also recognised one of the key properties of transformers: that when you scaled them up, by giving them more and more data, “they were able to learn much better”. They pointed the way to the advent of large models such as GPT-4, which have far better reasoning and language capabilities than their predecessors.
“The general theme was that transformers just seemed to work much better than [previous models] right out of the box on whatever people threw them at,” says Jones. “This is what I think caused the snowball effect.”
Life beyond Google
In the aftermath of the transformer model being published widely, the researchers had begun to feel impatient about pushing their ideas out into the market.
The pace of AI research was picking up, particularly in areas such as generating text and images using transformers, but many of the contributions were coming from outside Google, from start-ups like OpenAI.
Each of the co-authors who spoke to the FT said they wanted to discover what the toolbox they had created was capable of. “The years after the transformer were some of the most fertile years in research. It became apparent . . . the models would get smarter with more feedback,” Vaswani says. “It was too compelling not to pursue this.”
But they also found that Google was not structured in a way that allowed for risk-taking entrepreneurialism, or launching new products quickly. It would require building a “new kind of software . . . computers you can talk to,” Vaswani adds. “It seemed easier to bring that vision to light outside of Google.” He would eventually leave in 2021.
Polosukhin left early on, in 2017, to found a start-up called Near whose original idea was to use AI to teach computers to code but has since pivoted to blockchain payments.
Gomez, the youngest and most inexperienced, was the next to get restless. The Canadian undergraduate, who has a passion for fashion and design, interned for Kaiser (who has since left to join OpenAI), and found himself at the forefront of exciting new research on language understanding.
“The reason why I left Google was that I actually didn’t see enough adoption in the products that I was using. They weren’t changing. They weren’t modernising. They weren’t adopting this tech. I just wasn’t seeing this large language model tech actually reach the places that it needed to reach,” he says.
In 2019, he quit Google to found Cohere, a generative AI start-up that is valued at more than $2bn, with investment from Nvidia, Oracle and Salesforce, among others. Gomez is interested in applying large language models to business problems from banking and retail to customer service. “For us, it’s about lowering the barrier to access,” he says. “Every developer should be able to build with this stuff.”
Uszkoreit, meanwhile, decided to use the transformer in an entirely different field. His start-up Inceptive is a biotech company that is designing “biological software” using deep learning techniques. “If you think of computer software, it’s programming something executable . . . there’s a program that’s then converted into software that runs on your computer,” he says. “We want to do that but with cells in your body.”
The company has already delivered AI-designed molecules for infectious disease vaccines to a large pharmaceutical company. “I am convinced it is by far the best way to build on what I had been working on over the last decade to improve and maybe even save people’s lives,” Uszkoreit says.
Shazeer moved on from Google in 2021 after two decades to co-found Character.ai, a company that allows users to build chatbots of their own characters, from the Buddha to Julius Caesar or Japanese anime. “It seems that it’s kind of difficult to launch products at a large company . . . start-ups can move faster,” he says. The company, where he is chief executive, was recently valued at $1bn.
Vaswani and Parmar left at the same time in 2021, and have since partnered to found a new company called Essential.ai, which works on AI applications in business. The start-up is still in stealth, although it has raised $8mn from Thrive Capital, an early investor in Instagram, Slack and Stripe.
“Google was an amazing place, but they wanted to optimise for the existing products . . . so things were moving very slowly,” says Parmar. “I wanted to take this very capable technology, and build new novel products out of it. And that was a big motivation to leave.”
Many of the co-authors still communicate frequently, celebrating each other’s successes and supporting one another through the unique challenges of being start-up entrepreneurs.
If the transformer was a big bang moment, now a universe is expanding around it, from DeepMind’s AlphaFold, which predicted the protein structure of almost every known protein, to ChatGPT, which Vaswani calls a “black swan event”.
This has led to a period that Silicon Valley insiders call a technology overhang — time that industries will spend integrating the latest AI developments into products, even if research does not progress at all.
“You are seeing the aftermath — AI is drawing in researchers, technologists, builders and product folks. Now we believe there is a tech overhang . . . and there is a lot of value to be realised in various products,” Vaswani says. “In some sense that is why we all dispersed and tried to put this technology directly in people’s hands.”
Read the full article here