Artificial intelligence companies are exploring a new avenue to obtain the massive amounts of data needed to develop powerful generative models: creating the information from scratch.
Microsoft, OpenAI and Cohere are among the groups testing the use of so-called “synthetic data” — computer-generated information to train their AI systems known as large language models (LLMs) — as they reach the limits of human-made data that can further improve the cutting-edge technology.
The launch of Microsoft-backed OpenAI’s ChatGPT last November has led to a flood of products rolled out publicly this year by companies including Google and Anthropic, which can produce plausible text, images or code in response to simple prompts.
The technology, known as generative AI, has driven a surge of investor and consumer interest, with the world’s biggest technology companies including Google, Microsoft and Meta racing to dominate the space.
Currently, LLMs that power chatbots like OpenAI’s ChatGPT and Google’s Bard are trained primarily by scraping the internet. Data used to train these systems includes digitised books, news articles, blogs, search queries, Twitter and Reddit posts, YouTube videos and Flickr images, among other content.
Humans are then used to provide feedback and fill gaps in the information in a process known as reinforcement learning by human feedback (RLHF).
But as generative AI software becomes more sophisticated, even deep-pocketed AI companies are running out of easily accessible and high-quality data to train on. Meanwhile, they are under fire from regulators, artists and media organisations around the world over the volume and provenance of personal data consumed by the technology.
At an event in London in May, OpenAI’s chief executive Sam Altman was asked whether he was worried about regulatory probes into ChatGPT’s potential privacy violations. Altman brushed it off, saying he was “pretty confident that soon all data will be synthetic data”.
Generic data from the web is no longer good enough to push the performance of AI models, according to developers.
“If you could get all the data that you needed off the web, that would be fantastic,” said Aidan Gomez, chief executive of $2bn LLM start-up Cohere. “In reality, the web is so noisy and messy that it’s not really representative of the data that you want. The web just doesn’t do everything we need.”
Currently, the most cutting-edge models, such as OpenAI’s GPT-4, are approaching human-level performance in areas such as writing and coding, and are able to pass benchmarks such as the US bar exam.
To dramatically improve their performance and be able to address challenges in science, medicine or business, AI models will require unique and sophisticated data sets. These will either have to be created by world experts such as scientists, doctors, authors, actors or engineers, or acquired as proprietary data from large corporations like pharmaceuticals, banks and retailers. However, “human-created data . . . is extremely expensive”, Gomez said.
The new trend of using synthetic data sidesteps this costly requirement. Instead, companies can use AI models to produce text, code or more complex information related to healthcare or financial fraud. This synthetic data is then used to train advanced LLMs to become ever more capable.
According to Gomez, Cohere as well as several of its competitors already use synthetic data which is then fine-tuned and tweaked by humans. “[Synthetic data] is already huge . . . even if it’s not broadcast widely,” he said.
For example, to train a model on advanced mathematics, Cohere might use two AI models talking to each other, where one acts as a maths tutor and the other as the student.
“They’re having a conversation about trigonometry . . . and it’s all synthetic,” Gomez said. “It’s all just imagined by the model. And then the human looks at this conversation and goes in and corrects it if the model said something wrong. That’s the status quo today.”
Two recent studies from Microsoft Research showed that synthetic data could be used to train models that were smaller and simpler than state-of-the art software like OpenAI’s GPT-4 or Google’s PaLM-2.
One paper described a synthetic data set of short stories generated by GPT-4, which only contained words that a typical four-year-old might understand. This data set, known as TinyStories, was then used to train a simple LLM that was able to produce fluent and grammatically correct stories. The other paper showed that AI could be trained on synthetic Python code in the form of textbooks and exercises, which they found performed relatively well on coding tasks.
Start-ups such as Scale AI and Gretel.ai have sprung up to provide synthetic data as a service. Gretel, set up by former US intelligence analysts from the National Security Agency and the CIA, works with companies including Google, HSBC, Riot Games and Illumina to augment their existing data with synthetic versions that can help train better AI models.
The key component of synthetic data, according to Gretel chief executive Ali Golshan, is that it preserves the privacy of all individuals in a data set, while still maintaining its statistical integrity.
Well-crafted synthetic data can also remove biases and imbalances in existing data, he added. “Hedge funds can look at black swan events and, say, create a hundred variations to see if our models crack,” Golshan said. For banks, where fraud typically constitutes less than a 100th of a per cent of total data, Gretel’s software can generate “thousands of edge case scenarios on fraud and train [AI] models with it.”
Critics point out that not all synthetic data will be carefully curated to reflect or improve on real-world data. As AI-generated text and images start to fill the internet, it is likely that AI companies crawling the web for training data will inevitably end up using raw data produced by primitive versions of their own models — a phenomenon known as “dog-fooding”.
Research from universities including Oxford and Cambridge, recently warned that training AI models on their own raw outputs, which may contain falsehoods or fabrications, could corrupt and degrade the technology over time, causing “irreversible defects.”
Golshan agrees that training on poor synthetic data could impede progress. “The content on the web is more and more AI-generated, and I do think that will lead to degradation over time [because] LLMs are producing regurgitated knowledge, without any new insights,” he said.
Despite these risks, AI researchers like Cohere’s Gomez say that synthetic data has the potential to accelerate the path to superintelligent AI systems.
“What you really want is models to be able to teach themselves. You want them to be able to . . . ask their own questions, discover new truths and create their own knowledge,” he said. “That’s the dream.”
Read the full article here