The Promise and Perils of Integrated Data


Is it possible to train an AI only on data generated by other AIs? It may seem like a rehearsed idea. But that was a long time ago — and new, Real data is gaining traction as it becomes harder to come by.

Anthropic used some synthetic data to train one of its flagship models. Claude 3.5 Sonnet. The Meta has fine-tuned its sound. Lama 3.1 models Using data generated by AI. OpenAI claims to source from Synthetic Training Data. o1Its “reasonableness” model for next year Orion.

But why does AI need data in the first place—and what? Be kind. Need data? And this forest is available. for real Will it be replaced by synthetic data?

Importance of annotations

AI systems are statistical machines. Many instances have been trained; They learn patterns in these examples to make predictions such as “probably concerned” before the “who” of an email.

Comments Text that usually labels the meaning or components of the data that these systems enter is a key part of these examples. They are objects, It “teaches” as a model to distinguish between places and ideas.

Consider a photo-classification model that shows a large number of pictures of kitchens labeled with the word “kitchen”. During training, The model will start the relationship between “kitchen” and general. Symptoms Kitchen appliances (eg refrigerators and counters). After training, given a photo of a kitchen that wasn't included in the initial samples, the model was able to identify it as such. (Of course, if the kitchen pictures were labeled “cows,” they would be assigned to cows, with an emphasis on noteworthy importance.)

The appetite for AI and the need to provide labeled data for its development has driven the market for commenting services. Dimension Market Research Estimates It is worth $838.2 million today and will be worth $10.34 billion in the next 10 years. Although there is no accurate estimate of how many people are involved in the labeling industry, 2022, Paper Mark the number in “millions”.

Companies large and small rely on workers employed by data annotation companies to create labels for AI training sets. Some of these jobs pay decently, especially if labeling requires specialized knowledge (eg math skills). Others may corrupt. Illustrators in developing countries On average, they only pay a few dollars an hour.without benefit or guarantee;

It also has drying data.

So there are humanitarian reasons to look for alternatives to man-made labels. Uber, for example, is expanding its fleet. Huge workers to work on AI annotation and data labeling. But there are also practicalities.

Humans can label very quickly. There are also writers. Bias This makes their comments stand out. Any model they are trained on later. Performs simulations. Mistakesor get tripped. Labeling with instructions. And paying humans to do things is expensive.

District in general expensive For that matter. Shutterstock is charging AI vendors tens of millions of dollars to access it. ArchivesWhile on Reddit there is. Google Hundreds of millions were made from licensing data to OpenAI and others.

Finally, getting data is getting harder.

Most models are trained on large sets of public data — data whose owners are choosing to open their doors for fear of what might happen to their data. It has been stolen.Or they won't receive credit or attribution for it. Over 35% of the world's top 1,000 websites Now block OpenAI's web scraper.. About 25% of the data from “high-quality” sources is restricted from the primary datasets used to train the models; Study. found.

If the current entry-blocking trend continues, research group Epoch AI said. projects Developers will run out of data to train Generative AI models between 2026 and 2032. Copyright Litigation versus Objectionable material The influx of open datasets has also driven attribution for AI vendors.

Synthetic alternatives

At first glance, Composite data seems to be the solution to all these problems. Need annotations? Produce them. No more sample data issues. The sky is the limit.

to some extent This is true.

“If data is the new oil.” “Synthetic data emerges as creative biomass without the negative externalities of the real thing,” Os Keyes, a PhD candidate at the University of Washington who studies the ethical implications of emerging technologies, told TechCrunch.

The AI ​​industry took the concept and ran with it.

This month, Writer, a next-generation AI company focused on business, trained the model Palmyra X 004 almost entirely on composite data. It cost just $700,000 to develop, says the author: Compare About $4.6 million for an OpenAI model of the same size.

Microsoft's Phil The open models were partially trained using composite data. The same goes for Google. Gemma models. Nvidia This summer AI startup Hugging Face recently released its proposal, unveiling a family of models designed to generate hybrid training data. Largest AI training dataset of synthetic text.

Synthetic data generation has become an industry in its own right—perhaps. Worth it $2.34 billion in 2030. Gartner Estimated. This year, 60% of the data used for AI and analytics projects will be synthesized.

Luca Soldaini, Senior Research Scientist at the Allen Institute for AI, noted that aggregated data techniques can be used to generate training data in a format not readily available through scraping (or even content licensing). For example, Of course its a video generator Movie GenMeta used Llama 3 to subtitle the footage in the training data to include more human details like lighting descriptions.

Along these same lines, OpenAI says it's fine-tuned. GPT-4o Uses synthetic data to build sketchpad-like. Canvas Feature for ChatGPT. And then there's Amazon. He said. It generates synthetic data to supplement the real-world data used to train speech recognition models for Alexa.

“Integrated data models can be used to rapidly expand human insight where data is required to achieve a specific pattern of behavior,” says Soldaini.

Synthetic risks

Synthetic data is not a panacea. It suffers from the “garbage in, garbage out” problem like all AI. models create combined data; and if there are biases and limitations in the data used to train these models. Their exits will be similarly contaminated. For example, Groups that are poorly represented in the primary data will be less so in the composite data.

“The problem is, you can only do so much,” Keyes said. “Say you only have 30 black people in your data set. Zooming in might help, but if those 30 people are middle-class or light-skinned, all the 'representative' data will look the same.”

This is 2023. Study. Researchers at Rice University and Stanford found that relying too much on composite information during training “can create models that gradually decrease in quality or diversity.” Sampling bias — poorly representing the real world — can make a model's variance worse after a few generations of training, according to the researchers (although researchers have found that mixing in a little real-world data helps mitigate this).

Keyes sees additional risks in complex models like OpenAI's o1; Illusion. in their synthetic data. In turn, they can reduce the accuracy of models trained on the data — especially if the sources of confounds are not easy to identify.

“Complex models are misleading. The information produced by complex models contains hallucinations,” Keyes added. “With a model like o1, developers themselves don't necessarily need to explain why artefacts appear.”

Confusion and illusions can result in blurred patterns. One Study. A publication in the journal Nature describes how models trained on low-error data produce Even more Error-driven data and how this feedback loop will degrade future generations of models. The researchers found that the models lost their ability to grasp specific knowledge over generations — becoming more general and often producing answers unrelated to the questions they asked.

Image creditsIlia Shumailov et al.

follow up Study. Other types of models, such as image generators, show that they are not immune to this kind of collapse:

Image creditsIlia Shumailov et al.

Soldaini distrusts “raw” composite data; At the very least, the goal is to avoid forgetful chatbots and homogenized imagery by using it “safely” and thoroughly reviewing it; It requires processing and filtering, and ideally you need to pair it with fresh data, just like any other dataset, he says.

It may fail to do so. leading to model collapseA model becomes less “creative” in its results — and less biased. Ultimately, it seriously compromises its functionality. This process can be detected and caught before it becomes serious, but it is dangerous.

“Researchers need to check the generated data, iterate the generation process and identify safeguards to remove low-quality data points,” Soldaini said. “Integrated data pipelines are not a self-improving machine. Their production must be carefully checked and improved before being used for training.”

Sam Altman, CEO of OpenAI, has argued that AI will work. one day It generates synthetic data that is good enough to train itself effectively. But – assuming it's possible – the technology doesn't exist yet. No major AI lab has released a trained model. on synthetic data alone.

At least for the foreseeable future. I think we need humans in the loop. Somewhere A model's training is definitely missed.

TechCrunch has an AI-focused newsletter. Register here. Get it in your inbox every Wednesday.

Update: This story was originally published on October 23 and was updated with additional information on December 24.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *