General AI needs synthetic data. Should we be able to trust

Today's generative models of AI, such as those standing Chatgpt And GeminiThey are trained in real -world data beams, but even all the content of the Internet is not enough to prepare a model for every possible situation.

To continue to grow, these models need to be trained on simulated or synthetic data, which are scenarios that are reliable but not realistic. AI developers have to do so responsibly, say panel experts south of the southwest, or things could quickly go to Havir.

The use of simulated training data for artificial intelligence models has attracted new attention this year since the launch of DeepSeek AIA new model produced in China that was trained using more synthetic data from other models, saves money and processing power.

But experts say it is more than saving data collection and processing. Synthetic data -The computer generated often from AI itself to learn a model of scenarios that do not exist in the real world information that is provided, but that it can face in the future. That possibility of one-million does not have to be a surprise to the AI model if you see a simulation of it.

“With simulated data, you can get rid of the idea of cases on the edge, assuming you can trust it,” said Oji Udeuzu, who led teams for Products on Twitter, Atlasian, Microsoft and other companies. He and other panelists spoke on Sunday at the SXSW conference in Austin, Texas. “We can build a product that works for 8 billion people, in theory, as long as we can trust it.”

The hard part is to make sure you can trust it.

The problem with the simulated data

Simulated data have many benefits. For one, it costs less to produce. You can break down thousands of simulated cars with the help of software, but to get the same results in real life, you have to break cars – which costs a lot of money – said Udeuzu.

If you are training a car to driving on your own, for example, you should catch some less frequent scenarios that the vehicle can experience on the roads, even if they are not in training data, said Tahir Team, a business analytics professor at Texas State University. He used the case of bats that make a spectacular appearance from the Austin Congress Avenue Bridge. This may not appear in training data, but a driving car will need a sense of how to respond to a bat.

The risks arise from how the machine trained using synthetic data responds to changes in the real world. It cannot exist in an alternative reality, or it becomes less useful and even dangerous, Team said. “How would you feel,” he asked, “getting into a car by driving that was not trained on the road, it was only trained on simulated data?” Every system that uses simulated data should “be grounded in the real world,” he said, including feedback on how his simulated reasoning with what is actually happening.

Uduzu compared the problem of creating social media, which began as a way to expand communication worldwide, the goal he achieved. But social media was also abused, he said, noting that “now the despots are used to control people, and people use it to tell jokes at the same time.”

As AI tools grow in volume and popularity, the script facilitated with the use of synthetic training data, the potential impacts of the real world of untrustworthy training and the models to become separated from reality. “The burden is for US builders, scientists, to be double, triple sure the system is reliable,” Udeuzu said. “It's not a fantasy.”

How to keep simulated data in checking

One way to provide models are confidential is to make their training transparent, which users can choose which model to use on the basis of their assessment of that information. Panelists have constantly used the analogy of the diet label, which is easy to understand the user.

There is a certain transparency, such as models cards available through the developer platform Embracing a person which break down the details of different systems. This information should be as clarified and transparent as possible, said Mike Hollinger, Director of Product Management for AI Generation AI in Chipmaker Nvidia. “Those types of things must be in force,” he said.

Hollinger said that after all, it's not just AI developers, but also AI users who will define industry best practices.

The industry should also keep in mind ethics and risks, Udeuzu said. “Synthetic data will make many things easier to do,” he said. “Wille reduced the cost of building work. But some of those things will change society. “

Uduzu said that reconciliation, transparency and confidence must be embedded in models to ensure their reliability. It involves updating training models, so they reflect accurate data and do not increase synthetic data errors. One concern is the collapse of the model, when the AI model trained for data produced by other AI models will move away from reality, to the point that it will become useless.

“The more you are ashamed of capturing diversity in the real world, the answers can be unhealthy,” Udeu said. The solution is a correction of errors, he said. “These do not feel like unsolvable problems if you combine the idea of trust, transparency and correction of mistakes in them.”

Source link

The problem with the simulated data

How to keep simulated data in checking

Leave a ReplyCancel Reply