How a lack of data threatens the future of artificial intelligence

Artificial intelligence is facing a lack of key data, leading to the use of synthetic solutions. Can "fake" data be the future of AI or a risk to the quality of models and their performance?

The world of artificial intelligence is facing a shortage of its most valuable raw material – data. This has sparked discussions about an increasingly popular alternative: synthetic or even “fake” data. For years, companies like OpenAI and Google have been mining data from the internet to train the large language models (LLMs) that power their AI solutions. These models have digested vast amounts of human-generated content, from research papers and novels to YouTube videos.

Now, that data is slowly running out, its quantity becoming increasingly limited. Some major players in the field, such as OpenAI CEO Sam Altman, believe that self-learning models will be able to use synthetic data, which would provide a cheap and almost endless source of data.

However, researchers warn of risks. Synthetic data could reduce the quality of models, as they can be “poisoned” with their own errors. Research by Oxford and Cambridge universities showed that feeding models exclusively with synthetic data leads to poor results and “nonsense.” In their opinion, a balanced use of synthetic and real data is key.

More and more companies are creating synthetic data

The lack of data is leading companies to look for alternatives, such as synthetic data generated by AI systems based on real data. Tech companies, including OpenAI and Google, are already paying millions to access data from platforms like Reddit and various media outlets, as websites increasingly restrict the free use of their content. Still, resources are limited.

Nvidia, Tencent, and startups Gretel and SynthLabs are developing tools to create synthetic data, which is often cleaner and more specific than human-generated data. Meta, with its Llama 3.1 model, has used synthetic data to improve skills such as programming and mathematical problem-solving. Synthetic data also offers the potential to reduce the bias inherent in real-world data, although researchers warn that ensuring accuracy and impartiality remains a major challenge.

“Habsburg” artificial intelligence

Although synthetic data brings advantages, it also poses serious risks. Meta Research on the Llama 3.1 Model showed that training a model on its own synthetic data can actually degrade its performance. Similarly, study in the journal Nature warned that the uncontrolled use of synthetic data leads to “model collapse,” which the researchers compared to genetic degeneration and symbolically called the phenomenon “Habsburg artificial intelligence.” A term coined by researcher Jathan Sadowski.

The main question remains: how much synthetic data is too much? Some experts suggest using hybrid data, where synthetic data is combined with real data to prevent model degradation. Companies like Scale AI are exploring this approach, and their CEO Alexandr Wang believes that the hybrid approach is “the real future.”

Finding new solutions

In January, Google DeepMind unveiled AlphaGeometry, a system that solves geometric problems at an extremely high level using a “neuro-symbolic” approach. It combines the advantages of data-intensive deep learning and rule-based reasoning. The model was trained entirely on synthetic data and is seen as a potential step towards artificial general intelligence.

The field of neuro-symbolic is still young, but it could offer a promising direction for the future of AI development. Under pressure to monetize, companies like OpenAI, Google, and Microsoft will try all possible solutions to overcome the data crisis.