AI training data runs out 2026

Epoch AI warns with 80% certainty: high-quality training data exhausts 2026-2028. Analysis of the 300 trillion token limit, overtraining economics, and model collapse backed by Nature (2024).

By AI Twerp • Est. RT 14 min
Ai Business Ai Personal Ai Technology AI Premise Ai Signals

Epoch AI data exhaustion 2026: when the world runs out of human text

Imagine building a machine that can read everything humanity has ever written. Now imagine that machine growing hungrier than you ever expected. Hungrier than all the books, articles, forum posts, and chat messages ever stored can feed. That’s exactly where the AI industry stands right now, on the brink of a crisis few see coming but one that could reshape the entire race toward artificial intelligence, a question of strategy as much as scale. This is a deeper dive into the data constraint behind the core signal: Why AGI Won’t Happen by 2037: The Hard Limits of Data & Energy.

Between 2026 and 2032, the world’s supply of usable human written text will run out. That’s not speculation but the conclusion of Epoch AI, a research institute projecting with 80 percent certainty that we’ll hit the bottom of the data well [1]. Roughly 300 trillion tokens of high-quality public text form the total supply. At current scaling and overtraining rates, that limit could be reached as early as 2026 [2]. After that begins a new era where artificial intelligence is no longer fed by human words but by its own depleting echoes.

The Core of the Signal

The next wave of AI progress will be decided less by bigger models and more by who controls clean human data. This is not a distant academic debate, it is a near term constraint shaping budgets, licensing deals, and product quality. If the public web is running dry, where will the next training set come from, and what gets quietly excluded? The answer will determine whether AI stays grounded or drifts into self trained sameness.

  • Prioritize data strategy now, map which human sources you can legally access and refresh.
  • Negotiate licensing early, lock reliable pipelines before competitors push prices and restrictions higher.
  • Audit synthetic feedback loops, measure contamination risk and protect long tail and minority signals.

The irony cuts deep. Just as AI systems start resembling something that thinks, the fuel they’re built on threatens to dry up.

The mathematics behind scarcity

Epoch AI’s estimate rests on solid ground. The research institute presented its findings at the International Conference on Machine Learning 2024 in Vienna, after thorough peer review. The effective supply of human generated public text amounts to roughly 300 trillion tokens according to their calculations, with a 90 percent confidence interval between 100 and 1,000 trillion tokens. If “300 trillion” sounds like a made up number, it helps to see what a token actually is and why it becomes the meter that quietly limits scale: AI Tokens, The Secret Economics.

CommonCrawl, the largest collection of publicly available internet text, yields approximately 130 trillion tokens. The fully indexed web brings that to 510 trillion tokens. The complete web, including content behind login screens and private networks, reaches about 3,100 trillion tokens. But reality proves more stubborn than theory. Not all data is usable. Quality filters, deduplication of repeated content, and multi epoch training reduce what’s actually available to a fraction of the total.

Pablo Villalobos, lead author of the Epoch AI study, confirmed that while the timeline has shifted somewhat due to new techniques, the fundamental scarcity remains real. With compute optimal training, there’s sufficient data for a model with five times ten to the power of 28 floating point operations, expected to be reached in 2028. But with overtraining, a practice where models are intensively fed the same data to make them more efficient during use, the supply runs out sooner. Meta’s Llama 3 was overtrained by a factor of ten. If other labs move toward overtraining factors of one hundred, the data could run out as early as 2025.

The billion-dollar deals to buy time

The major AI companies see the crisis coming and are responding with an unprecedented wave of content licenses [4]. News Corp closed a deal with OpenAI worth more than $250 million for five years of access. Reddit is negotiating with Google and OpenAI for $203 million annually. Associated Press, Financial Times, Condé Nast, and dozens of other publishers are all signing deals worth billions of dollars in total, turning training data into a test of governance as much as capital.

Reddit has become the most cited source in AI models, three times more often than Wikipedia, according to analysis by Profound AI. The platform is negotiating dynamic pricing where the value of specific datasets is measured by their contribution to benchmark scores. The message is clear. Authentic human data is becoming scarce and therefore valuable. What was once free through web scraping is now sold to the highest bidder.

But these deals don’t solve the fundamental problem. They buy time, not an infinite supply. The amount of new content published daily simply can’t keep pace with the hunger of exponentially growing models.

The synthetic alternative and its curse

When human data runs out, synthetic data seems obvious. AI models can generate text themselves, after all. Why not use that output to train the next generation? Sam Altman of OpenAI is experimenting with it. Ilya Sutskever, OpenAI co-founder, signaled a paradigm shift when he said the 2010s were the era of scaling, but we’re now back in the era of wonder and discovery.

What comes next turns out to be a trap. In July 2024, Nature published groundbreaking research by Oxford scientists Ilia Shumailov and Zakhar Shumaylov demonstrating that AI models trained on output from other AI models progressively degrade to the point of collapse [3]. The mechanism is subtle but merciless. Model collapse unfolds in two phases. Early collapse begins almost imperceptibly when the model loses information about the tails of the distribution, especially minority data and rare events. This is hard to detect because overall performance can remain stable or even improve.

Late collapse leaves no room for doubt. The model converges to a distribution with dramatically reduced variance. Outputs bear little resemblance to the original data. Concepts become confused and the model becomes effectively worthless. A concrete example from the study shows a language model that after four generations of training on its own output produced completely unrelated text about jackrabbits when asked about medieval architecture.

The researchers describe the phenomenon as Model Autophagy Disorder. When generative models are iteratively trained on their own or other models’ output, subsequent generations inevitably lose quality or diversity unless each round contains sufficient fresh real data. Margaret Mitchell, Chief Ethics Scientist at Hugging Face, noted that the solution isn’t to reject synthetic data but to regulate its use with intelligent sampling, human oversight, and provenance tracking.

As AI-generated content floods the internet, the problem becomes self-reinforcing in a form of digital ouroboros. Datasets scraped from the web inevitably contain increasing amounts of AI-generated content, creating feedback loops impossible to untangle without radical transparency about the origin of every sentence.

Signs of slowdown and the data wall

Training data crisis illustration
When the web runs out of fresh human text

The first signs that the AI industry is hitting a wall are already visible [5]. OpenAI’s upcoming flagship model, internally codenamed Orion, would show significantly less performance improvement over previous models than the leap GPT-4 made over GPT-3, according to sources within the company. Google is also struggling to achieve comparable performance jumps for Gemini. The improvement is there, but falls short of expectations.

Dario Amodei, CEO of Anthropic, estimated in 2023 a ten percent chance that AI system scaling could stagnate due to insufficient data. He simultaneously noted that training costs are rising explosively. Current models cost around $100 million, models now in training cost roughly $1 billion, and expected models between 2025 and 2027 could reach $10 to $100 billion. Those costs aren’t rising because compute is getting more expensive but because models need increasingly more data to achieve incremental improvements while that data becomes scarcer.

Nicolas Papernot, assistant professor at the University of Toronto, noted that we don’t necessarily need to train ever larger models. Building more capable AI systems can also come from training models that are more specialized for specific tasks. That’s a fundamental shift from bigger is better to better data is better, an acknowledgment that the scaling strategy has reached its limits and innovation now must come from efficiency instead of expansion.

Possible escape routes and alternative paths

Despite the grim outlook, potential solutions exist. Epoch AI’s own analysis suggests that despite the exhaustion of public text data, AI developers likely won’t run completely dry. Two sources offer a way out: synthetic data particularly for reasoning training, and multimodal data such as images, video, and audio.

Multimodal learning could potentially triple the available training data. After correcting for uncertainties around data quality, Epoch AI estimates the equivalent of 400 trillion to twenty quadrillion tokens available for training by 2030. Video and audio contain enormous untapped data sources barely explored until now. Jensen Huang of Nvidia noted that every company database is their gold mine and every company sits on these gold mines, referring to private data that companies own.

Training techniques and algorithms for large language models are improving at a rate of roughly 0.4 orders of magnitude per year according to research by Ho et al from 2024. That means models learn more from less data, reducing pressure on the supply. Research from NYU’s Center for Data Science suggests the problem of model collapse can be mitigated by using reinforcement techniques to select high quality synthetic data, a kind of innovation that matters more than another parameter bump.

Other strategies include data accumulation where new synthetic data is added to existing real data rather than replaced, watermarking of AI content to prevent it from flowing back into training sets, and negative guidance training. These show promise but none have been tested at the scale needed to save the entire industry from data hunger.

The truth is rare, but supply still exceeds demand.

– Josh Billings

A growing number of copyright lawsuits threatens to make large portions of valuable training data retroactively inaccessible. The New York Times sued OpenAI and Microsoft, followed by Raw Story, The Intercept, and newspaper groups. If courts or legislators determine that permission was required for using copyrighted texts in training data, much of what Elon Musk once called the cumulative sum of human knowledge could be retroactively banned.

The moment when usable data runs out could then suddenly lie behind us rather than ahead of us. This legal risk intensifies the urgency of the data crisis and shifts the discussion from a technical problem to a question of governance and intellectual property rights in the era of machine learning. The outcome of these lawsuits will determine whether AI companies retain access to decades of archived human knowledge or are forced to rely on licensed and synthetic sources that are qualitatively and quantitatively inferior.

The turning point and the transition ahead

The AI industry is approaching a turning point where exponential growth driven by simply more data and more compute can’t continue forever. The comparison to a literal gold rush depleting finite natural resources fits. Just like with fossil fuels, the situation requires a transition to more sustainable sources, or in this case, smarter ways to use data.

The question isn’t whether limitations will come but how the industry will respond. Will companies shift toward smaller specialized models that need less data? Will they invest in human data generation at scale, perhaps by paying millions of people for high quality content? Will they break through to multimodal intelligence that learns from video and sensory input instead of text alone? Or will they hit the wall of model collapse while desperately trying to recycle synthetic data until nothing authentic remains?

The coming years will provide answers that reach beyond technology alone, because they touch on fundamental questions about what impact means when machines learn from machines instead of from people, and whether intelligence can keep growing when the source of truth runs dry.

References

[1] Villalobos P, Ho A, Sevilla J, Besiroglu T, Heim L, Hobbhahn M. Will we run out of data? Limits of LLM scaling based on human-generated data. Epoch AI. 2024. Available from: https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data

[2] PBS News. AI ‘gold rush’ for chatbot training data could run out of human-written text as early as 2026. 2024. Available from: https://www.pbs.org/newshour/economy/ai-gold-rush-for-chatbot-training-data-could-run-out-of-human-written-text-as-early-as-2026

[3] Shumailov I, Shumaylov Z, Zhao Y, Papernot N, Anderson R, Gal Y. AI models collapse when trained on recursively generated data. Nature. 2024;631:755-759. Available from: https://www.nature.com/articles/s41586-024-07566-y

[4] Digiday. A timeline of the major deals between publishers and AI tech companies in 2025. 2025. Available from: https://digiday.com/media/a-timeline-of-the-major-deals-between-publishers-and-ai-tech-companies-in-2025/

[5] SuperAnnotate. AI data wall: Why experts predict AI slowdown and how to break through the plateau. 2024. Available from: https://www.superannotate.com/blog/ai-data-wall