How AI training data works
- Training data is the raw input AI models learn from. For modern large language models, it's roughly the public web plus licensed corpora plus, increasingly, user conversations. Trillions of words.
- Your personal data is almost certainly in there if you've ever had a public web presence — a profile, a blog, a news mention, a public LinkedIn. The major vendors don't publish exact training-data lists.
- Removing your data from a trained model is technically hard. The data lives compressed into model weights, not in a database. What vendors offer is opt-out from future training plus output suppression for the current model. See the AI removal hub.
What "training data" actually is
A large language model like ChatGPT, Claude, or Gemini is a function with hundreds of billions of parameters. Each parameter is a number. Training is the process of adjusting those numbers until the model can do something useful — answer questions, write code, summarize text.
The adjustment is done by showing the model massive amounts of text and asking it to predict the next word. When it predicts wrong, the numbers shift slightly. After trillions of word-prediction tries, the model has internalized statistical patterns about language, facts, code, reasoning — everything that was in the training data.
The "training data" is that input text. The more of it, the more capable the model. The biggest publicly-disclosed model trainings as of 2026 used somewhere in the range of 10-30 trillion tokens, where a token is roughly 0.75 of an English word. That's an order of magnitude more text than has ever been written by humans deliberately — the rest is necessarily scraped, ingested, and processed at scale.
Where the data comes from
Five main sources, roughly in order of volume:
- Web crawls. Most foundational. Common Crawl is the public open-source corpus — petabytes of scraped web pages going back to 2008. Most major vendors use some form of crawl-based corpus, either Common Crawl directly or their own proprietary version. Your personal website, your LinkedIn profile, your published essays, your forum posts — if it was publicly accessible during a crawl, it was ingested.
- Wikipedia and reference corpora. Massive coverage of named entities (people, places, events). High signal density. Mostly clean, mostly accurate.
- Books and academic papers. Either licensed (LibGen-style scrapes were a 2022-2023 era issue), purchased through publisher deals (the post-NYT-lawsuit norm), or open-access (arXiv, public-domain).
- Code repositories. GitHub public repos, Stack Overflow Q&A. Powers the code-completion behavior of major chatbots.
- User-generated content from the vendor's own platform. If the vendor runs a search engine, social platform, or chatbot, the user data is often (with consent or via terms-of-service buried language) used for future training.
The "where it came from" question is increasingly important because of litigation. The 2023-2025 lawsuits (NYT v. OpenAI, Authors Guild, Getty v. Stability AI) are testing whether each of these sources requires explicit license. The outcomes will reshape how vendors source training data going forward.
The pipeline: from raw text to a model
The path from a scraped web page to a fact a chatbot knows:
- Collection. Vendor's crawler hits a URL, downloads the page, stores the raw HTML.
- Cleaning. HTML stripped to text, boilerplate removed, low-quality content filtered (spam, machine-generated junk, NSFW content, duplicates).
- Tokenization. Text converted into tokens (word-fragment IDs). Trillions of tokens get indexed into a training corpus.
- Training. The model sees batches of tokens. For each batch, it predicts the next token, gets the error, and adjusts its parameters. Iterates trillions of times across hundreds of GPUs over weeks or months.
- Reinforcement. Post-training, human raters (and now AI raters) score the model's outputs. The model is further fine-tuned to prefer high-scored responses. This is where vendors install safety filters and tone preferences.
- Deployment. The final model is deployed. Your queries go in, the model generates a response based on the patterns it learned during training.
What this pipeline does to your personal data: somewhere around step 1 or 2, it gets pulled in alongside everything else. By the end of step 4, it has been compressed into hundreds of billions of parameter adjustments. The data is not in the model in any identifiable form; it has been distilled into statistical signals about language, facts, and patterns.
If you've had any public web presence, you're in some training corpus. Free Delist scan tells you what's also visible to data brokers feeding future trainings.
Run my free exposure scan →Why deletion is technically hard
"Delete my data from the AI" sounds like it should mean: find the row in a database and remove it. With AI models, there is no row.
Once training is complete, the original text is no longer needed and is typically discarded (or kept as a corpus for the next model version). The model's "knowledge" of you exists as small adjustments to billions of parameters — not as text you can search. Removing the influence of one source from a trained model is closer to extracting one ingredient from baked bread than to deleting a file.
What vendors actually do when you request deletion:
- Suppress the output. Use reinforcement learning to make the model decline to discuss specific entities or facts. This is patchwork — the model still "knows," but is trained to refuse.
- Add an output filter. A post-generation layer that scans the response for named entities and blocks them. Imperfect — rephrasings often slip through.
- Retrain a new model without the disputed source. The cleanest fix — remove the source from the corpus, train v(n+1). Expensive (millions of dollars in compute per training run), so only done for major disputes or when the corpus is reaching its natural refresh cycle anyway.
- Stop future training on the source. Easy. Tell the crawler to skip your URL, don't include you in fine-tuning data. Applies to next model version, not current.
For most consumer deletion requests, vendors do the easy ones (stop future training, add some output suppression) and don't do the hard one (retrain). They are usually honest about this if you read the policy.
The 2024-2026 regulatory pressure
Three forces are reshaping training-data practices as of 2026:
- The NYT v. OpenAI case. Filed December 2023. The New York Times sued OpenAI for using NYT content without license in training. The case is testing whether news content training falls under fair use. Has prompted OpenAI and others to start licensing news content explicitly.
- The EU AI Act. Entered into force 2024, with provisions phasing in through 2026-2027. Includes specific transparency and opt-out requirements for training data, especially for "general-purpose AI" (i.e., LLMs).
- State-level opt-outs. California's CCPA and other state laws have been read by some vendors to require an "AI training" opt-out for residents. This is the basis for OpenAI's, Anthropic's, and Google's user-facing opt-out flows.
The trend is clear: training data is becoming a regulated input, not a free one. Vendors are increasingly disclosing what they trained on, licensing instead of scraping, and offering opt-outs as default.
What you can actually do
Pragmatic actions, in order of effectiveness:
- Opt out of future training at each major vendor. See our AI removal hub for the per-vendor flows.
- Remove your data from the broker layer. Future trainings will pull from the open web again. Brokers like Spokeo and Whitepages will be in those crawls. Removing your data from brokers indirectly reduces your exposure in future model versions.
- Audit what's about you on the public web. Old blog posts, dead social accounts, forum posts — deleting these reduces the source material future training has to work with.
- Use no-training-by-default services. Anthropic's Claude states it does not train on user conversations by default. Enterprise-tier services from most vendors have stronger no-training guarantees.
- Accept the limits. Models already trained have already learned what they learned. The point of opting out is to control future versions, not unwrite past ones.