How AI training data works

In short
  • Training data is the raw input AI models learn from. For modern large language models, it's roughly the public web plus licensed corpora plus, increasingly, user conversations. Trillions of words.
  • Your personal data is almost certainly in there if you've ever had a public web presence — a profile, a blog, a news mention, a public LinkedIn. The major vendors don't publish exact training-data lists.
  • Removing your data from a trained model is technically hard. The data lives compressed into model weights, not in a database. What vendors offer is opt-out from future training plus output suppression for the current model. See the AI removal hub.
7 min read Last reviewed May 2026 Free scan available

What "training data" actually is

A large language model like ChatGPT, Claude, or Gemini is a function with hundreds of billions of parameters. Each parameter is a number. Training is the process of adjusting those numbers until the model can do something useful — answer questions, write code, summarize text.

The adjustment is done by showing the model massive amounts of text and asking it to predict the next word. When it predicts wrong, the numbers shift slightly. After trillions of word-prediction tries, the model has internalized statistical patterns about language, facts, code, reasoning — everything that was in the training data.

The "training data" is that input text. The more of it, the more capable the model. The biggest publicly-disclosed model trainings as of 2026 used somewhere in the range of 10-30 trillion tokens, where a token is roughly 0.75 of an English word. That's an order of magnitude more text than has ever been written by humans deliberately — the rest is necessarily scraped, ingested, and processed at scale.

Where the data comes from

Five main sources, roughly in order of volume:

  1. Web crawls. Most foundational. Common Crawl is the public open-source corpus — petabytes of scraped web pages going back to 2008. Most major vendors use some form of crawl-based corpus, either Common Crawl directly or their own proprietary version. Your personal website, your LinkedIn profile, your published essays, your forum posts — if it was publicly accessible during a crawl, it was ingested.
  2. Wikipedia and reference corpora. Massive coverage of named entities (people, places, events). High signal density. Mostly clean, mostly accurate.
  3. Books and academic papers. Either licensed (LibGen-style scrapes were a 2022-2023 era issue), purchased through publisher deals (the post-NYT-lawsuit norm), or open-access (arXiv, public-domain).
  4. Code repositories. GitHub public repos, Stack Overflow Q&A. Powers the code-completion behavior of major chatbots.
  5. User-generated content from the vendor's own platform. If the vendor runs a search engine, social platform, or chatbot, the user data is often (with consent or via terms-of-service buried language) used for future training.

The "where it came from" question is increasingly important because of litigation. The 2023-2025 lawsuits (NYT v. OpenAI, Authors Guild, Getty v. Stability AI) are testing whether each of these sources requires explicit license. The outcomes will reshape how vendors source training data going forward.

The pipeline: from raw text to a model

The path from a scraped web page to a fact a chatbot knows:

  1. Collection. Vendor's crawler hits a URL, downloads the page, stores the raw HTML.
  2. Cleaning. HTML stripped to text, boilerplate removed, low-quality content filtered (spam, machine-generated junk, NSFW content, duplicates).
  3. Tokenization. Text converted into tokens (word-fragment IDs). Trillions of tokens get indexed into a training corpus.
  4. Training. The model sees batches of tokens. For each batch, it predicts the next token, gets the error, and adjusts its parameters. Iterates trillions of times across hundreds of GPUs over weeks or months.
  5. Reinforcement. Post-training, human raters (and now AI raters) score the model's outputs. The model is further fine-tuned to prefer high-scored responses. This is where vendors install safety filters and tone preferences.
  6. Deployment. The final model is deployed. Your queries go in, the model generates a response based on the patterns it learned during training.

What this pipeline does to your personal data: somewhere around step 1 or 2, it gets pulled in alongside everything else. By the end of step 4, it has been compressed into hundreds of billions of parameter adjustments. The data is not in the model in any identifiable form; it has been distilled into statistical signals about language, facts, and patterns.

If you've had any public web presence, you're in some training corpus. Free Delist scan tells you what's also visible to data brokers feeding future trainings.

Run my free exposure scan

Why deletion is technically hard

"Delete my data from the AI" sounds like it should mean: find the row in a database and remove it. With AI models, there is no row.

Once training is complete, the original text is no longer needed and is typically discarded (or kept as a corpus for the next model version). The model's "knowledge" of you exists as small adjustments to billions of parameters — not as text you can search. Removing the influence of one source from a trained model is closer to extracting one ingredient from baked bread than to deleting a file.

What vendors actually do when you request deletion:

For most consumer deletion requests, vendors do the easy ones (stop future training, add some output suppression) and don't do the hard one (retrain). They are usually honest about this if you read the policy.

The 2024-2026 regulatory pressure

Three forces are reshaping training-data practices as of 2026:

  1. The NYT v. OpenAI case. Filed December 2023. The New York Times sued OpenAI for using NYT content without license in training. The case is testing whether news content training falls under fair use. Has prompted OpenAI and others to start licensing news content explicitly.
  2. The EU AI Act. Entered into force 2024, with provisions phasing in through 2026-2027. Includes specific transparency and opt-out requirements for training data, especially for "general-purpose AI" (i.e., LLMs).
  3. State-level opt-outs. California's CCPA and other state laws have been read by some vendors to require an "AI training" opt-out for residents. This is the basis for OpenAI's, Anthropic's, and Google's user-facing opt-out flows.

The trend is clear: training data is becoming a regulated input, not a free one. Vendors are increasingly disclosing what they trained on, licensing instead of scraping, and offering opt-outs as default.

What you can actually do

Pragmatic actions, in order of effectiveness:

  1. Opt out of future training at each major vendor. See our AI removal hub for the per-vendor flows.
  2. Remove your data from the broker layer. Future trainings will pull from the open web again. Brokers like Spokeo and Whitepages will be in those crawls. Removing your data from brokers indirectly reduces your exposure in future model versions.
  3. Audit what's about you on the public web. Old blog posts, dead social accounts, forum posts — deleting these reduces the source material future training has to work with.
  4. Use no-training-by-default services. Anthropic's Claude states it does not train on user conversations by default. Enterprise-tier services from most vendors have stronger no-training guarantees.
  5. Accept the limits. Models already trained have already learned what they learned. The point of opting out is to control future versions, not unwrite past ones.

Frequently asked questions

Can a trained AI model actually "forget" someone?
Not in the way most people imagine. Model weights compress training data into a high-dimensional statistical representation. There is no row to delete. What vendors do instead is (1) suppress the model from surfacing specific facts via reinforcement learning, (2) add output filters that catch the named entity, or (3) retrain a new model without the disputed data. Option 3 is the closest to actual deletion, and it only applies to future model versions.
Was my data definitely in ChatGPT or Claude's training?
If you have ever had a public web presence — a personal website, a LinkedIn profile, a Twitter/X account, mentions in news articles — almost certainly yes. The major models were trained on web crawls (Common Crawl, etc.) that swept up most of the public web. Vendors do not publish exact training data lists, so individual confirmation is impossible. The safe assumption is yes.
Is web-scraping for AI training legal?
Unsettled. US copyright law has fair-use defenses that vendors invoke for training-data use. The NYT v. OpenAI case (filed Dec 2023) and Authors Guild v. OpenAI case are testing whether those defenses hold. The EU AI Act (2024) imposes more specific transparency and opt-out obligations. The legal landscape is shifting — vendors are increasingly licensing data rather than relying purely on scrape-and-pray.
Do AI vendors train on my conversations with their chatbot?
Varies. Some vendors train on free-tier conversations by default and require opt-out (this was OpenAI's default for ChatGPT until policy changes). Some vendors do not train on conversations by default (Anthropic's stated policy). Enterprise/API users typically have stronger no-training guarantees. Read the specific vendor's data-handling policy.
What's the difference between fine-tuning and training from scratch?
Training from scratch means starting with a randomly-initialized model and feeding it the full training corpus (billions to trillions of tokens). Fine-tuning takes an already-trained base model and adapts it on a smaller, task-specific dataset. Fine-tuning data is often where the most personally-sensitive corpora appear — proprietary support transcripts, customer-service logs, internal documents.

You can't unwrite training data. You can shape what's next.

Delist files AI opt-outs across major vendors plus the broker layer that feeds future trainings. Free scan first.

Start your free exposure scan