TL;DR
- Wikipedia made up roughly 3% of GPT-3's weighted training data, but English Wikipedia alone is only 0.6% of the raw tokens by volume. The disproportionate weight is the giveaway: models trust it more than the open web.
- Meta's LLaMA paper pulls about 4.5% of its training tokens from Wikipedia, despite Common Crawl supplying ~67% of the raw data. Quality beats quantity in the weighting.
- Approximately 1,000 to 1,500 articles get deleted from Wikipedia every single day. Most are about companies, people, or products that fail the notability bar.
- The Wikimedia Foundation reported that human pageviews dropped roughly 8% in October 2025 compared to a year earlier, which it blames on AI summaries. So traffic is down, but citation influence is up.
- Wikipedia notability for companies requires "significant coverage in multiple reliable secondary sources that are independent of the subject". Press releases, founder interviews, and trade magazine puffery do not count.
- Articles for Creation (the proper submission route for anyone with a conflict of interest) currently has thousands of pending drafts. Reviews can take days to months.
- If you have a real chance at notability, the page is one of the most defensible AI-search assets you can own. If you don't, attempting it will burn budget and make future attempts harder.
Why I'm writing this (and who it's actually for)
I run a link-building agency that focuses on AI search visibility, and Wikipedia comes up in almost every strategy call with a B2B founder right now. The question is some version of: "Can you get us a Wikipedia page so ChatGPT mentions us?"
It is the right instinct. It is also the question with the highest mismatch between client expectation and actual outcome that I deal with. I have helped two clients pursue Wikipedia pages in the last 18 months. One survived. One got speedily deleted within 36 hours and the editor who flagged it left a polite, withering summary on the talk page that I still think about.
This post is the honest version. Not the agency pitch. Not the "10 easy steps" listicle. The mechanics of why Wikipedia matters so much for LLMs, what counts as notability for a company, how the deletion queue actually works, and the specific situations where you should not bother.
If you are a founder, marketer, or PR lead thinking about this, read it before you spend a dollar.
The training-data math that makes Wikipedia matter
Here is the part that explains the obsession. When OpenAI published the GPT-3 paper in 2020, the training data breakdown they disclosed looked like this:
- Common Crawl (filtered): 60% of weighted tokens
- WebText2: 22%
- Books1: 8%
- Books2: 8%
- Wikipedia: 3%
Three percent looks small until you put it next to the raw volume. English Wikipedia, all 7 million articles and roughly 5 billion words (per Pew Research's 25th anniversary breakdown), made up only about 0.6% of GPT-3's raw token pool. OpenAI multiplied its sampling weight by roughly 5x.
That sampling decision is the entire story. Wikipedia gets oversampled because models trained on raw Common Crawl alone produce garbage. Encyclopaedic, neutral, well-cited prose teaches models how to talk about entities. Brands. People. Places. Concepts.
Meta's LLaMA paper, which Mozilla Foundation analysed in detail in its Common Crawl training-data report, shows a similar pattern. Common Crawl supplies about 67% of LLaMA's training data. Wikipedia supplies 4.5%. The Mozilla report goes further and notes that "at least 64% of 47 text generation models reviewed (30 models) used filtered versions of Common Crawl" between 2019 and October 2023. Common Crawl is the volume. Wikipedia is the quality signal.
This is why a Wikipedia page is the closest thing to a structural moat in AI search. The page itself is in the training data. The redirects, infobox, and Wikidata entity feed Google's Knowledge Graph. When ChatGPT or Gemini answers "who is [your brand]", a Wikipedia entry massively biases the model toward producing coherent, factual output about you instead of hallucinating.
I have written about the entity graph mechanics in more depth here, and the practical implications for LLM citations sit in this companion post.
What counts as notability (the actual policy, not the listicle version)
Every agency blog about Wikipedia says "you need notability". Almost none of them quote the actual policy. So let me do that, because it changes how you think about it.
The general notability guideline says: "A topic is presumed to be suitable for a stand-alone article or list when it has received significant coverage in reliable sources that are independent of the subject."
Three words do almost all of the work in that sentence.
Significant coverage. The policy defines this as coverage that "addresses the topic directly and in detail, so that no original research is needed to extract the content". A two-line quote from your founder in a roundup article is not significant coverage. A 1,200-word feature in a national newspaper analysing your business model is.
Reliable sources. The reliable sources guideline is brutal: it explicitly rejects "self-published materials", "user-generated content sites", "predatory journals", and any content from large language models. Most trade-magazine pieces that look authoritative on the surface fail the editorial-independence test once a reviewer pokes at them.
Independent of the subject. Press releases, sponsored content, paid-for awards, and founder interviews (where the founder is just talking about themselves) do not count. The company-specific notability guideline explicitly excludes "press releases, press kits, or similar public relations materials", "paid or sponsored articles", and "content written or published by the organization, its members, or sources closely associated with it".
When I audit a client's media coverage before deciding whether a Wikipedia attempt is even worth it, I run through this filter:
- Strip out anything from a press wire (PRWeb, PRNewswire, BusinessWire).
- Strip out anything that quotes only the founder or company spokespeople.
- Strip out anything from sponsored content sections (the "BrandVoice" on Forbes is the most common offender).
- Strip out anything in trade publications that does not show editorial independence (most of them fail).
- Strip out anything where the journalist clearly worked from a press release with one quote added.
What is left is your real notability evidence. For most B2B SaaS companies I have audited, the answer is somewhere between zero and three pieces. The bar for a successful Wikipedia article is usually five or more solid, independent secondary sources covering the subject directly and in depth.
This is also where digital PR becomes load-bearing. Earned coverage that survives this filter is the foundation. I wrote about how to build that kind of coverage in my 2026 digital PR and link-building strategy post, and the unlinked-brand-mentions piece is also relevant because notability-grade mentions almost always come without a backlink.
The deletion queue is real, and brutal
I mentioned the speedy-delete on a client earlier. Let me unpack what actually happened, because the mechanics surprise most founders.
Wikipedia processes roughly 1,000 to 1,500 article deletions every day. That is not an estimate I made up; it is roughly what the deletion log records show across a typical week. A meaningful chunk of those are company and brand pages flagged within hours of creation.
There are several ways a new article gets killed:
Speedy deletion (CSD). An administrator deletes it within minutes or hours. Common triggers include obvious promotional language (G11), copyright violation (G12), or non-notable subjects (A7 for companies). The deletion is unilateral; you do not get a debate.
Proposed deletion (PROD). A single editor proposes deletion, and if nobody contests it within seven days, the article is gone. This is most common for borderline cases.
Articles for Deletion (AfD). A full debate runs for seven days. Other editors weigh in with "keep" or "delete" votes citing policy. An administrator closes the discussion based on the strength of arguments, not vote counts.
The client of mine that got speedy-deleted had committed three of the classic mistakes. The article had promotional language ("a leading provider of", "innovative platform", "trusted by"). It cited sources that included two press releases, the company's own About page, and a sponsored article. And it was created from an account that had been making edits exclusively to that one article, which is a five-alarm signal for paid editing.
The second client survived. The differences were specific and instructive: independent journalist features in two national publications and one major industry publication, all written without input from the company. The draft went through Articles for Creation, which is the slow-but-survivable path. The first reviewer declined for tone. We rewrote in encyclopaedic register. The second reviewer accepted. From submission to live page took roughly six weeks.
If your draft does survive, you are not done. New articles sit in a watchlist for established editors who patrol for promotional content. I have seen brand pages survive creation and then get gutted three months later when a sharp-eyed editor strips out the marketing copy and challenges half the sources.
This is also where the connection to E-E-A-T thinking becomes useful, although in reverse: Wikipedia's editorial standards are essentially E-E-A-T enforced at maximum strictness by humans rather than algorithms. If your content cannot stand up to that scrutiny, it should not be there.
Conflict of interest, paid editing, and why agencies that promise Wikipedia pages are a red flag
The Wikipedia conflict of interest policy is unambiguous. If you are paid to edit (or even to coordinate edits), you must disclose three things: your employer, your client, and any other relevant affiliation. The disclosure must appear on your user page, on the talk page of any article you affect, and "whenever you discuss the topic".
Undisclosed paid editing is a policy violation. Editors who do it have accounts blocked, articles deleted, and in some cases their entire history reviewed and reverted. There are public investigations into agencies running undisclosed editing services. These investigations occasionally result in mass blocks and the deletion of every article the agency ever touched.
Which brings me to the agencies promising "guaranteed" Wikipedia pages for a flat fee. Many of them are running exactly the model that policy bans. I have seen pitches offering "Wikipedia listing service" for $1,500 to $10,000, where the playbook is a sockpuppet account creating the page, a friendly-looking edit history to disguise the conflict, and zero disclosure. When the article gets killed (and they do, eventually), the agency keeps the money and the brand loses the chance to ever try again cleanly.
The legitimate path is slower and not guaranteed. You disclose. You submit through Articles for Creation. You write in neutral tone. You let independent editors decide. If you cannot get to notability honestly, you do not have a Wikipedia page. That is the entire system functioning as designed.
I cover this trade-off more broadly in my defensive AI brand-narrative post, because it is the same logic: shortcut tactics that look like wins compound into long-term liabilities once platforms get smarter.
How to actually write a draft that survives review
Assume you have done the work. You have five or more genuinely independent, in-depth pieces of coverage. The notability bar is plausibly met. The draft itself still has to survive review. Here is the structure I use, based on what reviewers actually look for.
1. Open with a definitional, neutral lead
The first sentence should define what the subject is in encyclopaedic terms, in one tight sentence. Not "The company is a leading provider of". More like: "[Company name] is a [country]-based [type of company] founded in [year], headquartered in [city]". That is it.
2. Use sentence case in headings and follow Manual of Style
The Manual of Style requires sentence case for headings (not title case), straightforward language, and consistency within the article. Reviewers spot Title Case Headings instantly and read them as promotional.
3. Cite secondary sources at the sentence level, not the paragraph level
Every factual claim should have an inline citation. If you write "the company raised a Series B in 2024", the citation should sit on that sentence. Reviewers scan for unsupported claims and remove them.
4. Avoid every banned promotional word
Words like "leading", "innovative", "best-in-class", "award-winning", "renowned", "trusted", "world-class", "cutting-edge". Strip every single one. If a fact is notable, the cited source will say it; you do not need to label it.
5. Cover the negative and the boring
Encyclopaedic articles cover controversies, criticism, failed product lines, and ordinary corporate facts. If your draft reads like a brochure (only positives, no inconvenient facts), reviewers know. Include the lawsuit. Include the discontinued product. Include the CEO transition. It is what makes the article feel neutral.
6. Disclose conflicts on your user page and the article talk page
This is non-negotiable. The disclosure templates are public. Use them. Reviewers check.
7. Submit through Articles for Creation, not directly to the mainspace
For a conflict-of-interest editor, AfC is the only legitimate path. Direct mainspace creation by a COI account is the fastest way to get deleted and flagged.
What success actually does for your AI search visibility
This is the question that pays the bills, so I want to be careful with it.
A Wikipedia page is not a ranking factor in the classical sense. Google does not award you a search-bonus for having one. What it does is feed three other systems that compound:
The Knowledge Graph. Google's Knowledge Graph pulls structured data from Wikidata, which in turn pulls structured data from Wikipedia infoboxes. A clean Wikipedia article almost guarantees a Knowledge Panel for your brand in Google search. That panel is what shows up on the right-hand side of branded SERPs and is the most prominent way Google says "this is a real entity".
LLM training corpora. Future model retrains will include the article. So will all the downstream derivatives, fine-tunes, and embedding datasets that draw from Wikipedia. Once you are in, you are in for a long time. I wrote about why models lean so heavily on this kind of early-positioned content in my first-500-words study.
RAG retrieval at inference time. This is the underrated one. ChatGPT, Perplexity, and Gemini all run live retrieval as part of answering queries. Wikipedia is one of the most frequently retrieved sources because the models trust it. When a user asks about your brand, a Wikipedia hit gets pulled into the context window and informs the answer. That is the closest thing to a direct citation effect.
I have seen real examples of this in client work. After one client's Wikipedia article went live, their share of voice in ChatGPT responses to category-level queries ("best [their category] companies") roughly doubled over the following two months. Causation is hard to prove cleanly, but the timing was unambiguous. For more on how AI citation visibility actually moves, this post on AI search platform citation strategy is the deeper dive.
The traffic story is more complicated. Wikipedia traffic to brand pages is small. People searching your brand name on Google rarely click through to Wikipedia; they click your site. So you should not expect direct referral traffic to move. The win is upstream, in how AI systems and Knowledge Graph treat you as an entity.
And worth noting: Wikipedia's own pageviews are softening. The Wikimedia Foundation reported that human pageviews were down roughly 8% in October 2025 versus the same month in 2024, with the foundation attributing the trend to generative AI and AI search summaries. So in a sense, fewer humans read Wikipedia directly, but more AI systems read it and re-serve it. The leverage shifted.
When you should not bother
This is the part nobody talks about. Most brands should not attempt a Wikipedia page. Here are the situations where I tell clients to drop the idea.
You have fewer than three pieces of significant, independent coverage. Without that foundation, the article will not pass review. Spending time on the draft is wasted time. Build the coverage first; if you can earn five or more substantial features in the next 12 months, revisit.
Your only coverage is in trade publications. Wikipedia treats trade publications as borderline. If 100% of your coverage is in industry-only outlets, the notability case is fragile. Mix in mainstream business or news coverage first.
Your category is heavily contested by Wikipedia editors. Some categories (crypto projects, MLM-adjacent businesses, supplement brands, some consumer SaaS) are under heightened scrutiny because of past abuse. Editors are faster to delete, slower to approve, and more skeptical of sources. If your category fits this profile, expect a tougher ride.
You operate in a YMYL space without serious editorial coverage. Wikipedia is even more careful about medical, legal, and financial topics, in line with the same logic Google uses for YMYL editorial standards. If you are a health or finance brand, your sources have to be exceptional.
You have active legal or reputational controversies. Wikipedia articles attract editors who will surface every negative fact. If your goal is brand polish, a Wikipedia article will not give you that and may give you the opposite. Some brands have looked at their potential Wikipedia article and concluded they preferred not to have one.
You cannot commit to long-term maintenance. A Wikipedia article is not a fire-and-forget asset. Things change. Editors update content (sometimes badly). Vandalism happens. Someone needs to watch the page and engage on the talk page when issues arise. If nobody on your team can own this, the article will drift.
In each of these cases, your AI search visibility is better served by other tactics first: building citations on Reddit and forums (covered here), getting referenced in YouTube videos and transcripts, securing high-quality earned media, and making your own site easier for AI agents to cite.
Wikipedia is the capstone, not the foundation.
The Grokipedia and AI-encyclopedia question
A quick note on the new entrants. Grokipedia and other AI-generated encyclopedia projects have started to appear, with claims they will rival Wikipedia for brand visibility in AI answers. My honest read, based on what I have seen actually cited in LLM outputs, is that none of them have come close to displacing Wikipedia yet. The training-data inertia is too strong, and the existing trust signals (decades of editorial scrutiny, Wikidata integration, Knowledge Graph hooks) are not replicable in a year.
I wrote a longer breakdown of the Grokipedia question and what it means for SEO here. The short version: keep watching it, but do not redirect Wikipedia budget toward it yet.
A specific workflow for assessing and executing
If you have read this far and still want to attempt it, here is the seven-step workflow I run for clients. This is not theoretical; it is the actual order I work through.
Audit existing coverage against the notability filter. Strip out press releases, sponsored content, founder-quoted-only pieces, and trade publications without clear editorial independence. Count what remains. You need a minimum of three to five pieces, ideally more.
Identify the gaps and run earned-media outreach to close them. This is usually a 6 to 12 month effort. Real journalism takes time. Build relationships, pitch substantive angles, not press-release dressing.
Build out the Wikidata entity first. Wikidata is more permissive than Wikipedia. You can create an entity for your company, link it to authoritative sources, and start feeding the Knowledge Graph even before a Wikipedia article exists. This is a useful early move.
Draft in the Articles for Creation sandbox, not the mainspace. Use a registered account with your conflict of interest disclosed on the user page. Write in encyclopaedic tone. Cite secondary sources sentence-by-sentence.
Pre-review with an experienced editor. There are legitimate Wikipedia consultants (not the volume-output agencies) who will review your draft for tone, sourcing, and policy compliance before you submit. This is worth the cost.
Submit and wait. Expect at least one decline. Address the feedback specifically. Resubmit. Repeat. Most successful articles go through two or three review cycles.
Set up monitoring once live. Add the article to your watchlist. Subscribe to talk-page notifications. When edits happen, evaluate whether they are legitimate improvements (often yes) or vandalism (occasionally). Respond on the talk page, not by reverting directly, when you have a COI.
The whole process from "start of earned-media campaign" to "live, stable Wikipedia article" typically runs 9 to 18 months for a B2B company without massive pre-existing coverage. Anyone promising faster on a flat-fee basis is selling something that will break.
What to do this week
If you are considering Wikipedia for AI search visibility, do these three things before you spend any other money on it.
Run the notability audit yourself. Pull every piece of media coverage you have. Apply the filter (independent, secondary, in-depth). Count what is left. Be honest. If the count is under three, your priority is not Wikipedia, it is earned media.
Create a Wikidata entity if you do not have one. This is allowed, lower-friction, and starts feeding the Knowledge Graph immediately. You can do this without notability for a full Wikipedia article. Add your founding date, headquarters, key people, official identifiers, and authoritative source links.
Audit your existing AI search visibility before assuming Wikipedia is the missing piece. Run brand-name queries in ChatGPT, Perplexity, Gemini, and Google AI Mode. See what they currently say about you and which sources they cite. If you are already getting cited well from your own site and earned media, a Wikipedia article will compound that. If you are getting nothing or misinformation, fix the foundation first. A free audit on this kind of thing is here if that helps.
Wikipedia is a long, slow, high-skill play. When it works, it is one of the strongest AI search assets you can own. When it does not, it absorbs budget and energy and produces nothing.
Know which situation you are in before you start.



