All articles
AI Search2 April 2026 · 17 min read

Wikipedia Is the #1 Training Source for ChatGPT. Here's How to Get Your Brand Listed Without Getting Reverted.

Priyanshu Bisht

Priyanshu Bisht

SEO Executive

Wikipedia Is the #1 Training Source for ChatGPT. Here's How to Get Your Brand Listed Without Getting Reverted.

In a hurry? Summarise this with AI.

Open it in your AI tool of choice for the short version.

On this page

Almost every B2B founder we get on a strategy call right now asks us some version of the same thing. "Can you get us a Wikipedia page so ChatGPT mentions us?"

It is a smart instinct. It is also the question with the biggest gap between what clients expect and what actually happens. We have helped two clients chase a Wikipedia page in the last 18 months. One survived. One got speedily deleted inside 36 hours, and the editor who flagged it left a polite, withering note on the talk page that one of our team still quotes occasionally.

So this is the honest version of the Wikipedia, brand SEO and LLM citations conversation. Not the agency pitch. Not a "10 easy steps" listicle written by someone who has never submitted a draft. Why Wikipedia matters so much to large language models, what notability actually means for a company, how the deletion queue really works, and the specific situations where you should not bother at all.

The training-data maths that makes Wikipedia matter

Here is the bit that explains the obsession. When OpenAI published the GPT-3 paper, the training-data breakdown was right there in the open. According to the figures on the GPT-3 Wikipedia entry, the weighted training mix was Common Crawl at 60%, WebText2 at 22%, two book corpora at 8% each, and Wikipedia at 3%.

Three percent sounds tiny. Then you look at the raw token counts. Wikipedia supplied 3 billion tokens out of roughly 499 billion in the pool. By raw volume it is well under one percent. OpenAI multiplied its sampling weight by roughly five. They deliberately fed the model more Wikipedia than its size warranted.

That weighting decision is the whole story. Models trained on raw Common Crawl alone produce mush. Neutral, well-cited, encyclopaedic prose teaches a model how to talk about entities. Brands, people, places, concepts. Wikipedia is the cleanest large source of exactly that, so it gets oversampled.

You see the same pattern in Meta's LLaMA. The LLaMA paper's data table shows CommonCrawl at 67% and Wikipedia at 4.5% of a roughly 1.4 trillion token dataset, with Wikipedia and books being the only sources run through about two epochs rather than one. Translation: the model was shown Wikipedia twice while it saw most of the web once. Quality gets the second look.

To be clear about the bigger picture, the Mozilla Foundation's Common Crawl research found that Common Crawl made up more than 80% of the tokens in GPT-3, and that at least 30 of 47 text-generation models reviewed between 2019 and October 2023 used filtered versions of it. Common Crawl is the volume. Wikipedia is the trust signal layered on top.

This is why a Wikipedia page is the closest thing to a structural moat in AI search. The page sits in the training data. The infobox and Wikidata entity feed Google's Knowledge Graph. When ChatGPT or Gemini gets asked "who is [your brand]", a Wikipedia entry heavily biases the model towards coherent, factual output about you instead of a confident hallucination. We dig into the wiring of this in our piece on knowledge graphs and entity optimisation for AI search, and the practical citation mechanics live in our guide to getting your brand into AI answers.

What is Wikipedia notability for a company?

Notability is the bar a subject has to clear to deserve its own article. For a company, it means the world has written about you independently and in depth, not that you have a logo and a funding round.

Every agency blog says "you need notability" and then never quotes the policy. So here it is. The general notability guideline states that a topic is "presumed to be suitable for a stand-alone article or list when it has received significant coverage in reliable sources that are independent of the subject." Three phrases do almost all the work.

Significant coverage means material that "addresses the topic directly and in detail, so that no original research is needed to extract the content," and it is explicitly "more than a trivial mention." A two-line founder quote in a roundup does not count. A 1,200-word feature pulling your business model apart does.

Independent of the subject excludes anything "produced by the article's subject or someone affiliated with it." The guideline names "advertising, press releases, autobiographies, and the subject's website" outright.

The company-specific guideline is even less forgiving. The organisations and companies notability guideline rules out "press releases, press kits, or similar public relations materials," states that "only unpaid sources count," and specifically calls out paid or sponsored articles, including the contributor platforms on outlets like Forbes "that do not provide meaningful editorial oversight." If that is where your best coverage lives, you do not have coverage.

When we audit a client's media before deciding whether a Wikipedia attempt is even worth attempting, we run this filter:

  1. Strip out anything from a press wire (PRWeb, PRNewswire, BusinessWire).
  2. Strip out anything that only quotes the founder or company spokespeople.
  3. Strip out sponsored sections (the "BrandVoice"-style contributor slots are the usual culprits).
  4. Strip out trade pieces with no real editorial independence (most fail).
  5. Strip out anything where a journalist clearly worked from a press release with one quote bolted on.

Whatever survives is your actual notability evidence. For most B2B SaaS companies we have assessed, that number lands between zero and three. A page that holds up usually needs five or more solid, independent, in-depth secondary sources.

This is exactly where digital PR stops being a nice-to-have and becomes load-bearing. Earned coverage that survives the filter is the foundation. We covered how to build it in our link building framework, and our take on white hat link building is relevant too, because notability-grade mentions almost always arrive without a link attached.

The deletion queue is real, and it is brutal

Back to the client we lost in 36 hours, because the mechanics catch most founders off guard.

There are several ways a new article dies. Speedy deletion lets a single administrator bin it within minutes, with no debate, for promotional tone, copyright issues or a clearly non-notable subject. Proposed deletion gives a seven-day window for anyone to object before it vanishes. Articles for Deletion runs a full seven-day discussion where editors argue keep or delete on policy grounds and an admin weighs the arguments, not the vote count.

Our deleted client managed three classic mistakes at once. The article was stuffed with promotional language ("a leading provider of", "innovative platform", "trusted by"). Its sources included two press releases, the company's own About page and one sponsored article. And it was created from an account whose entire edit history was that one page, which is a five-alarm signal for undisclosed paid editing.

The one that survived did the opposite. Independent journalist features in two national publications and one major industry title, all written without company input. The draft went through Articles for Creation, the slow but survivable route, where new and conflicted editors submit drafts for review rather than publishing straight to the live encyclopaedia. The first reviewer declined for tone. We rewrote in plain encyclopaedic register. The second reviewer accepted. Six weeks from submission to live page.

Worth setting expectations on the queue. When we last checked, Articles for Creation had 4,355 pending submissions waiting, with hundreds sitting for two months or more. There is no guaranteed turnaround. As the page itself puts it, "getting a review can take a while, but your draft will be reviewed eventually." And here is a 2026 wrinkle: the guideline now states flatly that "articles that are generated entirely by LLMs will be rejected." If you were planning to have ChatGPT write your Wikipedia draft, do not.

Surviving creation is not the finish line either. New articles sit on watchlists, and patrollers strip promotional copy and challenge weak sources long after publication. We have watched a brand page survive its debut and then get gutted three months later by a sharp-eyed editor. In a way, Wikipedia is E-E-A-T enforced at maximum strictness by humans instead of an algorithm. If your content cannot stand up to that, it should not be there.

Conflict of interest, paid editing, and why "guaranteed Wikipedia page" agencies are a red flag

The conflict of interest policy is blunt. If you are paid to edit, you "must disclose who is paying you, on whose behalf the edits are made, and any other relevant affiliation," and that disclosure has to appear on your user page, on affected talk pages, and "whenever you discuss the topic." No disclosure, no legitimacy.

Undisclosed paid editing is a straight policy violation. Accounts get blocked, articles deleted, and edit histories reverted en masse. Wikipedia runs public investigations into agencies pushing undisclosed editing, and those occasionally end with every article the agency ever touched being wiped.

Which brings us to the outfits selling "guaranteed" Wikipedia pages for a flat fee. A lot of them are running precisely the model the policy bans. We have seen pitches for a "Wikipedia listing service" priced anywhere from $1,500 to $10,000, built on a fresh account creating the page, a sprinkle of unrelated edits to disguise the conflict, and zero disclosure. When the page dies, and it does, the agency keeps the money and the brand has burned its one clean shot. There is no appeals process that rewards "but I paid for it."

The legitimate path is slower and carries no guarantee. You disclose. You submit through Articles for Creation. You write neutrally. You let unrelated editors decide. If you cannot reach notability honestly, you do not get a page, and that is the system working as intended. We make the same argument in our defensive SEO and AI brand-narrative piece, because it is the same logic across the board: shortcut tactics that look like wins compound into liabilities once the platform gets smarter.

How to write a draft that actually survives review

Assume you have done the hard part. Five or more genuinely independent, in-depth sources. The notability case is plausible. The draft still has to clear review, and reviewers reject for tone as readily as for sourcing. Here is the structure our team uses.

1. Open with a flat, definitional lead

First sentence defines what the subject is, in one tight line. Not "the company is a leading provider of". More like "[Company] is a [country]-based [type of company] founded in [year] and headquartered in [city]." That is the entire job of sentence one.

2. Use sentence case in headings

The Manual of Style requires section headings "in sentence case ... not title case." Reviewers clock Title Case Headings instantly and read them as marketing.

3. Cite at the sentence level, not the paragraph

Every factual claim wants its own inline citation. "The company raised a Series B in 2024" should carry the reference on that sentence. Reviewers scan for unsupported claims and remove them.

4. Strip every promotional word

Leading, innovative, best-in-class, award-winning, renowned, trusted, world-class, cutting-edge. Delete the lot. If a fact is notable, the cited source already says it. You do not need to add the adjective.

5. Cover the boring and the negative

Real encyclopaedic articles include controversies, discontinued products, leadership changes and dull corporate facts. A draft that reads like a brochure, all upside and no inconvenient truths, tells the reviewer exactly what it is. Include the lawsuit. Include the CEO transition. Neutrality is the point.

6. Disclose your conflict, properly

User page and article talk page, using the public templates. This is not optional, and reviewers check.

7. Submit through Articles for Creation, not the live encyclopaedia

For a conflicted editor, AfC is the only honest route. Creating directly in the mainspace from a conflict-of-interest account is the fastest way to get deleted and flagged.

One more thing reviewers will not tell you but absolutely act on: sources you cannot use to build notability. The reliable sources guideline says self-published sources "are largely not acceptable," excludes user-generated content like forums and social media, and states that content produced by LLMs such as ChatGPT "is generally unreliable" because models hallucinate citations that look real and do not exist. Citing a chatbot to prove your own notability is a special kind of own goal.

What a live Wikipedia page actually does for AI search visibility

This is the question that pays the invoices, so we will be careful with it. A Wikipedia page is not a classical ranking factor. Google does not hand you a search bonus for owning one. What it does is feed three other systems that compound.

The Knowledge Graph. Google's Knowledge Graph pulls structured data from Wikidata, which pulls from Wikipedia infoboxes. A clean article almost guarantees a Knowledge Panel for your brand on branded searches, the right-hand box that is Google's way of saying "this is a real entity."

Training corpora. Future model retrains will include the article, along with the fine-tunes and embedding datasets that derive from Wikipedia. Once you are in, you tend to stay in for a long time. We looked at why models lean so hard on this kind of well-positioned content in our study on ChatGPT citations and the first 500 words.

Live retrieval at inference time. The underrated one. ChatGPT, Perplexity and Gemini all run retrieval while answering, and Wikipedia is one of the most frequently pulled sources because the models trust it. Ask about your brand and a Wikipedia hit lands in the context window and shapes the answer. That is the nearest thing to a direct citation effect. Our deeper breakdown of how that works sits in our getting cited in ChatGPT and AI Overviews.

We have watched this play out. After one client's article went live, their share of voice in ChatGPT responses to category-level queries ("best [their category] companies") roughly doubled over the next two months. Causation is hard to prove cleanly, but the timing was not subtle.

The traffic story is duller, and you should hear it anyway. People searching your brand name on Google click your site, not your Wikipedia page, so do not expect referral traffic to move. The win is upstream, in how AI systems and the Knowledge Graph treat you as an entity. And there is a fresh twist worth knowing. The Wikimedia Foundation reported in October 2025 that human pageviews were down roughly 8% year on year, attributing it to "search engines ... increasingly using generative AI to provide answers directly to searchers." Fewer humans read Wikipedia directly. More machines read it and re-serve it. The leverage simply shifted.

When you should not bother at all

Most brands should not attempt a Wikipedia page. Nobody selling the service will tell you that, so we will. Here is where we tell clients to drop it.

  • You have fewer than three pieces of significant, independent coverage. The draft will not pass. Build the coverage first. If you can earn five or more substantial features in the next 12 months, revisit then.
  • Your only coverage is trade press. Wikipedia treats trade publications as borderline. If 100% of your coverage is industry-only, the case is fragile. Mix in mainstream business or news first.
  • Your category is under heavy editor scrutiny. Crypto, MLM-adjacent businesses, supplements and some consumer SaaS attract faster deletions and pickier sourcing because of past abuse. Expect a harder ride.
  • You operate in a YMYL space without serious editorial coverage. Medical, legal and financial topics get extra caution, the same logic Google applies to YMYL editorial standards. Your sources have to be exceptional.
  • You have active legal or reputational controversies. Wikipedia articles attract editors who surface every negative fact. If your goal is brand polish, a Wikipedia page may hand you the opposite.
  • You cannot commit to maintenance. This is not a fire-and-forget asset. Editors update it, sometimes badly, and vandalism happens. Someone has to watch the page and engage on the talk page. If nobody on your team will own it, it drifts.

In every one of those cases, your AI search visibility is better served by other moves first. Building credible citations on our original AI search visibility research, earning real media, and making your own site easy for the engines to cite in ChatGPT and AI Overviews. If pulling that whole programme together is the actual job, that is what our AI search visibility service exists to do. Wikipedia is the capstone, not the foundation.

The Grokipedia and AI-encyclopedia question

Quick note on the new entrants. Grokipedia and other AI-generated encyclopaedia projects have turned up promising to rival Wikipedia for brand visibility in AI answers. Our honest read, based on what we actually see cited in LLM outputs, is that none of them have come close to displacing Wikipedia yet. The training-data inertia is too strong, and decades of editorial scrutiny plus Wikidata and Knowledge Graph integration are not things you replicate in a year. We wrote a fuller breakdown in our Grokipedia SEO case study. Short version: keep watching, do not redirect Wikipedia budget towards it yet.

The workflow we actually run

If you have read this far and still want to go for it, this is the order we work through for clients. Not theoretical. The actual sequence.

  1. Audit existing coverage against the notability filter. Strip press releases, sponsored content, founder-only quotes and trade pieces with no editorial independence. Count what is left. You want three to five at the absolute minimum, ideally more.
  2. Run earned-media outreach to close the gaps. Usually a 6 to 12 month effort, because real journalism takes time. Build relationships and pitch substantive angles, not press-release dressing.
  3. Build the Wikidata entity first. Wikidata is more permissive than Wikipedia. You can create a company entity, link authoritative sources, and start feeding the Knowledge Graph before any article exists. A genuinely useful early move with low downside.
  4. Draft in the AfC sandbox, not the mainspace. Registered account, conflict of interest disclosed on the user page, encyclopaedic tone, sentence-level citations.
  5. Pre-review with an experienced editor. There are legitimate Wikipedia consultants, not the volume-output agencies, who will check your draft for tone, sourcing and policy before submission. Worth the cost.
  6. Submit and wait. Expect at least one decline. Address the feedback specifically, resubmit, repeat. Most accepted articles go through two or three cycles.
  7. Monitor once live. Add the page to your watchlist, subscribe to talk-page notifications, and when edits arrive, judge whether they are legitimate improvements (often) or vandalism (occasionally). With a conflict of interest, raise issues on the talk page rather than reverting directly.

From the start of an earned-media campaign to a live, stable article typically runs 9 to 18 months for a B2B company without big pre-existing coverage. Anyone promising faster on a flat fee is selling something that will break.

What to do this week

Three things before you spend another penny on this.

  1. Run the notability audit yourself. Pull every piece of coverage you have. Apply the filter, independent, secondary, in-depth. Count what survives. Be honest. Under three, and your priority is earned media, not Wikipedia.
  2. Create a Wikidata entity if you do not have one. Lower friction, allowed without full-article notability, and it starts feeding the Knowledge Graph straight away. Add founding date, headquarters, key people, identifiers and source links.
  3. Check your current AI visibility before assuming Wikipedia is the missing piece. Run brand-name queries in ChatGPT, Perplexity, Gemini and Google's AI Mode. See what they say and which sources they cite. If you are already cited well, a page compounds it. If you are getting silence or nonsense, fix the foundation first.

If you would rather we ran that audit with you and told you straight whether a Wikipedia attempt is worth it for your brand, get in touch with our team. We would genuinely rather talk you out of a doomed attempt than take the budget for one.

Wikipedia is a long, slow, high-skill play. When it works it is one of the strongest AI search assets you can own. When it does not, it quietly eats budget and produces nothing. Know which situation you are in before you start.

Keep reading

Want this applied to your own site?

Reading about it is one thing. Start with a search performance audit and we will show you exactly where the wins are.

Book a search audit