robots.txt Wildcard That Blocks AI Citations

The robots.txt file is two lines on most property management websites: allow everything, point to the sitemap. When something goes wrong with AI citations, robots.txt is rarely the first place anyone looks. It should be.

A single wildcard rule, one that looks completely reasonable, can silently block every AI citation crawler on the web from accessing your pages. ChatGPT, Perplexity, Claude, Google AI Mode: all of them, at once, without a single error message.

How robots.txt wildcard rules work

The User-agent: * directive in robots.txt is a catch-all. Any rule you place under it applies to every crawler that doesn’t have its own explicit rule higher in the file.

Most sites use it simply: User-agent: * followed by Allow: /. That allows all crawlers to access all pages. Uncomplicated.

The problem starts when sites add a specific crawler block, usually GPTBot, using a pattern that looks like this:

User-agent: GPTBot
Disallow: /

User-agent: *
Allow: /

This is the intended structure: block GPTBot specifically, allow everything else. And it works, for a plain crawl-all-or-block-one setup.

But many sites that have been through a “block AI training crawlers” guide don’t stop there. They end up with something like:

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: GPTBot
Disallow: /

That second configuration blocks everything by default, then allows Googlebot back in. The intent is usually to limit crawling to Google only. The effect is that OAI-SearchBot, ClaudeBot, PerplexityBot, and Googlebot-Extended are all blocked, the wildcard Disallow: / catches them all, and they don’t have explicit allow rules.

The site looks fine in Google Search. AI citations stop without any visible signal.

The crawlers this affects

For AI citation purposes, the crawlers that matter are:

OAI-SearchBot, ChatGPT live-search citations (distinct from GPTBot, which is training only)
ClaudeBot, Claude web search citations
PerplexityBot, Perplexity inline citations
Googlebot-Extended, Google AI Mode

None of these are Googlebot. A robots.txt that allows only Googlebot and blocks everything else will block all four.

Why this is easy to miss

Google Search Console doesn’t surface AI crawler blocks as errors. If Googlebot can access your site, the Coverage report will be clean. There’s no GSC equivalent for OAI-SearchBot coverage, no dashboard showing “ChatGPT attempted to access 47 pages and was blocked on all of them.”

The block is silent. Rankings don’t change. Traffic doesn’t drop. The only signal is that AI citations don’t appear, and that’s easy to attribute to other causes.

What to check

Open your robots.txt file at yourdomain.com/robots.txt and answer three questions:

1. Is there a User-agent: * rule with Disallow: /? If yes, every crawler not explicitly named above it is blocked. The citation crawlers listed above need explicit Allow: / rules above the wildcard.

2. Are OAI-SearchBot, ClaudeBot, PerplexityBot, and Googlebot-Extended mentioned anywhere? If they’re absent and the wildcard is restrictive, they inherit the block.

3. Is the order correct? In robots.txt, more-specific rules take precedence, but only when the crawler matches its own User-agent block. The wildcard applies only when no specific block matches. If a citation crawler has no block of its own, it will always fall through to User-agent: *.

The fix

Add explicit allow rules for each citation crawler above the wildcard. Even if your wildcard already allows everything, naming them explicitly makes intent clear and prevents a future edit from accidentally recapturing them:

User-agent: *
Allow: /

# AI citation crawlers, explicitly allowed
User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Googlebot-Extended
Allow: /

# Training crawlers, block separately if desired
User-agent: GPTBot
Disallow: /

This configuration allows all standard crawling, explicitly whitelists the citation crawlers, and, if desired, blocks model training crawlers separately.

One thing this check won’t catch

robots.txt controls crawler access. It doesn’t control whether AI engines will actually cite your pages once they have access. Pages also need to rank in the top results for the query, be structured so the answer can be extracted, and have entity signals that let AI engines attribute the citation correctly.

Fixing a robots.txt block opens the door. Whether AI engines walk through it depends on the content and structure of what’s on the other side.

Technical audit of AI crawler access is included in every site review I run. If you’d like to know which crawlers can reach your pages, and what’s preventing AI engines from citing you, a free audit shows you. The audit is the next step.

GEOrobots.txttechnical seoai searchchatgpt