Every robots.txt directive plus the index and snippet controls beyond it, in one reference. robots.txt (User-agent, Disallow, Allow, Sitemap, Crawl-delay) controls crawling, not indexing. To remove a page from the index, use a noindex robots meta on a crawlable page. llms.txt and IndexNow are the AI-era additions.
robots.txt directives
robots.txt sits at your site root and controls what crawlers may fetch. It controls crawling, not indexing.
| Directive | What it does | Example |
|---|---|---|
User-agent | Targets a specific bot (or * for all) | User-agent: GPTBot |
Disallow | Blocks a path from crawling | Disallow: /private/ |
Allow | Permits a path (overrides a Disallow) | Allow: /private/public/ |
Sitemap | Points crawlers to your sitemap | Sitemap: https://site.com/sitemap.xml |
Crawl-delay | Seconds between requests (Bing/Yandex; Google ignores it) | Crawl-delay: 10 |
| Wildcards | * matches any sequence, $ matches URL end | Disallow: /*.pdf$ |
Index and snippet control (beyond robots.txt)
robots.txt cannot remove a page from the index. To deindex, the page must be crawlable and carry a noindex.
| Signal | Where it lives | What it does |
|---|---|---|
noindex robots meta | <head> | Keeps the page out of the index (must be crawlable to be seen) |
nofollow robots meta | <head> | Tells engines not to follow links on the page |
X-Robots-Tag | HTTP response header | noindex or nofollow for non-HTML files like PDFs |
rel=canonical | <head> or header | Consolidates duplicate URLs to one preferred version |
data-nosnippet | Inline attribute | Excludes that text from search snippets |
AI-era files
| File | Purpose |
|---|---|
llms.txt | A plain-markdown guide at root that tells LLMs who you are and what to read |
llms-full.txt | The full-content version of llms.txt |
IndexNow key file | Lets you instantly notify Bing and Yandex of new or changed URLs |
llms.txt format: markdown, an H1 with your name, a blockquote summary, then sections of linked pages. See the truth about llms.txt and the AI crawler cheatsheet for allow/block guidance.
Frequently asked questions
Does robots.txt stop a page from being indexed?
No. robots.txt stops crawling, not indexing. A blocked URL can still be indexed from external links, shown without a snippet. To deindex, let the page be crawled and add a noindex robots meta, or use X-Robots-Tag.
What is llms.txt?
A proposed standard: a plain-markdown file at your site root that tells large language models who you are and which pages to read. It does not replace robots.txt; it guides LLMs to your best content.
How do I block AI crawlers in robots.txt?
Target the bot by user-agent and disallow everything, for example User-agent: GPTBot then Disallow: /. Keep search and assistant bots allowed, because blocking them removes you from AI answers.
Does Google respect Crawl-delay?
No. Google ignores Crawl-delay and manages crawl rate in Search Console instead. Bing and Yandex do respect it.
Soaring Above Search
Weekly AI search insights from the front line. One newsletter. Six sections. Everything that actually moved this week, with a practitioner's take.