Robots.txt Generator
Generate a properly formatted robots.txt file for your website. Block/allow specific bots, paths, sitemaps.
Block specific bots (AI scrapers, etc.)
What is robots.txt?
Robots.txt is a plain text file placed at the root of your website that tells web crawlers (Googlebot, Bingbot, ChatGPT scraper, etc.) which pages or sections of your site they can or cannot access. It’s part of the Robots Exclusion Protocol developed in 1994, and remains the standard way to control crawler behavior. While robots.txt is a request not enforcement (malicious bots ignore it), all legitimate crawlers respect it. With AI scrapers now actively training on web content, blocking AI bots via robots.txt has become essential for protecting your content – GPTBot, Google-Extended (Bard/Gemini), ClaudeBot, and Common Crawl can all be blocked. This tool generates a properly formatted robots.txt with custom paths, AI scraper blocking, and sitemap reference.
How to use this tool
- Choose default behavior — Allow all bots (standard) or Disallow all bots (rare – only for staging/private sites).
- Add disallowed paths — One per line. Common: /admin/, /wp-admin/, /private/. Use trailing slash to block a directory, no slash for a single file.
- Add specifically allowed paths — For exceptions to a disallow rule. E.g. /wp-admin/admin-ajax.php (needed for site functionality).
- Optional: AI scraper blocking — Toggle checkboxes to block GPTBot, ClaudeBot, Google-Extended, Common Crawl. Important for content creators.
- Add sitemap URL — Tell crawlers where your sitemap.xml is. Improves indexing efficiency.
- Copy or download — Generated robots.txt ready to upload to your website root (e.g. https://yoursite.com/robots.txt).
robots.txt syntax
Basic format:
User-agent: * Disallow: /admin/ Allow: /admin/public/ Sitemap: https://example.com/sitemap.xml
Directives:
User-agent: *– rules apply to all botsUser-agent: Googlebot– rules specifically for GoogleDisallow: /path/– block bots from this pathAllow: /path/file– explicit allow (overrides broader Disallow)Crawl-delay: 10– wait 10 seconds between requests (some bots respect this)Sitemap: URL– point to your sitemap
Common AI bot User-Agents:
GPTBot– OpenAI ChatGPTGoogle-Extended– Google Bard/Gemini trainingClaudeBot,anthropic-ai– Anthropic ClaudeCCBot– Common Crawl (used by many AI companies)Bingbot– Microsoft Bing
Examples
Standard WordPress site:
User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Allow: /wp-admin/admin-ajax.php Sitemap: https://yoursite.com/wp-sitemap.xml
E-commerce blocking checkout pages and search:
User-agent: * Disallow: /cart/ Disallow: /checkout/ Disallow: /search? Disallow: /account/
Block AI scrapers while keeping search engines:
User-agent: * Allow: / User-agent: GPTBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: ClaudeBot Disallow: /
Tips & best practices
- Robots.txt MUST be at root: https://yoursite.com/robots.txt (not in a subdirectory)
- Test your robots.txt at Google Search Console: Settings > Crawlers report
- Don’t use robots.txt to hide sensitive data – it just makes pages public knowledge (anyone can read robots.txt)
- Use noindex meta tag for stronger blocking (in HTML page) – robots.txt prevents CRAWLING, not indexing
- Block /search, /filter, /sort URL parameters – they create infinite URL variations that waste crawl budget
- Always reference your sitemap.xml in robots.txt – helps crawlers find pages efficiently
- Block AI scrapers ONLY if you don’t want your content used for AI training – some content creators welcome it for citations
Limitations & notes
Robots.txt is a REQUEST not enforcement – malicious bots ignore it. For sensitive data, use authentication, not robots.txt. Some search engines may still INDEX disallowed URLs they find from external links (without crawling content) – use ‘noindex’ meta tag for stronger blocking. AI scrapers that respect robots.txt (Google, OpenAI, Anthropic) generally do, but newer scrapers may not.
Frequently Asked Questions
Where should robots.txt be located?
At the root of your domain: https://yoursite.com/robots.txt. NOT in a subdirectory. NOT case-sensitive in the filename (Google reads both robots.txt and Robots.txt). For subdomains, each needs its own robots.txt (https://blog.yoursite.com/robots.txt is separate from main site).
Does blocking in robots.txt prevent indexing?
Mostly yes but not absolutely. Robots.txt prevents CRAWLING (Google won’t read the page content). But if Google finds the URL linked from somewhere else, it MAY still appear in search results with limited info. For absolute non-indexing, use ‘noindex’ meta tag in the HTML.
Should I block AI scrapers like GPTBot?
Personal choice. Block if: you sell content/courses/articles for a living, your content is your unique competitive advantage, or you simply don’t want AI to train on it. Don’t block if: you want maximum exposure including AI citation, you publish public knowledge that benefits from broad access, or you want to be referenced by ChatGPT/Claude/Gemini.
What’s the difference between disallow and noindex?
Disallow (in robots.txt): tells crawler not to FETCH the page. The page may still appear in search results based on external link info. Noindex (HTML meta tag): tells crawler to fetch but NOT INDEX. The page won’t appear in results at all. For complete privacy: combine both, plus authentication.
How do I block all bots from my site?
Put this in robots.txt: User-agent: *\nDisallow: /\nThis tells all crawlers not to access any URL. Good for staging/development sites. NEVER use on production – search engines won’t find your site, no traffic.
Can I block specific bots while allowing others?
Yes – use separate User-agent blocks. For example, block ChatGPT but allow Google: User-agent: GPTBot\nDisallow: /\nUser-agent: *\nAllow: /\nThe specific rule wins over the general rule.
Does robots.txt help SEO?
Indirectly. By blocking duplicate URLs (filtered listings, search results), tracking parameters, and low-value pages, you concentrate crawl budget on important pages. This improves overall SEO. But robots.txt is not a direct ranking factor – it’s about crawl efficiency.
Related tools
XML Sitemap Generator · Meta Tag Generator · Schema.org Generator
