Robots.txt Implementation Guide
What is robots.txt?
The robots.txt file is a text file placed in your website’s root directory that tells web crawlers (bots) which pages or sections of your site they can or cannot access. It’s a standard used by responsible bots to understand your preferences.
Important: robots.txt is a voluntary standard. Well-behaved bots will respect it, but malicious bots may ignore it completely.
Where to Place Your robots.txt File
The file must be located at the root of your domain:
https://example.com/robots.txtNOT in a subdirectory like
/content/robots.txt
Base Template
Here’s a starting template that blocks known bad bots while allowing legitimate search engines:
# Allow major search engines
User-agent: Googlebot
User-agent: Bingbot
User-agent: Slurp
User-agent: DuckDuckBot
User-agent: Baiduspider
User-agent: YandexBot
User-agent: facebot
User-agent: ia_archiver
User-agent: Applebot
Allow: /
# Block known bad bots (SEO crawlers and scrapers)
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: MJ12bot
User-agent: DotBot
User-agent: BLEXBot
User-agent: PetalBot
User-agent: DataForSeoBot
User-agent: Serpstatbot
User-agent: SEOkicks
User-agent: AspiegelBot
User-agent: CCBot
User-agent: GPTBot
Disallow: /
# Default rule for all other bots
User-agent: *
Crawl-delay: 10
Disallow: /
# Sitemap location (update with your actual sitemap URL)
Sitemap: https://example.com/sitemap.xml
Customizing for Your Site’s Functionality
Adding Required Service Bots
Many websites use third-party services that require bot access to function properly. You’ll need to explicitly allow these bots.
Site Search Services
If you use AddSearch, Algolia, Swiftype, or similar:
# Allow site search bot
User-agent: AddSearchBot
Allow: /
Social Media Preview Bots
For proper link previews on social platforms:
# Social media preview bots
User-agent: Twitterbot
User-agent: LinkedInBot
User-agent: Slackbot
User-agent: facebookexternalhit
Allow: /
Monitoring and Analytics Services
If you use uptime monitoring or analytics:
# Monitoring services
User-agent: Pingdom
User-agent: UptimeRobot
Allow: /
AI Training Bots (Optional)
If you want to allow or block AI training bots:
# Block AI training bots
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: CCBot
User-agent: anthropic-ai
User-agent: Claude-Web
User-agent: Google-Extended
Disallow: /
# OR allow them
User-agent: GPTBot
Allow: /
Where to Add Your Custom Bots
Add service bots you need BEFORE the “User-agent: _” line.__
Here’s the correct placement:
# Allow major search engines
User-agent: Googlebot
Allow: /
# YOUR CUSTOM SERVICE BOTS GO HERE
User-agent: AddSearchBot
Allow: /
User-agent: Twitterbot
Allow: /
# Block known bad bots
User-agent: AhrefsBot
Disallow: /
# This must come LAST
User-agent: *
Disallow: /
Step-by-Step Customization Process
Step 1: Identify Required Bots
Make a list of all third-party services your website uses:
Site search (AddSearch, Algolia, etc.)
Social media platforms
Monitoring services
CDN services
Analytics tools
Marketing automation
Any other external services that crawl your site
Step 2: Find Bot User-Agent Names
For each service, find the bot’s User-agent name:
Check the service’s documentation
Search for “[Service Name] bot user-agent”
Check your server logs for the bot’s identifier
Common format: ServiceNameBot or ServiceName-Bot
Step 3: Add Bots to Your robots.txt
Add each required bot between the search engines section and the “User-agent: *” line:
User-agent: [BotName]
Allow: /
Step 4: Test Your Configuration
Save your robots.txt file
Upload it to your website cache in the root folder
Test it at:
https://example.com/robots.txtUse Google Search Console’s robots.txt Tester
Monitor your site to ensure functionality isn’t broken
Advanced Customization
Blocking Specific Directories
To allow bots but block certain directories:
User-agent: Googlebot
Disallow: /admin/
Disallow: /private/
Disallow: /temp/
Allow: /
Allowing Only Specific Files
User-agent: Googlebot
Allow: /public-content/
Disallow: /
Uploading a robots.txt file
The robots.txt file needs to be accessible at the root of your website, e.g. www.example.com/robots.txt. Once you've created the file, you'll need to upload it to the root of your site's file cache.
You can then point search engines to this file.
Sitemap
You can also add the location of your sitemap to the robots.txt file which will allow some search engine crawlers to automatically pick up the file. You'd need to upload your sitemap to the root of the file cache (the sitemap.xml file shown above) then add a line to your robots.txt file pointing to this file:
User-agent: *
Disallow:
Sitemap: www.example.com/sitemap.xml

