What is Robots.txt?¶
Think of robots.txt as a "Do Not Enter" sign for search engines.
Real-World Analogy¶
Your Website = Your House
Search Engines (Google) = Visitors
Robots.txt = Signs on your doors
Your House (Website):
πͺ Living Room β
"Welcome! Come in"
(Public pages - /about, /products)
πͺ Bedroom π« "Private - Stay Out"
(Admin pages - /admin, /login)
πͺ Storage π« "Messy - Don't Look"
(Staging pages - /staging, /test)
Robots.txt tells Google: "You can visit the living room, but please stay out of the bedroom and storage."
Location: Your robots.txt file lives at: https://yoursite.com/robots.txt
Why Do Robots.txt Conflicts Matter?¶
A conflict happens when you send MIXED SIGNALS to Google:
π« Robots.txt says: "Don't enter this room"
βοΈ BUT βοΈ
πΊοΈ Sitemap says: "Hey Google, please index this important page!"
OR
π Internal links say: "This page is part of our site navigation"
This is like putting a "Do Not Enter" sign on a door, but then: - Listing it on your house tour guide (sitemap) - Putting arrows pointing to it from every room (internal links)
Result: Google is confused!
Real Consequences¶
1. Wasted Crawl Budget¶
Google tries to crawl these pages but can't β wastes time
2. Broken Link Equity¶
Your internal links point to blocked pages β SEO value lost
3. Indexing Problems¶
Important pages might not get indexed
4. Poor User Experience¶
Users can reach blocked pages but Google can't
How Robots.txt Works (Simple Explanation)¶
Your robots.txt file contains RULES:
Example robots.txt file¶
User-agent: * β Applies to everyone
Disallow: /admin/ β Block /admin/ folder
Disallow: /private/ β Block /private/ too
Disallow: /*.pdf$ β Block all PDF files
Allow: /admin/public/ β Exception: allow this
Sitemap: https://site.com/sitemap.xml
Understanding the Rules¶
| Rule Type | Meaning | Example |
|---|---|---|
| User-agent: * | Who this applies to | * = all search engines |
| Disallow: /admin/ | Don't crawl this | Blocks everything in /admin/ folder |
| Allow: /admin/pub/ | Exception to Disallow | Allow this specific path |
| Sitemap: [URL] | Where to find sitemap | Helps Google find pages |
Pattern Matching Examples¶
Pattern: /admin/¶
- β Blocks: /admin/dashboard, /admin/users, /admin/settings/profile
- β Doesn't block: /administrator, /public/admin
Pattern: /*.pdf$¶
- β Blocks: /guide.pdf, /docs/manual.pdf, /files/report.pdf
- β Doesn't block: /pdf-viewer, /guide.html
Pattern: /temp¶
- β Blocks: /temp, /temporary, /temp/file.html
- β Doesn't block: /contemplate
Types of Conflicts Explained¶
Conflict Type 1: In Sitemap (CRITICAL) π΄¶
What it means: Your sitemap.xml tells Google "Please index this page!" BUT robots.txt says "Don't crawl it"
Example:
Sitemap.xml:
<url>
<loc>https://site.com/admin/dashboard</loc>
<priority>0.8</priority> β High priority!
</url>
Robots.txt:
Disallow: /admin/ β BLOCKED!
Why it's critical: This is the WORST type of mixed signal. You're actively asking Google to index a page you won't let it crawl.
Conflict Type 2: Has Internal Links (HIGH) π ¶
What it means: Other pages on your site link to this blocked page
Example:
Homepage has link:
<a href="/admin/dashboard">Admin Login</a>
But robots.txt blocks /admin/
Why it matters: - Link equity (SEO value) flows into a dead end - Confusing for site structure - Users can reach it, but Google can't
Conflict Type 3: BOTH (MAXIMUM PRIORITY) π΄π΄¶
When a page is BOTH in sitemap AND has internal links = Triple mixed signal = Fix immediately!
How to Fix Conflicts (Step-by-Step)¶
For EACH conflict, ask yourself ONE question:
"Should this page be PUBLIC or PRIVATE?"
Should page be PUBLIC?
β
YES β
β NO π«
β β β
βΌ β βΌ
Make it public Keep it private
Option A: Page SHOULD BE PUBLIC π¶
Fix: Remove robots.txt block
Example:
BEFORE (robots.txt):
Disallow: /admin/
Problem: /admin/public-info blocked
AFTER (robots.txt):
Disallow: /admin/
Allow: /admin/public-info/ β ADD EXCEPTION
Steps: 1. Edit your robots.txt file 2. Add Allow rule OR remove the Disallow rule 3. Keep in sitemap β 4. Keep internal links β 5. Wait for Google to recrawl (1-2 weeks)
Option B: Page SHOULD BE PRIVATE π¶
Fix: Remove from sitemap AND remove internal links
Example:
Page: /admin/dashboard
Step 1: Remove from sitemap.xml
β Delete: <loc>.../admin/dashboard</loc>
Step 2: Remove internal links
β Remove from: header, footer, homepage
β
Only accessible via direct URL or login form
Step 3: Keep robots.txt block
β
Disallow: /admin/ (stays as is)
Steps: 1. Open your sitemap.xml file 2. Remove the blocked URL from sitemap 3. Find all pages linking to blocked page 4. Remove those links (especially header/footer) 5. Keep robots.txt block β
Real-World Examples¶
Example 1: Admin Login Page π¶
URL: /admin/login
Conflict:
β’ β Blocked by robots.txt: /admin/
β’ β οΈ Footer has "Admin Login" link on every page
Should it be public? NO π«
Fix:
β
Remove "Admin Login" link from footer
β
Admins can bookmark or type URL directly
β
Keep robots.txt block
Why: Admin pages should NOT be in public navigation
Example 2: Product PDF Catalog π¶
URL: /catalog.pdf
Conflict:
β’ β Blocked by robots.txt: /*.pdf$
β’ β οΈ In sitemap.xml
β’ β οΈ Linked from 10 product pages
Should it be public? YES β
(It's a valuable resource customers need)
Fix:
β
Remove /*.pdf$ rule from robots.txt
OR
β
Add: Allow: /catalog.pdf
β
Keep in sitemap
β
Keep internal links
Why: PDFs can rank in Google and drive traffic
Example 3: Staging/Test Pages π§ͺ¶
URL: /staging/test-page
Conflict:
β’ β Blocked by robots.txt: /staging/
β’ β οΈ In sitemap (accidentally)
Should it be public? NO π«
(It's a work-in-progress test page)
Fix:
β
Remove from sitemap.xml
β
Remove any internal links
β
Keep robots.txt block
β
Better: Use password protection instead
Why: Test pages should never be discovered by search engines
Example 4: "Thank You" Pages π¶
URL: /thank-you-download
Conflict:
β’ β Blocked by robots.txt: /thank-you*
β’ β οΈ Has 5 internal links from blog posts
Should it be public? MAYBE π€
Two schools of thought:
Option A: Keep private
β’ Only accessible after form submission
β’ Remove from navigation
β’ Keep robots.txt block
Option B: Make public
β’ Can rank for "download X" searches
β’ Shows social proof
β’ Remove robots.txt block
Common Mistakes to Avoid¶
Mistake 1: Blocking Everything by Accident¶
Bad robots.txt:
User-agent: *
Disallow: / β BLOCKS ENTIRE SITE!
Result: Your entire website disappears from Google
How to avoid: Test your robots.txt carefully before deploying
Mistake 2: Blocking CSS and JavaScript¶
Bad:
Disallow: /css/
Disallow: /js/
Disallow: /images/
Result: Google can't render your pages properly
Fix: NEVER block CSS, JS, or images
Mistake 3: Using Robots.txt for Security¶
Bad idea:
Disallow: /admin/
(thinking this protects admin pages)
Reality: Anyone can still access /admin/ by typing URL. Robots.txt is PUBLIC - anyone can read it!
Fix: Use proper authentication (passwords, login required)
Mistake 4: Forgetting to Update Sitemap¶
You block pages in robots.txt but forget to remove from sitemap = Creates conflicts!
Fix: When you block pages, also update your sitemap
Mistake 5: Over-Blocking¶
Bad:
Disallow: /products/ β Blocks all products!
(You only wanted to block /products/test/)
Fix: Be specific with your rules
Disallow: /products/test/ β Only blocks test products
Quick Fix Checklist¶
For EACH conflict found:
-
[ ] Step 1: Decide - Should this page be public or private?
-
[ ] Step 2: If PUBLIC:
- [ ] Remove robots.txt block (or add Allow exception)
- [ ] Keep in sitemap β
-
[ ] Keep internal links β
-
[ ] Step 3: If PRIVATE:
- [ ] Remove from sitemap.xml
- [ ] Remove internal links (especially header/footer)
- [ ] Keep robots.txt block β
-
[ ] Consider adding password protection
-
[ ] Step 4: Test your changes
- [ ] Check robots.txt at yoursite.com/robots.txt
- [ ] Verify sitemap.xml updated
-
[ ] Check that unwanted links are removed
-
[ ] Step 5: Submit to Google (optional but recommended)
- [ ] Go to Google Search Console
- [ ] Submit updated sitemap
- [ ] Request reindex for changed pages
Estimated time: 30-60 minutes to fix all conflicts
How to Test Robots.txt¶
Before making changes, TEST first!
Method 1: Google Search Console (Free Tool)¶
- Go to: search.google.com/search-console
- Select your website
- Go to: Settings β robots.txt Tester
- Test a URL to see if it's blocked
Method 2: Manual Check¶
- Visit: yoursite.com/robots.txt
- Check if your URLs match the Disallow rules
- Remember:
Disallow: /admin/β Blocks /admin/anything
Method 3: Online Tools¶
Search for: "robots.txt tester" (many free tools available)
What to Expect After Fixing¶
Timeline¶
| Timeframe | What Happens |
|---|---|
| Immediate | Changes take effect once robots.txt is updated |
| 1-2 weeks | Google recrawls and notices changes |
| 2-4 weeks | Indexing reflects your updates |
What you'll see¶
- β No more conflicts in this report
- β Pages you unblocked start appearing in Google
- β Cleaner site architecture
- β Better crawl efficiency
Monitor¶
- Google Search Console β Coverage report
- Check for "Blocked by robots.txt" warnings
- Verify intended pages are indexed
FAQ: Common Questions¶
Q: "Do I need a robots.txt file?"
A: Not required, but recommended. If you don't have one, Google can crawl everything. Better to have one that explicitly allows all:
User-agent: *
Disallow:
Q: "Will robots.txt protect my private pages?"
A: NO! Anyone can still access blocked pages by typing the URL. Robots.txt only stops search engines from CRAWLING, not ACCESSING. Use password protection for actual security.
Q: "I fixed conflicts but still see them in the report. Why?"
A: This report shows a snapshot from your last crawl. Run a new crawl to see updated results.
Q: "Can I block specific search engines (like Bing but not Google)?"
A: Yes! Use different User-agent directives:
User-agent: Googlebot (Google only)
User-agent: Bingbot (Bing only)
Q: "I accidentally blocked my whole site! How do I fix it?"
A: Edit robots.txt immediately:
- Change: Disallow: / (blocks everything)
- To: Disallow: (blocks nothing)
- Then request re-indexing in Google Search Console.
Further Learning¶
Want to learn more? Search for: - "robots.txt tutorial" - "How to create robots.txt" - "robots.txt best practices" - "Google robots.txt tester"
Official documentation: - Google: developers.google.com/search/docs/crawling-indexing/robots/ - Robots Exclusion Protocol: robotstxt.org