How effective is robots.txt really against AI crawlers?

Effectiveness varies significantly: Large companies like OpenAI and Anthropic generally respect robots.txt, but there's no legal obligation to do so. Studies show only 7.8% of top websites block GPTBot. The crawl-to-referral ratio is problematic: While Google crawls at 14:1 and brings traffic back, OpenAI's ratio is 1,700:1 and Anthropic's is even 73,000:1. Many smaller or less transparent AI crawlers completely ignore robots.txt. For effective protection, a combination of robots.txt, firewall rules, and legal measures is recommended.

What legal options do I have?

The legal landscape is evolving dynamically in 2025: EU AI Act: Explicitly requires AI providers to respect robots.txt (Sub-Measure 4.1) Copyright: Your content is generally protected, but enforcement is complex TDM Directive: EU law permits text and data mining, but you can opt out Terms of Service: Explicit prohibition of AI crawling in your terms Precedent cases like the NYT lawsuit against OpenAI focus on copyright violations, not robots.txt violations.

Should I block Common Crawl (CCBot)?

This is a strategic decision. Common Crawl itself doesn't use the data for AI, but provides monthly snapshots that anyone can download - including AI companies. CCBot is the second most blocked crawler after GPTBot. Advantage of blocking: Your content won't end up in public datasets. Disadvantage: Legitimate research projects and smaller developers also use Common Crawl. If you want maximum protection, block CCBot. If you want to support open-source projects, allow it.

What alternative protection measures are there?

Robots.txt alone is often not enough. Consider these additional measures: WAF/Firewall: Cloudflare offers special AI bot blocking, but beware of false positives (3.31% of legitimate users affected) Noindex meta tags: Prevents indexing, but note: Don't combine with robots.txt! IP blocking: Many AI companies publish their IP ranges Password protection: For particularly sensitive content License agreements: Like Associated Press with OpenAI The most effective strategy is a multi-layered approach.

Will the situation improve with new standards like ai.txt?

There are several standardization efforts, but adoption is slow. The W3C is working on the 'Text and Data Mining Reservation Protocol' for fine-grained control. Google is experimenting with extensions to robots.txt. The ai.txt standard distinguishes between media types and complies with EU directives. Problem: None of these standards are widely adopted in 2025 - neither by web servers nor crawlers. The rapid development of new AI crawlers (OpenAI's OAI-SearchBot, Meta's Meta-ExternalAgent) makes it hard to keep up. In the medium term, standards could help; in the short term, robots.txt remains the most important tool.

Can blocking AI crawlers harm my SEO or business?

Yes, there are potential disadvantages. ChatGPT and other AI assistants are increasingly being used as information sources and could partially replace traditional search engines. If you block AI crawlers, your content won't appear in AI-generated answers - this could cost you traffic. On the other hand: The current crawl-to-referral ratio is extremely unbalanced (1,700:1 for OpenAI). Recommendation: Block selectively - allow access to marketing content or product descriptions, but protect exclusive content, guides, or creative works. This way you balance visibility and protection.

How to Block OpenAI/ChatGPT, Anthropic & Co. Crawlers with robots.txt

As an online entrepreneur or blogger, you're facing a new challenge:

Web crawlers from OpenAI, Anthropic, or Google are searching the web and collecting training data for LLMs and other AI models.

Your valuable blog posts that you created with great effort could be used without your knowledge or consent to generate AI-generated texts in ChatGPT & Co.

This can not only violate your copyrights but also jeopardize your competitive position. Somewhat unsettling, isn't it?

Perhaps you're already asking yourself: How can I protect my work? How do I prevent my content from being used for AI training without my consent?

No problem!

In this article, I'll show you simply and step by step how to configure your robots.txt to protect your content.

TL;DRKey Takeaways

Create or edit your robots.txt to block specific AI crawlers like GPTBot, ClaudeBot, and Google-Extended
Use selective blocking to protect only certain areas while keeping others accessible
Test your configuration with Google Search Console and combine robots.txt with additional protection measures for maximum security

1. Preparation

Before we get started protecting your website from curious AI crawlers, you need to make a few preparations. Don't worry, it's easier than you might think!

Access to the web server

First, you need access to your web server. This sounds technical but is often just a login to your hosting account.

If you're using WordPress, you can access your files directly via FTP or the File Manager Plugin.

Backup of existing robots.txt

Safety first! If you already have a robots.txt file, make sure to create a copy. This way you can always revert to the old version in case of emergency:

Find the robots.txt file in your website's root directory
Download it to your computer or copy the content into a text document
Store this backup in a safe place

2. Creating/Editing robots.txt

You don't need to be a programming genius to create or edit your robots.txt file.

Only a few steps are required:

2.1 Opening or Creating the File

First, you need to check if a robots.txt already exists on your website. There's a simple trick for this:

Open your browser
Enter your domain followed by "/robots.txt" (e.g., www.yourwebsite.com/robots.txt)
Do you see text? Great, the file already exists. If not, we'll create a new one.

If you need to create a new file:

Open a simple text editor (Notepad, TextEdit, etc.)
Create a new, empty document
Save it as "robots.txt" (Note: don't add a file extension like .txt!)

2.2 Setting Up the Basic Structure

The robots.txt follows a specific syntax (structure). Here are the basics:

User-agent: [Name of the crawler]
Disallow: [Path to be blocked]

For starters, you could write something like this:

User-agent: *
Disallow:

This means: All crawlers (*) may crawl everything (empty "Disallow"). This is our starting point from which we'll further customize the file.

Warning

Any change to robots.txt can affect how your website is indexed. So proceed carefully and test your changes thoroughly.

3. Blocking Specific AI Crawlers

To block common AI crawlers, you need to add the following blocks to your robots.txt:

OpenAI (ChatGPT)

OpenAI has a total of three different crawlers that serve different functions. To prevent content theft as effectively as possible, you should exclude all of them:

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

Anthropic (Claude)

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

Google (Bard/Gemini)

User-agent: Google-Extended
Disallow: /

Common Crawl

User-agent: CCBot
Disallow: /

Perplexity

User-agent: PerplexityBot
Disallow: /

Meta AI / Facebook

User-agent: FacebookBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

Webz.io

User-agent: OmgiliBot
Disallow: /

Cohere

User-agent: cohere-ai
Disallow: /

Note

Although many reputable companies respect robots.txt rules, there's no guarantee that all crawlers will comply.

5. Selective Blocking

Sometimes you don't want to completely lock out AI crawlers, but only protect certain areas of your website.

No problem!

Blocking specific directories/pages for AI crawlers

If you have an area with exclusive content, you can exclude this from crawlers with the following code:

User-agent: GPTBot
Disallow: /exclusive/

User-agent: anthropic-ai
Disallow: /premium-content/

In this example, you're blocking GPTBot from your "/exclusive/" directory and Anthropic's crawler from "/premium-content/".

Defining exceptions

Sometimes you might want to block most of your site but make certain areas accessible to AI crawlers. Here's an example:

User-agent: GPTBot
Disallow: /
Allow: /blog/

User-agent: anthropic-ai
Disallow: /
Allow: /public/

In this case, you first block everything with Disallow: /, then allow specific areas with Allow.

So GPTBot is allowed to crawl your blog, while Anthropic's crawler can only access the public area.

6. Verification and Testing

Everything set up? Great!

But before you sit back, you should make sure your robots.txt is actually doing what it's supposed to.

Google provides you with a great tool for this: The robots.txt Tester in Google Search Console.

Here you can see if your robots.txt can be properly fetched by Google and if it contains any errors.

Frequently Asked Questions About Blocking AI Crawlers

As an online entrepreneur or blogger, you're facing a new challenge:

Web crawlers from OpenAI, Anthropic, or Google are searching the web and collecting training data for LLMs and other AI models.

Your valuable blog posts that you created with great effort could be used without your knowledge or consent to generate AI-generated texts in ChatGPT & Co.

This can not only violate your copyrights but also jeopardize your competitive position. Somewhat unsettling, isn't it?

Perhaps you're already asking yourself: How can I protect my work? How do I prevent my content from being used for AI training without my consent?

No problem!

In this article, I'll show you simply and step by step how to configure your robots.txt to protect your content.

TL;DRKey Takeaways

Create or edit your robots.txt to block specific AI crawlers like GPTBot, ClaudeBot, and Google-Extended
Use selective blocking to protect only certain areas while keeping others accessible
Test your configuration with Google Search Console and combine robots.txt with additional protection measures for maximum security

1. Preparation

Before we get started protecting your website from curious AI crawlers, you need to make a few preparations. Don't worry, it's easier than you might think!

Access to the web server

First, you need access to your web server. This sounds technical but is often just a login to your hosting account.

If you're using WordPress, you can access your files directly via FTP or the File Manager Plugin.

Backup of existing robots.txt

Safety first! If you already have a robots.txt file, make sure to create a copy. This way you can always revert to the old version in case of emergency:

Find the robots.txt file in your website's root directory
Download it to your computer or copy the content into a text document
Store this backup in a safe place

2. Creating/Editing robots.txt

You don't need to be a programming genius to create or edit your robots.txt file.

Only a few steps are required:

2.1 Opening or Creating the File

First, you need to check if a robots.txt already exists on your website. There's a simple trick for this:

Open your browser
Enter your domain followed by "/robots.txt" (e.g., www.yourwebsite.com/robots.txt)
Do you see text? Great, the file already exists. If not, we'll create a new one.

If you need to create a new file:

Open a simple text editor (Notepad, TextEdit, etc.)
Create a new, empty document
Save it as "robots.txt" (Note: don't add a file extension like .txt!)

2.2 Setting Up the Basic Structure

The robots.txt follows a specific syntax (structure). Here are the basics:

User-agent: [Name of the crawler]
Disallow: [Path to be blocked]

For starters, you could write something like this:

User-agent: *
Disallow:

This means: All crawlers (*) may crawl everything (empty "Disallow"). This is our starting point from which we'll further customize the file.

Warning

Any change to robots.txt can affect how your website is indexed. So proceed carefully and test your changes thoroughly.

3. Blocking Specific AI Crawlers

To block common AI crawlers, you need to add the following blocks to your robots.txt:

OpenAI (ChatGPT)

OpenAI has a total of three different crawlers that serve different functions. To prevent content theft as effectively as possible, you should exclude all of them:

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

Anthropic (Claude)

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

Google (Bard/Gemini)

User-agent: Google-Extended
Disallow: /

Common Crawl

User-agent: CCBot
Disallow: /

Perplexity

User-agent: PerplexityBot
Disallow: /

Meta AI / Facebook

User-agent: FacebookBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

Webz.io

User-agent: OmgiliBot
Disallow: /

Cohere

User-agent: cohere-ai
Disallow: /

Note

Although many reputable companies respect robots.txt rules, there's no guarantee that all crawlers will comply.

5. Selective Blocking

Sometimes you don't want to completely lock out AI crawlers, but only protect certain areas of your website.

No problem!

Blocking specific directories/pages for AI crawlers

If you have an area with exclusive content, you can exclude this from crawlers with the following code:

User-agent: GPTBot
Disallow: /exclusive/

User-agent: anthropic-ai
Disallow: /premium-content/

In this example, you're blocking GPTBot from your "/exclusive/" directory and Anthropic's crawler from "/premium-content/".

Defining exceptions

Sometimes you might want to block most of your site but make certain areas accessible to AI crawlers. Here's an example:

User-agent: GPTBot
Disallow: /
Allow: /blog/

User-agent: anthropic-ai
Disallow: /
Allow: /public/

In this case, you first block everything with Disallow: /, then allow specific areas with Allow.

So GPTBot is allowed to crawl your blog, while Anthropic's crawler can only access the public area.

6. Verification and Testing

Everything set up? Great!

But before you sit back, you should make sure your robots.txt is actually doing what it's supposed to.

Google provides you with a great tool for this: The robots.txt Tester in Google Search Console.

Here you can see if your robots.txt can be properly fetched by Google and if it contains any errors.

How to Block OpenAI/ChatGPT, Anthropic & Co. Crawlers with robots.txt

1. Preparation

2. Creating/Editing robots.txt

2.1 Opening or Creating the File

2.2 Setting Up the Basic Structure

3. Blocking Specific AI Crawlers

5. Selective Blocking

6. Verification and Testing

Frequently Asked Questions About Blocking AI Crawlers

Finn Hillebrandt

Similar Articles

The 6 Best AI SEO Tools in 2026

How to Turn Images into AI Videos (Step by Step)

5 AI Tools to Create TikToks, Reels & Shorts (+ Bonus)

Do You Have to Label AI Content? And If So, How?

10 Brilliant Ways to Use ChatGPT for YouTube

10 Ways to Use ChatGPT for TikTok

How to Block OpenAI/ChatGPT, Anthropic & Co. Crawlers with robots.txt

1. Preparation

2. Creating/Editing robots.txt

2.1 Opening or Creating the File

2.2 Setting Up the Basic Structure

3. Blocking Specific AI Crawlers

5. Selective Blocking

6. Verification and Testing

Frequently Asked Questions About Blocking AI Crawlers

Finn Hillebrandt

Similar Articles

The 6 Best AI SEO Tools in 2026

How to Turn Images into AI Videos (Step by Step)

5 AI Tools to Create TikToks, Reels & Shorts (+ Bonus)

Do You Have to Label AI Content? And If So, How?

10 Brilliant Ways to Use ChatGPT for YouTube

10 Ways to Use ChatGPT for TikTok