The rise of Large Language Models (LLMs) in 2026 has changed the rules of web crawling forever. In the past, we wanted every bot to index our site. Today, "Greedy" AI crawlers are hitting websites with unprecedented aggression, scraping data to train models without providing any value back to the creator. If you are a website owner, you need to decide: are you a partner to these AI companies, or are you just their free data source?

AI Crawlers: Should You Allow or Block these Bots
Initially, I kept the Bot flight mode enabled in Cloudflare; however, I later observed that this was inadvertently blocking several beneficial bots—such as SSL verifiers—so I subsequently disabled it. One day, I noticed an overwhelming volume of incoming requests on our server, with OpenAI alone generating 2–3 requests every single second. This activity had been ongoing for several hours, so I immediately intervened to halt it. By that time, it had already generated requests at a consistent daily rate of approximately 80,000 for three to four consecutive days.
Since then, I have utilized custom WAF rules in conjunction with custom expressions; I personally analyze the logs and take appropriate action based on a thorough assessment of which requests are beneficial versus those that are harmful. I am sharing some of those experiences here.
The Illusion of robots.txt
Many beginners believe that adding a "Disallow" line in their robots.txt file is enough. Practically speaking, this is a mistake. Aggressive bots like Bytespider (Bytedance) or ClaudeBot (Anthropic) often ignore these suggestions or, even if they follow them, they still hit your server to read the file first. This consumes your bandwidth, fills your server's request quota, and uses up CPU resources. In 2026, relying on robots.txt is like leaving your door unlocked and hanging a "Please don't enter" sign.
Meet the Aggressive Scrapers: Bytespider, Meta, and Beyond
Through real-world log analysis on my own sites, I have identified several bots that are particularly aggressive. These entities crawl thousands of pages per days, looking for every bit of text and clinical data they can find.
- Bytespider: Owned by Bytedance (TikTok), this is perhaps the most aggressive scraper. It rarely gives citations and is purely focused on data harvesting for AI training.
- ClaudeBot: While Anthropic claims to be "AI Safety" focused, ClaudeBot hits servers hard. It might provide citations in its chat interface, but the "Zero-Click" nature of AI means users rarely click through to your site.
- Meta-Webindexer and Meta-ExternalAgent: Meta’s bots are massive. They scrape content to fuel Meta AI across Facebook, Instagram, and WhatsApp, often without direct traffic benefits to you.
- OpenAI and Amazonbot: These bots are constantly scanning. While some search-centric bots (like GPTBot) might lead to discovery, the majority are simply building a database that replaces the need for your website.
The "Citation" Trap: Why It Isn't Enough
A common argument is that we should allow these bots because they "cite" our websites. However, practically, these citations are often buried at the bottom of a generated response. If an AI provides a full answer based on your research or your poetry, the user has no reason to visit your blog. You lose the ad revenue, you lose the affiliate click, and you lose the loyal reader. You are essentially paying for the server costs while the AI company takes the profit.
Practical Guide: Block at the Edge (Cloudflare WAF)
The only effective way to manage these crawlers is to stop them before they even touch your website. This is what I do, and what I strongly recommend you do as well: Block them at the Cloudflare Edge level.
By using Cloudflare WAF (Web Application Firewall) rules, you can identify these bots by their User-Agent strings. When the bot attempts to connect, Cloudflare sees the "ClaudeBot" or "Bytespider" tag and issues a "Managed Challenge" or an outright "Block."
- Zero Server Load: The request is killed at the Cloudflare server. Your actual website server never even knows the bot tried to visit.
- Preserve Quotas: This prevents your hosting plan’s request limits from being exhausted by non-human traffic.
- Total Control: You can choose to allow "Good" search bots like Googlebot and Bingbot etc. while keeping "Greedy" AI scrapers out.
My Practical Recommendation
Don't be a passive observer. In 2026, your content is your currency. If an AI company wants your data to build a multi-billion dollar model, they should at least respect your server's resources. Use the "Power of Three" custom WAF strategy to identify these bots by their digital signatures and block them proactively. Your server speed will improve, your data will stay yours, and you will finally have a clean, human-focused traffic log.
My practical insight suggests that the decision of which AI crawlers to allow—and which to block—should be based on how rapidly they crawl pages, the volume of citations they generate, and the amount of click-through traffic they actually drive. If an AI crawler is performing excessive crawling yet fails to generate any inbound traffic, what is the point of allowing it? It merely places an unnecessary load on the server and consumes valuable resources for no productive gain.
In the next post, we will look at specific WAF code snippets you can copy and paste to stop these AI scrapers instantly.