First we had this list in the top 15 user agents on Cloudflare:
> ImagesiftBot - Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com)
>
> heritrix - Mozilla/5.0 (compatible; heritrix/3.4.0 +https://www.kb.dk/netarkivindsamling/) Firefox/57
>
> Bytespider - Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.0.0 Safari/537.36
>
> AhrefsBot - Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
>
> MJ12bot - Mozilla/5.0 (compatible; MJ12bot/v1.4.8;
http://mj12bot.com/)
>
> Barkrowler - Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)
Then we tried blocking them all via robots.txt now we have these:
> Mozilla/5.0 (compatible; DataForSeoBot/1.0;
> +https://dataforseo.com/dataforseo-bot)
>
> Mozilla/5.0 (compatible; Bytespider; spiderfeedback@
> bytedance.com) AppleWebKit/537.36
> (KHTML, like Gecko) Chrome/70.0.0.0 Safari/537.36
>
> Mozilla/5.0 (compatible; SemrushBot/7~bl;
> +http://www.semrush.com/bot.html)
>
> Mozilla/5.0 (compatible; DotBot/1.2;
> +https://opensiteexplorer.org/dotbot; help@moz.com)
>
> Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko;
> compatible; bingbot/2.0;
> +http://www.bing.com/bingbot.htm)
> Chrome/116.0.1938.76 Safari/537.36
I see that some bigger sites only have specific allow lists for agents they know and then disallow generally, but I can also see on some of these bots website that they disregard anything that is not a specific disallow on their user agent.
Mostly I'd love a solution where I wouldn't have to monitor and manually block things all the time.
Additionally some bots like the Bytespider bot doesn't care about robots.txt so they just spam the site no matter what.