Help reduce high bandwidth caused by numerous bots
# help-with-other
j
We have a site with 9 country domains, in our Google Analytics we see 100-200 visitors across all 9 a day, but tens of thousands of requests from a long list of bots/crawlers. We have tried disallowing several of them in our robots.txt, but new ones keep popping up, and some of them also completely ignore robots.txt. Has anyone else experienced something similar, and does anyone have a good way to prevent these bots from hitting your site? Ps. This is a Cloud site, so the process of reaching out to support and getting a list of user agents to then deploy a robots.txt disallow rule based on that and then waiting for more data is very slow and frustrating.
a
Is this search engine crawlers, or spam bots checking various suspicious URLs?
j
First we had this list in the top 15 user agents on Cloudflare: > ImagesiftBot - Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) > > heritrix - Mozilla/5.0 (compatible; heritrix/3.4.0 +https://www.kb.dk/netarkivindsamling/) Firefox/57 > > Bytespider - Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.0.0 Safari/537.36 > > AhrefsBot - Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) > > MJ12bot - Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) > > Barkrowler - Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler) Then we tried blocking them all via robots.txt now we have these: > Mozilla/5.0 (compatible; DataForSeoBot/1.0; > +https://dataforseo.com/dataforseo-bot) > > Mozilla/5.0 (compatible; Bytespider; spiderfeedback@ > bytedance.com) AppleWebKit/537.36 > (KHTML, like Gecko) Chrome/70.0.0.0 Safari/537.36 > > Mozilla/5.0 (compatible; SemrushBot/7~bl; > +http://www.semrush.com/bot.html) > > Mozilla/5.0 (compatible; DotBot/1.2; > +https://opensiteexplorer.org/dotbot; help@moz.com) > > Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; > compatible; bingbot/2.0; > +http://www.bing.com/bingbot.htm) > Chrome/116.0.1938.76 Safari/537.36 I see that some bigger sites only have specific allow lists for agents they know and then disallow generally, but I can also see on some of these bots website that they disregard anything that is not a specific disallow on their user agent. Mostly I'd love a solution where I wouldn't have to monitor and manually block things all the time. Additionally some bots like the Bytespider bot doesn't care about robots.txt so they just spam the site no matter what.
u
Are you configuring this manually or using Cloudflares bot protection? I don't know much about it, other than we have it configured at work and I've never heard of anyone having to update it or it not doing what it's meant to do. We have an enterprise account, so that might be a factor, but it is available of different accoutn levels
j
We don't have our own Cloudflare, we are hosting on Umbraco Cloud who has CF automatically, and all our knowledge of this is based on their support answering my questions. It is looking more and more like adding our own CF in front of Umbracos may be the move though..
a
I remember the kb.dk one being quite bad. I ended up blocking their IP addresses, and then opened up again when they promised to not flood our client's site with way too many requests over a short time ... which they then of course did once more 🤦‍♂️
j
Yep that one also doesn't care about robots.txt and had more requests than all our GA users on its own. It was also only the 3rd worst bot 😬
a
d
thats quite interesting. we also have a client where this is happening right now. and we're unsure if it's creating some of our issues that we're having.
j
The more I look into it the more outright blocking them via IIS Rewrites seem to be the answer: https://stackoverflow.com/questions/44862158/how-to-block-bots-with-iis
a
Just looked back of at an old email. One time back in 2022 when they crawled our client's site, they made a bit over 26k request the first day, and a bit over 45k requests the second day. The peak the second day was 794 requests for a single minute 😱 🙃
@Jemayn for a non headless-site, yes.
The earlier you can block the request, the less it will require of the server. But I haven't really worked a lot with Cloud, so not sure if you can modify the Web.config file
j
You can, it's down to either that or setting up our own Cloudflare in front as they have a bot blocking capability, which is probably better and periodically updated. On the other hand it will cost subscription money and require the client to move their nameserver to CF so a lot more "involved" in that..
a
Yeah, we're looking into some of Azure's products for the same purpose
I think most of our clients won't be too happy about moving their DNS to CF 🤔
u
do you have to move your DNS to Cloudflare? I thought you only had to do that on their free tier.
j
Don't think I've gone beyond the free tier before, so you may be right 🤔
s
Apparently you are required by law to let the kb.dk bot crawl your site 🙈 https://pro.kb.dk/pligtaflevering/digital-pligtaflevering
r
For CloudFlare free tier you get 5 custom WAF rules you can use to stop bots. It has an “or” or “contains in list” functionality so you can sweep up a lot in a single rule. You can also set it to hard block / redirect / have a JS captcha style preloader.
n
Also watch out for AI bots, we had a site recently that was been hammered by AI bots, cloudflare bot protection helped with that
11 Views