Selfhosted

60024 readers

832 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam.
Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.
Don't duplicate the full text of your blog or git here. Just post the link for folks to click.
Submission headline should match the article title.
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago

MODERATORS

curbstickle@anarchist.nexus

curbstickle_lw@lemmy.world

577

Anubis is awesome! Stopping (AI)crawlbots (lemmy.librebun.com)

submitted 11 months ago* (last edited 11 months ago) by sailorzoop@lemmy.librebun.com to c/selfhosted@lemmy.world

64 comments fedilink hide all child comments

Incoherent rant.

I've, once again, noticed Amazon and Anthropic absolutely hammering my Lemmy instance to the point of the lemmy-ui container crashing. Multiple IPs all over the US.

So I've decided to do some restructuring of how I run things. Ditched Fedora on my VPS in favour of Alpine, just to start with a clean slate. And started looking into different options on how to combat things better.

Behold, Anubis.

"Weighs the soul of incoming HTTP requests to stop AI crawlers"

From how I understand it, it works like a reverse proxy per each service. It took me a while to actually understand how it's supposed to integrate, but once I figured it out all bot activity instantly stopped. Not a single one got through yet.

My setup is basically just a home server -> tailscale tunnel (not funnel) -> VPS -> caddy reverse proxy, now with anubis integrated.

I'm not really sure why I'm posting this, but I hope at least one other goober trying to find a possible solution to these things finds this post.

Anubis Github, Anubis Website

Edit: Further elaboration for those who care, since I realized that might be important.

You don't have to use caddy/nginx/whatever as your reverse proxy in the first place, it's just how my setup works.
My Anubis sits between my local server and inside Caddy reverse proxy docker compose stack. So when a request is made, Caddy redirects to Anubis from its Caddyfile and Anubis decides whether or not to forward the request to the service or stop it in its tracks.
There are some minor issues, like it requiring javascript enabled, which might get a bit annoying for NoScript/Librewolf/whatever users, but considering most crawlbots don't do js at all, I believe this is a great tradeoff.
The most confusing part were the docs and understanding what it's supposed to do in the first place.
There's an option to apply your own rules via json/yaml, but I haven't figured out how to do that properly in docker yet. As in, there's a main configuration file you can override, but there's apparently also a way to add additional bots to block in separate files in a subdirectory. I'm sure I'll figure that out eventually.

Edit 2 for those who care: Well crap, turns out lemmy-ui crashing wasn't due to crawlbots, but something else entirely.
I've just spent maybe 14 hours troubleshooting this thing, since after a couple of minutes of running, lemmy-ui container healthcheck would show "unhealthy" and my instance couldn't be accessed from anywhere (lemmy-ui, photon, jerboa, probably the api as well).
After some digging, I've disabled anubis to check if that had anything to do with it, it didn't. But, I've also noticed my host ulimit -n was set to like 1000.... (I've been on the same install for years and swear an update must have changed it)
After changing ulimit -n (nofile) and shm_size to 2G in docker compose, it hasn't crashed yet. fingerscrossed
Boss, I'm tired and I want to get off Mr. Bones' wild ride.
I'm very sorry for not being able to reply to you all, but it's been hectic.

Cheers and I really hope someone finds this as useful as I did.

you are viewing a single comment's thread
view the rest of the comments

[–] blob42@lemmy.ml 11 points 11 months ago* (last edited 11 months ago) (1 children)

I am planning to try it out, but for caddy users I came up with a solution that works after being bombarded by AI crawlers for weeks.

It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.

Now here's the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.

Originally I only relied on the rate limiter and noticed that AI bots kept trying whenever the limit was reset. Once I introduced data poisoning they all stopped :)

git.blob42.xyz {
    @bot <<CEL
        header({'Accept-Language': 'zh-CN'}) || header_regexp('User-Agent', '(?i:(.*bot.*|.*crawler.*|.*meta.*|.*google.*|.*microsoft.*|.*spider.*))')
    CEL


    abort @bot
    

    defender garbage {

        ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8
      
    }

    rate_limit {
        zone dynamic_botstop {
            match {
                method GET
                 # to use with defender
                 #header X-RateLimit-Apply true
                 #not header LetMeThrough 1
            }
            key {remote_ip}
            events 1500
            window 30s
            #events 10
            #window 1m
        }
    }

    reverse_proxy upstream.server:4242

    handle_errors 429 {
        respond "429: Rate limit exceeded."
    }

}

If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

[–] azertyfun@sh.itjust.works 0 points 11 months ago (1 children)

If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

That's an ARIN block according to Wikipedia so North America, under Northen Telecom until 2010. It does look like Alibaba operate many networks under that /8, but I very much doubt it's the whole /8 which would be worth a lot; a /16 is apparently worth around $3-4M, so a /8 can be extrapolated to be worth upwards of a billion dollars! I doubt they put all their eggs into that particular basket. So you're probably matching a lot of innocent North American IPs with this.

[–] blob42@lemmy.ml 2 points 11 months ago (1 children)

Right I must have just blanket banned the whole /8 to be sure alibaba cloud is included. Did some time ago so I forgot

[–] Cozog@feddit.dk 2 points 11 months ago (1 children)

When I blocked Alibaba, the AI crawlers immediately started coming from a different cloud provider (Huawei, I believe), and when I blocked that, it happened again. Eventually the crawlers started coming from North American and then European cloud providers.

Due to lack of time to change my setup to accommodate Anubis, I had to temporarily move my site behind Cloudflare (where it sadly still is).

[–] blob42@lemmy.ml 2 points 11 months ago

We need a decentralized community owned cloudflare alternative. Anubis looks on good track.