this post was submitted on 12 Jul 2025
577 points (98.0% liked)

Selfhosted

60024 readers
832 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam.

  3. Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.

  4. Don't duplicate the full text of your blog or git here. Just post the link for folks to click.

  5. Submission headline should match the article title.

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago
MODERATORS
 

Incoherent rant.

I've, once again, noticed Amazon and Anthropic absolutely hammering my Lemmy instance to the point of the lemmy-ui container crashing. Multiple IPs all over the US.

So I've decided to do some restructuring of how I run things. Ditched Fedora on my VPS in favour of Alpine, just to start with a clean slate. And started looking into different options on how to combat things better.

Behold, Anubis.

"Weighs the soul of incoming HTTP requests to stop AI crawlers"

From how I understand it, it works like a reverse proxy per each service. It took me a while to actually understand how it's supposed to integrate, but once I figured it out all bot activity instantly stopped. Not a single one got through yet.

My setup is basically just a home server -> tailscale tunnel (not funnel) -> VPS -> caddy reverse proxy, now with anubis integrated.

I'm not really sure why I'm posting this, but I hope at least one other goober trying to find a possible solution to these things finds this post.

Anubis Github, Anubis Website

Edit: Further elaboration for those who care, since I realized that might be important.

  • You don't have to use caddy/nginx/whatever as your reverse proxy in the first place, it's just how my setup works.
  • My Anubis sits between my local server and inside Caddy reverse proxy docker compose stack. So when a request is made, Caddy redirects to Anubis from its Caddyfile and Anubis decides whether or not to forward the request to the service or stop it in its tracks.
  • There are some minor issues, like it requiring javascript enabled, which might get a bit annoying for NoScript/Librewolf/whatever users, but considering most crawlbots don't do js at all, I believe this is a great tradeoff.
  • The most confusing part were the docs and understanding what it's supposed to do in the first place.
  • There's an option to apply your own rules via json/yaml, but I haven't figured out how to do that properly in docker yet. As in, there's a main configuration file you can override, but there's apparently also a way to add additional bots to block in separate files in a subdirectory. I'm sure I'll figure that out eventually.

Edit 2 for those who care: Well crap, turns out lemmy-ui crashing wasn't due to crawlbots, but something else entirely.
I've just spent maybe 14 hours troubleshooting this thing, since after a couple of minutes of running, lemmy-ui container healthcheck would show "unhealthy" and my instance couldn't be accessed from anywhere (lemmy-ui, photon, jerboa, probably the api as well).
After some digging, I've disabled anubis to check if that had anything to do with it, it didn't. But, I've also noticed my host ulimit -n was set to like 1000.... (I've been on the same install for years and swear an update must have changed it)
After changing ulimit -n (nofile) and shm_size to 2G in docker compose, it hasn't crashed yet. fingerscrossed
Boss, I'm tired and I want to get off Mr. Bones' wild ride.
I'm very sorry for not being able to reply to you all, but it's been hectic.

Cheers and I really hope someone finds this as useful as I did.

you are viewing a single comment's thread
view the rest of the comments
[–] blob42@lemmy.ml 11 points 11 months ago* (last edited 11 months ago) (1 children)

I am planning to try it out, but for caddy users I came up with a solution that works after being bombarded by AI crawlers for weeks.

It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.

Now here's the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.

Originally I only relied on the rate limiter and noticed that AI bots kept trying whenever the limit was reset. Once I introduced data poisoning they all stopped :)

git.blob42.xyz {
    @bot <<CEL
        header({'Accept-Language': 'zh-CN'}) || header_regexp('User-Agent', '(?i:(.*bot.*|.*crawler.*|.*meta.*|.*google.*|.*microsoft.*|.*spider.*))')
    CEL


    abort @bot
    

    defender garbage {

        ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8
      
    }

    rate_limit {
        zone dynamic_botstop {
            match {
                method GET
                 # to use with defender
                 #header X-RateLimit-Apply true
                 #not header LetMeThrough 1
            }
            key {remote_ip}
            events 1500
            window 30s
            #events 10
            #window 1m
        }
    }

    reverse_proxy upstream.server:4242

    handle_errors 429 {
        respond "429: Rate limit exceeded."
    }

}

If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

[–] azertyfun@sh.itjust.works 0 points 11 months ago (1 children)

If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

That's an ARIN block according to Wikipedia so North America, under Northen Telecom until 2010. It does look like Alibaba operate many networks under that /8, but I very much doubt it's the whole /8 which would be worth a lot; a /16 is apparently worth around $3-4M, so a /8 can be extrapolated to be worth upwards of a billion dollars! I doubt they put all their eggs into that particular basket. So you're probably matching a lot of innocent North American IPs with this.

[–] blob42@lemmy.ml 2 points 11 months ago (1 children)

Right I must have just blanket banned the whole /8 to be sure alibaba cloud is included. Did some time ago so I forgot

[–] Cozog@feddit.dk 2 points 11 months ago (1 children)

When I blocked Alibaba, the AI crawlers immediately started coming from a different cloud provider (Huawei, I believe), and when I blocked that, it happened again. Eventually the crawlers started coming from North American and then European cloud providers.

Due to lack of time to change my setup to accommodate Anubis, I had to temporarily move my site behind Cloudflare (where it sadly still is).

[–] blob42@lemmy.ml 2 points 11 months ago

We need a decentralized community owned cloudflare alternative. Anubis looks on good track.