this post was submitted on 13 Feb 2026
584 points (98.8% liked)

Selfhosted

59939 readers
308 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam.

  3. Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.

  4. Don't duplicate the full text of your blog or git here. Just post the link for folks to click.

  5. Submission headline should match the article title.

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago
MODERATORS
 

I really hope they die soon, this is unbearable…

you are viewing a single comment's thread
view the rest of the comments
[–] Ephera@lemmy.ml 18 points 4 months ago (2 children)

My best guess is that they don't just index things, but rather download straight from the internet when they need fresh training data. They can't really cache the whole internet after all...

[–] Techlos@lemmy.dbzer0.com 15 points 4 months ago

Bingo, modern datasets are a list of URL's with metadata rather than the files themselves. Every new team/individual wanting to work with the dataset becomes another DDoS participant.

[–] spicehoarder@lemmy.zip 8 points 4 months ago

The sad thing is that they could cache the whole internet if there was a checksum protocol.

Now that I'm thinking about it, I actually hate the idea that there are several companies out there with graph databases of the entire internet.