Technology

68432 readers

11135 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

Data centers contain 90% crap data (gerrymcgovern.com)

submitted 14 hours ago by fantawurstwasser@feddit.org to c/technology@lemmy.world

28 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] nyan@lemmy.cafe -4 points 14 hours ago (2 children)

Massive deduplication across all accounts on all servers of image, audio, and video data would theoretically be possible, but ain't gonna happen. Or we could just discourage people from posting cat videos and bad memes (even less likely to happen).

[–] Brkdncr@lemmy.world 2 points 12 hours ago (1 children)

Deduplication is trivial when applied at the block level, as long as the data is not encrypted, or is encrypted at rest by the storage system.

[–] nyan@lemmy.cafe 1 points 9 hours ago

If the storage all belongs to one machine, yes. If it's spread across multiple machines with similar setups that share a LAN, then you need to put in a little thought to make sure that there's only one copy for all machines, but it's still doable.

In this case, we're talking millions of machines with different owners, OSs, network security setups, etc. that are only connected across the Internet. The logistics are enough to make a hardened sysadmin blanch.

[–] lemmyng@lemmy.ca 3 points 13 hours ago (2 children)

I would argue that duplication of content is a feature, not a bug. It adds resilience, and is explicitly built into systems like CDNs, git, and blockchain (yes I know, blockchains suck at being useful, but nevertheless the point is that duplication of data is intentional and serves a purpose).

[–] nyan@lemmy.cafe 3 points 13 hours ago

If the data has value, then yes, duplication is a good thing up to a point. The thesis is that only 10% of the data has value, though, and therefore duplicating the other 90% is a waste of resources.

The real problem is figuring out which 10% of the data has value, which may be more obvious in some cases than others.

[–] muntedcrocodile@lemm.ee -1 points 13 hours ago

Technically git is a blockchain