this post was submitted on 21 Nov 2024
82 points (97.7% liked)
Technology
59534 readers
3199 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I am a bit confused how it is legal for them to have the training data here?
Like is there anything a corpo can't do?
Like why can't subway Jared and Catholic church "train the AI"
Only half way joking, what's the catch here?
There are laws around it. Law enforcement doesn't just delete any digital CSAM they seize.
Known CSAM is archived and analyzed rather than destroyed, and used to recognize additional instances of the same files in the wild. Wherever file scanning is possible.
Institutions and corporation can request licenses to access the database, or just the metadata that allows software to tell if a given file might be a copy of known CSAM.
This is the first time an attempt is being made at using the database to create software able to recognize CSAM that isn't already known.
I'm personally quite sceptical of the merit. It may well be useful for scanning the public internet, but I'm guessing the plan is to push for it to be somehow implemented for private communication, no matter how badly that compromises the integrity of encryption.
So doesn't that make the law enforcement having the biggest CP collection from everybody? This sounds kinda dangerous...
It does. Kinda.
The police are seldom allowed to be in possession of CSAM, except for in terms of grabbing the hardware which contains it in an arrest. The database used in modern detection tools is maintained by NCMEC which has special permission to do so.
And of course there are risks, but it's just digital data. Unless you are creating more, you're not actively harming anyone. And law enforcement absolutely needs that data to take some of the most obvious steps to prevent it being spread further.
Obviously, someone has access, but to get to the actual media files wouldn't be simple. What typically happens, is that anyone wanting to detect CSAM, is given a hashed version of the database. They can then scan their systems for CSAM by hashing any media they are hosting, and seeing whether there are any matches.
Whenever possible, people aren't handling the actual media. But for any detection to be possible to begin with, the database of the actual media does need to be maintained somewhere.
AI is a touchier subject, as you can't train a model to recognize CSAM not already in the database using hashes, so in those cases you have to work with actual real media. This is only recently becoming a thing.
It also leaves open the possibility for false positives. An oft cited example is parents taking pictures of their own children for innocent reasons, or doctors and parents handling images for valid medical reasons. In a system that flagged such content, it would mean someone else would be seeing that "private" content because it was flagged.