this post was submitted on 14 Jun 2026
28 points (91.2% liked)

Technology

85392 readers
4188 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 3 years ago
MODERATORS
 

when datasets are scaled up to the volume of (partial) internet, together with the idea that scale will average out the noise, large dataset builders came up with a human-not-in-the-loop, cheaper-than-cheap-labor method to clean the datasets: heuristic filtering. Heuristics in this context are basically a set of rules came up by the engineers with their imagination and estimation to work best for their perspective of “cleaning”. Most datasets use heuristics adopted from existing ones, then add some extra filtering rules for specific characteristics of the datasets. I would like to invite you to have a taste together of these silent, anonymous yet upheld estimations and not-guaranteed rationalities in current sociotechnical artifacts, and on for whom these estimations are good-enough, as it will soon be part our technological infrastructures.

In 1980s, non-white women’s body size data was categorized as dirty data when establishing the first women's sizing system in US. Now in the age of GPT, what is considered as dirty data and how are they removed from massive training materials?

Datasets nowadays for training large models have been expanded to the volume of (partial) internet, with the idea of “scale averages out noise”, these datasets were scaled up by scrabbling whatever available data on the internet for free then “cleaned” with a human-not-in-the-loop, cheaper-than-cheap-labor method: heuristic filtering. Heuristics in this context are basically a set of rules came up by the engineers with their imagination and estimation that are “good enough” to remove “dirty data” of their perspective, not guaranteed to be optimal, perfect, or rational.

The talk will show some intriguing patterns of “dirty data” from 23 extraction-based datasets, like how NSFW gradually equals to NSFTM (not safe for training model), and reflect on these silent, anonymous yet upheld estimations and not-guaranteed rationalities in current sociotechnical artifacts, and ask for whom these estimations are good-enough, as it will soon be part our technological infrastructures.

Licensed to the public under http://creativecommons.org/licenses/by/4.0

top 1 comments
sorted by: hot top controversial new old
[–] atrielienz@lemmy.world 2 points 8 hours ago

You copy and pasted the first paragraph a second time in the body of the post.