this post was submitted on 25 Feb 2026

194 points (96.2% liked)

Technology

85080 readers

4002 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 3 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

194

Large-scale online deanonymization with LLMs (arxiv.org)

submitted 3 months ago by Beep@lemmus.org to c/technology@lemmy.world

54 comments fedilink hide all child comments

PDF.

We show that large language models can be used to perform at-scale deanonymization. With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision, given pseudonymous online profiles and conversations alone, matching what would take hours for a dedicated human investigator. We then design attacks for the closed-world setting. Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives. Compared to prior deanonymization work (e.g., on the Netflix prize) that required structured data or manual feature engineering, our approach works directly on raw user content across arbitrary platforms. We construct three datasets with known ground-truth data to evaluate our attacks. The first links Hacker News to LinkedIn profiles, using cross-platform references that appear in the profiles. Our second dataset matches users across Reddit movie discussion communities; and the third splits a single user's Reddit history in time to create two pseudonymous profiles to be matched. In each setting, LLM-based methods substantially outperform classical baselines, achieving up to 68% recall at 90% precision compared to near 0% for the best non-LLM method. Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.

LessWrong;
Hacker News.

top 50 comments

sorted by: hot top controversial new old

[–] doug@lemmy.today 100 points 3 months ago* (last edited 3 months ago) (8 children)

I think it was a Reddit scraper years ago that taught me that I should probably lie more often on the internet about my work, friends, family details, etc.

Just like, little lies that don’t really matter in the comment, but would misdirect an AI or investigator into things that aren’t true.

It’s just so much woooooork to think about this shit. And to come up with different screen names everywhere? And to like, sub to a city I don’t live in and comment there about shit I know nothing about? Exhausting.

Thankfully my brothers and three uncles are here to support me. And my alligator.

[–] frongt@lemmy.zip 28 points 3 months ago (1 children)

Aha! By posting this comment, I know you don't have an alligator!

[–] P1nkman@lemmy.world 18 points 3 months ago (2 children)

But I do! I know they're illegal in Denmark, but they seem to love the snow!

[–] DrunkenPirate@feddit.org 6 points 3 months ago

That’s funny I do as well. Unfortunately, I flush my alligator in my toilet down into the harbor I live. Now, I bought a green parot. My three sisters love it.

load more comments (1 replies)

[–] deacon@lemmy.world 14 points 3 months ago (1 children)

I call it salting and I do it religiously.

Or do I?

[–] Jakeroxs@sh.itjust.works 3 points 3 months ago (1 children)

Haha perfect username too

[–] deacon@lemmy.world 2 points 3 months ago

Ah my namesake and fellow gandy dancer.

[–] surewhynotlem@lemmy.world 10 points 3 months ago (2 children)

The trick is to pick someone else's identity and use that. I'm Dale from Ohio.

[–] MrQuallzin@lemmy.world 2 points 3 months ago

Mom said it's my turn to be Dale!

[–] papertowels@mander.xyz 1 points 3 months ago

Rusty shackleford, checking in

[–] Insekticus@aussie.zone 5 points 3 months ago (1 children)

Yeah exactly, like if youre 25, say youre 27. Then in another post 24. Youre still around that age, but the exact age is muddied in the waters.

You can also use Americanized spelling in some sentences and or if you're American, use British English, and become Unamericanised. Say you're a half-Brit half-American dual citizen even though you're from South Africa or something.

[–] MountingSuspicion@reddthat.com 3 points 3 months ago

I feel like that may be worse. Kind of like how if you have certain security measures while browsing the web it's almost easier to fingerprint you. It'll get a good idea of your age and that'll be enough rather than sticking to a specific lie. Just always be 3 years older with one additional sibling or a sibling of the opposite sex. If the sex of your sibling is relevant just describe them as a close family friend or close cousin in that instance. I can't say for sure, but if I had to guess having a static lie is maybe more obfuscation than a variable one. Though even posting on this thread is bad opsec.

[–] Anarki_@lemmy.blahaj.zone 4 points 3 months ago

Oh hey my dearest friend. Say, did you end up moving to Perth or was that just a thought outloud? Well if you're ever in the area let me know and we can meet up at that restaurant we enjoyed so much!

xoxo

[–] stickly@lemmy.world 4 points 3 months ago* (last edited 3 months ago)

The solution is simple, just launder each comment through an LLM to fudge the style and details a bit

Edit, tried it for fun:

lowkey just run every comment through an llm and let it switch up the words and details a bit so it dosnt sound like you wrote it

[–] couldhavebeenyou@lemmy.zip 3 points 3 months ago

Maybe get an AI agent to post misdirections

[–] Old_Jimmy_Twodicks@sh.itjust.works 24 points 3 months ago

[–] XLE@piefed.social 17 points 3 months ago

The doxxing efforts will be funded by venture capital.

What can LLM providers do? Refusal guardrails and usage monitoring can help, but both have significant limitations. Our deanonymization framework splits an attack into seemingly benign tasks – summarizing profiles, computing embeddings, ranking candidates – that individually look like normal usage, making misuse hard to detect. Refusals can be bypassed through task decomposition.

"Guardrails" are a joke and we all know Sam Altman and Elon Musk care about ethics as much as they care about not abusing their siblings or employees.

[–] Goodman@discuss.tchncs.de 13 points 3 months ago (4 children)

Everyday the internet gets a little worse. I hate it here in this technological hellscape. I have more to say, but this bullshit makes me so so tired. Goodnight.

[–] silverneedle@lemmy.ca 2 points 3 months ago (1 children)

Don't hate the technology. It's great. Just how people organize themselves around technology is not up to date. Markets are not meant to coexist with an extremely fast global communication network that everyone can access, why do you think economies restrict internet access?

Let the internet as a social activity die. It's got to in order to be reborn haha

[–] Goodman@discuss.tchncs.de 1 points 3 months ago (1 children)

The internet can mostly die as far as I'm concerned. Just roll it back to file servers again, or something like gemspace. But being able to talk with people across cultures, borders freely is really important. It's a tragedy that all these people will be hurt by the dystopification of the web. The new web needs to have a safe way to converse socially that is safe and easy enough to use for lay people. I have so much more to say on this, but real life is calling so I'll leave it at this.

I don't really get your point about markets though. I'm genuinely trying to understand, so bear with me. This is what I got from your post:

Our market has coexisted with an extremely fast global communication network for decades now. Given that the market feels like a quite organic thing, on what authority is the market not meant to coexists with the internet?

I think that internet access is restricted because of technological constraints, a technological lag in rolling out higher speed infrastructure, and a the lack of demand for that access which is driven by technological and practical constraint. Some complex function of those factors haha. Still, I don't really know what you are trying to get across.

[–] silverneedle@lemmy.ca 2 points 3 months ago (1 children)

Our market has coexisted with an extremely fast global communication network for decades now. Given that the market feels like a quite organic thing, on what authority is the market not meant to coexists with the internet?

I'll try to explain my thought.

The condition for markets to exist as self reproducing and self-stabilizing objects is government, usu. in the form of a state-entity, which itself is an economic actor that exists in competition with other states and in cooperation within free trade zones. Important note: government forms from market activity, specifically from the control of estates. Taxation is a form of rent, for example. I am not putting the state-before the market.

There is an interest for governments to:

Maximize economic output
To do so through cleverly tricking other economic actors outside of the own taxation system. I.e. trade agreements with built-in asymettries.
And to minimize damage to domestic production. Outsourcing can lead to cornerstones of the economy eroding.

Throw in the internet. We can now communicate and exchange with actors that are not in the same tax system. First and foremost this leads to issues with intellectual property. I'd cite geolocked internet radio stations and piracy. Japan doesn't care about its citizens pirating manhwas, and vice-versa, Korea doesn't care about anime piracy, and so on and so on. Then there is trade of physical objects. Say you need a laptop battery for your Linuxed MacBook M1 and a Chinese seller has batteries in stock that are cheaper and better than Apple's own (happens rather frequently), with taxation at the border factored in you are still getting the most optimal deal. Some might find ways of circumventing customs which sweetens the pot further. Obviously there are issues to the domestic economy that can arise from this.

Trade speeds up and global supply chains gain importance as cross border communication speeds up. At the level of national governments there is a distinct threat presenting itself. There is less control over market activity leading to a speedup of the self-polluting nature of trade, in other words the boom and butts cycle shortens. As a national government you'd want to lengthen the boom and bust cycle as crises are the natural killer of states, along with expansionist nations.

Everything you are seeing, from Chat Control to China's firewall are attempts to stabilize economies. The internet enables one to build structures that are wholly outside of state control. The state fails to direct the economy as planning starts happening between turfs. The internet due to its nation-decentralized function can aid in forming structures that oppose the state, should it falter.

Let's not forget one of the biggest threats to the economy that is open source. Patents and DRM are threatened by the unstoppable pace of Blender, Open Office and co.. It's as if people said YOLO, let's stop exchanging goods and services and at the same time solve very real and pressing issues, some of the biggest problems in fact. It works with much less friction than anything before, it exists as this hobbyist thing that we cannot call economical in any sense of the current understanding of the word and it would not exist if it wasn't for the internet.

I think that internet access is restricted because of technological constraints, a technological lag in rolling out higher speed infrastructure, and a the lack of demand for that access which is driven by technological and practical constraint. Some complex function of those factors haha. Still, I don’t really know what you are trying to get across.

India and China have smartphone ownership rates of over 85%. There are no significant technological constraints if you are not someone who needs exorbitant download upload speed and low latency. The Chinese have pretty decent internet speeds, faster than most European countries. I also do not at all believe that there is a lack of demand for practical access. The internet is most generally a sensible thing to have access to no matter who you are.

[–] Goodman@discuss.tchncs.de 2 points 3 months ago (1 children)

Thanks for explaining your thoughts. So to paraphrase you: you are saying that the market and by proxy, nations too, Are still adapting to the concept of the internet. One way to cope with the effects is to restricted access?

[–] silverneedle@lemmy.ca 1 points 3 months ago* (last edited 3 months ago) (1 children)

I am saying that the internet is as an international object antithetical to nations as its control panel sits not in one nation but all and that nations therefore seek to nerf it, only for it to return stronger and even more difficult to regulate as more and more people adapt to internationalized organizational patterns. As a corollary, there is a real cultural unification happening across borders as a secondary effect. I've read people terming it a "discordization" because people are starting to talk the way people talk in Discord chatrooms.

Yes, so you do have to restrict access and notably deanonymize users. California is trying to force OSes to implement age checking, which is of course a way to unmask people online. Protectionism cannot merely be understood as a set of possible tax policies, it is exactly the regaining of nation-centralized control in any sphere of life. States do not want people to be able to choose who to hang out with if the pool is the entire world, states do not have an interest in letting subjects learn about reality beyond a certain threshold where the scope of a person's understanding exceeds the boundaries of countries.

What I am getting at exactly is the social structure that humans find themselves in. When relations/hierarchies are on the brink of flattening, that is everyone is linked to the next in a symmetrical fashion, like in a family or within small communities 5000 years ago, states, companies and even small businesses will feel compelled to work in such a way that preserves their asymmetrical stance in society. As it happens the internet is extremely good at producing flat social structures, anonymity, reach, openness and near-infinite scalability make it possible. You may be able to neutralize one netizen or manipulate one online community, by the time that has happened five hundred heads of the hydra have regrown. Cost and expenses don't work out.

[–] Goodman@discuss.tchncs.de 2 points 3 months ago (1 children)

You have a lot say on this. Its good that someone thinks about these thing. I'm sorry that I can't really provide you with a good discussion. I don't know enough about markets etc and I don't want to spend too long online.

I agree that can't really stamp out openness and anonymity online (which is beautiful in a way) but I think that will mostly be reserved to technically capable users in the cracks and niches of the web who can navigate the restrictions. This is a massive tragedy.

This brings us to the current state of the web with age restrictions popping up everywhere, deanonimization etc. I think that we are in agreement regarding where it is going. Where you think we should be heading. I'm sure you have opinios on that

[–] silverneedle@lemmy.ca 2 points 3 months ago

You have a lot say on this. Its good that someone thinks about these thing. I’m sorry that I can’t really provide you with a good discussion. I don’t know enough about markets etc and I don’t want to spend too long online.

I mean I have a lot to say. I don't expect people to engage in discussions nor do I really want to create discussion as it eats a lot of time on my end as well.

I agree that can’t really stamp out openness and anonymity online (which is beautiful in a way) but I think that will mostly be reserved to technically capable users in the cracks and niches of the web who can navigate the restrictions. This is a massive tragedy.

You're right, but we don't know if the more technically capable users will create elegant solutions for the rest.

I’m sure you have opinios on that

Opinions probably. I try not to judge things though or impose expectations.

[–] daychilde@lemmy.world 1 points 3 months ago (2 children)

Shut up, Anthony.

(in case your name happens to actually be Anthony, I did pick it at pseudo-random jsut for a stupid joke!)

[–] Goodman@discuss.tchncs.de 2 points 3 months ago

It took me an embarrassingly long to get the joke. I forgot what the original thread was.

[–] BlindFrog@lemmy.world 2 points 3 months ago

Lmao, you are now Peter

load more comments (2 replies)

[–] silverneedle@lemmy.ca 10 points 3 months ago

I call BS. We'll see false positives go through the roof. Just another tool to arbitrarily harass opponents.

[–] cerebralhawks@lemmy.dbzer0.com 10 points 3 months ago (1 children)

It is absolutely possible to identify users who post a lot on a public forum with a real name (e.g. Facebook or the like) as well as Reddit. So say you have some politician who claims to have X, Y, Z values and a Reddit user who has A, B, and C values that are antonymous to X, Y, and Z. By comparing common phrases, as well as by charting when the two seemingly separate users are online, you could say with reasonable certainty that the two people are one and the same, especially if you prompt them carefully to say the kinds of things they would say about neutral topics on both accounts. It would be hard to get 100% certainty, but you'd be close enough to imply it's them.

AIs (LLMs) just make it faster.

Don't post about controversial politics if you also post under your real name. It's not a matter of "mask yourself better." There will always be tells.

load more comments (1 replies)

[–] Iconoclast@feddit.uk 6 points 3 months ago (4 children)

For the past 10 years or so I've pretty much lived under the assumption that at some point someone figures out a system that digs through the entire internet and everything anyone has ever posted gets linked back to them.

At the same time, it's both great and absolutely horrifying.

What's horrifying is that everything you've ever posted gets linked back to you.

What's great is that none of it can really be used against you anymore - because we now know that absolutely everyone is a massive hypocrite and nobody is without sin.

[–] KnitWit@lemmy.world 5 points 3 months ago

The Private Eye by Brian K Vaughn used that as a premise (set in 2076) for a comic run about a decade ago.

[–] Jrockwar@feddit.uk 3 points 3 months ago (1 children)

Some really good advice that someone gave me once is that the internet doesn't exist.

Sure, it obviously does exist, but this was about communication style. When you send an email, you change codes and don't write in the same way as a WhatsApp - you can expand your points more... But you should never forget you're talking to a person - just because it's internet, you shouldn't talk any different to them.

You shouldn't assume that the message is anonymous just because it's internet. You shouldn't assume certain things are okay "just because it's internet".

I don't think they were 100% right because they were disregarding that code changing between different mediums and audiences is normal (you don't talk the same way to your boss and your partner, or in written form vs spoken), but I do stand by the point that you shouldn't change code or make assumptions just because "internet".

load more comments (1 replies)

[–] Scrollone@feddit.it 2 points 3 months ago

I mean, there's even a website (don't remember the name) that lets you upload a photo of a person and it will show all pictures of that person that are on the web.

Like a Google search but for your face. Super creepy.

[–] silverneedle@lemmy.ca 1 points 3 months ago* (last edited 3 months ago)

That'll never work. The internet is messy like a jungle, I might find bird crap somewhere but it will not get me the bird. I might find a turned leaf, but what turned the leaf will never be known to me. All despite me being able to reason and investigate phenomena that occur.

I view all things like I view particle systems: There are general trends, sometimes we can observe how single particles travel and we can derive rules from their behavior. Yet we are never able to see everything at full resolution, let alone know everyone in the way the "evil" "AI" thought experiments portray all knowing bots. What people say about Palantir is very similar falls into the category of we-don't-know-the-rest-of-it.

No use going paranoid over preliminary results from a tool we readily use but don't fully comprehend the limitations of (in the meaning of: we don't know how shitty and unreliable they are in actuality).

[–] daychilde@lemmy.world 5 points 3 months ago (2 children)

Too late for me, I've been Daychilde since 1996, didn't keep it separate from my real name, and I'm on wikipedia, so it's trivial to find me. lol.

The good is that I can report that it's pretty safe to have an open identity. So far. heh

[–] BreadstickNinja@lemmy.world 3 points 3 months ago* (last edited 3 months ago) (1 children)

Your adversaries right now...

[–] daychilde@lemmy.world 4 points 3 months ago

haha, oh man. That actually reminds me. I know I mentioned the wiki thing - this is me: https://en.wikipedia.org/wiki/Beck_v._Eiland-Hall

Basically, back in 2009, I created glennbeckrapedandmurderedayounggirlin1990.com. It was largely in response to Glenn Beck's stupid technique of interviewing people - like to our first sitting Muslim member of Congress: "Now, I wouldn't say this, but some people are asking: Are you working for our enemies?" - to an elected member of Congress!

Of course, this was back in the Muslim-scare days after 9/11 still in 2009… and now we definitely have people in Congress working for our enemies.

But anyway. So the parody site.

My wife found a forum where some idiots were trying to track me down. I mean, my real name and address was out there, but they were looking for more information about me and the site. They were talking about what organizations must be funding this attack on their beloved Beck.

There was controversy at the time because an orgnization called ACORN was trying to get people to register to vote and supposedly signing up on behalf of people. IIRC the allegations were either bullshit or it wasn't a big deal or maybe it was and it was dealt with. All I remember for sure is that I thought it would be hilarious to offer these chucklefucks "evidence" for their conspiracies.

So I went out and copied the raw HTML from a 404 page on the ACORN website and made that the custom 404 page for my site. An then, to help these idiots "find" it, I made a "mistake" - I announced something on the main page and linked to a page that supposedly had the full story, only I intentionally put a typo in the link so the 404 page would come up. lol.

Oh, man, they went N U T S over in the forum "HOLY SHIT ITS ACORN BEHIND THIS" lolololol........

But anyway, your gif absolutely reminded me of those morons. That's how I envisioned their "hacking" of me. lol

[–] Goodman@discuss.tchncs.de 2 points 3 months ago (1 children)

I read one of your blog posts about empathy this mornjng. It resonated with me and my recent views on the world.

[–] daychilde@lemmy.world 1 points 3 months ago

Ahhhh, sometimes I forget the implications of me opening my big mouth in cases like this. lol. Well, I'm glad it was a positive experience. :) It's something I need to remind myself constantly of as I am bad about getting sucked into responding to perceived rudeness with rudeness of my own - I definitely have RSD and it fucking sucks.

[–] Bruncvik@lemmy.world 4 points 3 months ago

I have an account where I only post after I translated my writing through three different languages and back to English. The original input and the output convey the same message, but have very distinct styles. Randomizing the three languages in my translation sequence introduces enough variety that I doubt current LLM's can identify me. (Full disclosure: I don't post any sensitive information under any account; I do it just for fun.)

[–] thedeadwalking4242@lemmy.world 4 points 3 months ago (1 children)

There's no quality of an LLM that would make this possible. It's just more hallucinations and poor tool use.

[–] thinkercharmercoderfarmer@slrpnk.net 2 points 3 months ago* (last edited 3 months ago) (1 children)

Why not? if LLMs are good at predicting mean outcomes for the next symbol in a string, and humans have idiosyncrasies that deviate from that mean in a predictable way, I don't see why you couldn't detect and correlate certain language features that map to a specific user. You could use things like word choice, punctuation, slang, common misspellings, sentence structure... For example, I started with a contradicting question, I used "idiosyncrasies", I wrote "LLMs" without an apostrophe, "language features" is a term of art, as is "map" as a verb, etc. None of these are indicative on their own, but unless people are taking exceptional care to either hyper-normalize their style, or explicitly spiking their language with confounding elements, I don't see why an LLM wouldn't be useful for this kind of espionage.

I wonder if this will have a homogenizing effect on the anonymous web. It might become an accepted practice to communicate in a highly formalized style to make this kind of style fingerprinting harder.

[–] thedeadwalking4242@lemmy.world 1 points 3 months ago (1 children)

It's a language model not a classification model. People have already tried a similar experiment to have LLMs detect if a LLM wrote text or not and it couldn't.

[–] thinkercharmercoderfarmer@slrpnk.net 2 points 3 months ago

This is in some ways an easier problem than classifying LLM vs non-LLM authorship. That only has two possible outcomes, and it's pretty noisy because LLMs are trained to emulate the average human. Here, you can generate an agreement score based on language features per comment, and cluster the comments by how they disagree with the model. Comments that disagree in particular ways (never uses semicolons, claims to live in Canada, calls interlocutors "buddy", writes run-on sentences, etc.) would be clustered together more tightly. The more comments two profiles have in the same cluster(s), the more confident the match becomes. I'm not saying this attack is novel or couldn't be accomplished without an LLM, but it seems like a good fit for what LLMs actually do.

[–] Supervisor194@lemmy.world 4 points 3 months ago

I've never once posted on the Internet using a real name. I've never been a member of any social anything other than Reddit and Lemmy. I only even found Reddit because an IRC link aggregator I used to browse for news/memes went tits up.

[–] MalReynolds@slrpnk.net 4 points 3 months ago

So, pretty much what Meta/Facebook (and the three letter agencies / GovInt) has been doing with deterministic code (like they're not scraping reddit et.al, including Lemmy) for ages but probabilistic with more errors and new improved hallucination.

Competition, filling in gaps or just looking to be bought out. Evil.

[–] MonkderVierte@lemmy.zip 2 points 3 months ago* (last edited 3 months ago)

[–] Kissaki@feddit.org 1 points 3 months ago

Germans with a website: well, it's in clear text in the Impressum already, required by law

load more comments