Technology

75434 readers

1842 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

AI Loophole #1; Your GitHub README.md (lemmy.world)

submitted 1 year ago* (last edited 1 year ago) by elias_griffin@lemmy.world to c/technology@lemmy.world

73 comments fedilink hide all child comments

I used to be the Security Team Lead for Web Applications at one of the largest government data centers in the world but now I do mostly "source available" security mainly focusing on BSD. I'm on GitHub but I run a self-hosted Gogs (which gitea came from) git repo at Quadhelion Engineering Dev.

Well, on that server I tried to deny AI with Suricata, robots.txt, "NO AI" Licenses, Human Intelligence (HI) License links in the software, "NO AI" comments in posts everywhere on the Internet where my software was posted. Here is what I found today after having correlated all my logs of git clones or scrapes and traced them all back to IP/Company/Server.

Formerly having been loathe to even give my thinking pattern to a potential enemy I asked Perplexity AI questions specifically about BSD security, a very niche topic. Although there is a huge data pool here in general over many decades, my type of software is pretty unique, is buried as it does not come up on a GitHub search for BSD Security for two pages which is all most users will click, is very recent comparitively to the "dead pool" of old knowledge, and is fairly well recieved, yet not generally popular so GitHub Traffic Analysis is very useful.

The traceback and AI result analysis shows the following:

GitHub cloning vs visitor activity in the Traffic tab DOES NOT MATCH any useful pattern for me the Engineer. Likelyhood of AI training rough estimate of my own repositories: 60% of clones are AI/Automata
GitHub README.md is not licensable material and is a public document able to be trained on no matter what the software license, copyright, statements, or any technical measures used to dissuade/defeat it. a. I'm trying to see if tracking down whether any README.md no matter what the context is trainable; is a solvable engineering project considering my life constraints.
Plagarisation of technical writing: Probable
Theft of programming "snippets" or perhaps "single lines of code" and overall logic design pattern for that solution: Probable
Supremely interesting choice of datasets used vs available, in summary use, but also checking for validation against other software and weighted upon reputation factors with "Coq" like proofing, GitHub "Stars", Employer History?
Even though I can see my own writing and formatting right out of my README.md the citation was to "Phoronix Forum" but that isn't true. That's like saying your post is "Tick Tock" said. I wrote that, a real flesh and blood human being took comparitvely massive amounts of time to do that. My birthname is there in the post 2 times [EDIT: post signature with my name no longer? Name not in "about" either hmm], in the repo, in the comments, all over the Internet.

[EDIT continued] Did it choose the Phoronix vector to that information because it was less attributable? It found my other repos in other ways. My Phoronix handle is the same name as GitHub username, where my handl is my name, easily inferable in any, as well as a biography link with my fullname in the about.[EDIT cont end]

You should test this out for yourself as I'm not going to take days or a week making a great presentation of a technical case. Check your own niche code, a specific code question of application, or make a mock repo with super niche stuff with lots of code in the README.md and then check it against AI every day until you see it.

P.S. I pulled up TabNine and tried to write Ruby so complicated and magically mashed, AI could offer me nothing, just as an AI obsucation/smartness test. You should try something similar to see what results you get.

you are viewing a single comment's thread
view the rest of the comments

[–] AlexanderESmith@social.alexanderesmith.com 0 points 1 year ago* (last edited 1 year ago) (5 children)

It's not paranoia if you have proof that they're stealing your content without permission or compensation.

You come off as an AI bro apologist. What they're doing isn't okay.

[–] wizardbeard@lemmy.dbzer0.com 14 points 1 year ago (1 children)

These concepts are not mutually exclusive. You can be right about AI considerably overstepping boundaries and still be exhibiting classic signs of paranoia issues, which OP is.

Their immediate response to people not reacting to this post and their comments is to immediately jump to the idea that they're being targeted by their designated enemy. That's not particularly healthy.

I'm worried that AI is becoming the new gangstalking for tech aligned people predisposed to disprdered thinking.

[–] AlexanderESmith@social.alexanderesmith.com 6 points 1 year ago

I agree that their replies are a little... over the top. That's all kind of a distraction from the main topic though, isn't it? Do we really need to be rendering armchair diagnoses about someone we know very little about?

I mean, if I posted a legitimate concern - with evidence - and I was dog-piled with a bunch of responses that I was a nutter, I'd probably go on the defensive too. Some people don't know how to handle criticism or stressful interactions, it doesn't mean we should necessarily write them (or their verified concerns) off.

[–] catloaf@lemm.ee 5 points 1 year ago (1 children)

Just because they are out to get you doesn't mean you're not paranoid, and vice versa.

I have nothing for or against AI/ML as a tool, my issue with it is when companies scrape huge amounts of data in violation of the author's rights, as in OP's example. Although I'm not quite sure why he's keeping code in the README.md file; usually that's for basic installation and usage, and full examples are kept in full documentation. That said, I highly doubt README.md files are public domain, so they shouldn't be automatically used as training materials.

[–] AlexanderESmith@social.alexanderesmith.com 2 points 1 year ago (1 children)

I'm not quite sure who's argument you're making here. It reads like you agree with OP and I (e.g. "LLMs shouldn't be using other people's content without permission", et al).

But you called OP paranoid... I assumed because you thought OP thought their content was being used without their permission. And it's extremely clear that this is what is happening...

What am I missing?

[–] catloaf@lemm.ee 1 points 1 year ago

You're not missing anything. Both things can be true.

[–] DudeDudenson@lemmings.world 2 points 1 year ago

Frankly op replied to his own post multiple times with no prompting whatsoever, just reading through this stuff I'm concerned about him as well. LLM stuff not withstanding and even if he's right he seems somewhat obsessed with this in an unhealthy way