I am concerned about LLM code in Python

I am concerned about LLM code https://github.com/python/cpython/commit/951675c18a1f97513f495b9ec604054e0702eaaf in Python. This isn’t legal advice. I’ll merely link and quote you the sources that made me concerned, and you can draw your own conclusions.

Here’s a video clip of apparently a lawyer, reviewing what looks like plagiarism of a single source, triggered by e.g. function isEven() {:

https://github.com/mastodon/mastodon/issues/38072#issuecomment-4105681567

Here’s a high profile incident https://www.pcgamer.com/software/ai/microsoft-uses-plagiarized-ai-slop-flowchart-to-explain-how-github-works-removes-it-after-original-creator-calls-it-out-careless-blatantly-amateuristic-and-lacking-any-ambition-to-put-it-gently/ concerning Microsoft themselves.

The following study seems to suggest memorization may correlate with model performance https://www.sciencedirect.com/science/article/pii/S2949719123000213#b7:

We found that the models that consistently output the highest-quality text are also the ones that have the highest memorization rate.

This seems to suggest, together with study on lack of reasoning ability https://machinelearning.apple.com/research/illusion-of-thinking, that LLMs can’t draw their own conclusions:

We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We also investigate the reasoning traces in more depth, […] ultimately raising crucial questions about their true reasoning capabilities.

Takeaway of Forbes: “even the most sophisticated reasoning models fundamentally lack genuine cognitive abilities.”

Takeaway of The Atlantic: “Large language models don’t “learn”—they copy.”

There’s also this field study https://dl.acm.org/doi/10.1145/3543507.3583199 suggesting a plagiarism rate of gen AI of at least 2-5%, and that’s the part they could pin down.

It seems like above sources suggest LLMs plagiarizing the training data may be common, may happen even when not baited, and may concern single identifiable sources at significant length. But check the sources yourself.

I’m guessing the AI user will usually be unaware whenever this happens, therefore likely unable to stop it.


Therefore, I suggest even just for ethics and respecting the FOSS license of other projects, Python should consider banning LLM code submissions.

5 Likes

I think you’ve got to be a bit realistic - the commit you link to is from a fairly long-standing contributor to Python so they probably deserve the benefit of the doubt that they understand what they’re submitting. It’s also pretty specific to the internals of Python’s JIT so it’s unlikely to be too “plagiarised”.

My personal view is that drive-by slop contributions from people with no real link to the project are a bigger issue (for open source generally). In that they’re cheap to generate, take up a decent amount of time to review, are likely not even read by the people submitting it, and may mean you have to break the basic principle of being nice to new contributors to be able to handle them.

But that particular example that you link to really does seem like “least concern”.

(I think this is likely my first and last contribution to this particular conversation here… I’m not sure it’s likely to end up being hugely productive…)

11 Likes

We can go to the rabbit hole by calculating what is the percentage of plagiarization, but that’s like conceding non zero amount is okay.

Even if because it is highly technical and specific part of the code that never exist anywhere else, that won’t hold for so long, because newer people will do any shortcuts, that will lead to either:

1. Discounting new contributions.

2. Accepting more plagiarized contributions.

That also means for low income countries (like Indonesia where i come from), using LLM for coding still feel expensive.

Well, my current research is about how to keep AI, especially LLM, runnable & fixable locally, and i try to sidestep the question of copyright by insisting the usage mostly on education sector (yes i know it is still probably illegal).

But i also still have that guilty feeling, especially the environment (the water wars becoming more often), society (my artist & writer friends hard to find jobs nowadays), and lack of usefulness of AI itself (even after i lead some non technical people to try find usefulness of using AI via buildclub.ai Indonesia chapter).

I think we are getting the same problem that the music industry has had way before AI: “What is plagiarism vs inspiration”?
You can use AI and mix your original content and decisions in with the result, via specific prompting, code, pseudo-code, algorithm selection, architecture steering,…
It’s literally “Blurred Lines:slight_smile:

1 Like

Everything every human has ever done is plagiarism. The most brilliant ideas ever are all copies plus slop that happened to fit a purpose. When they didn’t fit we forget about them.

That fraud Mozart just copying the symmetries of Bach plus some flash to make his posts more meme.

4 Likes

This is either intentionally inflammatory or else a little bit sad. I find it to be a reductive view if truly held, but doubt I will convince you of anything. But if this is just trolling, please stop.

I’m not sure what your goal is in opening this discussion. With that unclear, I don’t see what I or anyone else should do with this thread.

I am extremely concerned about the influence of LLMs on FOSS myself, but I am also careful to draw a distinction between slop (which is junk) and LLM-assisted code (which I consider a derived work from all model inputs) as a broader category. A square (slop) is a rectangle (LLM-driven code). I think it is therefore unhelpful to mix citations about the two categories without acknowledging the difference.

If you want to push for cultural change in the broader Python community, I would welcome that and be happy to be part of that conversation. One of the current community values, which I share, is tolerance, so it is important to do this in a way that welcomes opposing viewpoints.


I’m going to move this to the Help subforum, which is (unintuitively) the general-chat category. Ideas is for specific ideas for changes to the language.


EDIT: I see now that I missed the note at the end (not sure how, maybe I’m just tired…) about suggesting a ban.

Some of the above is therefore a little off-base, but I still don’t think this goes in Ideas. Maybe Core Development would have been better.

Bans which are unenforceable are a tactical mistake. You guarantee that the rule will be broken silently, behind your back.

7 Likes

No. Sometimes the code file is very big and the important parts are scattered in various files so that human being cannot read and understand them at first time (especially when the coder is not the contributor or maintainer yet). In this case, using LLM to understand and write some code is useful.

The real question is to ask the coder to understand how the code written by LLM works.

Isn’t this something for the core team to decide for themselves?

What are the PSFs legal team saying? I’m not a lawyer but if there is ambiguity about whether a contributer actually owns the copyright to the code they want to add, this seems to me the contributer is not legally able to grant the necessary license terms to Python users, required by the PSF’s CLA for the LLM-generated code. I would conclude they are in breach.

If useless (except for marketing) LLM-cosigned-commits are allowed, then moving forward what’s the point of the CLA at all?

6 Likes

I recommend you watch the video demo. It seems like you may not have. My apologies if I’m mistaken.

This seems to me like saying “we can’t ensure people don’t (manually) steal and paste GPL code without a license, so we might as well allow it”.

2 Likes

Agreed. I don’t think that there can be any going back to a time when no one uses AI tools and having people lie about using them is also a problem.

Anthropic did a study in January where they compared people doing a task that involved learning to use a new library (Trio) in a randomised control trial with one group using an LLM chatbot and the control group not using any AI tools. The headline conclusions are fairly predictable but the part I found interesting was what it says in the paper about the pilot studies they did before the main study.

In the first pilot study they recruited 39 junior software engineers and divided them into the AI group and the control group and the people in the control group were asked to spend up to 30 minutes writing code without AI. The authors concluded that one third of the people in the control group had used AI anyway even though they were told not to.

In the second pilot study they recruited 107 people. This time they tried to make the instructions clearer like “literally we are just paying you to write code for 30 minutes without using AI so please do NOT use AI”. Apparently still a quarter of the people in the control group just said “yeah, but I’m gonna use AI for that”.

There was no material incentive in these studies for people to cheat using AI and it was literally the one thing that people in the control group needed to not do because it would poison the data undermining the entire purpose of the study.

This shows basically what I would expect to happen if banning the use of AI tools. People would just do it anyway and lie about it and then there would be endless arguments and accusations.

I think any large open source project is going to need to have some kind of explicit policy statement about generative AI use though.

It is not the same. The LLM is a tool that can do various things. The fact that it is trained by looking at lots of code does not mean that when using it someone is always copying that code. The linked commit is not going to be something that can be copied from anywhere except in so far as the code is largely just more examples of the repetitive patterns already seen in the surrounding code. Whether someone used an LLM or not to produce that commit it does not constitute copying from anything non-generic. I expect that the author would have told the LLM quite precisely what to do but then the LLM automates repeating the patterns needed to complete the work.

It is true that someone could use an LLM-based tool and end up unknowingly having something that is a copy of something else. A policy should aim to prevent that and other problems that can come from LLM-generated code but an outright ban is just impossible and discouraging something like the particular commit highlighted would seem like a bad outcome to me.

2 Likes

Here are projects that banned AI, so it seems to be possible: Asahi Linux https://asahilinux.org/docs/project/policies/slop/, elementaryOS https://docs.elementary.io/contributor-guide/development/generative-ai-policy, Gentoo https://wiki.gentoo.org/wiki/Project:Council/AI_policy, GIMP https://gitlab.gnome.org/GNOME/gimp/-/blob/master/.gitlab/merge_request_templates/default.md?plain=1#L11-12, GoToSocial https://codeberg.org/superseriousbusiness/gotosocial/src/branch/main/CODE_OF_CONDUCT.md#code-of-conduct, Löve2D https://github.com/love2d/love/commit/147d39251c2618852c026f8cadf95f0ffd6a746f, Loupe https://discourse.gnome.org/t/loupe-no-longer-allows-generative-ai-contributions/27327, NetBSD https://www.netbsd.org/developers/commit-guidelines.html, postmarketOS https://docs.postmarketos.org/policies-and-processes/development/contributing-and-ai.html, Qemu https://www.qemu.org/docs/master/devel/code-provenance.html#use-of-ai-generated-content, RedoxOS https://gitlab.redox-os.org/redox-os/redox/-/blob/master/CONTRIBUTING.md#ai-policy, Servo https://book.servo.org/contributing/getting-started.html#ai-contributions, stb libraries https://github.com/nothings/stb/blob/master/CONTRIBUTING.md#ai-and-llm-are-forbidden, Zig https://ziglang.org/code-of-conduct/#strict-no-llm-no-ai-policy.

“A policy should aim to prevent [copying].” How, if the coder typically won’t know? Isn’t a ban more actionable?

2 Likes

It would be nice if big tech AI providers provide checking the copyright infringements, but unfortunately it seems only enterprise solution. I wonder if we can afford it.

It’s likely not possible to reliably detect any infringements in an automated way.

And a ban seems crystal clear and removes the guessing for good actors. Ethically, it moves the responsibility to the malicious actors. You can’t stop those anyway, but at least you condemn it. More AI bans will mean less acceptance for ignoring them.

5 Likes

I didn’t say impossible. I said that it’s unenforceable.

The projects you cite have chosen to set policies which they know they cannot enforce. I consider that to be an error in judgement, although I think the motives are laudable.

You clearly don’t agree. I may or may not convince you, but I am about to try.

I don’t think that contribution policies are the right place to send this message. The contribution policy and CoC is about ensuring the integrity – technical, legal, social – of the project.

I think the right way to condemn things is to condemn them.

I resent “AI” vendors for stealing my work, for reproducing the works of others without attribution, for intentionally conflating criticism of their product with criticism of their users, for poisoning the Earth and its inhabitants, for sucking up resources which could be devoted to more noble causes, and too many other things to list.

There’s my condemnation.

If your concern is the legal one, the US courts have spoken: the power of the US Dollar has determined that we don’t own IP once it goes into the meat-grinder of LLM training. It has nothing to do with “justice” or “right”; it’s merely what US courts have decided. My disdain for the US legal system does not help FOSS communities.

If your concern is the ethical one, regarding copying the work of others without attribution, and without obeying the original author’s copyright[1], I’m with you. Can we shore up policies to require attribution of work, where that is not already sufficiently clear? Can we directly address the ethical problem?


  1. Copyright and copyright law is ridiculous in so many ways, but I don’t have a better system on offer. It’s what we’ve got. ↩︎

2 Likes

Most of AI/LLM ban i see mostly about generating the contribution, not understanding the codebase itself

Even as AI/LLM user myself (not for open source projects, because i am still not okay with polluting the common goods), i sometimes see that my understanding of the codebase, because AI/LLM told so, is wrong, and i will need to compensate my mistakes, which is possible because i have experienced on how to debug things even without AI/LLM.

But i haven’t able to see how beginners can gather that experience. Starting small will seem useless when there is temptation to take shortcuts.

There will be a day when the frameworks around AI usage is good enough, and hopefully all externalities including energy/water usage, job loss, and uselessness, at least mostly accounted for. I am not really sure if that day is coming soon.

At least, we will need to prepare what will happen if copyright infringements will be there. Do we have enough resources to compensate, enough capacity to rework the infringed parts? At least we will be better prepared if we already asked the contributors to not use the generated code.

If there are increased contributions volume, are we prepared enough to review it?

If we review it using AI, will we be prepared for increased false positives (accepting broken code, which will decrease our reliability) & false negatives (rejecting working code, which will decrease people motivation to contribue)?

If we accept the code, will we be prepared for model collapse because AI could be worse when using increasingly their own results?

As the one researching AI usages myself, i find most of the companies & organizations mostly underestimating works needed to accept AI usages. I don’t have the answers, but i do have above questions.

5 Likes

How about applying a bot to the task of identifying potential copyright (etc) violations? I asked Copilot how it would assess its own abilities in this area, and found its reply informative and initially promising - but haven’t tried it yet:

In fact, literal text duplication isn’t usually in play, so I was encouraged to hear that’s not the level it works at.

I know from experience that it’s good at recognizing “Tim’s coding style”, and over time has “learned” to write Python code more and more in that style. Indeed, at times it’s suggested “more-Tim-like” Python code than I would have written on the first try :wink:

1 Like

This is the solution that the industry intent on selling these tools has repeatedly pitched, but I do not find it believable. (Partly because none of them are selling a product which people want which does this.)

Since the tools are fundamentally untrustworthy and nondeterministic – they make things up all the time, and not always the same things – how am I supposed to trust this on my projects?
Suppose it says “This submission is an unlicensed copy of GPL code from X.” Then I run the tool again and it says “This submission is a copy of MIT-non-attribution licensed code from Y.”

What do I do with such hallucinations? It’s not useful or actionable. All I’ve done is poison my own experience as a maintainer.

I’m not sure that a bot does any better than a human at looking at a patch and guessing whether or not there’s an issue. And a human can be held accountable for their decisions, actions, and reasoning, while a bot cannot.

This is at least somewhat contested; I would not present it as fact.

This ventures a bit OT, but I think the fact that LLMs hold compressed and retrievable copies of their training sources raises some questions about what it means to duplicate text – if I rot13 something, paste it in a buffer, and then rot13 it again, I’d call that copying. I’m not sure that it matters that the manner of copying is highly obfuscated.

(And yes, this does open up the whole question of what human learning is. Like I said, it goes off-topic.)

7 Likes

I would find this far more compelling if you actually tried it and got bad results. Else it’s just another instance of the ages-old genetic fallacy: things should be judged on their own merits (or lack thereof), not by their source.

The same is true of many people, you know :wink:

“You will know them by their fruits”

Things Copilot repeatedly said it could not do. It does not (or so it says) have access to its training data.

You consider the source and give it more or less weight as you judge appropriate. Copilot also repeatedly said that it should not be taken as a final arbiter, but could play a role in initial triage. Not the end of the process, but a possibly helpful start.

In the hypothetical case at hand, take a look yourself at the “GPL code from X” it pointed you at.

Are people good at that? Not particularly that I’ve seen. The multi-billion dollar copyright lawsuit between Google and Oracle ended in the US Supreme Court, with conflicting opinions (aka “guesses”) along the way by lower courts.

Nobody is suggesting that a bot make the final decision on anything. A human would do that regardless.

But I would (and did), because that’s my own experience over decades. Could be wrong! But I doubt that it is.

I recall only one instance where a CPython patch contained a verbatim copy of some glibc code. They weren’t trying to “cheat”, they were a capable and principled contributor who was simply unaware of GPL licensing terms. They were a mathematician at heart, not a software geek,.

How did we detect it? The patch text had a comment plainly stating that the next stretch was copied from glibc source code. Else I doubt a human reviewer would have noticed. I certainly would not have.

In other cases I, as a subject-domain expert, asked for a “clean room” reimplementation of code that was clearly (to my eyes) a minor respelling of clever code I recognized from other projects, that went beyond (to my eyes) “fair use”.

Copilot said it has no access to its training data. Other systems may - I wouldn’t know. I care primarily about :“is the result helpful or not?”. Which can only be answered by trying it. If it too often pointed me at code bases with implausible connections to parches, I’d give it a rest and try again after another year passes. Assistants are moving targets that improve rapidly now.

1 Like