LLMs are definitionally plagiaristic

5Arz...2PCy

30 Mar 2024

One of the things I’m most afraid of is leaving the future of technology (a.k.a. the future of civilisation) to technologists.
There are very few places where a humanities degree is a useful qualification (I should know: I have a couple of them) but I’ve found it a helpful tool for considering Artificial Intelligence (AI). This is not least because, at present, the public-facing uses for AI seem to be assaults on low-level workers with humanities qualifications, whether that’s Dall-E rendering a generation of graphic designers almost entirely useless, or generative language models cannibalising the work of anyone dumb enough to do an English degree.
I noted with interest yesterday that CoinDesk, a trade publication focused on Bitcoin and other crypto, had announced that they would be using AI in some of their articles. This, in itself, wasn’t that interesting; even though it’s only recently hitting the news, the truth is that many publications (particularly B2B ones) have been utilising some form of AI generated text for a while. What was more interesting about CoinDesk’s announcement was the set of rules they spelled out for its use. Their five rules, as reported by my pal Ian Silvera, were:

1) Run all text through plagiarism-detection software.
2) Check sources are reliable.
3) Carefully fact-check all writing, including quotes, by a writer and editor.
4) Edited pieces an eye toward adding the “human” element.
5) Make explicit AI’s role in the creation of article.

These are, in my opinion, sensible rules, albeit nonsensical ones. They are nonsensical for a simple reason: Large Language Models (LLMs) are definitionally plagiaristic.
I don’t want to get into the weeds of undergraduate linguistic theory, but there’s a sense in which all language is plagiarism. The fact that I can see a grey, puffy bird (GREY + PUFFY + BIRD) and call that bird a PIGEON, is all down to the fact that someone, somewhere, pointed at that grey, puffy bird and called it a pigeon. And so any time, since that moment, that we have used the word PIGEON, we are, to some extent, plagiarising the uniqueness of the coinage. Of course, there is an accepted fair use argument for language, which is not just legally important but one that underscores the entirety of human development. Language is, after all, the most valuable invention in the history of pea-brained humanity.
Is this blog plagiarised? Not to the best of my knowledge, although, of course, everything that I write is, to some extent, a synthesis of all my learning, all my knowledge accumulation. The fact that I know about LLMs (even in this very rudimentary way) is down to sources like, say, the Financial Times or the Hard Fork podcast. The fact that I might write the words “with all my heart” is probably a by-product of late nights reading Jane Austen. The truth, and I believe this with all my heart, is that we output what we input.
But I am also organic matter, built of soft tissue and strange dreams. Tracing my plagiarism would be impossible; I am not coded to plagiarise. LLMs which are trained on specific data sets — for example, you might train an LLM on Wikipedia, the works of James Joyce, or, increasingly, the entire internet — are not soft tissue or strange dreams. They learn in a fundamentally inorganic matter. When generative text is outputted, the pathway is always hypothetically traceable — this is not alchemy, there is no magic moment, but the creation of an alloy. AI is not capable of true imaginative generation (at the moment) and therefore it must, definitionally, utilise the existing corpus.
And so I find it strange when a publication like CoinDesk puts, as its primary rule for the use of AI in its articles, that it will run its pieces through a plagiarism-detection software. This is, surely, only to cover their arses — they know full well that generative AI can fool most rudimentary plagiarism detection models. It is like the most rubbish, easy version of the Turing tests, asking incredible, brand-new tools to beat a fellow, decade-old, piece of software. What the rules say to me is that they, as a publication, are not interested in interrogating the intellectual and existential questions of AI. If you are a publication who is that uninterested in the questions of veracity, sourcing and originality, you are not much of a publication at all.
I don’t doubt that the folks at CoinDesk, and many others in the AI (and pro-AI) community, would define plagiarism, in the written word, as a stolen string of words (“the practice of taking someone else’s work or ideas and passing them off as one’s own,” according to Google’s define function). A + stolen + string + of + words = plagiarism. And that’s certainly true, but it’s a narrow interpretation, not legalistic even, but mathematic. In reality, plagiarism exists in many forms: plagiarism of ideas, of course, but also plagiarism of data and discover, plagiarism of research and citation, plagiarism of images and metaphors and funny little turns of phrase. And if you can’t create you are bound, forever, to plagiarise.
I am not against the use of AI in journalism. (In the same way that when the spaceships lands I will declare myself “not against” our alien overlords). But I think there is a dishonesty in pretending that, at present, LLMs, which are trained on human generated language and only capable of reproducing new versions based on statistical probability, rather than creativity, aren’t plagiaristic. The minute I hit publish on this blog, it will enter the public realm and be absorbed into some large language model, somewhere. And down the line, when a chatbot spits out the word “spits” after the word “chatbot”, it might be because this blog gave them that option. Suddenly the intelligence knows that “spits” might follow “chatbot”. And I will likely never know that I’ve been plagiarised — that I’ve provided the requisitite intelligence for the AI — and it would never show up in plagirism-detection software. Until an AI is capable of true, uncoached creativity, its output must, definitionally, use and repurpose human intelligence.
And we’re doing humanities graduates everywhere a disservice if we pretend otherwise.
Follow me on Twitter. Subscribe to my newsletter.