The Problem

AI learns by being trained on material that is out there in the world. It then produces outputs to prompts based on that learning. On a good day, with skilled prompting, the output may come close to what a human can produce, but typically has lacks in a few areas. Although mediocre, much of the generated content may still be usable, and due to the time saved, is often good enough. But use of AI for content generation is growing at an exponential rate. The result is that more and more content will be produced by AI over time. That content is then put out in the world. At some point, it will get scraped for use as training material for future AI models.

Right now, AI is learning mostly from content that humans have created, because that is mostly what’s out there. But that is changing rapidly. As the interwebs fill up with more and more AI-produced content, AI will soon it will be learning from a majority of content that AIs have created. Today this output might be on par with low-to-average human-produced content. But tomorrow?

The Consequences

The inevitable result of current AIs being trained from the output of earlier-generation AIs is a slow but consistent dumbing-down. Outputs will drift further and further away from what a human can produce, until the result is almost garbage.

Side note: we worry about the singularity, when a super-intelligent AI achieves sentience and begins making decisions that are not good for humanity. But has anyone considered how dangerous a super-stupid AI could be, if it achieves sentience?

The Solution

One possible solution is to watermark all AI-produced material, so that it can be easily eliminated from LLM training sets. This will offset the problem for many years, but the problem will eventually have to be dealt with when the amount of human-produced content is so vanishingly small that it is no longer suitable as training material. Another possible solution is to scrape internet content up to 2022 and simply ignore anything after that. This is also not sustainable, as AIs will start to diverge from reality as time passes.

Unless there is some way to tag or watermark AI-generated content, it will inevitably dumb itself down over time, until the entire corpus of knowledge on which machine learning is based becomes nothing but an empty husk.