The Great Smoothing: Another Reason Not To Fear (or trust) AI

The Great Smoothing: Another Reason Not To Fear (or trust) AI

Bad news, friends. Some people—none of you, I am sure—pass off AI as if it were their own work. There is a certain profession, which I’ll won’t reveal except to say it starts with “JOURNA”, that does this more-or-less routinely now. Perhaps they learned this in college, where, we hear, student cheating with AI is rampant.

The bad side of this is obvious, so I say nothing more about it. But there is a good side!

I bring you the terrific news (again) that there is no reason to panic about AI. As we discussed many times, what will happen is this: computer models—which is to say, AI—will begin to use AI output as part of its training data. And the output from those duly trained computer models (AI) will again be passed off as genuine by scurrilous operators.

Repeat the cycle.

This will lead to what I have been calling The Great Smoothing.

And the realization of this is the good news: there is no reason to panic over AI.

Again, we have discussed this before, too, but the nice lady I live with told me I ought to repeat myself to make myself understood. So here today is a simple example. If I had access to the code of, say, ChatGPT or some other LLM (the last word, we shall not forget, is Model), then I would prove it to you using it. Alas, I do not. I’m just one (canceled) man on the outer edge of the known internet.

Here is some data we used in Class. It is a time series of satellite of derived measure of Arctic sea ice in millions of square kilometers (no satellite can measure hip kilometers: thank you). That it is derived means we are looking already at a model of sea ice, and not actual sea ice, which nobody knows. But let that pass. Ignore this fundamental twist.

Let’s now AI-ify this; i.e. build a model. We want it to answer the prompt “What is the Arctic Sea Ice extent for date X”, where X is anything within a reasonable range.

All models smooth. Models cut down the highest peaks, and fill the lowest valleys. They are like balloons in a certain way. We can do a better job modeling peaks, push the ballon higher there, but at some expense elsewhere where the balloon distorts. No model is perfect. No model bats a thousand. No model exactly reproduces Reality. We use Reality for that.

But models have uses, and can be good. Here’s an AI of the data in red. (I used a loess with a span of 0.02, from R’s stock library.)

The original data is included. You can see the model smooths, but it’s not at all bad for many purposes. Success!

Now suppose some scientist comes along and tries to pass off this model data as if it were genuine.

“Briggs, that never happens. Scientists are honest, and Science itself is self-correcting.”

That so? Then how do we explain headlines like this recent one: “A medical journal says the case reports it has published for 25 years are, in fact, fiction“. Some 138 cases reports, some used in legal decisions, all faked. Hilarious, right?

“That’s just Science self-correcting itself.”

Well, have it your way. For now, this is proof that data used as genuine may indeed be fake. And we know that is certainly true in areas like “JOURNA”.

That red data is taken as if it real. Then the second generation AI comes along and trains on it. Then it produces output. Which is shown in this picture as green.

I also left the original and red (first generation AI). The green, second generation AI is smoother still. But it’s also still not terrible. We’ve seen many worse models. It would pass.

Yet, and I hope you saw this coming, here is bad boy scientist number two who passes off this second generation AI model data as his real data.

Then guess what happens. Yes, due to pressures of publish or perish another guy does it. Then another. And so on.

Here’s the result of 50 guys doing it:

The blue line is the fiftieth generation AI; that is, AI trained on data successively passed off as genuine. I don’t think 50 is too many, either, especially when you consider how rife cheating is becoming.

We’ve seen this with AI-generated images many times. Somebody starts with a picture and asks AI to generate the same picture again. The output looks similar to the input picture. So similar it takes a sharp man to see any departures. Maybe even the second generation, like ours above, looks close enough to the original nobody would notice.

But by the 50th? It heads right for the average, like our blue line. The peaks, the details in the image (especially behind the faces), have all been shaved away, and valleys all filled. The colors collapse to a murky medium brown. The Great Smoothing wins again.

This is guaranteed to happen. Perhaps some of it can be slowed if training data that is suspected as model output is rejected, but some of the slop will slip through. It is inevitable. Some of this can be mitigated by hard-coding rules like the successive picture example, to say maintain background integrity as they do for models which make “movies”, but you can’t code for all contingencies. It’s a never ending race to keep up.

The AI Slop Smoothing Test

This is easily tested. I mean the importance of it. Simply take an LLM and train it on known good data, i.e. that without AI model output. Then put the model through a range of tests, examining output across a variety of topics.

Then include that output as part of the new training data (we also keep the original training data). DO the same tests. And iterate for some small number of times, like our fifty.

Then look at the final output, particularly compared with the first. The Great Smoothing will win.

Again, this is good news if one of your fears was AI becoming AGI and taking over the world.

Here are the various ways to support this work:


Discover more from William M. Briggs

Subscribe to get the latest posts sent to your email.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *