AI Cannot Hallucinate Nor Lie

AI Cannot Hallucinate Nor Lie

I have admitted many times, grudgingly, beginning long ago, that computer science guys are brilliant at marketing. That adjective is a woeful understatement. Genius, though so overused to be almost drained of meaning, is far better. The term artificial intelligence itself ought to win every advertising award going.

Now they have come up with, ta da, AI hallucinations. Well, I could insult and berate this, because the idea is asinine, but that would only be jealousy speaking. I could only wish I had a fraction of the talent to come up with catchy tags or titles.

So. Not only can AI not hallucinate, nor lie, it cannot tell the truth, either. Indeed, AI cannot tell anything. Further, using language like this to describe the output of AI, even in a metaphorical sense, causes too many to fall into the Deadly Sin of Reification.

Take this story “OpenAI’s new reasoning AI models hallucinate more“:

OpenAI’s recently launched o3 and o4-mini AI models are state-of-the-art in many respects. However, the new models still hallucinate, or make things up — in fact, they hallucinate more than several of OpenAI’s older models.

Hallucinations have proven to be one of the biggest and most difficult problems to solve in AI, impacting even today’s best-performing systems. Historically, each new model has improved slightly in the hallucination department, hallucinating less than its predecessor. But that doesn’t seem to be the case for o3 and o4-mini.

According to OpenAI’s internal tests, o3 and o4-mini, which are so-called reasoning models, hallucinate more often than the company’s previous reasoning models — o1, o1-mini, and o3-mini — as well as OpenAI’s traditional, “non-reasoning” models, such as GPT-4o.

Here is a simple AI model: “x + 1”, allowing the user to input any integer x. You chat with the AI, ask it to run the “add 1” program, and give it an x. Say 17. The output reads “18”. Has AI told you the truth? No. It has sent some electrons through some switches and spit out a pattern on a screen which you, my dear reader, bring meaning to. To you it is true. Truth is a judgement. It requires a mind to make a judgement. To the machine it is nothing, for no thing is ever a thing for a machine.

What about “hallucinating”? This is when, they say, the model gives an incorrect answer. Complex models are never, that I heard tell of, perfect. They will err.

Example: I, being lazy, last week demanded the Grok model produce code to calculate the characteristics of a certain RF bandpass filter. (I used to get code from searching sites like Stack Overflow.) At one point, and the key point, it spit out a “<=” (less than or equal to), the exact wrong answer, where what was wanted was “>=” (greater than or equal to). This was tied to a decision to be made with the filter. So it was no small thing. This was easy to correct, because I knew in advance what I was after. But if you didn’t and trusted the model, you’d be in deep kimchi. (Incidentally, I “told” the model it made a mistake and where and it spit out apologies and corrections. So the model fashions well at handling some input text.)

If you knew you were dealing with a model, like say a weather forecast, you’d know there is a possibility of error in the model output, even if you didn’t know the precise characteristics of that error. You would never give total trust to a weather forecast. Yet with AI, because of the vibe of the model’s name, far too many people are too trusting.

Long ago, in pre-AI hype days, among modelers there was the lore against “over-fitting”. As I have tried to teach you in Class, for any observed set of data an infinite number of models can be discovered which fit that data to whatever degree of accuracy one likes, even perfectly. This being so, it becomes simplicity itself to find a model to fit your data.

Problem is, once you move from this wonder model fit to old data and use it to predict new data, you are extremely likely to see it produce nonsensical answers, especially on the “outskirts” of the data or when that data itself is far from simple. And even more especially if the model is forced to give answers. As most AI is.

Language, except in places like DIE lectures and Harvard’s simplified math classes, is far from simple. Models fit too tightly will thus be more likely to produce error or over-certainty. The simple expedient of coding AI models to output “I don’t know” when, internally, answers are too improbable, would do wonders, as this snippet proves:

recent study from researchers at Cornell, the universities of Washington and Waterloo and the nonprofit research institute AI2 sought to benchmark hallucinations by fact-checking models like GPT-4o against authoritative sources on topics ranging from law and health to history and geography. They found that no model performed exceptionally well across all topics, and that models that hallucinated the least did so partly because they refused to answer questions they’d otherwise get wrong.

AI workers are having a blast writing like things like this, speaking as if the machines they themselves code and tell what to do, are somehow alive. That paragraph speaks of entities, not machines. Writers ought to check this behavior. It’s going to eventually come back to bite them in the keisters. Or maybe this is only me hoping it will, because I find the Deadly Sin of Reification so grating.

As proof over-fitting is at work:

There’s been other academic attempts at probing the “factuality” of models, including one by a separate AI2-affiliated team. But Zhao notes that these earlier tests asked models questions with answers easily found on Wikipedia — not exactly the toughest ask, considering most models are trained on Wikipedia data.

Asking an over-fit model to reproduce the data it was fit to works well, as expected. It’s like I said: when the model is made to predict new data, it flubs.

What’s the fix? Dial back over-fitting; i.e. simplify the model. But that comes at the price of “smoother” output. Which is less interesting that sparkling predictions. All correlational models, which includes most of AI (which has partial causal aspects built in, like ending sentences with punctuation), smooth data. If you’ve ever seen a line drawn over a time series, you have seen smoothing in action. All the peaks and valleys disappear into the smooth soft model. Which, alas, so distracts many that they come to believe the model is Reality. I’ll cover this in Class.

That AI smooths is why, for instance, if you ask it to edit text it all comes out sounding the same. The peaks and valleys of the language are being drawn to the mean, which is some function of its input training data.

What’s not obvious is that models can be over-fit and smooth simultaneously! This is when the model is itself heterogeneous, say, by containing many different modules devoted to different tasks. Some of these can be overfit and produce goofiness and others can be underfit (as it were) and sandpaper the results.

Here’s an interesting visual example of both things happening to the same prediction. Someone asked an AI model to do a feedback loop: “Generate the same photo 5 seconds into the future.” It began with the distracted boyfriend meme, and ended after 13 seconds with what you see (click the link to see the video). Horrible, hilarious over-fitting.

Some complained the angling toward diversity reflects the AI’s training data bias. Well of course it does! All AI does! All models only say what they are told to say. The limited color palette, disappearance of the background, and lack of all other details reveals the smoothing.

As long as you know you’re working with a model, and not an “entity”, you’ll be fine.

Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: \$WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank. BUY ME A COFFEE.

5 Comments

  1. Russo

    I have said to anyone who will listen that “artificial intelligence” is a misnomer. It would more aptly named “Simulated Intelligence”.

  2. Brad Tittle

    In Fluid Dynamics back in 1990, my professor told the story of a researcher who came in a polynomial model of fluid flow. He was attempting to sell it to the engineering market. He had 10,000 parameters fitting a polynomial curve… Back then, they told him to go jump in a lake.

    I like to imagine that jumping in a lake was a way to reconnect this researcher with fluids.

    I was ejected from one Skeptic community because I suggesting that 2x4s be part of physicists tools set. Every once in a while it is necessary to hit such a person with a 2×4 or something else solid to remind him HE IS IN A PHYSICAL WORLD… the 2 x 4 in question is now 1.5″ x 3.5″ … it is now like 1 and 7/16 by 3 and 7/16… Which leads me to the time we set up a Pergola, and put in our 6″ x 6″ posts and went to put the joining material in place and were scratching our heads as to why the measurements were off… Someone took his tape to the 6×6… “HOLY EFFING TAMALES BATMAN… These are true 6×6 posts… ” Everyone standing around thought they were 5.5 x 5.5…

    NEVER forget your tape measure… Having a stud finder is also useful…

    Then I ran into “Junk Science Judo”. I got there via John Brignell. He sent me here. RIP Numberwatch.co.uk.. You were a bright spot in the nightmare of the modern world. Between Brignell, Milloy and our beloved host, I gained a great appreciation of the evil that is contained in epidemiology. Please forgive me for sullying Epidemiology a little by saying it has some connection to the LLMs that we like to call AI…

    Grok tells me I am not off my rocker when i suggest that AI is little more than a epidemiology on steroids. I suspect it suffers from the same fatal flaw. The data that is the most important is the data that is not there because it has been thrown away.

    The true knowledge is the summation of all that we know that is not true… But that knowledge bank is not generally findable, because we call it the wastebin..

  3. DWSWesVirginny

    Contra the respondent “Russo” neither “artificial intelligence” or “simulated intelligence” is apt. As Briggs points out we are really talking about machines, not “entities.” What is going on is that results are being obtained from a machine that has been programmed and in no way are they the products of an intellect.

  4. Douglas Skinner

    Contra the respondent “Russo” neither “artificial intelligence” nor “simulated intelligence” is apt. According to Briggs (and I agree with him) we are dealing with machines that are programmed, not entities. The results that are obtained from these models are machine output, not the products of an intellect.

  5. Uncle Mike

    Wrong word: AI isn’t hallucinating, it has dementia. Dementia is in, it’s hip, it’s cool. All the wokies have dementia. Drooling, vacancy, early early onset, autopens, car scratching, outbursts, manic depression — it’s the new behavioral model favored by the Left. It’s not a lie if you have dementia.

Leave a Reply

Your email address will not be published. Required fields are marked *