My Mental Model of LLMs

Yesterday’s AI sell-off (and today’s weakness in Google because they didn’t get the message that the street no longer wants to see massive AI investments) is a not surprising. I thought some version of that has been coming for a couple of reasons. The first is we’re not seeing the data that shows productivity improvements. Second, we’re not seeing cost reductions. Third, is my mental model of LLMs precludes there being intelligence there at all.

The first two points are the most straightforward. Let’s start with the actual impact on productivity. When I hear someone say they used an LLM to write codes in a couple of minutes that would have taken them a couple of hours, I believe them. But the question is not whether the individual task is more productive, but what is the actual impact on productivity. This is a variation of Amdahl’s Law. The total impact on productivity is the time saved in the one task divided by the total time for all tasks. That has to be weighed against the cost to provide that AI. We’ve only seen anecdotal stories of AI improving specific tasks, which are central to a job, but may only occupy a small fraction of the time in a given month. It’s quite possible a sustainable AI subscription price (not subsidized by investors) is well above the actual productivity impact over a month.

The second item is also straightforward and that is improvements in performance require using multiple stages of “reasoning.” For example, I prompt the AI to write code and it writes the code. It then runs the compiler and tests. It then fixes all the bugs. It then re-runs the tests. It then audits the code. It rewrite some of it, producing new bugs, fixes those bugs, and so on. After a few minutes, it’s finished. I go back in and make some comments about some of the structure and the AI starts churning again. The original code may have been hundreds of tokens. But the entire loop might be tens of thousands of tokens. Even if the cost of one token has dropped to 1/10 of what it was in 2022, we use hundreds of times more tokens. That pushes up the profitable subscription price and meets up against my first point, of the actual value of any savings.

But let’s move on the central reason I don’t believe LLMs are intelligent (which is not to say they aren’t useful – I use them at work). Let’s start by imagining a function (a mathematical machine that produces outputs for a given input), that produces the most likely text given some input text. If we give it the phrase “to be or not to be,” it spits out a meaningful response, such as the rest of Hamlet’s soliloquy. We will never be able to produce this function, because we don’t have a complete model of all human speech and meaning around that speech. While we can’t know this function, we can estimate this function given a lot of example inputs and outputs.

Enter the neural network. One property of multi-layer neural networks is they are excellent function estimators. If you have some function (math machine that makes outputs for inputs) and all you have are example inputs and outputs, a neural network can be trained to act like that function. The bigger the neural network, the better it can estimate a given function1. But the more data it requires to train all the “neurons” in the network. (The neurons are actually weights by which the inputs of the “neuron” are multiplied and then “squashed” as the neuron output for the next neuron in the chain). If you like, each neuron is a function we tune when training the network that estimates some part of the internals of the actual function (which we do not know).

What the LLM estimates is a probability density function over language. This is a fancy way of saying its estimating the probability of choosing a specific piece of text, if you feed it some input text. I send my text into the function and it spits out some output. What context does is to make that probability function conditional. Conditional probabilities are statements like the probability it has rained, given the ground is wet. The “ground is wet” conditions the probability of it having rained. Irrespective of the ground being wet, an any given day it might be a 3% chance of rain. If the ground is wet when we go out, there might have been a 65% chance of rain and maybe a 35% chance of my neighbor watering their lawn.

When we feed an LLM plus some context, what we’re saying is what is the most probable text that follows this text, given the context. It’s no longer examining all possible answers, but only the most likely answers given the context. Hence, the conversation and my prompt files help tune the answer to a better answer. By giving a coding LLM specific examples of what I want, I condition its output to provide the most likely additional text that you would expect to see if other code similar to the examples were already present. That is why I need to spend not just a few minutes, but sometimes significant effort writing and curating the prompt injection files.

Fortunately, for our world, things are rarely unique and novel. If I have to add Passport authentication to an application, it is rarely a completely novel exercise. Chances are, it looks like the vast majority of similar integrations. A machine that generates the most likely text for some input text, and given a context, may produce a working implementation. Even in my day-job, where I work on a novel processor, assembly language and C language code for things like interrupt handlers have not changed much in the last 40 years. For some tasks, an LLM will probably perform to the standards set for an average human practitioner, simply because we have an estimate of what we would expect an average human practitioner to produce.

If we move out of the range of the training data, neural networks can generalize, but they lose their accuracy. A neural network can be trained to within some level of error, but this is only for the the training data and test data. Outside that data, or at the extremes of that data, the performance falls off. Which is why novel languages or truly novel problems are more difficult for a neural network. It can mimic thought as it produces text using its statistical model because that’s what thought looks like to us. That’s how we expect thinking around that topic to appear. Which is why LLMs can even produce mathematical proofs. Just because it completes the idea prompted by “men are mortal and Socrates is a man,” does not mean it reasoned at all.

There is a base-line randomness to LLMs, without which they would be useless. As they produce their output it might not be exactly what was in the training data. But that’s because there’s other training data that expands or contradicts any given example. In addition, without a little injection of randomness (either intentionally or unintentionally) it would lose the illusion of intelligence. If I give it “to be or not to be,” I might get 10 variations on the standard Shakespeare and one odd-ball response, or the wrong play, or a critical analysis of Hamlet. I might get what I view as the “wrong” response. However I use the LLM, I have to accommodate the possibility of getting the wrong answer as part of my cost of using the LLM, which is why I have to review, and often refactor, the LLM code prior to adding it to my code-base.

This is where we come back to my first two points and perhaps the central issue. Is it profitable to use an LLM based system to improve the productivity of a worker to the point it pays for the system? While individual circumstances vary, we need to look broadly. Even with the detailed mental model I painted, we come back to a basic problem of economics. The cost of running the model (or paying API tokens) must exceed the benefits through productivity. That answer is not clear cut. One benefit of writing the prompt files is I work through various ideas I have before writing software, usually in much greater detail than a design document. I have to work in small steps. I need to refactor the code and do more critical reviews. If someone tracking my time said you didn’t actually save three hours, because you spent two hours on writing prompt files, code reviews, and refactoring, I might believe them. And if they then said, the cost of the tokens I burned are equivalent to that hour in savings, that might also be true. And if that’s the case, there isn’t much of a case to use LLMs.

  1. It is possible to over-train networks for small problems, but all of human language is not a small problem. ↩︎

Leave a comment