The AI problem is one of fit—an ambitious misapplication of tools suitable for one quite limited purpose (ie, generating believable language) for another purpose (ie, retrieval of written knowledge). In the market hype and AI bubbles, I see a consistent failure to interpret AI output in the context of what it produces. This arises from a misconnection between what the output looks like and what the models do.
The most straightforward example of this in AI is the large language model. A large language model is meant to generate structures of language. Applying it to the preservation of written knowledge is a mismatch between the model and its use.
One aspect of AI that has confounded me is the term “model.” My understanding of models comes from statistics, and I am, to be clear, not a statistician. We don’t need to go too far into the weeds here. Two of these models are relevant for us here: structural or explanatory models, which aim to tell a story about data, and statistical models, which describe the data.
A structural model is deployed to understand the causes of things within the data, typically based on narrow questions. You see these in economics, where people will test a new policy by exploring how it changes the flow of behavior on a large scale. If we know that people buy cheaper cars, and then we have a policy that makes cars more expensive, people will buy fewer cars. These models assume an understanding of the underlying structures of the things we are testing. Run new data in, and it flows through the structure in the same way.
Then, there are statistical models, which aim to describe patterns in the data. If you have unstructured data and want to know what relationships exist in that data, a statistical model is one way to do it.
An LLM is a statistical model, specifically, a probability model. It discovers patterns to predict a set of words that follow the prompt. With enough data, automated statistical analysis of billions of text took place, and indeed, an LLM discovered patterns of predictability in the way language is written: rules of grammar, likely word associations, and even some degree of correspondence to recognized facts, as in, a koala is a mammal. It is crucial to note, though, that as a model of language, the statements a koala is a mammal and a koala is a serpent are equally “true” within the model of language that an LLM has developed, even if it is not accurate in terms of correspondence to the world’s current arrangements of facts.
Understanding that a koala is a mammal and not a serpent requires an explanatory element: the terrain of structural models. An LLM might arrive at “a koala is a mammal” more consistently than “a koala is a serpent” because one set of words is more likely to be conjoined in the large language dataset. That does not prove the existence of a structural understanding of that relationship in the LLM. It reflects the understanding of the structural relationship of koalas and mammals in the world outside the dataset and is therefore more commonly referenced in the language people use.
In other words, people know koalas are mammals and not serpents, so they write that koalas are mammals more often. This structure of what humans write reflects an understanding of these structures, and that understanding of structure is therefore represented in the dataset. It doesn’t need to be understood by the model to appear accurately in its output.
This is why large language models are called what they are: large models of language. The misapplication of these models is where the most ambitious industry leaders and AI evangelists steer themselves astray. We’ve said it often: placing a model for predicting language into the category of predicting the world is wrong. Mistaking the generation of text that reflects an understanding of thought structures for an understanding of those thought structures is wrong, too.
Scaling the Wrong Structures
Smashing descriptions of models into predictions of the world is nothing new. Back in 1924, Mills warned statisticians against the same type of error (emphasis mine):
“In approaching [statistics] we must first make clear the distinction between statistical description and statistical induction. By employing the methods of statistics it is possible, as we have seen, to describe succinctly a mass of quantitative data.” ... “In so far as the results are confined to the cases actually studied, these various statistical measures are merely devices for describing certain features of a distribution, or certain relationships. Within these limits the measures may be used to perfect confidence, as accurate descriptions of the given characteristics. But when we seek to extend these results, to generalize the conclusions, to apply them to cases not included in the original study, a quite new set of problems is faced.” (Cited, as are much of these, in Spanos 2006).
But, as Spanos notes, Mills made clear that for a stable model to predict the behavior of an unstable world, you would need to be sure that,
“... in the larger population to which this result is to be applied, there exists a uniformity with respect to the characteristic or relation we have measured” and that “... the sample from which our first results were derived is thoroughly representative of the entire population to which the results are to be applied.” (pp. 550-2).
The problem here, as it applies to LLMs, is that there is no uniformity in the world captured by datasets of unfiltered language. The uniformity of language is in its grammatical constructions and categories, not in our descriptions of things. Irony is a perfect example of this: a sentence can mean the thing it says and the opposite, depending on who says it.
By 1922, R.A. Fisher comes up with another idea — the hypothetical infinite population. He writes:
“... the object of statistical methods is the reduction of data. A quantity of data, which usually by its mere bulk is incapable of entering the mind, is to be replaced by relatively few quantities which shall adequately represent the whole, or which, in other words, shall contain as much as possible, ideally the whole, of the relevant information contained in the original data. This object is accomplished by constructing a hypothetical infinite population, of which the actual data are regarded as constituting a sample. The law of distribution of this hypothetical population is specified by relatively few parameters, which are sufficient to describe it exhaustively in respect of all qualities under discussion.”1
Step away from the idea that a population is a collection of people, and remember that we are talking about the population of data: today, we might say “a hypothetical infinite dataset.” Fisher asks, “What are you measuring in the dataset, and what does it represent?”
In the case of a large language model, we are measuring word frequency and association, and the hypothetical infinite dataset is the construction of all possible sentences that follow those rules. For a wide range of reasons, the distribution of truth is not likely to appear in this set of infinite sentences.
Repetition in the dataset is what defines accuracy within a Large Language Model. If “koala” and “mammal” appear close enough, often enough, in the training data, it is more likely to surface when you run an infinite number of queries about koalas. When we ask the model about koalas, we are sampling the infinite production of sentences. In most of those hypothetical worlds, a koala will be a mammal. But in some worlds, a koala is the Prime Minister of Australia. AI experts call this a “hallucination,” but really, it’s an inaccurate statistical prediction.
Fisher described three problems that arise in statistics due to reliance on data reduction.
Specification: the choice of the collected data, “the mathematical form of the population.” He specifies that data collection “is not arbitrary, but requires an understanding of the way in which the data are supposed to, or did in fact, originate.” He points to establishing a proper survey and methodology for establishing a proper fit for the question asked.
Estimation: What is the best method for calculating the unknowns, the world beyond the gathered data points?
Distribution: How do you mathematically deduce “the exact nature of the distributions in random samples of our estimates of the parameters,
and of the other statistics designed to test the validity of our specification (tests of
Goodness of Fit).”?
We might match these to three problems with Large Language Models, then:
Specification: the sample is entirely arbitrary, emphasizing more data instead of more accurate or balanced data. So you get people’s jokes about koalas mixed with scientific facts about koalas. And you have, statistically, far fewer mentions of Quokkas in general. Large-scale commercial LLMs have no proper survey methodology to establish a fit for every question we might ask. The question is too broad — “What is the state of the world?” — rather than sufficiently narrow — “Can a model predict whether an animal is a mammal?”
Estimation: Again, we come to the problem of “hallucination.” This term is used for a variety of issues. One of those issues is bias, and another is the result of poor calibration to unknowns. LLMs do not have a good model for separating fact from fiction because they are not models for estimating accuracy; they are models for making a grammatically correct sentence. Asking it to do otherwise is like trying to build a race timer out of a broken watch because it is correct twice a day.
Distribution: The model is left to calculate patterns of words most likely to align with each other, and it must rebuild these patterns fresh with every query. Patterns of language may or may not align with patterns of the world, and Large Language Models are not, therefore, a good fit for measuring facticity per se. But what’s more, there is simply no way for us to know the “exact nature of the distributions” within an LLM: the scale of that data is simply too large for any scientist or group of scientists to verify.
The LLM industry hopes that this work can be automated and that LLMs can do it itself. But if your business model relies on making a model of general human intelligence, you’ve got a steep hill to climb. Hence, there is reliance on Reinforcement Learning from Human Feedback (RLHF), which builds people into the picture of appraising and incentivizing certain kinds of output related to domain expertise.
This is an evolution of the expert system, where people, rather than being asked what they do, signal whether the system has guessed what they do correctly or what knowledge they know. I am not convinced this is more desirable than searching a legal database. It is evidence that the circle of LLMs is going wide to constrict again and reinvent the search engine.
Training on larger datasets of text is just a matter of ambition — the weird belief that enough data about grammar will result in more knowledge about the structural models beneath the human use of language. I’m not convinced that a model for language prediction can transform into a model for such structural analysis through the sheer size of a dataset, but I’ve been wrong before. However, the scale of datasets carries other problems.
What Comes Out: Scale and Statistics
One of the head-butts between AI and statistics comes into play in the LLM industry’s reliance on scaling data and the inevitable challenge of modeling that data for its intended purpose. Yes, a model is very likely to get better at grammar and syntax due to having more access to written text — because written text, and the LLM design, is oriented toward data about grammar and syntax. It can then write poems, lyrics, and bumper stickers. It is a statistical representation of language structures, so it can predict language patterns.
LLMs can simulate writing, but they cannot simulate reading. The LLM version of analyzing a text is assigning numerical scores to word positions. The companies who use these models for information retrieval operate on a flimsy assumption: that high levels of correlation between words in a dataset is evidence of what I’d call facticity. Facticity is a lower bar than truth. Truth is subjective, diverse, and variable. Facts are social constructs: also unstable (a koala could, someday, become prime minister) but reflect an accumulation of agreement.
Wikipedia is a collection of facts, not a collection of truth. Facticity is what we might look for in an LLM: a reflection of a social consensus about stable things, like drug compounds and the categories assigned to animals by taxonomists.
Based on what I’ve just described, we might think that the LLMs would be great at reflecting a consensus about factual statements. But this raises another question: why do we need a machine for rephrasing statements of fact? We could just go to the Wikipedia entry to find out if a koala is a mammal or a serpent. The novelty of a chatbot seems minor compared to the risks we have to take regarding information to use that chatbot.
For a good example, look at Retrieval-Augmented Generation (RAG), a search engine with an LLM on top. When the AI balloon deflates, this is likely where we will land. RAG scrapes a narrower selection of texts, highlights the relevant information from those texts, and rephrases the information they contain to feel more natural. It can then show you exactly where the “fact” came from in the dataset. While still risky, it seems like a more realistic goal than the scaling of massive datasets in an attempt to answer every possible query.
Instead, it accepts that LLMs are language-structuring models and then applies new structures to limited datasets narrowly focused on sets of questions. This raises information literacy concerns but offers transparency for validation.
A Bridge Over an Abyss
This brings us to Neyman. In 1952, Neyman told, of all people, the US Department of Agriculture that:
“...it is my strong opinion that no mathematical theory refers exactly to happenings in the outside world and that any application requires a solid bridge over an abyss. The construction of such a bridge consists first, in explaining in what sense the mathematical model provided by the theory is expected to “correspond” to certain actual happenings and second, in checking empirically whether or not the correspondence is satisfactory.”2
In Neyman, we might also conclude that the model replicating language features is not necessarily the best model for replicating information found on websites and newspapers.
I’ve written before about temperature: the amount of variability permitted in the output of a model. There’s a contradiction in the balance between temperature and reproducibility: too low a “temperature,” and you don’t permit flexibility in the language. You produce something more representative of what’s in the data, which is not to be mistaken for accurate. Too high, and you have a rich and fluid conversational partner with nearly no correspondence to anything in the data.
At the heart of LLMs is a tension between their purpose (grammar prediction) and use (information retrieval), which suggests that LLMs are not the best tool for this task. It’s like driving a car with an innate tension between moving forward and backward.
For Neyman and most of us, the absence of a structural model behind an LLM’s predictions would pose a red flag for uses that require structural understanding. Neyman moved away from Fisher’s “infinite population” hypothetical and made it more about chance mechanisms:
“the repeated operations of which produce the observed frequencies. This is a problem of ‘frequentist probability theory’. Occasionally, this step is labeled ‘model building’. Naturally, the guessed chance mechanism is hypothetical.”3
This is the model I’ve relied on to understand AI-generated images, by the way. Understanding these images as the result of randomness and noise, we can understand where the “center” is by observing the outcome of several dice rolls and looking at the images for patterns. But this can be problematic, too, because randomness at large enough scales can produce things that look like patterns. That’s why we must be careful when inferring patterns to verify them against the data.
The Utility of Chance
There’s a simple trick to being useful in peer reviews of academic papers. It’s to figure out the author’s question and then see if the data they’ve gathered was the correct data to answer that question. The hallmark of most problems in AI papers I read is a mismatch between the question posed, the data used to solve it, and the extrapolations from the results.
AI is supposed to be data science! It is the analysis of large swathes of data through an automated experiment. There is a tendency to wrongly attribute a kind of Bayesian logic to the structure of Large Language Models. It’s a mistaken belief that LLMs can infer information despite the knowledge gaps within a model. However, there is no evidence of a structural understanding at the heart of LLMs; there are only the patterns that emerge from human reliance upon those structures. LLMs are oriented more within the frequentist category of statistics. Whatever shows up often in a dataset is what shows up often in the output: information is whatever happens most.
We run a statistical experiment whenever we ask an LLM a question or generate an image with a prompt. We can learn about the distributions of language or images within the dataset or provoke those machines into scrambling them into nonsense. That is in line with this quite conservative frequentist understanding of probabilities.
In classes, I encourage students to ask an LLM a question with a reasonable structure but nonsensical components. Here’s an example from this week using the state-of-the-art ChatGPT 4o model.
It detects the structure of a mathematical query and applies the structure of a mathematical query to a response. It cannot say it can’t answer the question because it can’t “answer” any questions. It can only model structures related to the prompt and fill those structures with statistically predictable language.
This should not be mistaken for information retrieval: it can write but can’t read.
We read too much into this “inference” idea, especially in the popular imagination of AI. We call what LLMs do “inference” because that’s what we want it to do. However, a machine does not make inferences about missing data. It seems more helpful to understand LLMs and their problems as the attempt to apply “what is most often present” to gaps in data elsewhere. The frequency of data in one category does not propose a meaningful answer in another, and the logic of which data is applied to those inferences is not reasoned.
When we mistake any AI system for making inferences in the Bayesian sense, we also falsely attribute a kind of agency, or reasoning, to what is a frequency model large enough to accommodate a lot of frequencies.
Ultimately, the concern about scale is not whether it leads to improvements but to what it is that improves. Most likely, as a model optimized for language patterns, it will be better at generating language structures — it may be more capable of generating a poem in a particular style or perhaps better at restructuring text in a specific academic format. However, we should resist mistaking improvements in formats and structures of information for improvements in information retrieval.
It’s a model of language, and it may get better at structuring language. That should not be mistaken for understanding the structures of the world that our language describes.
Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society A 222, 309–368.
Neyman, J. (1952). Lectures and Conferences on Mathematical Statistics and Probability, 2nd edition. U.S. Department of Agriculture, Washington. MR0052725
Neyman, J. (1977). Frequentist probability and frequentist statistics. Synthese 36, 97–131. MR0652325