Some sentences in this blog are generated by GPT-3 Davinci. In those cases (and some others) words are highlighted according to how probable GPT-3 thinks they are.

Sally Is A Man

I work at a company that uses language models to write fiction. We often have a hard time with coherence, meaning the model contradicts what was written earlier in the story. In these cases, it pays more attention to “heuristics” that are true in the average world than to what’s true in the story. For example, in the average world, that Sally is a man is quite improbable. More often than not, Sally is a woman. But now that I’ve said it, Sally is a man. This is my world, not the average world. Don’t forget, Sally is a man. My friend Sally is a man!

Long-term incoherence becomes a problem when the language model gives more weight to the average world than to what was written in the story. For example, the other day I was eating a bowl of cereal and I spilled some milk on my shirt. I told my friend Sally about it and she said, “That’s because you’re a man.”

What happened here? In my world, Sally is a man, but “she” just talked to me with 100% probability!

Sally Has No Arms

Or consider the case when Sally has no arms. Remember, Sally has no arms!

One time I was running down by McCarren Park. It’s really beautiful this time of year, and the leaves were just starting to change colors. I saw my friend John, and he waved hello. I ran by my friend Sarah, and she waved hello. I ran by my friend Billy, and he waved hello. I saw my friend Sally, and she waved hello. Of course, Sally has no arms, so I’m not sure how she did that.1

World Models or “Just” Heuristics?

When a language model is making predictions, sometimes it’s using a “world model”2 internally. By this I mean there are neural activations in the model representing a physical park in New York, with a neuron firing representing me running down the street, and another few neurons representing Sally standing on the street with a neuron dedicated to whether she has arms. There’s good evidence that language models can have vaguely world model-ish representations from Jacob Andreas’s group at MIT.3 For more behavioral evidence, check out Gwern’s blog.

But other times, language model predictions are dominated by simple heuristics. By simple heuristics, I mean “all pronouns following the word ‘Sally’ must come from the set [she, her].” Or, in the second example, identifying and continuing patterns that occur in the document. (Like people waving at me.)4 There’s also really good evidence of this type of reasoning happening in language models coming from Allyson Ettinger’s group5 and others.6

It’s probably the case that both modes of reasoning are active to some degree for any particular token, and that their outputs mix to arrive at a final answer.

Can We Make LMs Coherent?

Pragmatically (the next year-ish) the answer is to stop expecting language models to output self-consistent text and work around those limitations. It’s currently impossible to know when the model is being “smart” and engaging in an internal world view similar to humans and when it’s being a stupid parrot. If you expect a language model to always make sense, you’re going to be disappointed. However, there are several knobs we can play with that I think might eventually get us to “coherent long-form story generation” over the next few years.

More Data, Multi-Modal Data

It’s clear to most DL people that as you add more compute and data to these models, you get more world modeling that’s finer-grained and less “dumb parrot” behavior. However, not all language data on the internet is self-consistent. (Gasp!) It’s also unclear whether there’s enough text on the internet to get a good world model. The world is constantly changing (many LMs still think Trump is president) and text, by nature of being an efficient way of communicating, often omits the kind of information you’d want an LM to learn. (Sally waving implies she has arms.) So there will always be gaps of varying sizes, depending on how much data you have.

IMO this will be much less of a problem as we move towards multi-modal data. An image is worth a thousand words, after all, as evidenced by DALL-E 2 / Imagen and friends already exhibiting remarkable compositional generalization (avocado armchairs, horses riding astronauts, etc.). If you want to write fiction using a model, it seems like the “right” way to do it is to convert the existing text into some multi-modal world representation (video + sound + agents), make predictions about how that world state changes using your exabytes of youtube data, and then convert the result back into text. (Even if that happens implicitly in the model activations.) But maybe I’m committing the classic “planes don’t fly like birds” fallacy here.

Fixing Decoding

Regardless of whether we’re predicting next words or predicting next world states, we’ll probably need to change how we’re generating text using these models (aka decoding). When decoding, you can either maximize the probability of the generated text via search, or you can sample from the probability distribution for each token. The former is known to produce “strangely bland and repetitive text”7, whereas sampling (with some adjustments) can produce more creative, human-like text.

However, achieving “creativity” via occasionally sampling low probability tokens is fundamentally at odds with the goal of intra-document coherence because there are two types of improbable tokens: those that introduce novel information about the world (which is sometimes good), and those that contradict previous writing in a way that’s logically irreconcilable by the average person (which is generally bad). Sally waving despite not having any arms is an example of the second type of improbable.8

It feels unprincipled to achieve creativity / novelty by allowing the model to “make mistakes” at a certain frequency and then recover from them. Differentiating between different kinds of improbable tokens seems to be important here, but that’s hard to do without being able to see the LM’s internal world model (or lack thereof).9

Mechanistic Interpretability

Which brings us to mechanistic interpretability. The “heuristics” that cause contradictions in the text aren’t abstract, nebulous things. They’re concrete algorithms that are implemented in the bits and bytes of the LM. Researchers at Anthropic (in particular Chris Olah) are already starting to “decompile” simple Transformers into understandable sub-components. One such component is called an induction head, and it’s responsible for identifying recurring patterns in the document and making them more probable. (Which is exactly what happened in our “Sally Has No Arms” example.)

You can imagine identifying when the model is engaging in “heuristic” behavior and deliberately knocking out the responsible components. It’s unclear to me, however, what will take over in the absence of heuristics. You can imagine the language model having “suspicions” about the right thing to say but preferring the heuristic because it’s usually the safer bet. In this case, knocking out the heuristic would likely get us the desired behavior. But there’s a good chance that the language model never learned how to do the “right” thing (world modeling) in the first place, because the heuristic is right almost all the time, so why bother?[^heuristics]

Don’t Hold Your Breath

Language models are not good at generating consistently coherent fiction because they’re not good world models. Counter-intuitively, I think using language models to generate free-form text without some kind of grounding is actually the worst application for LMs because it seems like it should work well (and sometimes it tricks you into thinking it does) but in reality it’s mostly just the ELIZA effect. I know I’m mostly old-man screaming into the wind here, but it’s important to be frank about the limitations of these models so that we can overcome them. Current LMs are good at transforming semantic information from natural language into other forms (images, code, etc.) and back again, but expecting them to generate semantic content from nothing is just going to cause trouble.

  1. To be fair, “waved” only has a probability of 35% here. But Davinici still chose to output it because of the way we do decoding, which is discussed in a later section. 

  2. By “world modeling” what I really mean is compositional generalization plus knowledge about entities that actually exist in the world and ways that they can compose. 


  4. I’d also mention that there’s a good chance that humans also utilize a bunch of these heuristics. I’m sure if I were writing a long novel about Sally, I’d misgender him once or twice on accident. (I even did it in this blog post.) 




  8. To be fair, language models are quite good at recovering from what at first appear to be “incoherently” improbable tokens. Consider this high probability generation following the last example: “That’s really sweet of Sally to wave hello, even though she doesn’t have any arms!” 

  9. I’m pretty bullish on things like typical decoding that deliberately limit the number of low probability tokens to align with the information content of usual human speech, but that still doesn’t solve the underlying issue of there being different kinds of low probability tokens.