Mitchell A. Gordon

Memorializing My Twitter Side-Project

2023-09-24T00:00:00+00:00

This is a short memorial post for my dead Twitter summarizer project, aka “Twitter at a Glance.”

Links: Demo (if it’s still up), Github

What It Does

Twitter at a Glance^™ summarizes your Twitter feed so you can stay on top of what’s happening. Concretely it:

Grabs your timeline tweets from the last 24 hours
Clusters them using ChatGPT (maybe twice, hierarchically)
Summarizes those clusters

This gets you a nice compact homepage with some topics you can click through to see the tweets.

Interactive

How It Works

Hashtag Generation & Clustering

Clustering operates on hashtags, which we get from a simple prompt to GPT-3.5:

TWEET:
{tweet}

Generate 30 possible hashtags that could go with TWEET.

Rules:
If TWEET refers to a location or event, include at least one hashtag containing the name of the event.
If TWEET refers to a specific object or thing, include at least one hashtag containing the name of that thing.

ChatGPT example

We then pick hashtags which are associated with 7 tweets or less¹ and use those as clusters. For smaller clusters, we also “pack” them by greedily grabbing tweets that have a high hashtag-overlap with the existing tweets in the cluster.²

Summarization

Then just ask for a summary:

TWEETS:
"""
{tweets_text}
"""

What topic do all TWEETS have in common? Rules:

- The topic must begin with "{num_tweets} tweets are about"
- The topic must be no more than 1 sentence.
- The topic must be discussed in a majority of the tweets.
- The topic must be related to {hashtags}

Think out loud, then state the topic prefixed with the TOPIC label.

ChatGPT example

Which, you know, could probably still use some elbow grease. But it works ok. We do something similar for the meta-clusters, where the inputs to that prompt are just the summaries from the sub-clusters.

If we do meta-summarization, we can also re-summarize the sub-clusters to get more specific about what makes that cluster different:

TWEETS:
\"\"\"
{tweets_text}
\"\"\"

What topic do all TWEETS have in common? Rules:

- The topic must be no more than 1 sentence.
- The topic must be discussed in a majority of the tweets.
- The topic must be related to {hashtags}
- The topic must begin with "{num_cluster_tweets} tweets are about {cluster_summary}.  More specifically, {num_tweets} are about"

Do not think. Just say the topic and only the topic.

ChatGPT Example

Why It’s Dead

Price

The Twitter API is ridiculously priced, at $100 / month to retrieve up to 10k tweets for the basic tier. We can serve ~4 DAUs for that price, assuming each page view retrieves 100 tweets. The next tier up costs $5k / month. Which brings me to…

Usefulness

Common feedback I’ve gotten is that most people tend to have pretty domain-specific feeds (AI, crypto, etc.). So the clusters all tend to be about the same thing. Take my feed for example:

All the clusters are some variation of “advances in AI”. The meta-clustering and resummarization give a little bit more specificity, but I usually end up clicking through most of the clusters anyway.

Takeaways

Perhaps a flood of information is just a flood of information, regardless of how you slice and dice it? I will say that the experience is slightly nicer than infinite scrolling because it makes the content feel more finite and manageable. But not worth $25 / month.

I still think there’s a huge potential for LMs to fundamentally change how we consume information. If the Twitter API were less egregious, I might keep experimenting with different presentation formats (interactive cluster subjects, perhaps, or intents like “keep up with AI developments”) and different prompting strategies.

But alas, the planets spin quite outside of my grasp.

which is a magic number I picked because more felt like it overloaded my working memory ↩
NB: I tried a few other things that didn’t work nearly as well. Naively embedding and kmeans clustering tended to produce “dirty” clusters that had too many topics. I tried to fix this via various heuristics (prompting the model to identify dirty clusters and sub-clustering them, or sub-clustering when the summary was too long) but the quality was still pretty sub-par. TL;DR: picking the “k” in kmeans turned out to be pretty non-trivial. ↩

Anecdotes in Language Model Coherence

2022-09-16T00:00:00+00:00

Some sentences in this blog are generated by GPT-3 Davinci. In those cases (and some others) words are highlighted according to how probable GPT-3 thinks they are.

Sally Is A Man

I work at a company that uses language models to write fiction. We often have a hard time with coherence, meaning the model contradicts what was written earlier in the story. In these cases, it pays more attention to “heuristics” that are true in the average world than to what’s true in the story. For example, in the average world, that Sally is a man is quite improbable. More often than not, Sally is a woman. But now that I’ve said it, Sally is a man. This is my world, not the average world. Don’t forget, Sally is a man. My friend Sally is a man!

Long-term incoherence becomes a problem when the language model gives more weight to the average world than to what was written in the story. For example, the other day I was eating a bowl of cereal and I spilled some milk on my shirt. I told my friend Sally about it and she said, “That’s because you’re a man.”

What happened here? In my world, Sally is a man, but “she” just talked to me with 100% probability!

Sally Has No Arms

Or consider the case when Sally has no arms. Remember, Sally has no arms!

One time I was running down by McCarren Park. It’s really beautiful this time of year, and the leaves were just starting to change colors. I saw my friend John, and he waved hello. I ran by my friend Sarah, and she waved hello. I ran by my friend Billy, and he waved hello. I saw my friend Sally, and she waved hello. Of course, Sally has no arms, so I’m not sure how she did that.¹

World Models or “Just” Heuristics?

When a language model is making predictions, sometimes it’s using a “world model”² internally. By this I mean there are neural activations in the model representing a physical park in New York, with a neuron firing representing me running down the street, and another few neurons representing Sally standing on the street with a neuron dedicated to whether she has arms. There’s good evidence that language models can have vaguely world model-ish representations from Jacob Andreas’s group at MIT.³ For more behavioral evidence, check out Gwern’s blog.

But other times, language model predictions are dominated by simple heuristics. By simple heuristics, I mean “all pronouns following the word ‘Sally’ must come from the set [she, her].” Or, in the second example, identifying and continuing patterns that occur in the document. (Like people waving at me.)⁴ There’s also really good evidence of this type of reasoning happening in language models coming from Allyson Ettinger’s group⁵ and others.⁶

It’s probably the case that both modes of reasoning are active to some degree for any particular token, and that their outputs mix to arrive at a final answer.

Can We Make LMs Coherent?

Pragmatically (the next year-ish) the answer is to stop expecting language models to output self-consistent text and work around those limitations. It’s currently impossible to know when the model is being “smart” and engaging in an internal world view similar to humans and when it’s being a stupid parrot. If you expect a language model to always make sense, you’re going to be disappointed. However, there are several knobs we can play with that I think might eventually get us to “coherent long-form story generation” over the next few years.

It’s clear to most DL people that as you add more compute and data to these models, you get more world modeling that’s finer-grained and less “dumb parrot” behavior. However, not all language data on the internet is self-consistent. (Gasp!) It’s also unclear whether there’s enough text on the internet to get a good world model. The world is constantly changing (many LMs still think Trump is president) and text, by nature of being an efficient way of communicating, often omits the kind of information you’d want an LM to learn. (Sally waving implies she has arms.) So there will always be gaps of varying sizes, depending on how much data you have.

IMO this will be much less of a problem as we move towards multi-modal data. An image is worth a thousand words, after all, as evidenced by DALL-E 2 / Imagen and friends already exhibiting remarkable compositional generalization (avocado armchairs, horses riding astronauts, etc.). If you want to write fiction using a model, it seems like the “right” way to do it is to convert the existing text into some multi-modal world representation (video + sound + agents), make predictions about how that world state changes using your exabytes of youtube data, and then convert the result back into text. (Even if that happens implicitly in the model activations.) But maybe I’m committing the classic “planes don’t fly like birds” fallacy here.

Fixing Decoding

Regardless of whether we’re predicting next words or predicting next world states, we’ll probably need to change how we’re generating text using these models (aka decoding). When decoding, you can either maximize the probability of the generated text via search, or you can sample from the probability distribution for each token. The former is known to produce “strangely bland and repetitive text”⁷, whereas sampling (with some adjustments) can produce more creative, human-like text.

However, achieving “creativity” via occasionally sampling low probability tokens is fundamentally at odds with the goal of intra-document coherence because there are two types of improbable tokens: those that introduce novel information about the world (which is sometimes good), and those that contradict previous writing in a way that’s logically irreconcilable by the average person (which is generally bad). Sally waving despite not having any arms is an example of the second type of improbable.⁸

It feels unprincipled to achieve creativity / novelty by allowing the model to “make mistakes” at a certain frequency and then recover from them. Differentiating between different kinds of improbable tokens seems to be important here, but that’s hard to do without being able to see the LM’s internal world model (or lack thereof).⁹

Mechanistic Interpretability

Which brings us to mechanistic interpretability. The “heuristics” that cause contradictions in the text aren’t abstract, nebulous things. They’re concrete algorithms that are implemented in the bits and bytes of the LM. Researchers at Anthropic (in particular Chris Olah) are already starting to “decompile” simple Transformers into understandable sub-components. One such component is called an induction head, and it’s responsible for identifying recurring patterns in the document and making them more probable. (Which is exactly what happened in our “Sally Has No Arms” example.)

You can imagine identifying when the model is engaging in “heuristic” behavior and deliberately knocking out the responsible components. It’s unclear to me, however, what will take over in the absence of heuristics. You can imagine the language model having “suspicions” about the right thing to say but preferring the heuristic because it’s usually the safer bet. In this case, knocking out the heuristic would likely get us the desired behavior. But there’s a good chance that the language model never learned how to do the “right” thing (world modeling) in the first place, because the heuristic is right almost all the time, so why bother?[^heuristics]

Don’t Hold Your Breath

Language models are not good at generating consistently coherent fiction because they’re not good world models. Counter-intuitively, I think using language models to generate free-form text without some kind of grounding is actually the worst application for LMs because it seems like it should work well (and sometimes it tricks you into thinking it does) but in reality it’s mostly just the ELIZA effect. I know I’m mostly old-man screaming into the wind here, but it’s important to be frank about the limitations of these models so that we can overcome them. Current LMs are good at transforming semantic information from natural language into other forms (images, code, etc.) and back again, but expecting them to generate semantic content from nothing is just going to cause trouble.

To be fair, “waved” only has a probability of 35% here. But Davinici still chose to output it because of the way we do decoding, which is discussed in a later section. ↩
By “world modeling” what I really mean is compositional generalization plus knowledge about entities that actually exist in the world and ways that they can compose. ↩
https://arxiv.org/abs/2106.00737, https://www.youtube.com/watch?v=BHQBkN4PyPc ↩
I’d also mention that there’s a good chance that humans also utilize a bunch of these heuristics. I’m sure if I were writing a long novel about Sally, I’d misgender him once or twice on accident. (I even did it in this blog post.) ↩
https://aclanthology.org/2021.emnlp-main.119.pdf, https://youtu.be/9tH9Qz1nH3k?t=2234 ↩
https://compositionalintelligence.github.io/ ↩
https://arxiv.org/abs/1904.09751 ↩
To be fair, language models are quite good at recovering from what at first appear to be “incoherently” improbable tokens. Consider this high probability generation following the last example: “That’s really sweet of Sally to wave hello, even though she doesn’t have any arms!” ↩
I’m pretty bullish on things like typical decoding that deliberately limit the number of low probability tokens to align with the information content of usual human speech, but that still doesn’t solve the underlying issue of there being different kinds of low probability tokens. ↩

RETRO Is Blazingly Fast

2022-07-01T00:00:00+00:00

When I first read Google’s RETRO paper, I was skeptical. Sure, RETRO models are 25x smaller than the competition, supposedly leading to HUGE savings in training and inference costs. But what about the new trillion token “retrieval database” they added to the architcture? Surely that must add back some computational costs, balancing the cosmic seesaw?

Apparently not. After running benchmarks for myself, at scale, I am convinced that RETRO is indeed BLAZINGLY fast. RETRO is so fast and cheap, in fact, that I cannot fathom why anyone would choose to do language modeling without retrieval.

RETRO Overview

To achieve similar performance to bigger models like OpenAI’s GPT-3, RETRO adds an auxiliary “database” of text data, which is queried both during training and inference. This database needs to be HUGE (> 1T tokens!), or else it doesn’t really help.

https://jalammar.github.io/illustrated-retrieval-transformer/

We’ll see that making and querying this database is orders of magnitude cheaper than training / inference on big neural networks. In this post I’ll briefly describe how the database is constructed and some benchmarks I did while making a database of The Pile, which I’m happy to share by request.¹

I used a fork of LucidRain’s RETRO-pytorch implementation, which has been modified to handle some scale things like parallelization of jobs. Also thanks to my employer, Latitude, for giving me the compute to do these experiments.

The Pile

I used The Pile as my benchmark dataset, which is an open-source dataset provided by EleutherAI. It weighs in at around 830 GB of raw text. To get a sense of how much data this is, notice the “Wikipedia” section in the source breakdown below:

https://arxiv.org/abs/2101.00027

Building The Database

Building a database of The Pile was surprisingly cheap by neural network training standards (~$1k total). It broadly involves three steps:

Tokenize the data and split it into chunks of 64 tokens each
Embed the chunks with BERT
Index the embeddings with a MIPS library (FAISS, SCANN, etc.)

Tokenization

Tokenization takes around 1.9 min / 1M chunks on your standard CPU core. The Pile ends up being around 5.8B chunks (370B tokens), so that means you’re looking at ~180 hours of CPU time to tokenize, which you can easily parallelize down to only a few hours of wall time.

With a CPU core on the cloud going for around $0.03 / hour, that means you’ll spend less than $10 on tokenization.

Embedding

BERT embedding is the most expensive step. On an RTX A5000, BERT embedding takes around 10 minutes per 1M chunks.² That’s around 1k GPU hours to embed The Pile, which again is very easy to parallelize. This cost around $1k on Coreweave.

Note that BERT embeddings are around 3 KB each on disk. (768 float32s). 5.8B of them takes up about 16 TB on disk, so watch out for that. (Disk space is cheap.)

MIPS Indexing

The MIPS index is the reason the RETRO database lookup is so fast. MIPS stands for maximum inner-product search, which is when you search a database of vectors for the ones closest to your “query” vector. In RETRO, we use this to look up chunks of text from The Pile that are similar to our input.

Companies like Google and Facebook have been doing MIPS at scale for over a decade, so there’s been a huge amount of research optimizing the heck out of this stuff. Google’s RETRO used their new library, SCANN, but I ended up using the more mature FAISS library from Facebook, which has a near identical implementation of the algorithm used by SCANN.

I tried to get the FAISS configuration as close as possible to what Google used in the RETRO paper. FAISS indices can be built using “factory strings” which specify which types of indices to build and how to compose them. My factory string is OPQ16_64,IVF1048576_HNSW32,PQ16x4fs

Check out Pinecone’s wonderful faiss tutorial and index factory explainer for more information on the optimization tricks used by FAISS and similar libraries. I also enjoyed this tutorial on how Product Quantization works under the hood. There are still some things I could tune here to optimize the speed / accuracy trade-off, but I’ll leave that for future me.³

Index Training

One particular trick used by FAISS (the inverted file structure) requires taking a small percentage of the data (64M embeddings) and using them to train the index. On a V100 GPU, this only took around 4 hours, so the cost was negligible.

Once the index is trained, we can add all the embeddings to the index, compressing them for lookup. This takes longer than you’d expect (around 192 CPU hours) but ultimately only represents a cost of <$30.

Querying the Database

Now that we’ve built the database, how long does it take to query it? Personally, I would have been happy with anything < 100ms, since that would have represented a marginal increase in existing generation times. For reference, here’s how long it takes to generate around 50 tokens with various language models:

GPT-J (6B): ~3s
AI21 Grande (17B): ~4s
GPT-NeoX (20B): >4s
AI21 Jumbo (175B): ~6.5s (x ~6 GPUs)

In practice, our FAISS index takes between 2 and 40 ms,⁴ based on my manual testing. That’s… really fast. Embedding the query with BERT takes an additional 10 ms on a CPU. Altogether, the cost of querying the database during inference and training has a totally neglibile impact on total cost.

Qualitative Results

query: The old man wept, for he knew that his end had come. The waves of time washed over him.

result 1: she faded from them, as the bright snow, that none may keep, melts in our very hands. A murmur of farewell came to his ears, - - no more. She was gone. He would have followed, but Charon, now on guard, drove him back. Seven days he lingered there between the worlds

result 2: but as I tarried? And when I could no more, I did go, and I did stay, and I did steward. Stayed at the station. The ravens did raven. The steward did steward. But one thing mattered. The Spirit did Spirit. And the word remained. For

query: In today's news, Miley Cyrus was caught shoplifting from a clothing store on Hollywood Boulevard.

result 1: ##s in Texas. The child, whose name was not released, boarded the Techno Jump Ride with her 8 - year - old brother at the RodeoHouston carnival around 2 p. m. Wednesday, according to local affiliate KTRK. RodeoHouston is a popular local attraction. Witnesses told

result 2: [CLS] Is this the worst airplane loader in the world? Proof can be found in a year - old YouTube video that just surfaced via Reddit. In it, an unidentified freight handler can be seen haphazardly tossing packages from a flat bed onto a conveyor belt at China's Guangzhou Airport. Capt

query: Hey Betty! Thanks for getting back to my email. Are we still on for Saturday?

result 1: 20 AM I just recd. an email from gary sinclair and it got me thinking about all the great people and good freinds of VR - 24. I know a few of you have emailed me in the past and I didnt respond but I will to all future emails. After

result 2: starmail. com Subject : oops Soz babe didnt mean to sned that!!!! Was trying to email a mate on my phone and been drinkin ps hop u r ok I close the laptop and I sit for a long time in silence. As I do, I examine the happy, laughing

The Hidden Cost of CPU RAM

The FAISS index is not totally cost free. The index itself ends up being big, requiring around 176 GB of RAM to query, which costs about $0.88 per hour on your average cloud provider.

However, this allows you to drastically reduce your GPU usage. Say, for example, you need 5 GPUs running in parallel to do inference on a 175B parameter model, which costs around $6 an hour. By adding an extra $0.88 / hour in CPU RAM, you can reduce the number of GPUs you have to run to just 1, saving around $5 / hour in GPU costs. I’d take that trade any day.

This also applies to models that are already using a single GPU. By shrinking your model with RETRO’s database, requests get served faster, meaning more GPU bang for your buck. Instead of serving 60 req / hour on a single GPU, you’re serving 600+, just for a little extra CPU RAM.

Update (7/6/22) - I’ve been informed that FAISS has the ability to memory map an index, which allows you to read it directly from disk instead of allocating RAM for it. This is slightly slower, of course, but probably worth the trade. (Thanks rom1504.)

Conclusion

At first I was skeptical, but upon closer inspection it seems like RETRO is indeed a HUGE cost savings over existing LM approaches. These cost savings seem to boil down to the fact that MIPS is super optimized by existing libraries and only requires more CPU RAM to use. Based on these observations, I can’t imagine why anyone doing language modeling in production would choose to do it without retrieval.

I tried uploading some of it to Huggingface, but even the compressed FAISS index file exceeded the max 50 GB file size. The tokens themselves are over 1.5 TB. Feel free to shoot me an email and I’ll get you a copy. Update (5/11/23) - I no longer work at Latitude and therefore no longer have access to this index. Sorry! ↩
Naively, I didn’t do much optimization here. I suspect the bottleneck is probably getting data off disk to the GPU, not the computation speed. ↩
Specifically I’m not certain we need to be so aggressive with the dimensionality reduction during pre-processing. (768 dims → 64.) Because of the way PQ works, I’m pretty sure I could get away with less dimensionality reduction and improve accuracy. ↩
For k=5, with the IVF nprobe also set to 5. (Which seems to be a standard setting, but could be tuned to trade speed / accuracy.) ↩

Do Infinite Pencils Exist??

2022-02-01T00:00:00+00:00

Does infinity exist? This comment from MathOverflow riled my feathers a bit:

I’ve heard a worse story. A college instructor claimed in Number Theory class that there are only finitely many primes. When confronted by a student, her reply was: “If you think there are infinitely many, write them all down.” She was on tenure track, but need I add, didn’t get tenure.

What a dumb teacher, right? Everyone knows there are an infinite number of primes! Haven’t you heard of Euclid?

Except I think she’s right. (Depending on what she meant, exactly.)

I believe this professor was attempting to make a subtle point that many students of mathematics tend to miss. That is, they mistakenly believe that infinity actually exists. Like, in the real world.

If I tell someone to imagine an “infinite number of pencils,” usually they picture a bunch of pencils. Like, more than they could ever count. An empire state building made out of pencils. An ocean full of pencils. Mars. But pencils.

That’s not really what infinty means, mathematically. When a mathematician says infinity they mean a repeatable process that we can keep doing forever.

For example, let’s imagine I’m standing at the blackboard and I ask the class to give me a pencil. Now I have 1 pencil. That’s our repeatable process. I do it a second time, and I have 2 pencils. A third time, and three. And so on and so forth.

Imagine I never stop asking for pencils. Ever. How many pencils do I have, at that indeterminate point in the future? A mathematician would say an infinite number of pencils.

But you see, I’d never really get to an “infinite” quantity of pencils. Infinity is purely a work of imagination. Eventually I’ll have to stop asking for pencils. I’ll get hungry, or get old or die. Or we’ll run out of wood or something. Or we’ll exhaust all the matter in the universe and there will be nothing but pencils floating around in the vastness of space. Then who would ask for the pencil, and who would give it?¹

Anyway, back to the primes. Yes, the student is correct in that there is an infinite number of primes. (And by that we mean there is a repeatable process² for generating primes which we can repeat until the heat death of the universe.)

But the teacher is also right, in her way. If we built a computer to count all the primes, it would eventually run out of memory. Even if we turned the whole universe into a computer, eventually we’d run out of stars and junk to fuel our prime-counting computer. Thus the number of primes we’ll ever be able to count is finitely bounded by the size of our universe.³

That’s… funny? Sad? I don’t know. Why did I even write this?

- Mitchell

If infinity doesn’t really exist, then why do we talk about it so much? Because it’s a useful approximation of reality. Of course, it doesn’t make sense to ask, “if I keep asking for pencils how many pencils will I have?” But suppose instead of asking for 1 pencil each time, I asked for 1/4th a pencil, and then 1/9th, and then $1/n^2$ of a pencil for ever and ever. How many pencils would I have? The answer is that I will get real close to 1.645 pencils, but never more than that. Why’s that useful? Well, ever try to build a rocket? No? Me neither. ↩
Euclid’s proof is famously not constructive, so it doesn’t directly give us a method for constructing a new prime. I prefer the proof by Filip Saidak that does. ↩
I wonder how many digits it has? ↩

Redefining SOTA

2021-08-31T00:00:00+00:00

In the machine learning research community, achieving state-of-the-art usually means reporting a single score (percentage accuracy or F1) on a public research dataset. There are two legitimate reasons to report a “SOTA score” in a research paper, besides gaming the system.¹

A SOTA score may signal to the community that you have “solved” a task that was previously unsolved (like protein folding).
A SOTA score may signal to the community that your new method is the “best” method to solve the task, and that the rest of the community (in both academia and industry) should adopt your method as the new standard.

However, a SOTA score in today’s context accomplishes neither of those goals. Because of the way many benchmark datasets are constructed, a high test score (even surpassing human performance) is unlikely to mean that the model is ready for real-world deployment or that the task is “solved.” Furthermore, the ability of neural methods to predictably improve performance with scale means that a single SOTA score is not enough information to decide whether one neural method is better than another.

In light of these two observations (underspecification and neural scaling laws), I think the ML community needs to redefine SOTA. Below, I’ll review some of the literature surrounding underspecification and neural scaling laws, and then make some suggestions about new “metrics for success” that we should adopt as a community.

Underspecification: The Task Is Not Solved

In the early days of machine learning, task performance was often associated with accuracy on a single dataset. “Solving” hand-written digit recognition meant achieving a high accuracy on MNIST, and the Penn Treebank was the gold standard for part-of-speech tagging in natural language processing. However, as the field matured we began meeting the goals we set for ourselves, and we quickly understood that solving the task is not the same as solving the benchmark.

I first experienced this when BERT broke the General Language Understanding Benchmark², well surpassing human-level performance. Many linguists appropriately asked: does this mean we’ve solved language understanding? The answer was a resounding no. Many papers since have been dedicated to all the ways BERT can be wrong or, worse, right for the wrong reasons.³ ⁴ Many papers pointed out that BERT (as well as later models) can rely on spurious correlations in the data and demonstrated that small, meaningless input pertubations could lead to incorrect answers.⁵ This is analogous to adversarial examples in image recognition, where adding a small amount of noise can change a correct label to an incorrect label.⁶

Evaluation datasets are often not powerful enough to differentiate between a model which generalizes and a model which relies on spurious correlations. They may also lack sufficient coverage, such that a high test score obscures problems with the model that would cause problems in production, such as racial/gender bias⁷ or susceptibility to attack.⁸ In a recent paper, Google researchers called this problem “underspecification,”⁹ and point out several examples across the company in which models achieve similar test scores but exhibit widely divergent behaviors when deployed in production. They show that this is a distinct problem from domain-shift, in which the test data distribution is different from the training distribution.

Robust Evaluation

One fix for the problem of underspecification is just to “make the dataset better.” Some interesting work in this direction:

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Inspired by “unit tests” in traditional software engineering, Checklist is a framework for testing NLP models along many directions of “linguistic proficiency” by augmenting test examples with deterministic transformations. Examples include negating verbs and replacing nouns in sentences with novel nouns.

Open Reading Benchmark

A suite of NLP datasets from the Allen Institute for AI. Some datasets are constructed to target specific language capabilities. For example, DROP involves performing discrete reasoning (adding, sorting, counting) over many paragraphs of text.

Robustness Gym

Tail Chasing

“Tail-chasing” is an attempt at making models more robust to items in the long-tail of the dataset, such as rare words or images. Some other work in this direction:

Neural Scaling: Which Method Is “Best”?

Scaling Laws for Neural Language Models

Which neural architecture achieves SOTA on a task depends entirely on the amount of data and compute provided to the architecture. As shown above, performance scales like a power law with data, compute, and parameters. This has now been demonstrated for many data domains, modalities, and neural architectures.¹⁰

This means you can make any neural architecture SOTA if you’re willing to spend enough money pouring resources into it. A single SOTA score is not expressive enough to capture this behavior. Consider the following graph showing the performance of machine translation methods with varying amounts of data:¹¹

Purely neural methods out-perform other methods when given enough data. However, in low-resource regimes, they fail miserably compared to phrase-based approaches. Depending on how big our training dataset is, a SOTA score might lead us to dramatically different conclusions. A small dataset might give us the impression that phrase-based methods are “better”, whereas a large dataset would lead us to believe neural methods are “better.”

The reality is more nuanced: phrase-based methods have inductive biases that make them better in low-resource scenarios, whereas neural methods scale better with data. And this simple situation doesn’t even take into account the amount of money spent on compute while training each method, and whether that was a bottleneck for either method.

SOTA Scaling, not SOTA Scores

This implies that beating a benchmark dataset is no longer newsworthy (i.e. worthy of publication). Anyone can get a SOTA score if they invest enough money in procuring the data / compute required to get there. What is newsworthy is if you improve the money-to-performance trade-off. That could save billions of parameters or millions of dollars!

In other words, because of neural scaling laws, nearly everyone in ML is working on machine learning efficiency at this point (either compute efficiency or sample efficiency), but no one is measuring success that way!! That’s why ML reviewing feels so broken lately. Here’s a few things we could do right now:

Any paper proposing a new “SOTA” neural method needs to report not just the data / compute used to achieve SOTA, but the score achieved at several points of data/compute. The slope of the curve should be better than all other known methods. SOTA scaling is the objective, not SOTA scores.
Benchmarks should release pre-determined dataset splits of various sizes, to help fairly measure the sample complexity curves of new methods.
Compute / parameters should be measured via a standardized platform, like MLPerf. (But perhaps more streamlined to compare new neural architectures.)

A lot of people complain ML reviewing is broken. I tend to agree. But I also believe that it’s possible to get our act together, as long as we all agree on a paradigm for evaluating new approaches. I think scaling laws, accompanied by robust and strengthened evaluation methods, can help fill that role.

The deluge of papers submitted to machine learning conferences has lead to a shortage of quality reviewers who result to heuristics like “reject if not SOTA.” Therefore, many researchers frame their papers as “SOTA score” papers to boost chances of acceptance, even when the paper would be better formulated as a scientific endeavor (or when the paper would not otherwise meet conference standards). Some conferences have started trying to fix this, but progress is slow. ↩
“How the Transformers Broke NLP Leaderboards.” 2019. June 30, 2019. https://hackingsemantics.xyz/2019/leaderboards/. ↩
Heinzerling, Benjamin. n.d. “NLP’s Clever Hans Moment Has Arrived.” Accessed December 30, 2020. https://bheinzerling.github.io/post/clever-hans/. ↩
Marasović, Ana. 2018. “NLP’s Generalization Problem, and How Researchers Are Tackling It.” The Gradient. August 22, 2018. https://thegradient.pub/frontiers-of-generalization-in-natural-language-processing/. ↩
“Fall 2019 Natural Language Processing: Matt Gardner (AI2 Irvine).” 2019. Youtube. December 31, 2019. https://www.youtube.com/watch?v=k7d_Nnv_shw. ↩
Ilyas, Andrew, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. 2019. “Adversarial Examples Are Not Bugs, They Are Features.” arXiv [stat.ML]. arXiv. http://arxiv.org/abs/1905.02175. ↩
“NeurIPS 2020 : You Can’t Escape Hyperparameters and Latent Variables: Machine Learning as a Software Engineering Enterprise.” n.d. Accessed December 30, 2020. https://nips.cc/virtual/2020/public/invited_16166.html?utm_campaign=NLP%20News&utm_medium=email&utm_source=Revue%20newsletter. ↩
Chen, Xinyun, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. 2017. “Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning.” arXiv [cs.CR]. arXiv. http://arxiv.org/abs/1712.05526. ↩
D’Amour, Alexander, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, et al. 2020. “Underspecification Presents Challenges for Credibility in Modern Machine Learning.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2011.03395. ↩
https://arxiv.org/abs/2010.14701 ↩
Koehn, Philipp, and Rebecca Knowles. 2017. “Six Challenges for Neural Machine Translation.” In Proceedings of the First Workshop on Neural Machine Translation, 28–39. Stroudsburg, PA, USA: Association for Computational Linguistics. ↩

Why Is VSCode Typescript Linting So Damn Fast?

2021-06-28T00:00:00+00:00

I’ve been using VSCode as my defacto Typescript IDE for the last few months. I’m a heavy Emacs user, however, and it was only a matter of time until I attempted to get a similar experience via Emacs configuration, thereby continuing my quest to use Emacs as the sole interface to my computer.

I took the plunge last weekend, thinking I’d start with something “simple” like getting linting with eslint to work. Little did I know…

Batteries (Not) Included

When you think of linting in Emacs, you immediately think of Flycheck. Luckily for me (I thought), Emacs flycheck has eslint support built in. So theoretically, all I have to do is add

(flycheck-mode +1)

to my Emacs config and I’m good to go. As I’m editing files, this package will call eslint asynchronously as a shell command and report back the results.

As I’m editing, however, I notice results are coming back with a lot of lag. Like, 5-10 seconds of lag. Huh. Is this an Emacs thing or a eslint thing? So I pop open a terminal and do

$ cd my_proj/
$ time eslint my_file.ts
real    0m6.684s
user    0m9.321s
sys     0m0.821s

And sure enough, the linter takes 9 seconds to complete. But I was just in VSCode editing this same exact file, and linting was happening almost instantaneously… what gives??!!

Linting => Type Checking => Compiling

Turns out, most people won’t run into this slowness unless they enable certain eslint options in .eslintrc.js. Ours happens to look something like this

module.exports = {
  parser: '@typescript-eslint/parser',
  plugins: ['import', '@typescript-eslint'],
  parserOptions: {
    ecmaVersion: 2018,
    sourceType: 'module',
    project: './tsconfig.json',
  }
  ...
}

You’ll notice we have parser set, which tells eslint to use a plug-in to read the typescript syntax. We also have parserOptions.project set to ./tsconfig.json, which tells our parser to include type information in the parse, so we can use it to make type-aware eslint rules.

Unfortunately, enabling this option is known to be very slow since it requires compiling the entire typescript project to get the type information. This lines up with my experience, since doing

$ cd my_proj
$ time tsc 
real    0m4.968s
user    0m8.181s
sys     0m0.689s

takes around 8 seconds. So that’s 8 seconds to compile the project, and < 1 second to run the actual linter rules. And as expected, removing parserOptions.project speeds up eslint to under 1 second.

The Difference That Makes the Difference

So why is VSCode so fast?

Linting in VSCode is done by the ESLint Extension, which claims to just call eslint in the background. Thinking that couldn’t possibly be true, I dug into the source code.

Turns out they do call eslint, just not from the shell. Instead, they import eslint into a node process and call it repeatedly as the file changes. I suspected this meant that the process was able to cache the AST coming back from tsc, and only compile the parts that changed. I put together a little proof-of-concept and what do you know…

eslint = require('my_proj/node_modules/eslint/lib/api.js')
cli = new eslint.CLIEngine({ cwd: 'my_proj' })

// This is slow, takes ~9 seconds
cli.executeOnFiles(['my_proj/my_file.ts']).results[0].messages

// This is fast, one second at most
cli.executeOnFiles(['my_proj/my_file.ts']).results[0].messages

// This is still fast, even if we change the underlying file to introduce a previously unseen linter error
introduceLintError('my_proj/my_file.ts')
cli.executeOnFiles(['my_proj/my_file.ts']).results[0].messages

// And it's fast on other files we haven't loaded before
cli.executeOnFiles(['my_proj/other_file.ts']).results[0].messages

The Fix?

My first intuition was to just stop calling eslint from the shell. Instead, I imagined I could do something like this:

Write a short node server that receives HTTP POST requests containing filenames to lint. It would then call out to eslint.CLIEngine and return the lint errors as JSON.
Write a flycheck checker that would just curl the endpoint.

As soon as I wrote down that plan, however, I realized I was basically describing an eslint language server. So I looked up if Emacs lsp-mode had a defacto eslint langauge server and surprise surprise, they pointed me to the VSCode ESlint extension.

So what I was looking for was with me the whole time. All I needed to do was add lsp-mode to my config

(use-package lsp-mode
  :ensure t)

and then install the eslint server with M-x lsp-install-server. And voilà, we have lightning fast linting, just like VSCode.

Epilogue

The language server fix is perfectly acceptable. In fact, a standardized LSP syntax checker seems like a logical successor to the Flycheck framework for linting.

It seems to me, though, that we could have made the flycheck version work. Whatever caching the parser is doing could reasonably be serialized to disk between calls to the eslint command line tool. I imagine we could add a setting like parserOptions.cacheASTFname that would tell eslint where to store that information. This would be in line with the behavior other caching options like the built-in --cache.

Do I Hate Emacs?

No, I still like Emacs. And despite this little saga taking me the better part of three days, I’m happy to have the tools to get to the root of the problem and fix it myself.

My Struggle With Probability Theory

2021-04-02T00:00:00+00:00

TL;DR - the fundamental assumption of probability theory is one of ignorance. This assumption is too easy to break in most contexts and leads to unfounded confidence in conclusions.

There are many circumstances in which uncertainty is warranted.¹ Gas temperature measurements, weather forecasts, horse races, coin flips, and clinical trials all have some uncertainty involved. Probability theory is the science that finds commonality among these seemingly disconnected phenomena. We can observe, for example, that the summation of many “repeatable” random events, properly normalized, begins to look like a gaussian distribution (aka the central limit theorem). We can notice common shapes in the histograms of these repeatable experiments, such as “fat-tailed” or “power law” distributions. And if the event is not repeatable, we can at least apply the rules of probability theory to avoid inconsistencies in our thinking (which would allow a savvy adversary to take advantage of us when gambling).

However, I believe there are tasteful and distasteful applications of probability theory. This is because the application of probability to a particular event requires a suspension of disbelief. To consider an event as repeatable and iid is to accept that the causal factors driving the outcome are (practically) unobservable and therefore ignorable. In effect, it means giving up on deeply understanding a causal explanation of the phenomena and instead sweeping the details under the rug of “the distribution.”

This makes probability theory the science of last resort. Only after truly exhausting your ability to investigate causal factors and processes should you indulge in probabilistic thinking. Doing otherwise is a cop-out, one that dangerously feels “scientific.”

Examples of Distaste

The tastefulness of a particular application of probability theory is a matter of context. Consider the humble coin flip. A lay-person may reasonably assume that this event is a repeatable, iid experiment with a uniform prior; they may be working with a variety of coins, fingers, and surfaces, and may not have equipment available to make precise measurements. To the physicist, however, this is obviously a cop-out. The physicist knows that the coin’s trajectory can be precisely captured by the laws of classical mechanics, and therefore predicted with almost certainty.² Where the average person gives up and shrugs, the scientist continues searching for explanations.

Similar things can be said about drawing a card from a deck of cards. A casual observer may reasonably assign uncertainty to the event. But to the magician who controls the precise method of shuffling, this is obviously a cop-out.

Or consider a randomized clinical study in which a drug harms patients in 0.01% of cases. It’s easy to sweep the causal factors under the rug and assume the effects are “randomly distributed.” But we can imagine more information gathering revealing that all instances of harm occured in an ethnic minority. In practice, “randomness” is more often a cop-out than an unavoidable facet of the system under study.

Human Nature

My struggle with probability theory is that it lends itself to distaste. Humans are lazy, and when presented with the option of doing more investigative leg work or simply assuming data is randomly iid, they will often choose the latter, especially when the latter appears to be “scientifically and mathematically rigorous.” However, mathematics are only as correct as the assumptions made at the beginning, and by hiding the causal factors of an event behind the abstraction of a “probability distribution” we deprive ourselves of the ability to identify when those causal factors change and our assumptions no longer hold (i.e. the distribution shifts).³

And even when the assumption of iid is justified, the logic of probability theory is more often misapplied than not, despite supposedly being a “guide towards logical consistency.” In my experience, probability theory is more often used to prove a point in scientific papers than it is a self-check for correctness. Just look at the p-value crisis of the 2010’s.⁴ As they say, there are lies, damned lies, and statistics. Even Bayesians, who seem to think they’re always right, can occasionally get it wrong as evidenced by E.T. Jaynes’ humorous exploration of a paper that proved a woman had ESP.⁵⁶

Given the prevalence of misapplication, I can only conclude that probability theory needs to be redesigned as a mental device. I don’t want to be a probability theory nazi, but all the evidence seems to indicate that probability theory is a science for people who have given up on science, rather than the rigorous system of analysis it purports to be.

Many events are unpredictable to us in practice, either because the laws governing the outcome are not known, or because the laws are known but the observations required are too arduous to make. Sometimes the required observations are too numerous to collect (as in statistical mechanics), and other times the non-linear, chaotic nature of the system necessitates observations that are too precise to be practical. Quantum experiments seem to be inherently unpredictable, although whether this is a fundamental facet of nature is a matter of debate. ↩
There may be some chaos in the bounce on a hard surface or drift in the wind, so we would need a properly controlled environment and precise engineering. ↩
For example, I may experiment with a coin and decide that it is fair when tossing it onto a wooden surface, only to discover later that the coin is magnetized and slightly biased towards heads on metallic surfaces. ↩
https://www.americanscientist.org/article/the-statistical-crisis-in-science ↩
http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf ↩
I also highly recommend Chapter 10. ↩

Ducttape: Why and How

2021-02-09T00:00:00+00:00

One of the most useful things I’ve learned during my PhD is how to use ducttape, a research workflow management system. Like many good software tools, the mindset behind ducttape is more powerful than the code itself. In this post, I’ll try to motivate the research workflow management mindset and then give you a sense of how ducttape solves the problems I present.

A Simple Experiment

Suppose we’re training a new machine learning model, and we’re given the following utilities:

download_data.py - downloads training data from the internet.
filter_1.py, filter_2.py - two filtering programs based on different criteria.
aug_1.py, aug_2.py, aug_3.py - three data augmentation programs.
train_model.py - trains a model, given some training data.

Our task is to determine which combination of data filtering and augmentation leads to the best model performance. For simplicity of exposition, we’ll assume that we can only choose one filtering program and one augmentation script to use. (We can’t use multiple augmentation scripts at the same time.)

The Most Naive Approach

The most naive approach is to manually try all the possible combinations of filters and augmentation scripts. After all, there’s only $2 \times 3 = 6$ possible combinations. Here’s what our bash history might look like if we did this…

bash-3.2$ python download_data.py data.txt
bash-3.2$ python filter_1.py data.txt filtered_data1.txt
bash-3.2$ python filter_2.py data.txt filtered_data2.txt
bash-3.2$ python aug_1.py filtered_data1.txt data_1_1.txt
bash-3.2$ python aug_2.py filtered_data1.txt data_1_2.txt
bash-3.2$ python aug_3.py filtered_data1.txt data_1_3.txt
bash-3.2$ python aug_1.py filtered_data2.txt data_2_1.txt
bash-3.2$ python aug_2.py filtered_data2.txt data_2_2.txt
bash-3.2$ python aug_3.py filtered_data2.txt data_2_3.txt
bash-3.2$ python train_model.py data_1_1.txt model_1_1
bash-3.2$ python train_model.py data_1_2.txt model_1_2
bash-3.2$ python train_model.py data_1_3.txt model_1_3
bash-3.2$ python train_model.py data_2_1.txt model_2_1
bash-3.2$ python train_model.py data_2_2.txt model_2_2
bash-3.2$ python train_model.py data_2_3.txt model_2_3

Obviously this is going to be tedious and error-prone. Personally, it took me three tries to type all these commands without making a typo and mixing up the numbers. And now we have a bunch of files lying around in our working directory that we probably won’t remember when we come back to this project next week.

This approach might work for a small, manageable number of experiments, but we can see how this approach could quickly become impractical, especially if we try to add more steps.

The Less Naive Approach

But the above approach is a strawman. Any decent programmer would probably do something smarter, like write a couple nested bash for-loops:

python download_data.py data.txt
for X in 1 2; do
    python filter_$X.py data.txt filtered_data$X.txt
    for Y in 1 2 3; do
        python aug_$Y.py filtered_data$X.txt data_$X_$Y.txt
        python train_model.py data_$X_$Y.txt model_$X_$Y
    done
done

Still, there’s a couple issues with this approach. First, it doesn’t give us very fine-grained control over which experiments get executed and when they get executed. Real-world workflows can have many steps each with their own configurable options. In a simple 6-step workflow with 3 options each, there’s $3^6 = 729$ total experimental configurations. We probably don’t want to run all of those. Even if we do, we probably don’t want to come up with names for that many intermediate files or have them lying around unorganized on our filesystem.

Second, this approach is not very extensible. Let’s say I find out each data filtering program has an extra option, called “strictness,” which is set to “high” by default. If I want to run some experiments with strictness set to low, it’s non-obvious how to add that to the above bash script without wrecking my current results. You can do it, sure, but it will be painful. And every time someone wants to add another dimension to the experiments, the pain increases.

In summary, our ideal workflow looks something like the above, but with a few extra features:

Fine-grained control over which experiments run and the hardware used to execute each task.
Sensible automatic naming and organization of intermediate files.
Re-use of previous work when possible, and parallelization of work where possible
Easily extensible with new experiment dimensions/tasks without breaking results.
Easy to summarize results in tabular format .

Ducttape has all these nice features, which we’ll demonstrate in the next section.

Enter Ducttape

In ducttape, we organize our workflow as a directed acyclic graph. Each node in the graph is a bash script (called a task) which accepts filenames from previously completed tasks and optionally outputs files that can be consumed downstream. These input/output relationships form the edges of our graph. For example, here is a task which downloads the training data:

task download_data
> data {
  python download_data.py $data
}

This task takes no input and outputs a single file, $data. Notice that $data is a bash variable, not an absolute path. This is because ducttape will automatically assign $data to a sensible location on the filesystem for us, depending on the current settings of the experiment dimensions.

Now that we have a task that provides the data, we can consume it in the filtering task.

task filter
< data=$data@download_data
> filtered_data 
:: filter_type=(FilterType: 1 2) {
  python filter_${filter_type}.py $data $filtered_data
}

This task displays all three possible argument types. Left angle brackets < specify input files from previous tasks, while right angle brackets > specify output files. (Similar to bash file pipes.) Double colons :: specify parameters, which are just bash string variables.

Our parameter $filter_type is assigned to be an experiment dimension, which is called a “branch” in ducttape. The syntax :: filter_type=(FilterType: 1 2) means that the bash variable $filter_type may be assigned a value of 1 or 2 at run-time depending on which experiment we’ve asked ducttape to run. Notice that even though this task has multiple experimental configurations, it always writes its output to the location specified by $filtered_data, which is set by ducttape to a sensible filename based on the current experiment configuration.

The task which augments the data is similar to the above:

task augment
< filtered_data=$filtered_data@filter
> augmented_data
:: aug_type=(AugType: 1 2 3) {
  python aug_${aug_type}.py $filtered_data $augmented_data
}

And our last task is to train our model:

task train_model
< augmented_data=$augmented_data@augment
> model {
  python train_model.py $augmented_data $model
}

Finally, to run a particular set of experiments, we make a “plan” of execution:

plan main {
  reach train_model via (FilterType: 1) * (AugType: *)
}

This plan trains three models: all use the first filtering option and each uses a different augmentation option. We can easily extend this plan to target different tasks, or to take different paths through our workflow graph. If a branch is not specified, ducttape uses the “baseline” branch, which is the first option provided in the branch definition.

This is a fairly linear workflow, but many real-world tasks will have tasks which take input from many upstream tasks and provide files to many downstream tasks. You’ll notice that our workflow implementation is more verbose than our original bash script; however, all this boilerplate gives us the nice features we mentioned above, including automatic parallelization, assigning different tasks to different machines, and more.

Extending the Workflow

Supposing we ran the above experiments, we can go on to extend our workflow with new experiments without breaking our existing results.

Adding A New Augmentation Script

If we wrote a new augmentation script aug_4.py, this can easily be added to our workflow with one character change:

task augment
< filtered_data=$filtered_data@filter
> augmented_data
:: aug_type=(AugType: 1 2 3 4) {
  python aug_${aug_type}.py $filtered_data $augmented_data
}

Similarly, if we wanted to add a new branch to specify the “strictness” of the filter, we could update the task like this:

task filter
< data=$data@download_data
> filtered_data 
:: filter_strictness=(FilterStrict: high low)
:: filter_type=(FilterType: 1 2) {
  python filter_${filter_type}.py $data $filtered_data --strictness=$filter_strictness
}

and then update our execution plan:

plan main {
  reach train_model via (FilterType: 1) * (AugType: *) * (FilterStrict: *)
}

This would not break any of our existing results. When we run this new plan, ducttape will assume that previous executions of the filter task were run with the strictness set to “high,” which is the baseline value for the branch.

Adding Evaluation

We can also add new tasks to our workflow which will re-use previous results:

task evaluate
< model=$model@train_model
> score {
  python eval_model.py $model $score
}

Caveats

Ducttape is crufty: the latest commit was in 2015, and there are still some rough edges. That being said, it gets the job done. And since most experiments are short-lived, I’m not super worried about the tech-debt I might incur by using it. There are other workflow management frameworks out there, like Airflow and Luigi, but I’ve found those don’t have as good of a story for managing experimentation branches.

One other thing I don’t like is that it’s too easy to do the wrong thing with ducttape. For mildly complex complex workflows, it’s not immediately obvious what the right task/branch setup should be. This requires “ducttape zen,” which is discovered with time. In general, I think best practice is to implement more branches than you need and then trim down your execution space using lots of execution plans. I might talk about that in a later post.

There are also other features that I haven’t covered, such as package management, hardware configurations for each task, and summaries of results. If you’d like to learn more, feel free to read the tutorial. However, I believe this brief overview covers about 90% of my ducttape usage, and I hope it gives you a sense of the usefulness of research workflow management.

A Software Tester’s Perspective on Statistical Learning Theory

2020-11-05T00:00:00+00:00

You’re a software testing engineer, working at a big tech company. While other engineers on your team write code, your job is to make sure the code is safe before you push it to production. Your goal isn’t to prove the code is “correct,” but rather to assess the risk of potential failures to the company and test accordingly. You mainly write tests that try to rule out known or suspected failure modes, and you spend a lot of time thinking about edge cases.

One day, over your morning cup of coffee, you get an email from the other engineers on your team. They’ve decided that writing source code is too hard, so they’ve started randomly guessing program implementations until one meets the specification. They call this wacky approach “Software 2.0” or something.

“Not to worry,” they tell you, “we can prove it works. You don’t even have to write tests any more!” They go on to explain that there’s this book called “Statistical Learning Theory,” which describes a mathematical framework that proves Software 2.0 can give you a correct implementation.

Intrigued, you ask them for more details.

Future Input $\sim$ Past Input

First, they have to assume that any future user input will be similar to what you’ve seen previously in production. They call this the IID assumption.

“But what about hackers?” you ask. “and what happens when we change the UI? People change their behaviour all the time! This week they’re Googling for election results, but next week they’ll go back to Googling Kanye West…”

They concede that maybe you have a point, but they definitely need this assumption to make it work. You begrudgingly let them continue.

No Specification

Then they tell you there’s no specification. You ask them what the hell that means.

“Ok, hear us out,” they say. “The old spec was basically impossible to write. There were too many edge cases! Our poor product manager didn’t even know where to start, honestly.”

Instead, they decided to ask the product manager to write down a bunch of example user inputs along with the correct output for each example. That would serve as the defacto specification. They call this “training data.” Then they guess a program that meets those requirements, using this thing called gradient descent.

You mention that this reminds you of happy-path testing, where you only write tests for things you expect without thinking about possible failure modes. In this case the training examples test the happy paths. How do you know there aren’t still edge cases and bugs lurking around?

That’s where the magic of STL kicks in, they say.

Future Performance $\sim$ Past Performance

If you assume that future inputs look like past inputs, they say, then you can also assume future performance will look like past performance! As long as you have enough training data, there’s a low probability that you’ll encounter edge cases. Basically, you want the happy paths to be the only paths.

“But how many examples do you need to make sure there aren’t any unhappy paths?” you ask.

It depends on how complicated the program you’re guessing is, they say. If the program is super complex and you don’t know anything about it, you basically need to enumerate all the possible inputs. They’ve been calling this the “No Free Lunch Theorem.”

But if it’s “simple” somehow and you can use that to narrow down the possible candidate programs, then you need way less data. That part sounds kind of reasonable to you. It reminds you of having branch coverage when writing unit tests. If you have more branches, then you have to write more tests. Similarly, if you have more possible candidate programs, then you need more data to make sure you pick the right one.

Bias/Variance Trade-off

“But wait,” you say, “what if you’re wrong about how the program you’re guessing is simple? Isn’t the problem that you don’t know the right program in the first place?”

They tell you you’re right, of course, and that there’s a trade-off. There are different kinds of simplicity with different levels of strictness.¹ If you suppose the wrong kind of simplicity from the start, that puts a hard cap on how well you can learn the program. They call this the bias or approximation error. On the other hand, if you don’t assume anything at all, then you need way more data. If you don’t have enough, then you might encounter variance or estimation error.

The best case, of course, is when you don’t have to guess and you know the correct implementation of the program you want to write. Then you would have maximum correct bias and no variance.

You muse that the second best case would be to just label every possible input so that you don’t have to assume anything. They tell you that’s usually impractical (the poor product manager can only work so much) but in some cases you basically have infinite data and that’s exactly what they do.

Conclusion

“So let me get this straight,” you say.

First, you assume that the future will look like the past.
Next, you get somebody to write down a bunch of example inputs and correct outputs which you use as the spec.
Then, you make some assumptions about the program you’re trying to guess. If you make the wrong assumptions, then you cap your maximum performance. But if you don’t make any, you might accidentally overload your product manager.
Finally, you guess a random program that fits the spec. As long as you have enough data, you can guarantee you did the best you could with your assumptions and that you probably won’t hit any edge cases.

They nod. You shake your head. “I don’t know guys, this seems kind of fishy. The IID assumption is one thing, but we also have no idea how much the approximation error is, right?”

They shrug. “Look man, we just don’t want to write any code, ok? It’s too hard.” You can understand the sentiment.

“Besides, we don’t ever really use STL in practice.”

“What?”

“Yeah, we just set aside some of the training data as a test set. If the program we guess does good on the training data and the test set, we just assume it’s good to go.”

“But what if your test set is bad?”

They shrug again. “We just do our best. Sometimes we craft test sets that look for specific properties.”

You nod knowingly. At the end of the day, you don’t write tests to prove correctness. You write tests to show the presence or absence of bugs in a way that appropriately manages risk. Some things never change. You take a sip of coffee and go back to writing unit tests, suspecting your colleagues will join you in a few years.

You might assume, for example, that your program is translation invariant. ↩

The Variance of Yotta Savings Accounts

2020-08-24T00:00:00+00:00

My girlfriend recently got a Yotta savings account which has an interesting twist: instead of paying interest like a normal bank, they buy you lottery tickets. This makes saving more exciting, since you have a small chance of winning millions of dollars.

Now, my first thought was that this must be a terrible deal for whoever’s playing, since the lottery is generally considered a bad investment. But it turns out the Yotta lottery actually has pretty good odds. People have calculated the average return to be around 2.6% APY¹, which is not a bad return for a savings account. This will likely decrease as they become more established and get more users.

What is the variance of Yotta interest?

One question I haven’t seen answered on the internet is about the variance of the APY. Sure, maybe the average investor gets 2.6% APY. But that might mean most people get 0% and one person wins a few million dollars. If I’m going to invest, I’d like some guarantees on the lower-bound of the amount of money I’m going to get back.

This is a really straight-forward instance of the law of large numbers: the more tickets you buy, the closer you’ll get to the average return. But before we break out the math, let’s start with some simulations to get some intuition of what to expect.

Let’s suppose 10k people each invest $10k in a Yotta savings account. How much interest will each of them earn? We can simulate this scenario with a pretty simple python script:

$10k buys each person 400 lottery tickets per week.
We can simulate the lottery drawings with a random number generator to predict how much money² each person will win from their 400 tickets. (No split prizes.¹)
Rinse and repeat for 52 weeks, keeping track of the total money won for each person.³

The results are plotted in the histogram below. On the x-axis is an amount of money, and the y-axis shows how many people won that much money via lottery tickets.

As expected, the average money won is around $260 (2.6% APY). However, some people won as little as $200 (2.0% APY) and some as much as $350 (3.5% APY). No one won less than $150 (1.5% APY).

So the variance isn’t that bad. Of course, if you invest less money you get less tickets, and so the variance will increase. Below is the same experiment when people invest $1k each.

In this case, the average still seems to be around 2.6% APY. However, some people get back as little as 1.0% APY, and there’s a large skew right, with some people getting as much as 6.0% APY. If you’re interested in running more experiments, I’ve provided the python snippet below. (If that’s broken you can also grab the code on github.)

Tail Inequalities

I mentioned earlier that buying many lottery tickets is a very straight-forward instance of the law of large numbers: the more tickets you buy, the closer your winnings get to the expected APY. But what if I want to quantify exactly how far my earnings will be from the average? For example, if I invest $10k for one year, how likely is it that I earn less than $200?

This is exactly the question that tail inequalities answer. Suppose I have a random variable $X$. Tail inequalities tell us how probable it is that $X < t$, for some value of $t$. There are several types of tail-inequalities you can use, depending on how much information you have about $X$:

The Markov Inequality is the simplest version, which you can use if you only know the expected value of $X$.
The Chebyshev Inequality is more complicated, taking into account both the expected value of $X$ and its variance.
Chernoff Bounds usually give you the tightest bounds, but are only applicable when $X$ is a sum of multiple independent random variables.

Technically we could apply Chernoff bounds, since the amount that we win in a year is the sum of the amount that we win for every ticket we buy that year. But that requires a little more elbow grease than I’m willing to put in at the moment, so we’ll just use Chebyshev bounds. Here’s the theorem:

Theorem (Chebyshev’s Inequality) Let $X$ be a random variable with expectation $\mu_X$ and standard deviation $\sigma_X$. Then for any $t \in \mathbb{R}^+$,

\[Pr[|X - \mu_X| \leq t \sigma_X] \leq \frac{1}{t^2}\]

Let’s break this down for our case:

Math stuff	Our case
$X$	The amount of money we’ll earn if we invest $10k in a Yotta savings account over one year.
$\mu_X$	The average value of X. ($260)
$\sigma_X$	The standard deviation of X. ($29)⁴
$Pr[\|X - \mu_X\| \leq t \sigma_X]$	The probability that the money we earn is $t$ standard deviations away from the mean.

So let’s say we want to know the probability that we earn less than $200 in a year. That’s more than two standard deviations away from the mean. The Chebyshev theorem tells us that the probability that we make less than two standard deviations below the mean is $\frac{1}{2^2} = 1/4$.

So Chebyshev guarantees that we’ll make more than $200 with at least 75% certainty. But remember, that’s just a bound. Based on our simulations, the probability that we make more than $200 is likely much higher than 75%. I would probably say it’s around 95% based on the histogram.

Central Limit Theorem

Now, you might have noticed that our first histogram looks essentially like a normal distribution. This happens all the time and is the subject of the central limit theorem (CLT). The CLT says that if you buy enough lottery tickets, the amount of money you make that year will fall on an approximately normal distribution. The mean of this normal distribution is the sum of the average value of a ticket, while the variance is the sum of the variance of each ticket.

If we can verify that the distribution is approximately normal, as we have above with the $10k scenario, we can skip calculating the Chebyshev or Chernoff bounds and just assume the distribution is normal. Plugging in the average and standard deviation into this normal distribution calculator tells us the probability of making more than $200 is 98%. This is much more in line with the results of our experiments, if slightly more hand-wavy.

You have to be careful with this approach, however, since not all distributions will be normal. For example, the second scenario we tested, where each person invested $1k, showed that the distribution wasn’t very close to a normal distribution. In that case, it might be better to apply Chernoff or Chebyshev bounds.

Some prizes can be split between multiple winners (such as the $10M prize). If more people are playing, then split prizes get smaller. Since we’re interested in the “worst-case” scenario, we assume the actual payout for all split prizes is $0. In this case, the average APY is 2.6%. ↩ ↩²
We used the payouts from this spreadsheet. ↩
For simplicity, we did not compound the weekly interest. This can be changed in the python snippet, if you care, but it does not impact growth too much. ↩
The variance of a single ticket is roughly 4 cents. The variance of a sum of independent random variables is the sum of their variances. So for 400 * 52 = 20,800 tickets, the variance is about $842. The standard deviation is the square root of the variance. ↩

Math stuff	Our case
\(X\)	The amount of money we’ll earn if we invest $10k in a Yotta savings account over one year.
\(\mu_X\)	The average value of X. ($260)
\(\sigma_X\)	The standard deviation of X. ($29)⁴
\(Pr[\|X - \mu_X\| \leq t \sigma_X]\)	The probability that the money we earn is \(t\) standard deviations away from the mean.

Mitchell A. Gordon

Memorializing My Twitter Side-Project

What It Does

How It Works

Hashtag Generation & Clustering

Summarization

Why It’s Dead

Price

Usefulness

Takeaways

Anecdotes in Language Model Coherence

Sally Is A Man

Sally Has No Arms

World Models or “Just” Heuristics?

Can We Make LMs Coherent?

More Data, Multi-Modal Data

Fixing Decoding

Mechanistic Interpretability

Don’t Hold Your Breath

RETRO Is Blazingly Fast

RETRO Overview

The Pile

Building The Database

Tokenization

Embedding

MIPS Indexing

Index Training

Querying the Database

Qualitative Results

The Hidden Cost of CPU RAM

Conclusion

Do Infinite Pencils Exist??

Redefining SOTA

Underspecification: The Task Is Not Solved

Robust Evaluation

Tail Chasing

Neural Scaling: Which Method Is “Best”?

SOTA Scaling, not SOTA Scores

Why Is VSCode Typescript Linting So Damn Fast?

Batteries (Not) Included

Linting => Type Checking => Compiling

The Difference That Makes the Difference

The Fix?

Epilogue

Do I Hate Emacs?

My Struggle With Probability Theory

Examples of Distaste

Human Nature

Ducttape: Why and How

A Simple Experiment

The Most Naive Approach

The Less Naive Approach

Enter Ducttape

Extending the Workflow

Adding A New Augmentation Script

Adding a New Filter Option

Adding Evaluation

Caveats

A Software Tester’s Perspective on Statistical Learning Theory

Future Input \(\sim\) Past Input

No Specification

Future Performance \(\sim\) Past Performance

Bias/Variance Trade-off

Conclusion

The Variance of Yotta Savings Accounts

What is the variance of Yotta interest?

Tail Inequalities

Central Limit Theorem