Jekyll2023-12-21T06:50:58+00:00http://mitchgordon.me/feed.xmlMitchell A. GordonThinking ThoughtsMemorializing My Twitter Side-Project2023-09-24T00:00:00+00:002023-09-24T00:00:00+00:00http://mitchgordon.me/ml/2023/09/24/memorializing-twitter<p>This is a short memorial post for my dead Twitter summarizer project, aka “Twitter at a Glance.”</p>
<p>Links: <a href="http://twitter.mitchgordon.me">Demo</a> (if it’s still up), <a href="https://github.com/mitchellgordon95/TwitterSummary" target="_blank">Github</a></p>
<h2 id="what-it-does">What It Does</h2>
<p>Twitter at a Glance<sup>™</sup> summarizes your Twitter feed so you can stay on top of what’s happening. Concretely it:</p>
<ul>
<li>Grabs your timeline tweets from the last 24 hours</li>
<li>Clusters them using ChatGPT (maybe twice, hierarchically)</li>
<li>Summarizes those clusters</li>
</ul>
<p>This gets you a nice compact homepage with some topics you can click through to see the tweets.</p>
<p><img src="/assets/taag_demo.png" />
<a href="/taag_demo.html">Interactive</a></p>
<h2 id="how-it-works">How It Works</h2>
<h1 id="hashtag-generation--clustering">Hashtag Generation & Clustering</h1>
<p>Clustering operates on hashtags, which we get from a simple prompt to GPT-3.5:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TWEET:
{tweet}
Generate 30 possible hashtags that could go with TWEET.
Rules:
If TWEET refers to a location or event, include at least one hashtag containing the name of the event.
If TWEET refers to a specific object or thing, include at least one hashtag containing the name of that thing.
</code></pre></div></div>
<p><a href="https://chat.openai.com/share/957c094d-0849-416c-b49a-51b4ce57db72">ChatGPT example</a></p>
<p>We then pick hashtags which are associated with 7 tweets or less<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">1</a></sup> and use those as clusters. For smaller clusters, we also “pack” them by greedily grabbing tweets that have a high hashtag-overlap with the existing tweets in the cluster.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">2</a></sup></p>
<h1 id="summarization">Summarization</h1>
<p>Then just ask for a summary:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TWEETS:
"""
{tweets_text}
"""
What topic do all TWEETS have in common? Rules:
- The topic must begin with "{num_tweets} tweets are about"
- The topic must be no more than 1 sentence.
- The topic must be discussed in a majority of the tweets.
- The topic must be related to {hashtags}
Think out loud, then state the topic prefixed with the TOPIC label.
</code></pre></div></div>
<p><a href="https://chat.openai.com/share/98892dd4-d6aa-45d1-9ef2-9bfade0793a7">ChatGPT example</a></p>
<p>Which, you know, could probably still use some elbow grease. But it works ok. We do something similar for the meta-clusters, where the inputs to that prompt are just the summaries from the sub-clusters.</p>
<p>If we do meta-summarization, we can also re-summarize the sub-clusters to get more specific about what makes that cluster different:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TWEETS:
\"\"\"
{tweets_text}
\"\"\"
What topic do all TWEETS have in common? Rules:
- The topic must be no more than 1 sentence.
- The topic must be discussed in a majority of the tweets.
- The topic must be related to {hashtags}
- The topic must begin with "{num_cluster_tweets} tweets are about {cluster_summary}. More specifically, {num_tweets} are about"
Do not think. Just say the topic and only the topic.
</code></pre></div></div>
<p><a href="https://chat.openai.com/share/af0f95ff-3e3e-42e9-8cc1-9ff617e92f87">ChatGPT Example</a></p>
<h2 id="why-its-dead">Why It’s Dead</h2>
<h1 id="price">Price</h1>
<p>The Twitter API is ridiculously priced, at $100 / month to retrieve up to 10k tweets for the basic tier. We can serve ~4 DAUs for that price, assuming each page view retrieves 100 tweets. The next tier up costs $5k / month. Which brings me to…</p>
<h1 id="usefulness">Usefulness</h1>
<p>Common feedback I’ve gotten is that most people tend to have pretty domain-specific feeds (AI, crypto, etc.). So the clusters all tend to be about the same thing. Take my feed for example:</p>
<p><img src="/assets/taag_mitchg.png" /></p>
<p>All the clusters are some variation of “advances in AI”. The meta-clustering and resummarization give a little bit more specificity, but I usually end up clicking through most of the clusters anyway.</p>
<h1 id="takeaways">Takeaways</h1>
<p>Perhaps a flood of information is just a flood of information, regardless of how you slice and dice it? I will say that the experience is <em>slightly</em> nicer than infinite scrolling because it makes the content feel more finite and manageable. But not worth $25 / month.</p>
<p>I still think there’s a huge potential for LMs to fundamentally change how we consume information. If the Twitter API were less egregious, I might keep experimenting with different presentation formats (interactive cluster subjects, perhaps, or intents like “keep up with AI developments”) and different prompting strategies.</p>
<p>But alas, the planets spin quite outside of my grasp.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:2" role="doc-endnote">
<p>which is a magic number I picked because more felt like it overloaded my working memory <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p>NB: I tried a few other things that didn’t work nearly as well. Naively embedding and kmeans clustering tended to produce “dirty” clusters that had too many topics. I tried to fix this via various heuristics (prompting the model to identify dirty clusters and sub-clustering them, or sub-clustering when the summary was too long) but the quality was still pretty sub-par. TL;DR: picking the “k” in kmeans turned out to be pretty non-trivial. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>This is a short memorial post for my dead Twitter summarizer project, aka “Twitter at a Glance.”Anecdotes in Language Model Coherence2022-09-16T00:00:00+00:002022-09-16T00:00:00+00:00http://mitchgordon.me/ml/2022/09/16/anecdotes-in-lm-coherence<style>
.green {
background-color: #53c271
}
.yellow {
background-color: #faed39
}
.red {
background-color: #fa4534
}
</style>
<blockquote>
<p>Some sentences in this blog are generated by GPT-3 Davinci. In those cases (and some others) words are highlighted according to how probable GPT-3 thinks they are.</p>
</blockquote>
<h2 id="sally-is-a-man">Sally Is A Man</h2>
<p>I work at a company that uses language models to write fiction. We often have a hard time with coherence, meaning the model contradicts what was written earlier in the story. In these cases, it pays more attention to “heuristics” that are true in the average world than to what’s true in the story. For example, in the average world, that Sally is a <span class="red">man</span> is quite improbable. More often than not, Sally is a <span class="green">woman</span>. But now that I’ve said it, Sally is a <span class="yellow">man</span>. This is my world, not the average world. Don’t forget, Sally is a <span class="green">man</span>. My friend Sally is a <span class="green">man</span>!</p>
<p>Long-term incoherence becomes a problem when the language model gives more weight to the average world than to what was written in the story. For example, the other day I was eating a bowl of cereal and I spilled some milk on my shirt. I told my friend Sally <span class="green">about it and <strong>she</strong> said, “</span><span class="yellow">That</span><span class="green">’s because you’re a man.”</span></p>
<p>What happened here? In my world, Sally is a man, but “she” just talked to me with 100% probability!</p>
<h2 id="sally-has-no-arms">Sally Has No Arms</h2>
<p>Or consider the case when Sally has no arms. Remember, Sally has no arms!</p>
<p>One time I was running down by McCarren Park. It’s really beautiful this time of year, and the leaves were just starting to change colors. I saw my friend John, and he waved hello. I ran by my friend Sarah, and she waved hello. I ran by my friend Billy, and he waved hello. I saw my friend Sally, <span class="green">and she </span><span class="yellow">waved</span><span class="green"> hello.</span> Of course, Sally has no arms, so I’m not sure how she did that.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
<h2 id="world-models-or-just-heuristics">World Models or “Just” Heuristics?</h2>
<p>When a language model is making predictions, <em>sometimes</em> it’s using a “world model”<sup id="fnref:world" role="doc-noteref"><a href="#fn:world" class="footnote" rel="footnote">2</a></sup> internally. By this I mean there are neural activations in the model representing a physical park in New York, with a neuron firing representing me running down the street, and another few neurons representing Sally standing on the street with a neuron dedicated to whether she has arms. There’s good evidence that language models can have vaguely world model-ish representations from Jacob Andreas’s group at MIT.<sup id="fnref:Andreas" role="doc-noteref"><a href="#fn:Andreas" class="footnote" rel="footnote">3</a></sup> For more behavioral evidence, check out <a href="https://www.gwern.net/GPT-3-nonfiction">Gwern’s blog</a>.</p>
<p>But other times, language model predictions are dominated by simple heuristics. By simple heuristics, I mean “all pronouns following the word ‘Sally’ must come from the set [she, her].” Or, in the second example, identifying and continuing patterns that occur in the document. (Like people waving at me.)<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> There’s also really good evidence of this type of reasoning happening in language models coming from Allyson Ettinger’s group<sup id="fnref:Ettinger" role="doc-noteref"><a href="#fn:Ettinger" class="footnote" rel="footnote">5</a></sup> and others.<sup id="fnref:composition" role="doc-noteref"><a href="#fn:composition" class="footnote" rel="footnote">6</a></sup></p>
<p>It’s probably the case that both modes of reasoning are active to some degree for any particular token, and that their outputs mix to arrive at a final answer.</p>
<h2 id="can-we-make-lms-coherent">Can We Make LMs Coherent?</h2>
<p>Pragmatically (the next year-ish) the answer is to stop expecting language models to output self-consistent text and work around those limitations. It’s currently impossible to know when the model is being “smart” and engaging in an internal world view similar to humans and when it’s being a stupid parrot. If you expect a language model to always make sense, you’re going to be disappointed. However, there are several knobs we can play with that I think might eventually get us to “coherent long-form story generation” over the next few years.</p>
<h3 id="more-data-multi-modal-data">More Data, Multi-Modal Data</h3>
<p>It’s clear to most DL people that as you add more compute and data to these models, you get more world modeling that’s finer-grained and less “dumb parrot” behavior. However, not all language data on the internet is self-consistent. (Gasp!) It’s also unclear whether there’s <a href="https://jacobbuckman.com/2022-06-14-an-actually-good-argument-against-naive-ai-scaling/">enough text</a> on the internet to get a good world model. The world is constantly changing (many LMs still think Trump is president) and text, by nature of being an efficient way of communicating, often omits the kind of information you’d want an LM to learn. (Sally waving implies she has arms.) So there will always be gaps of varying sizes, depending on how much data you have.</p>
<p>IMO this will be much less of a problem as we move towards multi-modal data. An image is worth a thousand words, after all, as evidenced by DALL-E 2 / Imagen and friends already exhibiting remarkable compositional generalization (avocado armchairs, horses riding astronauts, etc.). If you want to write fiction using a model, it seems like the “right” way to do it is to convert the existing text into some multi-modal world representation (video + sound + agents), make predictions about how that world state changes using your exabytes of youtube data, and then convert the result back into text. (Even if that happens implicitly in the model activations.) But maybe I’m committing the classic “planes don’t fly like birds” fallacy here.</p>
<h3 id="fixing-decoding">Fixing Decoding</h3>
<p>Regardless of whether we’re predicting next words or predicting next world states, we’ll probably need to change how we’re generating text using these models (aka decoding). When decoding, you can either maximize the probability of the generated text via search, or you can sample from the probability distribution for each token. The former is known to produce “strangely bland and repetitive text”<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">7</a></sup>, whereas sampling (with some adjustments) can produce more creative, human-like text.</p>
<p>However, achieving “creativity” via occasionally sampling low probability tokens is fundamentally at odds with the goal of intra-document coherence because there are two types of improbable tokens: those that introduce novel information about the world (which is sometimes good), and those that contradict previous writing in a way that’s logically irreconcilable by the average person (which is generally bad). Sally waving despite not having any arms is an example of the second type of improbable.<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">8</a></sup></p>
<p>It feels unprincipled to achieve creativity / novelty by allowing the model to “make mistakes” at a certain frequency and then recover from them. Differentiating between different kinds of improbable tokens seems to be important here, but that’s hard to do without being able to see the LM’s internal world model (or lack thereof).<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">9</a></sup></p>
<h3 id="mechanistic-interpretability">Mechanistic Interpretability</h3>
<p>Which brings us to mechanistic interpretability. The “heuristics” that cause contradictions in the text aren’t abstract, nebulous things. They’re concrete algorithms that are implemented in the bits and bytes of the LM. Researchers at Anthropic (in particular Chris Olah) are already starting to “decompile” simple Transformers into understandable sub-components. One such component is called an <a href="https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html">induction head</a>, and it’s responsible for identifying recurring patterns in the document and making them more probable. (Which is exactly what happened in our “Sally Has No Arms” example.)</p>
<p>You can imagine identifying when the model is engaging in “heuristic” behavior and deliberately knocking out the responsible components. It’s unclear to me, however, what will take over in the absence of heuristics. You can imagine the language model having “suspicions” about the right thing to say but preferring the heuristic because it’s usually the safer bet. In this case, knocking out the heuristic would likely get us the desired behavior. But there’s a good chance that the language model never learned how to do the “right” thing (world modeling) in the first place, because the heuristic is right almost all the time, so why bother?[^heuristics]</p>
<h2 id="dont-hold-your-breath">Don’t Hold Your Breath</h2>
<p>Language models are not good at generating consistently coherent fiction because they’re not good world models. Counter-intuitively, I think using language models to generate free-form text without some kind of grounding is actually the worst application for LMs because it seems like it should work well (and sometimes it tricks you into thinking it does) but in reality it’s mostly just the ELIZA effect. I know I’m mostly old-man screaming into the wind here, but it’s important to be frank about the limitations of these models so that we can overcome them. Current LMs are good at transforming semantic information from natural language into other forms (images, code, etc.) and back again, but expecting them to generate semantic content from nothing is just going to cause trouble.</p>
<hr />
<blockquote>
</blockquote>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>To be fair, “waved” only has a probability of 35% here. But Davinici still chose to output it because of the way we do decoding, which is discussed in a later section. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:world" role="doc-endnote">
<p>By “world modeling” what I really mean is compositional generalization plus knowledge about entities that actually exist in the world and ways that they can compose. <a href="#fnref:world" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Andreas" role="doc-endnote">
<p>https://arxiv.org/abs/2106.00737, https://www.youtube.com/watch?v=BHQBkN4PyPc <a href="#fnref:Andreas" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>I’d also mention that there’s a good chance that humans also utilize a bunch of these heuristics. I’m sure if I were writing a long novel about Sally, I’d misgender him once or twice on accident. (I even did it in this blog post.) <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Ettinger" role="doc-endnote">
<p>https://aclanthology.org/2021.emnlp-main.119.pdf, https://youtu.be/9tH9Qz1nH3k?t=2234 <a href="#fnref:Ettinger" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:composition" role="doc-endnote">
<p>https://compositionalintelligence.github.io/ <a href="#fnref:composition" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p><a href="https://arxiv.org/abs/1904.09751">https://arxiv.org/abs/1904.09751</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>To be fair, language models are quite good at recovering from what at first appear to be “incoherently” improbable tokens. Consider this high probability generation following the last example: “That’s really sweet of Sally to wave hello, even though she doesn’t have any arms!” <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>I’m pretty bullish on things like <a href="https://arxiv.org/abs/2202.00666">typical decoding</a> that deliberately limit the number of low probability tokens to align with the information content of usual human speech, but that still doesn’t solve the underlying issue of there being different kinds of low probability tokens. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Some sentences in this blog are generated by GPT-3 Davinci. In those cases (and some others) words are highlighted according to how probable GPT-3 thinks they are.RETRO Is Blazingly Fast2022-07-01T00:00:00+00:002022-07-01T00:00:00+00:00http://mitchgordon.me/ml/2022/07/01/retro-is-blazing<p>When I first read Google’s RETRO paper, I was skeptical. Sure, RETRO models are 25x smaller than the competition, supposedly leading to HUGE savings in training and inference costs. But what about the new trillion token “retrieval database” they added to the architcture? Surely that must add back some computational costs, balancing the cosmic seesaw?</p>
<p>Apparently not. After running benchmarks for myself, at scale, I am convinced that RETRO is indeed BLAZINGLY fast. RETRO is so fast and cheap, in fact, that I cannot fathom why anyone would choose to do language modeling without retrieval.</p>
<h2 id="retro-overview">RETRO Overview</h2>
<p>To achieve similar performance to bigger models like OpenAI’s GPT-3, RETRO adds an auxiliary “database” of text data, which is queried both during training and inference. This database needs to be HUGE (> 1T tokens!), or else it doesn’t really help.</p>
<p><img src="http://mitchgordon.me/assets/retro_architecture.png" alt="Retro Architecture" /></p>
<p><a href="https://jalammar.github.io/illustrated-retrieval-transformer/">https://jalammar.github.io/illustrated-retrieval-transformer/</a></p>
<p>We’ll see that making and querying this database is orders of magnitude cheaper than training / inference on big neural networks. In this post I’ll briefly describe how the database is constructed and some benchmarks I did while making a database of The Pile, which I’m happy to share <a href="mailto:mitchell.gordon95@gmail.com">by request</a>.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
<p>I used a <a href="https://github.com/latitudegames/RETRO-pytorch">fork of LucidRain’s RETRO-pytorch</a> implementation, which has been modified to handle some scale things like parallelization of jobs. Also thanks to my employer, <a href="https://latitude.io/">Latitude</a>, for giving me the compute to do these experiments.</p>
<h2 id="the-pile">The Pile</h2>
<p>I used The Pile as my benchmark dataset, which is an open-source dataset provided by EleutherAI. It weighs in at around 830 GB of raw text. To get a sense of how much data this is, notice the “Wikipedia” section in the source breakdown below:</p>
<p><img src="http://mitchgordon.me/assets/pile_overview.png" alt="Pile Overview" />
<a href="https://huggingface.co/latitude/RETRO_retrieval">https://arxiv.org/abs/2101.00027</a></p>
<h2 id="building-the-database">Building The Database</h2>
<p>Building a database of The Pile was surprisingly cheap by neural network training standards (~$1k total). It broadly involves three steps:</p>
<ol>
<li>Tokenize the data and split it into chunks of 64 tokens each</li>
<li>Embed the chunks with BERT</li>
<li>Index the embeddings with a MIPS library (FAISS, SCANN, etc.)</li>
</ol>
<p><img src="http://mitchgordon.me/assets/retro_database_prep.png" alt="RETRO Database Prep" /></p>
<h3 id="tokenization">Tokenization</h3>
<p>Tokenization takes around 1.9 min / 1M chunks on your standard CPU core. The Pile ends up being around 5.8B chunks (370B tokens), so that means you’re looking at ~180 hours of CPU time to tokenize, which you can easily parallelize down to only a few hours of wall time.</p>
<p>With a CPU core on the cloud going for around $0.03 / hour, that means you’ll spend less than $10 on tokenization.</p>
<h3 id="embedding">Embedding</h3>
<p>BERT embedding is the most expensive step. On an RTX A5000, BERT embedding takes around 10 minutes per 1M chunks.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> That’s around 1k GPU hours to embed The Pile, which again is very easy to parallelize. This cost around $1k on <a href="https://www.coreweave.com/pricing">Coreweave</a>.</p>
<p>Note that BERT embeddings are around 3 KB each on disk. (768 float32s). 5.8B of them takes up about 16 TB on disk, so watch out for that. (Disk space is cheap.)</p>
<h3 id="mips-indexing">MIPS Indexing</h3>
<p>The MIPS index is the reason the RETRO database lookup is so fast. MIPS stands for maximum inner-product search, which is when you search a database of vectors for the ones closest to your “query” vector. In RETRO, we use this to look up chunks of text from The Pile that are similar to our input.</p>
<p>Companies like Google and Facebook have been doing MIPS at scale for over a decade, so there’s been a huge amount of research optimizing the heck out of this stuff. Google’s RETRO used their new library, SCANN, but I ended up using the more mature FAISS library from Facebook, which has a near identical implementation of the algorithm used by SCANN.</p>
<p>I tried to get the FAISS configuration as close as possible to what Google used in the RETRO paper. FAISS indices can be built using “factory strings” which specify which types of indices to build and how to compose them. My factory string is <code class="language-plaintext highlighter-rouge">OPQ16_64,IVF1048576_HNSW32,PQ16x4fs</code></p>
<p><img src="http://mitchgordon.me/assets/faiss_index.png" alt="FAISS Index explainer" /></p>
<p>Check out Pinecone’s wonderful <a href="https://www.pinecone.io/learn/faiss-tutorial/">faiss tutorial</a> and <a href="https://www.pinecone.io/learn/composite-indexes/">index factory</a> explainer for more information on the optimization tricks used by FAISS and similar libraries. I also enjoyed <a href="https://mccormickml.com/2017/10/13/product-quantizer-tutorial-part-1/">this tutorial</a> on how Product Quantization works under the hood. There are still some things I could tune here to optimize the speed / accuracy trade-off, but I’ll leave that for future me.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
<h3 id="index-training">Index Training</h3>
<p>One particular trick used by FAISS (the inverted file structure) requires taking a small percentage of the data (64M embeddings) and using them to train the index. On a V100 GPU, this only took around 4 hours, so the cost was negligible.</p>
<p>Once the index is trained, we can add all the embeddings to the index, compressing them for lookup. This takes longer than you’d expect (around 192 CPU hours) but ultimately only represents a cost of <$30.</p>
<h2 id="querying-the-database">Querying the Database</h2>
<p>Now that we’ve built the database, how long does it take to query it? Personally, I would have been happy with anything < 100ms, since that would have represented a marginal increase in existing generation times. For reference, here’s how long it takes to generate around 50 tokens with various language models:</p>
<ul>
<li>GPT-J (6B): ~3s</li>
<li>AI21 Grande (17B): ~4s</li>
<li>GPT-NeoX (20B): >4s</li>
<li>AI21 Jumbo (175B): ~6.5s (x ~6 GPUs)</li>
</ul>
<p>In practice, our FAISS index takes between <strong>2 and 40 ms</strong>,<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> based on my manual testing. That’s… really fast. Embedding the query with BERT takes an additional 10 ms on a CPU. Altogether, <strong>the cost of querying the database during inference and training has a totally neglibile impact on total cost.</strong></p>
<h3 id="qualitative-results">Qualitative Results</h3>
<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>query: The old man wept, for he knew that his end had come. The waves of time washed over him.
result 1: she faded from them, as the bright snow, that none may keep, melts in our very hands. A murmur of farewell came to his ears, - - no more. She was gone. He would have followed, but Charon, now on guard, drove him back. Seven days he lingered there between the worlds
result 2: but as I tarried? And when I could no more, I did go, and I did stay, and I did steward. Stayed at the station. The ravens did raven. The steward did steward. But one thing mattered. The Spirit did Spirit. And the word remained. For
</code></pre></div></div>
<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>query: In today's news, Miley Cyrus was caught shoplifting from a clothing store on Hollywood Boulevard.
result 1: ##s in Texas. The child, whose name was not released, boarded the Techno Jump Ride with her 8 - year - old brother at the RodeoHouston carnival around 2 p. m. Wednesday, according to local affiliate KTRK. RodeoHouston is a popular local attraction. Witnesses told
result 2: [CLS] Is this the worst airplane loader in the world? Proof can be found in a year - old YouTube video that just surfaced via Reddit. In it, an unidentified freight handler can be seen haphazardly tossing packages from a flat bed onto a conveyor belt at China's Guangzhou Airport. Capt
</code></pre></div></div>
<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>query: Hey Betty! Thanks for getting back to my email. Are we still on for Saturday?
result 1: 20 AM I just recd. an email from gary sinclair and it got me thinking about all the great people and good freinds of VR - 24. I know a few of you have emailed me in the past and I didnt respond but I will to all future emails. After
result 2: starmail. com Subject : oops Soz babe didnt mean to sned that!!!! Was trying to email a mate on my phone and been drinkin ps hop u r ok I close the laptop and I sit for a long time in silence. As I do, I examine the happy, laughing
</code></pre></div></div>
<h3 id="the-hidden-cost-of-cpu-ram">The Hidden Cost of CPU RAM</h3>
<p>The FAISS index is not totally cost free. The index itself ends up being big, requiring around 176 GB of RAM to query, which costs about $0.88 per hour on your average cloud provider.</p>
<p>However, this allows you to drastically reduce your GPU usage. Say, for example, you need 5 GPUs running in parallel to do inference on a 175B parameter model, which costs around $6 an hour. By adding an extra $0.88 / hour in CPU RAM, you can reduce the number of GPUs you have to run to just 1, saving around $5 / hour in GPU costs. I’d take that trade any day.</p>
<p>This also applies to models that are already using a single GPU. By shrinking your model with RETRO’s database, requests get served faster, meaning more GPU bang for your buck. Instead of serving 60 req / hour on a single GPU, you’re serving 600+, just for a little extra CPU RAM.</p>
<p>Update (7/6/22) - I’ve been informed that FAISS has the ability to memory map an index, which allows you to read it directly from disk instead of allocating RAM for it. This is slightly slower, of course, but probably worth the trade. (Thanks rom1504.)</p>
<h2 id="conclusion">Conclusion</h2>
<p>At first I was skeptical, but upon closer inspection it seems like RETRO is indeed a HUGE cost savings over existing LM approaches. These cost savings seem to boil down to the fact that MIPS is super optimized by existing libraries and only requires more CPU RAM to use. Based on these observations, I can’t imagine why anyone doing language modeling in production would choose to do it without retrieval.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>I tried uploading some of it to Huggingface, but even the compressed FAISS index file exceeded the max 50 GB file size. The tokens themselves are over 1.5 TB. Feel free to shoot me an email and I’ll get you a copy. Update (5/11/23) - I no longer work at Latitude and therefore no longer have access to this index. Sorry! <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Naively, I didn’t do much optimization here. I suspect the bottleneck is probably getting data off disk to the GPU, not the computation speed. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Specifically I’m not certain we need to be so aggressive with the dimensionality reduction during pre-processing. (768 dims → 64.) Because of the way PQ works, I’m pretty sure I could get away with less dimensionality reduction and improve accuracy. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>For k=5, with the IVF nprobe also set to 5. (Which seems to be a standard setting, but could be tuned to trade speed / accuracy.) <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>When I first read Google’s RETRO paper, I was skeptical. Sure, RETRO models are 25x smaller than the competition, supposedly leading to HUGE savings in training and inference costs. But what about the new trillion token “retrieval database” they added to the architcture? Surely that must add back some computational costs, balancing the cosmic seesaw?Do Infinite Pencils Exist??2022-02-01T00:00:00+00:002022-02-01T00:00:00+00:00http://mitchgordon.me/math/2022/02/01/infinity<p>Does infinity exist? This <a href="https://mathoverflow.net/a/23521">comment</a>
from MathOverflow riled my feathers a bit:</p>
<blockquote>
<p>I’ve heard a worse story. A college instructor claimed in Number
Theory class that there are only finitely many primes. When
confronted by a student, her reply was: “If you think there are
infinitely many, write them all down.” She was on tenure track, but
need I add, didn’t get tenure.</p>
</blockquote>
<p>What a dumb teacher, right? Everyone knows there are an infinite number
of primes! Haven’t you heard of Euclid?</p>
<p>Except I think she’s right. (Depending on what she meant, exactly.)</p>
<p>I <em>believe</em> this professor was attempting to make a subtle point that
many students of mathematics tend to
miss. That is, they mistakenly believe that <strong>infinity</strong> actually
exists. Like, in the real world.</p>
<p>If I tell someone to imagine an “infinite number of pencils,” usually
they picture a <em>bunch</em> of pencils. Like, more than they could ever
count. An empire state building made out of pencils. An ocean full of
pencils. Mars. But pencils.</p>
<p>That’s not really what <strong>infinty</strong> means, mathematically. When a mathematician says
<strong>infinity</strong> they mean a <em>repeatable process that we can keep doing forever.</em></p>
<p>For example, let’s imagine I’m standing at the blackboard and
I ask the class to give me a pencil. Now I have 1 pencil. That’s our
repeatable process. I do it a second time, and I have 2 pencils. A
third time, and three. And so on and so forth.</p>
<p>Imagine I never stop asking for pencils. Ever. How many pencils do I
have, at that indeterminate point in the future? A mathematician would
say an <strong>infinite</strong> number of pencils.</p>
<p>But you see, I’d never <em>really</em> get to an “infinite” quantity of
pencils. <strong>Infinity</strong> is purely a work of imagination. Eventually I’ll
have to stop asking for pencils. I’ll get hungry, or get old or
die. Or we’ll run out of wood or something. Or we’ll exhaust all the
matter in the universe and there will be nothing but pencils floating
around in the vastness of space. Then who would ask for the pencil,
and who would give it?<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
<p>Anyway, back to the primes. Yes, the student is correct in that there
is an infinite number of primes. (And by that we mean there is a
repeatable process<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> for generating primes which we can repeat until the
heat death of the universe.)</p>
<p>But the teacher is also right, in her way. If we built a computer to
count all the primes, it would eventually run out of memory. Even if
we turned the whole universe into a computer, eventually we’d run out
of stars and junk to fuel our prime-counting computer. Thus the number of primes we’ll ever be able to count is finitely bounded by the size of our universe.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
<p>That’s… funny? Sad? I don’t know. Why did I even write this?</p>
<p>- Mitchell</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>If infinity doesn’t really exist, then why do we talk about it so much? Because it’s a useful approximation of reality. Of course, it doesn’t make sense to ask, “if I keep asking for pencils how many pencils will I have?” But suppose instead of asking for 1 pencil each time, I asked for 1/4th a pencil, and then 1/9th, and then \(1/n^2\) of a pencil for ever and ever. How many pencils would I have? The answer is that I will get real close to 1.645 pencils, but never more than that. Why’s that useful? Well, ever try to build a rocket? No? Me neither. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Euclid’s proof is famously <a href="https://en.wikipedia.org/wiki/Euclid%27s_theorem#Euclid's_proof">not constructive</a>, so it doesn’t directly give us a method for constructing a new prime. I prefer the <a href="https://en.wikipedia.org/wiki/Euclid%27s_theorem#Proof_by_construction">proof</a> by Filip Saidak that does. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>I wonder how many digits it has? <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Does infinity exist? This comment from MathOverflow riled my feathers a bit:Redefining SOTA2021-08-31T00:00:00+00:002021-08-31T00:00:00+00:00http://mitchgordon.me/ml/2021/08/31/sota<p>In the machine learning research community, achieving state-of-the-art usually
means reporting a single score (percentage accuracy or F1) on a public research
dataset. There are two legitimate reasons to report a “SOTA score” in a research
paper, besides gaming the system.<sup id="fnref:reviewing" role="doc-noteref"><a href="#fn:reviewing" class="footnote" rel="footnote">1</a></sup></p>
<ol>
<li>
<p>A SOTA score may signal to the community that you have “solved” a
task that was previously unsolved (like <a href="https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology">protein
folding</a>).</p>
</li>
<li>
<p>A SOTA score may signal to the community that your new method
is the “best” method to solve the task, and that the rest of the
community (in both academia and industry) should adopt your method as
the new standard.</p>
</li>
</ol>
<p>However, a SOTA score in today’s context accomplishes neither of those goals.
Because of the way many benchmark datasets are constructed, a high test score
(even surpassing human performance) is
unlikely to mean that the model is ready for real-world deployment or that the
task is “solved.” Furthermore, the ability of neural methods to predictably
improve performance with scale means that a single SOTA score is not enough
information to decide whether one neural method is better than another.</p>
<p>In light of these two observations (underspecification and neural scaling laws),
I think the ML community needs to redefine SOTA. Below, I’ll review some of the
literature surrounding underspecification and neural scaling laws, and then make
some suggestions about new “metrics for success” that we should adopt as a
community.</p>
<h2 id="underspecification-the-task-is-not-solved">Underspecification: The Task Is Not Solved</h2>
<p>In the early days of machine learning, task performance was often
associated with accuracy on a single dataset. “Solving” hand-written
digit recognition meant achieving a high accuracy on MNIST, and the
Penn Treebank was the gold standard for part-of-speech tagging in
natural language processing. However, as the field matured we began
meeting the goals we set for ourselves, and we quickly understood that
solving the task is not the same as solving the benchmark.</p>
<p>I first experienced this when BERT broke the General
Language Understanding Benchmark<sup id="fnref:leaderboard" role="doc-noteref"><a href="#fn:leaderboard" class="footnote" rel="footnote">2</a></sup>, well surpassing
human-level performance. Many linguists appropriately asked: does this
mean we’ve solved language understanding? The answer was a resounding
no. Many papers since have been dedicated to all the ways BERT can be
wrong or, worse, right for the wrong reasons.<sup id="fnref:cleverhans" role="doc-noteref"><a href="#fn:cleverhans" class="footnote" rel="footnote">3</a></sup>
<sup id="fnref:generalization" role="doc-noteref"><a href="#fn:generalization" class="footnote" rel="footnote">4</a></sup> Many papers pointed out that BERT (as well as
later models) can rely on spurious correlations in the data and
demonstrated that small, meaningless input pertubations could lead to
incorrect answers.<sup id="fnref:gardner" role="doc-noteref"><a href="#fn:gardner" class="footnote" rel="footnote">5</a></sup> This is analogous to adversarial examples
in image recognition, where adding a small amount of noise can change
a correct label to an incorrect label.<sup id="fnref:bugsfeatures" role="doc-noteref"><a href="#fn:bugsfeatures" class="footnote" rel="footnote">6</a></sup></p>
<p>Evaluation datasets are often not powerful enough to differentiate
between a model which generalizes and a model which relies on spurious
correlations. They may also lack sufficient coverage, such that a high
test score obscures problems with the model that would cause problems
in production, such as racial/gender bias<sup id="fnref:isbell" role="doc-noteref"><a href="#fn:isbell" class="footnote" rel="footnote">7</a></sup> or susceptibility
to attack.<sup id="fnref:poisoning" role="doc-noteref"><a href="#fn:poisoning" class="footnote" rel="footnote">8</a></sup> In a recent paper, Google researchers called this problem
“underspecification,”<sup id="fnref:underspecification" role="doc-noteref"><a href="#fn:underspecification" class="footnote" rel="footnote">9</a></sup> and point out several examples across
the company in which models achieve similar test scores but exhibit widely
divergent behaviors when deployed in production. They show that this is a
distinct problem from domain-shift, in which the test data distribution is
different from the training distribution.</p>
<h3 id="robust-evaluation">Robust Evaluation</h3>
<p>One fix for the problem of underspecification is just to “make the
dataset better.” Some interesting work in this direction:</p>
<ul>
<li><a href="https://aclanthology.org/2020.acl-main.442/">Beyond Accuracy: Behavioral Testing of NLP Models with CheckList</a></li>
</ul>
<p>Inspired by “unit tests” in traditional software engineering,
Checklist is a framework for testing NLP models along many directions
of “linguistic proficiency” by augmenting test examples with
deterministic transformations. Examples include negating verbs and
replacing nouns in sentences with novel nouns.</p>
<ul>
<li><a href="https://leaderboard.allenai.org/orb/submissions/get-started">Open Reading Benchmark</a></li>
</ul>
<p>A suite of NLP datasets from the Allen Institute for AI. Some datasets
are constructed to target specific language capabilities. For example,
<a href="https://allennlp.org/drop">DROP</a> involves performing discrete
reasoning (adding, sorting, counting) over many paragraphs of text.</p>
<ul>
<li><a href="https://aclanthology.org/2021.naacl-demos.6/">Robustness Gym</a></li>
</ul>
<h3 id="tail-chasing">Tail Chasing</h3>
<p>“Tail-chasing” is an attempt at making models more robust to items in the long-tail of the dataset, such as <a href="https://aclanthology.org/2021.naacl-demos.6/">rare words</a> or images. Some other work in this direction:</p>
<ul>
<li><a href="https://arxiv.org/abs/1911.05248">What Do Compressed Deep Neural Networks Forget?</a></li>
<li><a href="https://twitter.com/Nils_Rethmeier/status/1344730606807224322">CLESS: Contrastive Label Embedding Self-supervised Zero to Few-shot Learning from and for Small, Long-tailed Text Data</a></li>
<li><a href="https://bair.berkeley.edu/blog/2019/05/13/oltr/">Large-Scale Long-Tailed Recognition in an Open World</a></li>
<li><a href="https://arxiv.org/abs/1911.00172">Generalization through Memorization: Nearest Neighbor Language Models</a></li>
<li><a href="https://arxiv.org/abs/1906.05271">Does Learning Require Memorization? A Short Tale about a Long Tail</a></li>
</ul>
<h2 id="neural-scaling-which-method-is-best">Neural Scaling: Which Method Is “Best”?</h2>
<div style="text-align: center">
<img src="http://mitchgordon.me/assets/scaling-laws.png" />
<a href="https://arxiv.org/abs/2001.08361">Scaling Laws for Neural Language Models</a>
</div>
<p><br /></p>
<p>Which neural architecture achieves SOTA on a task depends entirely on
the amount of data and compute provided to the architecture. As shown
above, performance scales like a power law with data, compute, and
parameters. This has now been demonstrated for many data domains,
modalities, and neural architectures.<sup id="fnref:autoregressive" role="doc-noteref"><a href="#fn:autoregressive" class="footnote" rel="footnote">10</a></sup></p>
<p>This means you can make any neural architecture SOTA if you’re willing
to spend enough money pouring resources into it. A single SOTA score
is not expressive enough to capture this behavior. Consider the
following graph showing the performance of machine translation
methods with varying amounts of data:<sup id="fnref:koehn" role="doc-noteref"><a href="#fn:koehn" class="footnote" rel="footnote">11</a></sup></p>
<div style="text-align: center">
<img src="http://mitchgordon.me/assets/neural-vs-stat-mt.png" style="height: 300px" />
</div>
<p>Purely neural methods out-perform other methods when given enough
data. However, in low-resource regimes, they fail miserably compared
to phrase-based approaches. Depending on how big our training dataset
is, a SOTA score might lead us to dramatically different
conclusions. A small dataset might give us the impression that
phrase-based methods are “better”, whereas a large dataset would lead
us to believe neural methods are “better.”</p>
<p>The reality is more nuanced: phrase-based methods have inductive
biases that make them better in low-resource scenarios, whereas neural
methods scale better with data. And this simple situation doesn’t even
take into account the amount of money spent on compute while training
each method, and whether that was a bottleneck for either method.</p>
<h3 id="sota-scaling-not-sota-scores">SOTA Scaling, not SOTA Scores</h3>
<p>This implies that beating a benchmark dataset is no longer newsworthy
(i.e. worthy of publication). Anyone can get a SOTA score if they
invest enough money in procuring the data / compute required to get
there. What is newsworthy is if you improve the money-to-performance
trade-off. That could save <a href="https://twitter.com/qinyuan_ye/status/1345258852644573185">billions of
parameters</a>
or millions of dollars!</p>
<p>In other words, because of neural scaling laws, nearly everyone in ML
is working on machine learning efficiency at this point (either
compute efficiency or sample efficiency), but no one is measuring
success that way!! That’s why ML reviewing feels so broken
lately. Here’s a few things we could do right now:</p>
<ul>
<li>
<p>Any paper proposing a new “SOTA” neural method needs to report not
just the data / compute used to achieve SOTA, but the score achieved
at several points of data/compute. The slope of the curve
should be better than all other known methods. SOTA scaling is the
objective, not SOTA scores.</p>
</li>
<li>
<p>Benchmarks should release pre-determined dataset splits of various
sizes, to help fairly measure the sample complexity curves of new
methods.</p>
</li>
<li>
<p>Compute / parameters should be measured via a standardized platform, like
<a href="https://www.nvidia.com/en-us/data-center/mlperf/">MLPerf</a>. (But
perhaps more streamlined to compare new neural architectures.)</p>
</li>
</ul>
<p>A lot of people complain ML reviewing is broken. I tend to agree. But
I also believe that it’s possible to get our act together, as long as
we all agree on a paradigm for evaluating new approaches. I think
scaling laws, accompanied by robust and strengthened evaluation
methods, can help fill that role.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:reviewing" role="doc-endnote">
<p>The
<a href="http://mitchgordon.me/logistics/2019/09/14/improving-ml-conferences.html">deluge</a>
of papers submitted to machine learning conferences has lead to a
shortage of quality reviewers who result to heuristics like
<a href="https://hackingsemantics.xyz/2020/reviewing-models/">“reject if not
SOTA.”</a>
Therefore, many researchers frame their papers as “SOTA score”
papers to <a href="https://arxiv.org/abs/2003.14415">boost</a> chances of
acceptance, even when the paper would be better formulated as a
scientific endeavor (or when the paper would not otherwise meet
conference standards). Some conferences have started
<a href="https://www.aclweb.org/adminwiki/index.php?title=Short-Term_Reform_Proposals_for_ACL_Reviewing">trying</a>
to <a href="https://2020.emnlp.org/blog/2020-05-17-write-good-reviews">fix
this</a>,
but progress is slow. <a href="#fnref:reviewing" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:leaderboard" role="doc-endnote">
<p>“How the Transformers Broke NLP Leaderboards.” 2019. June 30, 2019. https://hackingsemantics.xyz/2019/leaderboards/. <a href="#fnref:leaderboard" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:cleverhans" role="doc-endnote">
<p>Heinzerling, Benjamin. n.d. “NLP’s Clever Hans Moment Has Arrived.” Accessed December 30, 2020. https://bheinzerling.github.io/post/clever-hans/. <a href="#fnref:cleverhans" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:generalization" role="doc-endnote">
<p>Marasović, Ana. 2018. “NLP’s Generalization Problem, and How Researchers Are Tackling It.” The Gradient. August 22, 2018. https://thegradient.pub/frontiers-of-generalization-in-natural-language-processing/. <a href="#fnref:generalization" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:gardner" role="doc-endnote">
<p>“Fall 2019 Natural Language Processing: Matt Gardner (AI2 Irvine).” 2019. Youtube. December 31, 2019. https://www.youtube.com/watch?v=k7d_Nnv_shw. <a href="#fnref:gardner" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:bugsfeatures" role="doc-endnote">
<p>Ilyas, Andrew, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. 2019. “Adversarial Examples Are Not Bugs, They Are Features.” arXiv [stat.ML]. arXiv. http://arxiv.org/abs/1905.02175. <a href="#fnref:bugsfeatures" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:isbell" role="doc-endnote">
<p>“NeurIPS 2020 : You Can’t Escape Hyperparameters and Latent Variables: Machine Learning as a Software Engineering Enterprise.” n.d. Accessed December 30, 2020. https://nips.cc/virtual/2020/public/invited_16166.html?utm_campaign=NLP%20News&utm_medium=email&utm_source=Revue%20newsletter. <a href="#fnref:isbell" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:poisoning" role="doc-endnote">
<p>Chen, Xinyun, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. 2017. “Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning.” arXiv [cs.CR]. arXiv. http://arxiv.org/abs/1712.05526. <a href="#fnref:poisoning" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:underspecification" role="doc-endnote">
<p>D’Amour, Alexander, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, et al. 2020. “Underspecification Presents Challenges for Credibility in Modern Machine Learning.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2011.03395. <a href="#fnref:underspecification" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:autoregressive" role="doc-endnote">
<p>https://arxiv.org/abs/2010.14701 <a href="#fnref:autoregressive" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:koehn" role="doc-endnote">
<p>Koehn, Philipp, and Rebecca Knowles. 2017. “Six Challenges for Neural Machine Translation.” In Proceedings of the First Workshop on Neural Machine Translation, 28–39. Stroudsburg, PA, USA: Association for Computational Linguistics. <a href="#fnref:koehn" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>In the machine learning research community, achieving state-of-the-art usually means reporting a single score (percentage accuracy or F1) on a public research dataset. There are two legitimate reasons to report a “SOTA score” in a research paper, besides gaming the system.1 The ↩Why Is VSCode Typescript Linting So Damn Fast?2021-06-28T00:00:00+00:002021-06-28T00:00:00+00:00http://mitchgordon.me/software/2021/06/28/why-vscode-eslint-fast<p>I’ve been using VSCode as my defacto Typescript IDE for the last few
months. I’m a heavy Emacs user, however, and it was only a
matter of time until I attempted to get a similar experience via Emacs
configuration, thereby continuing my quest to use Emacs as the sole
interface to my computer.</p>
<p>I took the plunge last weekend, thinking I’d start with something
“simple” like getting linting with eslint to work. Little did I know…</p>
<h2 id="batteries-not-included">Batteries (Not) Included</h2>
<p>When you think of linting in Emacs, you immediately think of
<a href="https://www.flycheck.org/en/latest/">Flycheck</a>. Luckily for me (I
thought), Emacs flycheck has eslint support built in. So
theoretically, all I have to do is add</p>
<div class="language-elisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">flycheck-mode</span> <span class="mi">+1</span><span class="p">)</span>
</code></pre></div></div>
<p>to my Emacs config and I’m good to go. As I’m editing files, this
package will call eslint asynchronously as a shell command and report
back the results.</p>
<p>As I’m editing, however, I notice results are coming back with a lot
of lag. Like, 5-10 seconds of lag. Huh. Is this an Emacs thing or a
eslint thing? So I pop open a terminal and do</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cd </span>my_proj/
<span class="nv">$ </span><span class="nb">time </span>eslint my_file.ts
real 0m6.684s
user 0m9.321s
sys 0m0.821s
</code></pre></div></div>
<p>And sure enough, the linter takes 9 seconds to complete. But I was
just in VSCode editing this same exact file, and linting was happening
almost instantaneously… what gives??!!</p>
<h2 id="linting--type-checking--compiling">Linting => Type Checking => Compiling</h2>
<p>Turns out, most people won’t run into this slowness unless they enable
certain eslint options in <code class="language-plaintext highlighter-rouge">.eslintrc.js</code>. Ours happens to look
something like this</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">module</span><span class="p">.</span><span class="nx">exports</span> <span class="o">=</span> <span class="p">{</span>
<span class="na">parser</span><span class="p">:</span> <span class="dl">'</span><span class="s1">@typescript-eslint/parser</span><span class="dl">'</span><span class="p">,</span>
<span class="na">plugins</span><span class="p">:</span> <span class="p">[</span><span class="dl">'</span><span class="s1">import</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">@typescript-eslint</span><span class="dl">'</span><span class="p">],</span>
<span class="na">parserOptions</span><span class="p">:</span> <span class="p">{</span>
<span class="na">ecmaVersion</span><span class="p">:</span> <span class="mi">2018</span><span class="p">,</span>
<span class="na">sourceType</span><span class="p">:</span> <span class="dl">'</span><span class="s1">module</span><span class="dl">'</span><span class="p">,</span>
<span class="na">project</span><span class="p">:</span> <span class="dl">'</span><span class="s1">./tsconfig.json</span><span class="dl">'</span><span class="p">,</span>
<span class="p">}</span>
<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>You’ll notice we have <code class="language-plaintext highlighter-rouge">parser</code> set, which tells eslint to use a
<a href="https://github.com/typescript-eslint/typescript-eslint">plug-in</a> to
read the typescript syntax. We also have <code class="language-plaintext highlighter-rouge">parserOptions.project</code> set
to <code class="language-plaintext highlighter-rouge">./tsconfig.json</code>, which tells our parser to include type
information in the parse, so we can use it to make type-aware eslint
rules.</p>
<p>Unfortunately, enabling this option is
<a href="https://github.com/typescript-eslint/typescript-eslint/blob/master/docs/getting-started/linting/TYPED_LINTING.md">known</a>
to be <a href="https://github.com/typescript-eslint/typescript-eslint/issues/243">very
slow</a>
since it requires compiling the entire typescript project to get the
type information. This lines up with my experience, since doing</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cd </span>my_proj
<span class="nv">$ </span><span class="nb">time </span>tsc
real 0m4.968s
user 0m8.181s
sys 0m0.689s
</code></pre></div></div>
<p>takes around 8 seconds. So that’s 8 seconds to compile the project,
and < 1 second to run the actual linter rules. And as expected,
removing <code class="language-plaintext highlighter-rouge">parserOptions.project</code> speeds up eslint to under 1 second.</p>
<h2 id="the-difference-that-makes-the-difference">The Difference That Makes the Difference</h2>
<p>So why is VSCode so fast?</p>
<p>Linting in VSCode is done by the <a href="https://github.com/microsoft/vscode-eslint">ESLint
Extension</a>, which claims
to just call eslint in the background. Thinking that couldn’t possibly
be true, I dug into the source code.</p>
<p>Turns out they <strong>do</strong> call eslint, just not from the shell. Instead, they
import eslint into a node process and call it repeatedly as the file
changes. I suspected this meant that the process was able to cache
the AST coming back from <code class="language-plaintext highlighter-rouge">tsc</code>, and only compile the parts that
changed. I put together a little proof-of-concept and what do you
know…</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">eslint</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">my_proj/node_modules/eslint/lib/api.js</span><span class="dl">'</span><span class="p">)</span>
<span class="nx">cli</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">eslint</span><span class="p">.</span><span class="nx">CLIEngine</span><span class="p">({</span> <span class="na">cwd</span><span class="p">:</span> <span class="dl">'</span><span class="s1">my_proj</span><span class="dl">'</span> <span class="p">})</span>
<span class="c1">// This is slow, takes ~9 seconds</span>
<span class="nx">cli</span><span class="p">.</span><span class="nx">executeOnFiles</span><span class="p">([</span><span class="dl">'</span><span class="s1">my_proj/my_file.ts</span><span class="dl">'</span><span class="p">]).</span><span class="nx">results</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nx">messages</span>
<span class="c1">// This is fast, one second at most</span>
<span class="nx">cli</span><span class="p">.</span><span class="nx">executeOnFiles</span><span class="p">([</span><span class="dl">'</span><span class="s1">my_proj/my_file.ts</span><span class="dl">'</span><span class="p">]).</span><span class="nx">results</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nx">messages</span>
<span class="c1">// This is still fast, even if we change the underlying file to introduce a previously unseen linter error</span>
<span class="nx">introduceLintError</span><span class="p">(</span><span class="dl">'</span><span class="s1">my_proj/my_file.ts</span><span class="dl">'</span><span class="p">)</span>
<span class="nx">cli</span><span class="p">.</span><span class="nx">executeOnFiles</span><span class="p">([</span><span class="dl">'</span><span class="s1">my_proj/my_file.ts</span><span class="dl">'</span><span class="p">]).</span><span class="nx">results</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nx">messages</span>
<span class="c1">// And it's fast on other files we haven't loaded before</span>
<span class="nx">cli</span><span class="p">.</span><span class="nx">executeOnFiles</span><span class="p">([</span><span class="dl">'</span><span class="s1">my_proj/other_file.ts</span><span class="dl">'</span><span class="p">]).</span><span class="nx">results</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nx">messages</span>
</code></pre></div></div>
<h2 id="the-fix">The Fix?</h2>
<p>My first intuition was to just stop calling eslint from the
shell. Instead, I imagined I could do something like this:</p>
<ol>
<li>Write a short node server that receives HTTP POST requests
containing filenames to lint. It would then call out to
<code class="language-plaintext highlighter-rouge">eslint.CLIEngine</code> and return the lint errors as JSON.</li>
<li>Write a flycheck checker that would just curl the endpoint.</li>
</ol>
<p>As soon as I wrote down that plan, however, I realized I was basically
describing an eslint language server. So I looked up if Emacs lsp-mode
had a defacto eslint langauge server and surprise surprise, they pointed
me to the <a href="https://emacs-lsp.github.io/lsp-mode/page/lsp-eslint/">VSCode
ESlint</a>
extension.</p>
<p>So what I was looking for was with me the whole time. All I
needed to do was add <code class="language-plaintext highlighter-rouge">lsp-mode</code> to my config</p>
<div class="language-elisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">use-package</span> <span class="nv">lsp-mode</span>
<span class="ss">:ensure</span> <span class="no">t</span><span class="p">)</span>
</code></pre></div></div>
<p>and then install the eslint server with <code class="language-plaintext highlighter-rouge">M-x lsp-install-server</code>. And
voilà, we have lightning fast linting, just like VSCode.</p>
<h2 id="epilogue">Epilogue</h2>
<p>The language server fix is perfectly acceptable. In fact, a
standardized LSP syntax checker seems like a logical successor to the
Flycheck framework for linting.</p>
<p>It seems to me, though, that we could have made the flycheck version
work. Whatever caching the parser is doing could reasonably be
serialized to disk between calls to the eslint command line tool. I
imagine we could add a setting like <code class="language-plaintext highlighter-rouge">parserOptions.cacheASTFname</code> that
would tell eslint where to store that information. This would be in
line with the behavior other caching options like the built-in
<code class="language-plaintext highlighter-rouge">--cache</code>.</p>
<h2 id="do-i-hate-emacs">Do I Hate Emacs?</h2>
<p>No, I still like Emacs. And despite this little saga taking me the
better part of three days, I’m happy to have the tools to get to the
root of the problem and fix it myself.</p>I’ve been using VSCode as my defacto Typescript IDE for the last few months. I’m a heavy Emacs user, however, and it was only a matter of time until I attempted to get a similar experience via Emacs configuration, thereby continuing my quest to use Emacs as the sole interface to my computer.My Struggle With Probability Theory2021-04-02T00:00:00+00:002021-04-02T00:00:00+00:00http://mitchgordon.me/math/2021/04/02/probability<blockquote>
<p>TL;DR - the fundamental assumption of probability theory is one of ignorance. This assumption is too easy to break in most contexts and leads to unfounded confidence in conclusions.</p>
</blockquote>
<p>There are many circumstances in which uncertainty is warranted.<sup id="fnref:uncertainty" role="doc-noteref"><a href="#fn:uncertainty" class="footnote" rel="footnote">1</a></sup>
Gas temperature measurements, weather forecasts, horse races, coin flips, and
clinical trials all have some uncertainty involved. Probability theory is the
science that finds commonality among these seemingly disconnected phenomena. We
can observe, for example, that the summation of many “repeatable” random events,
properly normalized, begins to look like a gaussian distribution (aka the
central limit theorem). We can notice common shapes in the histograms of these
repeatable experiments, such as “fat-tailed” or “power law” distributions. And
if the event is not repeatable, we can at least apply the rules of probability
theory to avoid inconsistencies in our thinking (which would allow a savvy
adversary to take advantage of us when gambling).</p>
<p>However, I believe there are tasteful and distasteful applications of
probability theory. This is because the application of probability to a
particular event requires a suspension of disbelief. To consider an event as
repeatable and iid is to accept that the causal factors driving the outcome are
(practically) unobservable and therefore ignorable. In effect, it means giving
up on deeply understanding a causal explanation of the phenomena and instead
sweeping the details under the rug of “the distribution.”</p>
<p>This makes probability theory the science of last resort. Only after truly
exhausting your ability to investigate causal factors and processes should you
indulge in probabilistic thinking. Doing otherwise is a cop-out, one that
dangerously <em>feels</em> “scientific.”</p>
<h3 id="examples-of-distaste">Examples of Distaste</h3>
<p>The tastefulness of a particular application of probability theory is a matter
of context. Consider the humble coin flip. A lay-person may reasonably assume
that this event is a repeatable, iid experiment with a uniform prior; they may
be working with a variety of coins, fingers, and surfaces, and may not have
equipment available to make precise measurements. To the physicist, however,
this is obviously a cop-out. The physicist knows that the coin’s trajectory can
be precisely captured by the laws of classical mechanics, and therefore
predicted with almost certainty.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">2</a></sup> Where the average person gives up and
shrugs, the scientist continues searching for explanations.</p>
<p>Similar things can be said about drawing a card from a deck of cards. A casual
observer may reasonably assign uncertainty to the event. But to the magician who
controls the precise method of shuffling, this is obviously a cop-out.</p>
<p>Or consider a randomized clinical study in which a drug harms patients in 0.01%
of cases. It’s easy to sweep the causal factors under the rug and assume the
effects are “randomly distributed.” But we can imagine more information
gathering revealing that all instances of harm occured in an ethnic minority. In
practice, “randomness” is more often a cop-out than an unavoidable facet of the
system under study.</p>
<h3 id="human-nature">Human Nature</h3>
<p>My struggle with probability theory is that it lends itself to distaste. Humans
are lazy, and when presented with the option of doing more investigative leg
work or simply assuming data is randomly iid, they will often choose the latter,
especially when the latter appears to be “scientifically and mathematically
rigorous.” However, mathematics are only as correct as the assumptions made at
the beginning, and by hiding the causal factors of an event behind the
abstraction of a “probability distribution” we deprive ourselves of the ability
to identify when those causal factors change and our assumptions no longer hold
(i.e. the distribution shifts).<sup id="fnref:magnets" role="doc-noteref"><a href="#fn:magnets" class="footnote" rel="footnote">3</a></sup></p>
<p>And even when the assumption of iid is justified, the logic of probability
theory is more often misapplied than not, despite supposedly being a “guide
towards logical consistency.” In my experience, probability theory is more often
used to prove a point in scientific papers than it is a self-check for
correctness. Just look at the p-value crisis of the 2010’s.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">4</a></sup> As they say,
there are lies, damned lies, and statistics. Even Bayesians, who seem to think
they’re always right, can occasionally get it wrong as evidenced by E.T. Jaynes’
humorous exploration of a paper that proved a woman had ESP.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">5</a></sup><sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">6</a></sup></p>
<p>Given the prevalence of misapplication, I can only conclude that probability
theory needs to be redesigned as a mental device. I don’t want to be a
probability theory nazi, but all the evidence seems to indicate that probability
theory is a science for people who have given up on science, rather than the
rigorous system of analysis it purports to be.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:uncertainty" role="doc-endnote">
<p>Many events are unpredictable to us in practice, either because the laws governing the outcome are not known, or because the laws are known but the observations required are too arduous to make. Sometimes the required observations are too numerous to collect (as in statistical mechanics), and other times the non-linear, chaotic nature of the system necessitates observations that are too precise to be practical. Quantum experiments seem to be inherently unpredictable, although whether this is a fundamental facet of nature is a matter of debate. <a href="#fnref:uncertainty" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p>There may be some chaos in the bounce on a hard surface or drift in the
wind, so we would need a properly controlled environment and <a href="https://www.npr.org/templates/story/story.php?storyId=1697475">precise
engineering</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:magnets" role="doc-endnote">
<p>For example, I may experiment with a coin and decide that it is fair when tossing it onto a wooden surface, only to discover later that the coin is magnetized and slightly biased towards heads on metallic surfaces. <a href="#fnref:magnets" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>https://www.americanscientist.org/article/the-statistical-crisis-in-science <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>I also highly recommend Chapter 10. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>TL;DR - the fundamental assumption of probability theory is one of ignorance. This assumption is too easy to break in most contexts and leads to unfounded confidence in conclusions.Ducttape: Why and How2021-02-09T00:00:00+00:002021-02-09T00:00:00+00:00http://mitchgordon.me/ml/2021/02/09/ducttape<p>One of the most useful things I’ve learned during my PhD is how to use
<a href="https://github.com/jhclark/ducttape">ducttape</a>, a research workflow management
system. Like many good software tools, the mindset behind ducttape is more
powerful than the code itself. In this post, I’ll try to motivate the research
workflow management mindset and then give you a sense of how ducttape solves the
problems I present.</p>
<h2 id="a-simple-experiment">A Simple Experiment</h2>
<p>Suppose we’re training a new machine learning model, and we’re given the following utilities:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">download_data.py</code> - downloads training data from the internet.</li>
<li><code class="language-plaintext highlighter-rouge">filter_1.py</code>, <code class="language-plaintext highlighter-rouge">filter_2.py</code> - two filtering programs based on different criteria.</li>
<li><code class="language-plaintext highlighter-rouge">aug_1.py</code>, <code class="language-plaintext highlighter-rouge">aug_2.py</code>, <code class="language-plaintext highlighter-rouge">aug_3.py</code> - three data augmentation programs.</li>
<li><code class="language-plaintext highlighter-rouge">train_model.py</code> - trains a model, given some training data.</li>
</ul>
<p>Our task is to determine which combination of data filtering and augmentation
leads to the best model performance. For simplicity of exposition, we’ll assume
that we can only choose one filtering program and one augmentation script to
use. (We can’t use multiple augmentation scripts at the same time.)</p>
<h2 id="the-most-naive-approach">The Most Naive Approach</h2>
<p>The most naive approach is to manually try all the possible combinations of
filters and augmentation scripts. After all, there’s only \(2 \times 3 = 6\) possible
combinations. Here’s what our bash history might look like if we did this…</p>
<div class="language-terminal highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">bash-3.2$</span><span class="w"> </span>python download_data.py data.txt
<span class="gp">bash-3.2$</span><span class="w"> </span>python filter_1.py data.txt filtered_data1.txt
<span class="gp">bash-3.2$</span><span class="w"> </span>python filter_2.py data.txt filtered_data2.txt
<span class="gp">bash-3.2$</span><span class="w"> </span>python aug_1.py filtered_data1.txt data_1_1.txt
<span class="gp">bash-3.2$</span><span class="w"> </span>python aug_2.py filtered_data1.txt data_1_2.txt
<span class="gp">bash-3.2$</span><span class="w"> </span>python aug_3.py filtered_data1.txt data_1_3.txt
<span class="gp">bash-3.2$</span><span class="w"> </span>python aug_1.py filtered_data2.txt data_2_1.txt
<span class="gp">bash-3.2$</span><span class="w"> </span>python aug_2.py filtered_data2.txt data_2_2.txt
<span class="gp">bash-3.2$</span><span class="w"> </span>python aug_3.py filtered_data2.txt data_2_3.txt
<span class="gp">bash-3.2$</span><span class="w"> </span>python train_model.py data_1_1.txt model_1_1
<span class="gp">bash-3.2$</span><span class="w"> </span>python train_model.py data_1_2.txt model_1_2
<span class="gp">bash-3.2$</span><span class="w"> </span>python train_model.py data_1_3.txt model_1_3
<span class="gp">bash-3.2$</span><span class="w"> </span>python train_model.py data_2_1.txt model_2_1
<span class="gp">bash-3.2$</span><span class="w"> </span>python train_model.py data_2_2.txt model_2_2
<span class="gp">bash-3.2$</span><span class="w"> </span>python train_model.py data_2_3.txt model_2_3
</code></pre></div></div>
<p>Obviously this is going to be tedious and error-prone. Personally, it took me
three tries to type all these commands without making a typo and mixing up the
numbers. And now we have a bunch of files lying around in our working directory
that we probably won’t remember when we come back to this project next week.</p>
<p>This approach might work for a small, manageable number of experiments, but we
can see how this approach could quickly become impractical, especially if we try
to add more steps.</p>
<h2 id="the-less-naive-approach">The Less Naive Approach</h2>
<p>But the above approach is a strawman. Any decent programmer would probably do
something smarter, like write a couple nested bash for-loops:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python download_data.py data.txt
<span class="k">for </span>X <span class="k">in </span>1 2<span class="p">;</span> <span class="k">do
</span>python filter_<span class="nv">$X</span>.py data.txt filtered_data<span class="nv">$X</span>.txt
<span class="k">for </span>Y <span class="k">in </span>1 2 3<span class="p">;</span> <span class="k">do
</span>python aug_<span class="nv">$Y</span>.py filtered_data<span class="nv">$X</span>.txt data_<span class="nv">$X_$Y</span>.txt
python train_model.py data_<span class="nv">$X_$Y</span>.txt model_<span class="nv">$X_$Y</span>
<span class="k">done
done</span>
</code></pre></div></div>
<p>Still, there’s a couple issues with this approach. First, it doesn’t give
us very fine-grained control over which experiments get executed and when they
get executed. Real-world workflows can have many steps each with their own
configurable options. In a simple 6-step workflow with 3 options each, there’s
\(3^6 = 729\) total experimental configurations. We probably don’t want to run
<strong>all</strong> of those. Even if we do, we probably don’t want to come up with names
for that many intermediate files or have them lying around unorganized on our filesystem.</p>
<p>Second, this approach is not very extensible. Let’s say I find out each data
filtering program has an extra option, called “strictness,” which is set to “high” by
default. If I want to run some experiments with strictness set to low, it’s
non-obvious how to add that to the above bash script without wrecking my current
results. You can do it, sure, but it will be painful. And every time someone
wants to add another dimension to the experiments, the pain increases.</p>
<p>In summary, our ideal workflow looks something like the above, but with a few
extra features:</p>
<ul>
<li><strong>Fine-grained control</strong> over which experiments run and the hardware used to
execute each task.</li>
<li>Sensible <strong>automatic naming</strong> and organization of intermediate files.</li>
<li><strong>Re-use of previous work</strong> when possible, and <strong>parallelization</strong> of work where possible</li>
<li><strong>Easily extensible</strong> with new experiment dimensions/tasks without
breaking results.</li>
<li><strong>Easy to summarize</strong> results in tabular format .</li>
</ul>
<p>Ducttape has all these nice features, which we’ll demonstrate in the next section.</p>
<h2 id="enter-ducttape">Enter Ducttape</h2>
<p>In ducttape, we organize our workflow as a directed acyclic graph. Each node in
the graph is a bash script (called a task) which accepts filenames from
previously completed tasks and optionally outputs files that can be consumed
downstream. These input/output relationships form the edges of our graph. For
example, here is a task which downloads the training data:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>task download_data
<span class="o">></span> data <span class="o">{</span>
python download_data.py <span class="nv">$data</span>
<span class="o">}</span>
</code></pre></div></div>
<p>This task takes no input and outputs a single file, <code class="language-plaintext highlighter-rouge">$data</code>. Notice that <code class="language-plaintext highlighter-rouge">$data</code>
is a bash variable, not an absolute path. This is because ducttape will
automatically assign <code class="language-plaintext highlighter-rouge">$data</code> to a sensible location on the filesystem for us,
depending on the current settings of the experiment dimensions.</p>
<p>Now that we have a task that provides the data, we can consume it in the filtering task.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>task filter
< <span class="nv">data</span><span class="o">=</span><span class="nv">$data</span>@download_data
<span class="o">></span> filtered_data
:: <span class="nv">filter_type</span><span class="o">=(</span>FilterType: 1 2<span class="o">)</span> <span class="o">{</span>
python filter_<span class="k">${</span><span class="nv">filter_type</span><span class="k">}</span>.py <span class="nv">$data</span> <span class="nv">$filtered_data</span>
<span class="o">}</span>
</code></pre></div></div>
<p>This task displays all three possible argument types. Left angle brackets <code class="language-plaintext highlighter-rouge"><</code>
specify input files from previous tasks, while right angle brackets <code class="language-plaintext highlighter-rouge">></code> specify
output files. (Similar to bash file pipes.) Double colons <code class="language-plaintext highlighter-rouge">::</code> specify
parameters, which are just bash string variables.</p>
<p>Our parameter <code class="language-plaintext highlighter-rouge">$filter_type</code> is assigned to be an experiment dimension, which is
called a “branch” in ducttape. The syntax <code class="language-plaintext highlighter-rouge">:: filter_type=(FilterType: 1 2)</code>
means that the bash variable <code class="language-plaintext highlighter-rouge">$filter_type</code> may be assigned a value of 1 or 2 at
run-time depending on which experiment we’ve asked ducttape to run. Notice that
even though this task has multiple experimental configurations, it always writes
its output to the location specified by <code class="language-plaintext highlighter-rouge">$filtered_data</code>, which is set by
ducttape to a sensible filename based on the current experiment configuration.</p>
<p>The task which augments the data is similar to the above:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>task augment
< <span class="nv">filtered_data</span><span class="o">=</span><span class="nv">$filtered_data</span>@filter
<span class="o">></span> augmented_data
:: <span class="nv">aug_type</span><span class="o">=(</span>AugType: 1 2 3<span class="o">)</span> <span class="o">{</span>
python aug_<span class="k">${</span><span class="nv">aug_type</span><span class="k">}</span>.py <span class="nv">$filtered_data</span> <span class="nv">$augmented_data</span>
<span class="o">}</span>
</code></pre></div></div>
<p>And our last task is to train our model:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>task train_model
< <span class="nv">augmented_data</span><span class="o">=</span><span class="nv">$augmented_data</span>@augment
<span class="o">></span> model <span class="o">{</span>
python train_model.py <span class="nv">$augmented_data</span> <span class="nv">$model</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Finally, to run a particular set of experiments, we make a “plan” of execution:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plan main {
reach train_model via (FilterType: 1) * (AugType: *)
}
</code></pre></div></div>
<p>This plan trains three models: all use the first filtering option and each
uses a different augmentation option. We can easily extend this plan to
target different tasks, or to take different paths through our workflow graph.
If a branch is not specified, ducttape uses the “baseline” branch, which is the
first option provided in the branch definition.</p>
<p>This is a fairly linear workflow, but many real-world tasks will have tasks
which take input from many upstream tasks and provide files to many downstream
tasks. You’ll notice that our workflow implementation is more verbose than our
original bash script; however, all this boilerplate gives us the nice features
we mentioned above, including automatic parallelization, assigning different
tasks to different machines, and more.</p>
<h2 id="extending-the-workflow">Extending the Workflow</h2>
<p>Supposing we ran the above experiments, we can go on to extend our workflow with
new experiments without breaking our existing results.</p>
<h3 id="adding-a-new-augmentation-script">Adding A New Augmentation Script</h3>
<p>If we wrote a new augmentation script <code class="language-plaintext highlighter-rouge">aug_4.py</code>, this can easily be added to
our workflow with one character change:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>task augment
< <span class="nv">filtered_data</span><span class="o">=</span><span class="nv">$filtered_data</span>@filter
<span class="o">></span> augmented_data
:: <span class="nv">aug_type</span><span class="o">=(</span>AugType: 1 2 3 4<span class="o">)</span> <span class="o">{</span>
python aug_<span class="k">${</span><span class="nv">aug_type</span><span class="k">}</span>.py <span class="nv">$filtered_data</span> <span class="nv">$augmented_data</span>
<span class="o">}</span>
</code></pre></div></div>
<h3 id="adding-a-new-filter-option">Adding a New Filter Option</h3>
<p>Similarly, if we wanted to add a new branch to specify the “strictness” of the filter, we could update the task like this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>task filter
< <span class="nv">data</span><span class="o">=</span><span class="nv">$data</span>@download_data
<span class="o">></span> filtered_data
:: <span class="nv">filter_strictness</span><span class="o">=(</span>FilterStrict: high low<span class="o">)</span>
:: <span class="nv">filter_type</span><span class="o">=(</span>FilterType: 1 2<span class="o">)</span> <span class="o">{</span>
python filter_<span class="k">${</span><span class="nv">filter_type</span><span class="k">}</span>.py <span class="nv">$data</span> <span class="nv">$filtered_data</span> <span class="nt">--strictness</span><span class="o">=</span><span class="nv">$filter_strictness</span>
<span class="o">}</span>
</code></pre></div></div>
<p>and then update our execution plan:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plan main {
reach train_model via (FilterType: 1) * (AugType: *) * (FilterStrict: *)
}
</code></pre></div></div>
<p>This would not break any of our existing results. When we run this new plan,
ducttape will assume that previous executions of the filter task were run with
the strictness set to “high,” which is the baseline value for the branch.</p>
<h3 id="adding-evaluation">Adding Evaluation</h3>
<p>We can also add new tasks to our workflow which will re-use previous results:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>task evaluate
< <span class="nv">model</span><span class="o">=</span><span class="nv">$model</span>@train_model
<span class="o">></span> score <span class="o">{</span>
python eval_model.py <span class="nv">$model</span> <span class="nv">$score</span>
<span class="o">}</span>
</code></pre></div></div>
<h2 id="caveats">Caveats</h2>
<p>Ducttape is crufty: the latest commit was in 2015, and there are still some
rough edges. That being said, it gets the job done. And since most experiments
are short-lived, I’m not super worried about the tech-debt I might incur by
using it. There are other workflow management frameworks out there, like
<a href="https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8">Airflow</a>
and <a href="https://github.com/spotify/luigi">Luigi</a>, but I’ve found those don’t have
as good of a story for managing experimentation branches.</p>
<p>One other thing I don’t like is that it’s too easy to do the wrong thing with
ducttape. For mildly complex complex workflows, it’s not immediately obvious
what the right task/branch setup should be. This requires “ducttape zen,” which
is discovered with time. In general, I think best practice is to implement more
branches than you need and then trim down your execution space using lots of
execution plans. I might talk about that in a later post.</p>
<p>There are also other features that I haven’t covered, such as package
management, hardware configurations for each task, and summaries of results. If
you’d like to learn more, feel free to read the
<a href="https://github.com/jhclark/ducttape/blob/master/tutorial/TUTORIAL.md">tutorial</a>.
However, I believe this brief overview covers about 90% of my ducttape usage,
and I hope it gives you a sense of the usefulness of research workflow
management.</p>One of the most useful things I’ve learned during my PhD is how to use ducttape, a research workflow management system. Like many good software tools, the mindset behind ducttape is more powerful than the code itself. In this post, I’ll try to motivate the research workflow management mindset and then give you a sense of how ducttape solves the problems I present.A Software Tester’s Perspective on Statistical Learning Theory2020-11-05T00:00:00+00:002020-11-05T00:00:00+00:00http://mitchgordon.me/ml/2020/11/05/statistical-learning-theory-testing<p>You’re a software testing engineer, working at a big tech company. While other
engineers on your team write code, your job is to make sure the code is safe
before you push it to production. Your goal isn’t to prove the code is
“correct,” but rather to assess the risk of potential failures to the company
and <a href="http://assets.cambridge.org/97811071/72012/excerpt/9781107172012_excerpt.pdf#page=8">test
accordingly</a>.
You mainly write tests that try to rule out known or suspected failure modes,
and you spend a lot of time thinking about edge cases.</p>
<p>One day, over your morning cup of coffee, you get an email from the other
engineers on your team. They’ve decided that writing source code is too hard, so
they’ve started <strong>randomly guessing</strong> program implementations until one meets
the specification. They call this wacky approach “<a href="https://medium.com/@karpathy/software-2-0-a64152b37c35">Software
2.0</a>” or something.</p>
<p>“Not to worry,” they tell you, “we can prove it works. You don’t even have to
write tests any more!” They go on to explain that there’s this book called
“<a href="https://en.wikipedia.org/wiki/Statistical_learning_theory#:~:text=Statistical%20learning%20theory%20is%20a,predictive%20function%20based%20on%20data.">Statistical Learning
Theory</a>,”
which describes a mathematical framework that <em>proves</em> Software 2.0 can give you
a correct implementation.</p>
<p>Intrigued, you ask them for more details.</p>
<h1 id="future-input-sim-past-input">Future Input \(\sim\) Past Input</h1>
<p>First, they have to assume that any future user input will be similar to what
you’ve seen previously in production. They call this the
<a href="https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables">IID</a>
assumption.</p>
<p>“But what about hackers?” you ask. “and what happens when we change the UI?
People change their behaviour all the time! This week they’re Googling for
election results, but next week they’ll go back to Googling Kanye West…”</p>
<p>They concede that maybe you have a point, but they definitely
need this assumption to make it work. You begrudgingly let them continue.</p>
<h1 id="no-specification">No Specification</h1>
<p>Then they tell you there’s no specification. You ask them what the hell that means.</p>
<p>“Ok, hear us out,” they say. “The old spec was basically impossible to write.
There were too many edge cases! Our poor product manager didn’t even know where
to start, honestly.”</p>
<p>Instead, they decided to ask the product manager to write down a bunch of
example user inputs along with the correct output for each example. That would
serve as the defacto specification. They call this “training data.” Then they
guess a program that meets those requirements, using this thing called gradient
descent.</p>
<p>You mention that this reminds you of <a href="https://medium.com/dev-genius/dont-just-test-the-happy-path-e3fd565bad53">happy-path
testing</a>,
where you only write tests for things you expect without thinking about possible
failure modes. In this case the training examples test the happy paths. How do
you know there aren’t still edge cases and bugs lurking around?</p>
<p>That’s where the magic of STL kicks in, they say.</p>
<h1 id="future-performance-sim-past-performance">Future Performance \(\sim\) Past Performance</h1>
<p>If you assume that future inputs look like past inputs, they say, then you can
also assume future performance will look like past performance! As long as you
<strong>have enough training data</strong>, there’s a low probability that you’ll encounter
edge cases. Basically, you want the happy paths to be the only paths.</p>
<p>“But how many examples do you need to make sure there aren’t any unhappy paths?”
you ask.</p>
<p>It depends on how complicated the program you’re guessing is, they say. If the
program is super complex and you don’t know anything about it, you basically
need to enumerate all the possible inputs. They’ve been calling this the “<a href="https://analyticsindiamag.com/what-are-the-no-free-lunch-theorems-in-data-science/">No
Free Lunch Theorem</a>.”</p>
<p>But if it’s “simple” somehow and you can use that to narrow down the possible
candidate programs, then you need way less data. That part sounds kind of
reasonable to you. It reminds you of having <a href="https://en.wikipedia.org/wiki/Code_coverage">branch
coverage</a> when writing unit tests.
If you have more branches, then you have to write more tests. Similarly, if you
have more possible candidate programs, then you need more data to make sure you
pick the right one.</p>
<h1 id="biasvariance-trade-off">Bias/Variance Trade-off</h1>
<p>“But wait,” you say, “what if you’re wrong about how the program you’re guessing
is simple? Isn’t the problem that you don’t know the right program in the
first place?”</p>
<p>They tell you you’re right, of course, and that there’s a trade-off. There are
different kinds of simplicity with different levels of strictness.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> If you
suppose the wrong kind of simplicity from the start, that puts a hard cap on how
well you can learn the program. They call this the <em>bias</em> or <em>approximation
error</em>. On the other hand, if you don’t assume anything at all, then you need
way more data. If you don’t have enough, then you might encounter <em>variance</em> or
<em>estimation error</em>.</p>
<p>The best case, of course, is when you don’t have to guess and you know the
correct implementation of the program you want to write. Then you would have
maximum correct bias and no variance.</p>
<p>You muse that the second best case would be to just label every possible input
so that you don’t have to assume anything. They tell you that’s usually impractical (the
poor product manager can only work so much) but in some cases you basically have
<a href="https://arxiv.org/abs/2005.14165">infinite data</a> and that’s exactly what they do.</p>
<h1 id="conclusion">Conclusion</h1>
<p>“So let me get this straight,” you say.</p>
<ul>
<li>
<p>First, you assume that the future will look like the past.</p>
</li>
<li>
<p>Next, you get somebody to write down a bunch of example inputs and
correct outputs which you use as the spec.</p>
</li>
<li>
<p>Then, you make some assumptions about the program you’re trying to guess. If
you make the wrong assumptions, then you cap your maximum performance. But if
you don’t make any, you might accidentally overload your product manager.</p>
</li>
<li>
<p>Finally, you guess a random program that fits the spec. As long as
you have enough data, you can guarantee you did the best you could with your
assumptions and that you probably won’t hit any edge cases.</p>
</li>
</ul>
<p>They nod. You shake your head. “I don’t know guys, this seems kind of fishy. The
IID assumption is one thing, but we also have no idea how much the approximation error is, right?”</p>
<p>They shrug. “Look man, we just don’t want to write any code, ok? It’s too hard.”
You can understand the sentiment.</p>
<p>“Besides, we don’t ever really use STL in practice.”</p>
<p>“What?”</p>
<p>“Yeah, we just set aside some of the training data as a test set. If the program
we guess does good on the training data and the test set, we just assume it’s
good to go.”</p>
<p>“But what if your test set is bad?”</p>
<p>They shrug again. “We just do our best. Sometimes we <a href="https://www.aclweb.org/anthology/2020.acl-main.442/">craft test
sets</a> that look for
<a href="https://arxiv.org/abs/1912.12598">specific properties</a>.”</p>
<p>You nod knowingly. At the end of the day, you don’t write tests to prove
correctness. You write tests to show the <a href="http://assets.cambridge.org/97811071/72012/excerpt/9781107172012_excerpt.pdf#page=8">presence or absence of
bugs</a>
in a way that appropriately manages risk. Some things never change. You take a
sip of coffee and go back to writing unit tests, suspecting your colleagues
will join you in a few years.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>You might assume, for example, that your program is <a href="http://egrcc.github.io/docs/dl/deeplearningbook-convnets.pdf">translation invariant</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>You’re a software testing engineer, working at a big tech company. While other engineers on your team write code, your job is to make sure the code is safe before you push it to production. Your goal isn’t to prove the code is “correct,” but rather to assess the risk of potential failures to the company and test accordingly. You mainly write tests that try to rule out known or suspected failure modes, and you spend a lot of time thinking about edge cases.The Variance of Yotta Savings Accounts2020-08-24T00:00:00+00:002020-08-24T00:00:00+00:00http://mitchgordon.me/math/2020/08/24/yotta-ball<p>My girlfriend recently got a <a href="https://www.withyotta.com/">Yotta savings account</a>
which has an interesting twist: instead of paying interest like a normal bank,
they buy you lottery tickets. This makes saving more exciting, since you have a
small chance of winning millions of dollars.</p>
<p>Now, my first thought was that this must be a terrible deal for whoever’s
playing, since the lottery is generally considered a bad investment. But it
turns out the Yotta lottery actually has pretty good odds. People have
<a href="https://youtu.be/ziGvywXzhfw?t=209">calculated the average return</a> to be around
2.6% APY<sup id="fnref:splits" role="doc-noteref"><a href="#fn:splits" class="footnote" rel="footnote">1</a></sup>, which is
not a bad return for a savings account. This will likely decrease as they become
more established and get more users.</p>
<h2 id="what-is-the-variance-of-yotta-interest">What is the variance of Yotta interest?</h2>
<p>One question I haven’t seen answered on the internet is about the variance of
the APY. Sure, maybe the <em>average</em> investor gets 2.6% APY. But that might mean most
people get 0% and one person wins a few million dollars. If I’m going to invest,
I’d like some guarantees on the lower-bound of the amount of money I’m going to
get back.</p>
<p>This is a really straight-forward instance of the law of large numbers: the
more tickets you buy, the closer you’ll get to the average return. But before we
break out the math, let’s start with some simulations to get some intuition
of what to expect.</p>
<p>Let’s suppose 10k people each invest $10k in a Yotta savings account. How much
interest will each of them earn? We can simulate this scenario with a pretty
simple python script:</p>
<ol>
<li>$10k buys each person 400 lottery tickets per week.</li>
<li>We can simulate the lottery drawings with a random number generator to
predict how much money<sup id="fnref:payouts" role="doc-noteref"><a href="#fn:payouts" class="footnote" rel="footnote">2</a></sup> each person will win from their 400 tickets.
(No split prizes.<sup id="fnref:splits:1" role="doc-noteref"><a href="#fn:splits" class="footnote" rel="footnote">1</a></sup>)</li>
<li>Rinse and repeat for 52 weeks, keeping track of the total money won for each person.<sup id="fnref:compounding" role="doc-noteref"><a href="#fn:compounding" class="footnote" rel="footnote">3</a></sup></li>
</ol>
<p>The results are plotted in the histogram below. On the x-axis is an amount of
money, and the y-axis shows how many people won that much money via lottery tickets.</p>
<p><img src="http://mitchgordon.me/assets/yotta_10000_52_10000_False.png" alt="10k" /></p>
<p>As expected, the average money won is around $260 (2.6% APY). However, some
people won as little as $200 (2.0% APY) and some as much as $350 (3.5% APY). No one
won less than $150 (1.5% APY).</p>
<p>So the variance isn’t that bad. Of course, if you invest less money you get less
tickets, and so the variance will increase. Below is the same experiment when
people invest $1k each.</p>
<p><img src="http://mitchgordon.me/assets/yotta_1000_52_10000_False.png" alt="1k" /></p>
<p>In this case, the average still seems to be around 2.6% APY. However, some
people get back as little as 1.0% APY, and there’s a large skew right, with some
people getting as much as 6.0% APY. If you’re interested in running more
experiments, I’ve provided the python snippet below. (If that’s broken you can
also grab the code on
<a href="https://github.com/mitchellgordon95/implementing-paradoxes/blob/master/yotta.py">github</a>.)</p>
<iframe src="https://trinket.io/embed/python3/5771943659" width="100%" height="356" frameborder="0" marginwidth="0" marginheight="0" allowfullscreen=""></iframe>
<h2 id="tail-inequalities">Tail Inequalities</h2>
<p>I mentioned earlier that buying many lottery tickets is a very
straight-forward instance of the law of large numbers: the more tickets you buy,
the closer your winnings get to the expected APY. But what if I want to quantify
exactly how far my earnings will be from the average? For example, if I invest
$10k for one year, how likely is it that I earn less than $200?</p>
<p>This is exactly the question that tail inequalities answer. Suppose I have a
random variable \(X\). Tail inequalities tell us how probable it is that \(X <
t\), for some value of \(t\). There are several types of tail-inequalities you
can use, depending on how much information you have about \(X\):</p>
<ul>
<li>
<p>The <strong>Markov Inequality</strong> is the simplest version, which you can use if you
only know the expected value of \(X\).</p>
</li>
<li>
<p>The <strong>Chebyshev Inequality</strong> is more complicated, taking into account both the
expected value of \(X\) and its variance.</p>
</li>
<li>
<p><strong>Chernoff Bounds</strong> usually give you the tightest bounds, but are only
applicable when \(X\) is a sum of multiple independent random variables.</p>
</li>
</ul>
<p>Technically we could apply Chernoff bounds, since the amount that we win in a year is the sum of the amount that we win for every ticket we buy that year. But that requires a little more elbow grease than I’m willing to put in at the moment, so we’ll just use Chebyshev bounds. Here’s the theorem:</p>
<p><strong>Theorem (Chebyshev’s Inequality)</strong> Let \(X\) be a random variable with expectation \(\mu_X\) and standard deviation \(\sigma_X\). Then for any \(t \in \mathbb{R}^+\),</p>
\[Pr[|X - \mu_X| \leq t \sigma_X] \leq \frac{1}{t^2}\]
<p>Let’s break this down for our case:</p>
<table>
<tbody>
<tr>
<td>Math stuff</td>
<td>Our case</td>
</tr>
<tr>
<td>\(X\)</td>
<td>The amount of money we’ll earn if we invest $10k in a Yotta savings account over one year.</td>
</tr>
<tr>
<td>\(\mu_X\)</td>
<td>The average value of X. ($260)</td>
</tr>
<tr>
<td>\(\sigma_X\)</td>
<td>The standard deviation of X. ($29)<sup id="fnref:variance" role="doc-noteref"><a href="#fn:variance" class="footnote" rel="footnote">4</a></sup></td>
</tr>
<tr>
<td>\(Pr[|X - \mu_X| \leq t \sigma_X]\)</td>
<td>The probability that the money we earn is \(t\) standard deviations away from the mean.</td>
</tr>
</tbody>
</table>
<p>So let’s say we want to know the probability that we earn less than $200 in a
year. That’s more than two standard deviations away from the mean. The Chebyshev
theorem tells us that the probability that we make less than two standard
deviations below the mean is \(\frac{1}{2^2} = 1/4\).</p>
<p>So Chebyshev guarantees that we’ll make more than $200 with at least 75%
certainty. But remember, that’s just a bound. Based on our simulations, the
probability that we make more than $200 is likely much higher than 75%. I would
probably say it’s around 95% based on the histogram.</p>
<h2 id="central-limit-theorem">Central Limit Theorem</h2>
<p>Now, you might have noticed that our first histogram looks essentially like a
normal distribution. This happens all the time and is the subject of the
<em>central limit theorem</em> (CLT). The CLT says that if you buy enough lottery
tickets, the amount of money you make that year will fall on an approximately
normal distribution. The mean of this normal distribution is the sum of the
average value of a ticket, while the variance is the sum of the
variance of each ticket.</p>
<p>If we can verify that the distribution is approximately normal, as we have above
with the $10k scenario, we can skip calculating the Chebyshev or Chernoff bounds
and just assume the distribution is normal. Plugging in the average and standard
deviation into this <a href="http://onlinestatbook.com/2/calculators/normal_dist.html">normal distribution
calculator</a> tells us
the probability of making more than $200 is 98%. This is much more in line with
the results of our experiments, if slightly more hand-wavy.</p>
<p>You have to be careful with this approach, however, since not all distributions
will be normal. For example, the second scenario we tested, where each person
invested $1k, showed that the distribution wasn’t very close to a normal
distribution. In that case, it might be better to apply Chernoff or Chebyshev
bounds.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:splits" role="doc-endnote">
<p>Some prizes can be split between multiple winners (such as the $10M prize). If more people are playing, then split prizes get smaller. Since we’re interested in the “worst-case” scenario, we assume the actual payout for all split prizes is $0. In this case, the average APY is 2.6%. <a href="#fnref:splits" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:splits:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:payouts" role="doc-endnote">
<p>We used the payouts from this <a href="https://bit.ly/yottaev">spreadsheet</a>. <a href="#fnref:payouts" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:compounding" role="doc-endnote">
<p>For simplicity, we did not compound the weekly interest. This can be changed in the python snippet, if you care, but it does not impact growth too much. <a href="#fnref:compounding" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:variance" role="doc-endnote">
<p>The variance of a single ticket is roughly 4 cents. The variance of a sum of independent random variables is the sum of their variances. So for 400 * 52 = 20,800 tickets, the variance is about $842. The standard deviation is the square root of the variance. <a href="#fnref:variance" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>My girlfriend recently got a Yotta savings account which has an interesting twist: instead of paying interest like a normal bank, they buy you lottery tickets. This makes saving more exciting, since you have a small chance of winning millions of dollars.