{"generator":"Jekyll","link":[{"@attributes":{"href":"https:\/\/lorenlugosch.github.io\/feed.xml","rel":"self","type":"application\/atom+xml"}},{"@attributes":{"href":"https:\/\/lorenlugosch.github.io\/","rel":"alternate","type":"text\/html"}}],"updated":"2023-04-06T09:19:59-07:00","id":"https:\/\/lorenlugosch.github.io\/feed.xml","title":"Loren Lugosch","subtitle":"personal description","author":{"name":"Loren Lugosch"},"entry":[{"title":"What does Hegel mean by \u201cReason\u201d?","link":{"@attributes":{"href":"https:\/\/lorenlugosch.github.io\/posts\/2021\/05\/hegel-reason\/","rel":"alternate","type":"text\/html","title":"What does Hegel mean by \u201cReason\u201d?"}},"published":"2021-05-15T00:00:00-07:00","updated":"2021-05-15T00:00:00-07:00","id":"https:\/\/lorenlugosch.github.io\/posts\/2021\/05\/hegel-reason","content":"<p>I finally finished reading the big and baffling <a href=\"https:\/\/www.google.ca\/books\/edition\/Phenomenology_of_Spirit\/xOnhG9tidGsC?hl=en&amp;gbpv=0\"><em>Phenomenology of Spirit<\/em><\/a> by Georg Wilhelm Friedrich Hegel.<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/hegel\/knives_out.jpg\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<p>The gist of the book is that something called \u201cSpirit\u201d (\u201cGeist\u201d) develops from simple consciousness of the \u201cHere and Now\u201d, to abstract concepts, to knowledge of self and the social\/ethical world, to more sophisticated forms of art, science, religion, and finally <em>Absolute Knowledge<\/em>. Along the way, Spirit invents big world-historical things like Stoicism, Skepticism, Christianity, and Kant. (See the German audiobook cover below, which incidentally cracks me up because it looks like something from a cult.)<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/hegel\/stages.jpeg\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<p>For me, as for other people, the most fruitful way to get through the book turned out to be <em>not<\/em> to try following the logic (if it exists!) of the text from section to section, but rather to sit back and enjoy the stream-of-consciousness of an extremely erudite man, who occasionally drops an interesting phrase or formulation that people like Marx and Sartre later picked up.<sup id=\"fnref:sadler\" role=\"doc-noteref\"><a href=\"#fn:sadler\" class=\"footnote\" rel=\"footnote\">1<\/a><\/sup><\/p>\n\n<p>Still, there was one aspect of the <em>Phenomenology of Spirit<\/em> that defied my stream-of-consciousness-style reading and gave me pause: namely, the way in which Hegel defines the word \u201cReason\u201d (\u201cVernunft\u201d).<\/p>\n\n<p>Here are a few selections from the beginning of the \u201cReason\u201d section of the book:<\/p>\n\n<blockquote>\n  <p>[Self-consciousness as Reason] is certain that it is itself reality, or that everything actual is none other than itself\u2026 [Sec. 232, p. 139]<\/p>\n<\/blockquote>\n\n<blockquote>\n  <p>Reason is the certainty of consciousness that it is all reality; thus does idealism express its Notion. [Sec. 233, p. 140]<\/p>\n<\/blockquote>\n\n<blockquote>\n  <p>Reason is the certainty of being all <em>reality<\/em>. [Sec. 235, p. 142]<\/p>\n<\/blockquote>\n\n<blockquote>\n  <p>Reason, as it <em>immediately<\/em> comes before us as the certainty of consciousness that is is all reality, \u2026 [Sec. 242, p. 146]<\/p>\n<\/blockquote>\n\n<p>This is an odd definition for \u201cReason\u201d. It sounds more like a definition of \u201cidealism\u201d. I think most philosophers, or AI people like me, would instead define \u201cReason\u201d as \u201cproducing (or the faculty of producing) valid new assertions given other assertions\u201d, or something like that. But Hegel is definitely very keen on his own unusual definition, as he makes a point of saying it at least four times.<\/p>\n\n<p>Yet Hegel\u2019s definition doesn\u2019t seem that useful or fitting even in his own book. Take, for example, this description of what Reason does:<\/p>\n\n<blockquote>\n  <p>Reason sets to work to <em>know<\/em> the truth, to find in the form of a Notion that which, for \u2018meaning\u2019 and \u2018perceiving\u2019, is a Thing; i.e. it seeks to possess in thinghood the consciousness only of itself. [Sec. 240, p. 145]<\/p>\n<\/blockquote>\n\n<p>Plugging in \u201cthe certainty of consciousness that it is all reality\u201d for \u201cReason\u201d does not really seem to work here:<\/p>\n\n<blockquote>\n  <p>[The certainty of consciousness that it is all reality] sets to work to <em>know<\/em> the truth\u2026<\/p>\n<\/blockquote>\n\n<p>Or here, in which Hegel starts to take one of his many potshots at \u201csound common sense\u201d:<\/p>\n\n<blockquote>\n  <p>Since self-consciousness knows itself to be a moment of the <em>being-for-self<\/em> of this substance, it expresses the existence of the law within itself as follows: sound Reason knows immediately what is right and good. [Sec. 422, p. 253]<\/p>\n<\/blockquote>\n\n<p>$\\rightarrow$<\/p>\n\n<blockquote>\n  <p>sound [certainty of consciousness that it is all reality] knows immediately what is right and good (?)<\/p>\n<\/blockquote>\n\n<p>One possibility is that Hegel is not actually using \u201cthe certainty of consciousness that it is all reality\u201d as a <em>definition<\/em> of \u201cReason\u201d. Instead, maybe he just wants to firmly establish the notion that \u201cconsciousness is all reality\u201d, and this advanced stage of the development of Spirit in his book seemed like a good place to do it. We <em>know<\/em> what Reason is; so instead of wasting the reader\u2019s time giving a \u201cdefinition\u201d they already have, why not use the gap where a definition would normally go as a free space to stick an idea he wants to make sure we share with him? Hegel seems to pull <a href=\"https:\/\/www.youtube.com\/watch?v=N5iKYhnPpV4\">many such tricks<\/a> in this book.<\/p>\n\n<p>Another possibility is that when Hegel says \u201cReason <em>is<\/em> the certainty of being all reality\u201d, the \u201c<em>is<\/em>\u201d is not indicating <em>identity<\/em> but rather just marking a <em>predicate<\/em>: that is, \u201cthe certainty of being all reality\u201d is <em>one<\/em> aspect of the thing Hegel calls \u201cReason\u201d, but not the complete definition (which he never provides). But the predicate of \u201cthe certainty of being all reality\u201d still does not really jive with our commonsensical notion of \u201cReason\u201d. So maybe Hegel is just casually asserting that \u201c<em>if<\/em> you\u2019re a thinking person, a <em>reasonable<\/em> German intellectual in the year 1807, and you\u2019ve read your Kant, your Fichte, and your Schelling, then <em>of course<\/em> you\u2019re an idealist and you already <em>know<\/em> that consciousness is all reality\u201d.<\/p>\n\n<p>A third possibility is that \u201cVernunft\u201d has overtones in German that \u201cReason\u201d does not have in English and that could make Hegel\u2019s odd usage more sensible. This is at least true of \u201cGeist\u201d (cognate with English \u201cghost\u201d), which we could translate in English as \u201cSpirit\u201d (as is done for Hegel), \u201cMind\u201d, or maybe \u201cintellectual activity\u201d (\u201chumanities\u201d in German is \u201cGeisteswissenschaften\u201d, in contrast to \u201cNaturwissenschaften\u201d, \u201cnatural science\u201d). My German isn\u2019t good enough for me to know whether something similar can be said of \u201cVernunft\u201d.<\/p>\n\n<p>(A fourth possibility is that Hegel is full of shit and we should just put the book down. In math, this is called \u201c<a href=\"https:\/\/en.wikipedia.org\/wiki\/Triviality_(mathematics)#Trivial_and_nontrivial_solutions\">the trivial solution<\/a>\u201d.)<\/p>\n\n<p>It\u2019s also interesting to look at how Hegel uses \u201cReason\u201d (\u201cVernunft\u201d) in some of his later writings. In the preface<sup id=\"fnref:owl\" role=\"doc-noteref\"><a href=\"#fn:owl\" class=\"footnote\" rel=\"footnote\">2<\/a><\/sup> to his 1820 <em>Philosophy of Right<\/em> Hegel attributes an idea about Reality and Reason to Plato:<\/p>\n\n<blockquote>\n  <p>Yet [Plato] has proved himself to be a great mind because the very principle and central distinguishing feature of his idea is the pivot upon which the world-wide revolution then in process turned: What is rational is real; And what is real is rational. (Was vern\u00fcnftig ist, das ist wirklich; und was wirklich ist, das ist vern\u00fcnftig.)<\/p>\n<\/blockquote>\n\n<p>Another famous instance is his phrase \u201cthe Cunning of Reason\u201d (\u201cdie List der Vernunft\u201d) in his <em>Lectures on the Philosophy of History<\/em>. The idea is that individual humans who think they are simply acting in their own self-interest, with no regard for whatever shenanigans Spirit is trying to pull with world history at the moment, are, by being <em>rational<\/em>, unwittingly carrying out what <em>Reason<\/em> wants to have happen in the world:<\/p>\n\n<blockquote>\n  <p>The particular interest linked to passion is thus inseparable from the actualization of the universal principle; for the universal is the outcome of the particular and determinate, and from its negation. \u2026 This may be called the <em>Cunning of Reason<\/em>, that it allows the passions to work for it, while what it brings into existence suffers loss and injury.<\/p>\n<\/blockquote>\n\n<p>Reason as logic; Reason as idealism; Reason as Spirit moving history forward by tricking us rational peasants into doing what it wants. Is there some simple notion underlying these conceptions for Hegel? You\u2019d probably have to get a philosophy degree to prove it. As a programmer, polysemy frustrates me. As a poet, it fascinates me.<\/p>\n\n<div class=\"footnotes\" role=\"doc-endnotes\">\n  <ol>\n    <li id=\"fn:sadler\" role=\"doc-endnote\">\n      <p>For a more systematic reading by someone who is <em>not<\/em> an amateur like me, I recommend <a href=\"https:\/\/www.youtube.com\/playlist?list=PL4gvlOxpKKIgR4OyOt31isknkVH2Kweq2\">Gregory Sadler\u2019s \u201cHalf Hour Hegel\u201d<\/a>, a years-long lecture series in which Sadler attempts to explain or interpret <em>every section<\/em> of the book. I\u2019m not that patient!\u00a0<a href=\"#fnref:sadler\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:owl\" role=\"doc-endnote\">\n      <p>At the end of that preface, Hegel gives a beautiful metaphor for his conception of philosophy: <em>\u201cWhen philosophy paints its grey in grey, one form of life has become old, and by means of grey it cannot be rejuvenated, but only known. The owl of Minerva takes its flight only when the shades of night are gathering.\u201d (\u201cWenn die Philosophie ihr Grau in Grau malt, dann ist eine Gestalt des Lebens alt geworden, und mit Grau in Grau l\u00e4\u00dft sie sich nicht verj\u00fcngen, sondern nur erkennen; die Eule der Minerva beginnt erst mit der einbrechenden D\u00e4mmerung ihren Flug.\u201d)<\/em>\u00a0<a href=\"#fnref:owl\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n  <\/ol>\n<\/div>","author":{"name":"Loren Lugosch"},"summary":"I finally finished reading the big and baffling Phenomenology of Spirit by Georg Wilhelm Friedrich Hegel."},{"title":"First PC build","link":{"@attributes":{"href":"https:\/\/lorenlugosch.github.io\/posts\/2021\/03\/pc\/","rel":"alternate","type":"text\/html","title":"First PC build"}},"published":"2021-03-03T00:00:00-08:00","updated":"2021-03-03T00:00:00-08:00","id":"https:\/\/lorenlugosch.github.io\/posts\/2021\/03\/pc","content":"<p>My everyday work computer is a MacBook Pro that I\u2019ve had since 2013. It\u2019s a great machine and continues to serve me well, but I was moved in a moment of pandemic malaise to treat myself to a little upgrade.<\/p>\n\n<p>I decided to build (for the first time!) a desktop PC instead of getting another laptop, for a few reasons:<\/p>\n<ul>\n  <li>I\u2019m working from home now, and I\u2019ll probably have to for the foreseeable future.<\/li>\n  <li>As a holder of electrical and computer engineering degrees, I felt a little embarrassed that I had never built my own computer. (Cf. the <a href=\"https:\/\/www.imdb.com\/title\/tt0582462\/\">engine repair episode of <em>Frasier<\/em><\/a>.)<\/li>\n  <li>I wanted to try modern PC games that require a little more horsepower than <em>Undertale<\/em> and that I can\u2019t play on my Nintendo Switch.<\/li>\n<\/ul>\n\n<p>So here\u2019s how my first PC build went.<\/p>\n\n<h2 id=\"the-gpu\">The GPU<\/h2>\n\n<p>As you might guess from my website\u2019s favicon, and my all-around Stack More Layers demeanor, I started with the GPU.<\/p>\n\n<p>I wanted a not-too-expensive GPU that I could use both for gaming and for training reasonably large neural nets. (I do have access to lots of powerful GPUs in the Mila cluster, but it\u2019s nice not to have to share with other people sometimes, and to be able to leave a model training for, like, a month.)<\/p>\n\n<p>The best option looked to be the Nvidia (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Talk%3ANvidia#Naming_Conventions\">nVidia? NVIDIA?<\/a>) <a href=\"https:\/\/www.nvidia.com\/en-us\/geforce\/graphics-cards\/30-series\/rtx-3080\/\">RTX 3080<\/a>. But whether due to pandemic disruptions, or Bitcoin bros, or BERT bros, or something else, it\u2019s off shelves everywhere.<\/p>\n\n<p>I waited a month to see if the RTX 3080 would come back in stock anywhere, but no luck. Instead, I consulted Tim Dettmers\u2019s magisterial <a href=\"https:\/\/timdettmers.com\/2020\/09\/07\/which-gpu-for-deep-learning\/\">deep learning GPU guide<\/a> and found that, among not-unavailable GPUs, the <a href=\"https:\/\/www.nvidia.com\/en-us\/geforce\/products\/10series\/ultimate-4k\/\">GTX 1080 Ti<\/a> had the best performance per dollar, and good performance for training transformer models. So I picked one up from a guy nearby in Montreal off Kijiji.<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/pc\/gpu.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<h2 id=\"the-rest\">The rest<\/h2>\n\n<p>Using <a href=\"https:\/\/pcpartpicker.com\/builds\/\">PCPartPicker\u2019s Completed Builds<\/a> search feature, I found a <a href=\"https:\/\/pcpartpicker.com\/b\/BPgJ7P\">build<\/a> for a PC based around the RTX 3080 and followed it to the letter. I probably could have optimized the components more for what I want, but having not ever built a PC before, I didn\u2019t want to accidentally select components that were incompatible with each other. (PCPartPicker does have a compatibility checker, but I don\u2019t know how idiot-proof it is.)<\/p>\n\n<p>Then I bought all the components, put it together, and breathed a sigh of relief when I pushed the power button for the first time and the fans turned on.<\/p>\n\n<p>The build process felt like assembling a LEGO (Lego?) set, but with the instructions distributed across multiple boxes and the additional stress of unfamiliar sensitive electronic components that would cost a lot of money to replace.<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/pc\/build.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<p>I started by reading the motherboard manual, which told me to put in the CPU. That was a little nerve-wracking because I really had to pull the little whammy bar (or so I\u2019m inclined to call it) <em>tight<\/em> to get the chip in, which felt wrong.<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/pc\/cpu.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<p>Then I installed the fan, with the help of a <a href=\"https:\/\/www.youtube.com\/watch?v=Ascz5P0-jyU\">video guide<\/a>. I wasn\u2019t even sure what the orientation of the fan within the case was supposed to be, so I tried to infer it from a photo taken by the creator of the build. Because I hadn\u2019t put the motherboard into the case yet and had to operate solely on other landmarks, this was mystifying, until I realized that the photo showed a machine with glowy RAM sticks instead of the non-glowy RAM I had. Thus I learned that people use RAM that <em>glows<\/em>.<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/pc\/vidicus_build.jpg\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<p>The PC case itself was a complex enough beast that I had to follow yet another (thorough, excellent) <a href=\"https:\/\/www.youtube.com\/watch?v=2mMBA4lzDvo\">video guide<\/a> on how to get it open and wire things up to the motherboard.<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/pc\/complete.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<p>It\u2019s alive! And it <em>glows<\/em>. But the RGB effects are a little distracting, so I\u2019ll probably deactivate them. Besides, I don\u2019t need them to feel like a real <em>Gamer<\/em>. As it is written: <em>\u201cWhen you Game, do not be like the hypocrites, for they love to be seen as Gamers with their showy RGB effects. Instead, when you Game, go into your room, close the door, and Game in secret.\u201d<\/em><\/p>\n\n<h2 id=\"some-extras\">Some extras<\/h2>\n\n<p>The build I followed did not include any peripheral devices, so I picked some up, again using PCPartPicker. I\u2019ve never been picky about keyboards, so I just filtered for low prices, sorted by rating, and picked the top one. Same for the monitor. For the mouse, I bought this nice glowy one with some sort of fangly beast on it. More RGB!<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/pc\/mouse.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<p>Also, apparently, motherboards cannot connect to the Internet without help! As I discovered when I turned on the PC. (Stop laughing!) I don\u2019t have an easily-reachable Ethernet cable, so I had to wait a couple days more for a Wi-Fi PCIe card to arrive before I could actually use the computer.<\/p>\n\n<p>Incidentally, the package for the Wi-Fi card featured maybe my favorite acronym ever:<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/pc\/fast.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<h2 id=\"testing-er-out\">Testing \u2018er out<\/h2>\n\n<p>For my first game, I picked up <em>Star Wars: Squadrons<\/em>, the recently released spiritual successor to the old <em>Rogue Squadron<\/em> series of dogfighting games. I\u2019m bad at it. But it\u2019s fun! And it runs great on my new machine. It\u2019s very pretty, but I haven\u2019t looked up how to do a screenshot yet, so here\u2019s a picture swiped from the Wikipedia article on the game.<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/pc\/squadrons.jpg\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<p>I also downloaded the original <em>Half-Life<\/em>, a game I bought on Steam a while back that my MacBook rudely made unplayable with the Catalina macOS update.<\/p>\n\n<p>Yay! Almost done. The last step for me is getting Linux dual-booted so that I can use all my machine learning tools.<\/p>","author":{"name":"Loren Lugosch"},"summary":"My everyday work computer is a MacBook Pro that I\u2019ve had since 2013. It\u2019s a great machine and continues to serve me well, but I was moved in a moment of pandemic malaise to treat myself to a little upgrade."},{"title":"End-to-end models falling short on SLURP","link":{"@attributes":{"href":"https:\/\/lorenlugosch.github.io\/posts\/2020\/12\/slurp\/","rel":"alternate","type":"text\/html","title":"End-to-end models falling short on SLURP"}},"published":"2020-12-19T00:00:00-08:00","updated":"2020-12-19T00:00:00-08:00","id":"https:\/\/lorenlugosch.github.io\/posts\/2020\/12\/slurp","content":"<p>There\u2019s a hot new dataset for spoken language understanding: <a href=\"https:\/\/www.aclweb.org\/anthology\/2020.emnlp-main.588.pdf\">SLURP<\/a>. I\u2019m excited about SLURP for a couple reasons:<\/p>\n\n<ul>\n  <li>It\u2019s way bigger and more challenging than existing open-source SLU datasets:<\/li>\n<\/ul>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/slurp\/dataset-info.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<ul>\n  <li>The authors found that an end-to-end model did not work:<\/li>\n<\/ul>\n\n<blockquote>\n  <p>We have tested several SOTA E2E-SLU systems on SLURP, including (Lugosch et al., 2019b) which produces SOTA results on the FSC corpus. However, re-training these models on this more complex domain did not converge or result in meaningful outputs. Note that these models were developed to solve much easier tasks (e.g. a single domain).<\/p>\n<\/blockquote>\n\n<p>Having <a href=\"https:\/\/lorenlugosch.github.io\/posts\/2020\/12\/slu\/\">just written<\/a> a SpeechBrain recipe for end-to-end SLU with my much simpler Timers and Such dataset, I thought I\u2019d give that recipe a whirl on SLURP. The model used in the recipe looks like this:<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/slu\/direct.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<p>The output dictionaries for SLURP look a bit different from the ones in Timers and Such or Fluent Speech Commands:<\/p>\n<pre><code style=\"font-size:14px\">\n{\n  \"scenario\": \"alarm\",\n  \"action\": \"query\",\n  \"entities\": [\n    {\"type\": \"event_name\", \"filler\": \"dance class\"}\n  ]\n}\n<\/code>\n<\/pre>\n\n<p>But\u2014much like a honey badger\u2014our autoregressive sequence-to-sequence model doesn\u2019t care. It just generates the dictionary character-by-character, no matter what the format looks like.<\/p>\n\n<p>The authors of the SLURP paper provide baseline results with a more task-specific model, <a href=\"https:\/\/www.aclweb.org\/anthology\/W19-5931.pdf\">HerMiT<\/a>, which uses an ASR model to predict a transcript and a conditional random field for each of (scenario, action, entities). The outputs are generated using the Viterbi algorithm.<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/slurp\/hermit.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<p>The authors kindly <a href=\"https:\/\/github.com\/pswietojanski\/slurp\">released<\/a> their tool for computing performance metrics, including a new \u201cSLU-F1\u201d metric they propose. I used this tool and got the following results <em>(EDIT: I left the model training a little longer and updated the numbers EDIT EDIT: I got rid of a coverage penalty term in the beam search and updated the numbers again)<\/em>:<\/p>\n\n<table>\n  <thead>\n    <tr>\n      <th>Model<\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">scenario<\/code> (accuracy)<\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">action<\/code> (accuracy)<\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">intent<\/code> (accuracy)<\/th>\n      <th>Word-F1<\/th>\n      <th>Char-F1<\/th>\n      <th>SLU-F1<\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <td>End-to-end<\/td>\n      <td><s>71.17<\/s><br \/>81.73<\/td>\n      <td><s>65.43<\/s><br \/>77.11<\/td>\n      <td><s>61.85<\/s><br \/>75.05<\/td>\n      <td><s>45.57<\/s><br \/>61.24<\/td>\n      <td><s>49.23<\/s><br \/>65.42<\/td>\n      <td><s>47.33<\/s><br \/>63.26<\/td>\n    <\/tr>\n    <tr>\n      <td>HerMiT<\/td>\n      <td><strong>85.69<\/strong><\/td>\n      <td><strong>81.42<\/strong><\/td>\n      <td><strong>78.33<\/strong><\/td>\n      <td><strong>69.34<\/strong><\/td>\n      <td><strong>72.39<\/strong><\/td>\n      <td><strong>70.83<\/strong><\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n<p>Turns out HerMiT does way better than our end-to-end model. This isn\u2019t totally surprising because HerMiT has a lot of structure built in that our autoregressive model has to learn from scratch. For instance, I don\u2019t think it\u2019s possible for their model to output <code class=\"language-plaintext highlighter-rouge\">\"brussia\"<\/code> (one of the more charming output mistakes I noticed early in training).<\/p>\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/slurp\/brussia.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/>\nAnother difference is that their ASR model is trained on Multi-ASR, a massive <strong>24,000 hour<\/strong> dataset formed by Captain-Planeting the LibriSpeech, Switchboard, Fisher, CommonVoice, AMI, and ICSI datasets\u2014whereas my encoder is only pre-trained using the 1,000 hours of LibriSpeech.<\/p>\n\n<p>So: how much of the gap is due to more\/better audio? The authors also report some results when applying HerMiT to the gold transcripts instead of the ASR output; similarly, we can feed the gold transcripts into our sequence-to-sequence model instead of audio and compare the results <em>(EDIT again, I let the model train a bit longer\/got rid of the pesky coverage penalty and updated the numbers)<\/em>:<\/p>\n\n<table>\n  <thead>\n    <tr>\n      <th>Model<\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">scenario<\/code> (accuracy)<\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">action<\/code> (accuracy)<\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">intent<\/code> (accuracy)<\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <td>End-to-end (text input, gold transcripts)<\/td>\n      <td><s>89.91<\/s><br \/> <strong>90.81<\/strong><\/td>\n      <td><s>86.54<\/s><br \/> <strong>88.29<\/strong><\/td>\n      <td><s>85.43<\/s><br \/> <strong>87.28<\/strong><\/td>\n    <\/tr>\n    <tr>\n      <td>HerMiT (gold transcripts)<\/td>\n      <td>90.15<\/td>\n      <td>86.99<\/td>\n      <td>84.84<\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n<p>Here the simple sequence-to-sequence model actually does about as well as HerMiT <em>(EDIT actually a bit better!)<\/em>. This suggests that the audio side of things is more where our problems lie.<\/p>\n\n<p><em>In summary,<\/em> I have attempted to defend the honor of end-to-end models: we can indeed train one on SLURP and get semi-reasonable outputs. Note, though, that I\u2019ve done no hyperparameter tuning on the model (except to increase the number of training epochs and getting rid of the coverage penalty term), so it\u2019s possible we could do better with a little elbow grease\u2014maybe starting by swapping out the now-unfashionable RNNs I used in the encoder and decoder with \u2728Transformers\u2728.<\/p>","author":{"name":"Loren Lugosch"},"category":{"@attributes":{"term":"sequence modeling"}},"summary":"There\u2019s a hot new dataset for spoken language understanding: SLURP. I\u2019m excited about SLURP for a couple reasons:"},{"title":"Siri from scratch! (Not really.)","link":{"@attributes":{"href":"https:\/\/lorenlugosch.github.io\/posts\/2020\/12\/slu\/","rel":"alternate","type":"text\/html","title":"Siri from scratch! (Not really.)"}},"published":"2020-12-11T00:00:00-08:00","updated":"2020-12-11T00:00:00-08:00","id":"https:\/\/lorenlugosch.github.io\/posts\/2020\/12\/slu","content":"<p>I make fairly heavy use of the voice assistant on my phone for things like setting timers while cooking. As a result, when I spent some time this summer at my in-laws\u2019 place\u2014where there was no cell signal and not-very-good Wi-Fi\u2014I often tried using Siri only to get a sad little \u201csorry, no Internet :(\u201c response. (#FirstWorldProblems.)<\/p>\n\n<p>This reminded me of a tweet I saw a while back:<\/p>\n\n<blockquote class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">like 90% of my voice assistant usage is setting timers, math, and unit conversions<br \/><br \/>can we ship an offline on-device model that does this well and nothing else?<\/p>&mdash; Arkadiy Kukarkin (@parkan) <a href=\"https:\/\/twitter.com\/parkan\/status\/1119334813960429569?ref_src=twsrc%5Etfw\">April 19, 2019<\/a><\/blockquote>\n<script async=\"\" src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script>\n\n<p>Now, it just so happens that I 1) am a big fan of doing things offline rather than in the cloud, 2) have some experience training speech models, and 3) wanted to procrastinate. (<em>The perfect storm.<\/em>) So here\u2019s the record of me taking a crack at it.<\/p>\n\n<h2 id=\"the-goal\">The goal<\/h2>\n<p>The goal is to make a box that can take as input the speech signal and output what the speaker wants. The output takes the form of a dictionary containing the semantics of the utterance<sup id=\"fnref:RL\" role=\"doc-noteref\"><a href=\"#fn:RL\" class=\"footnote\" rel=\"footnote\">1<\/a><\/sup>\u2014i.e. the intent, slots, and slot values, and maybe some other fancy things like Named Entities. I shall casually refer to this output simply as the \u201cintent\u201d, a common synecdoche. This is called \u201cspoken language understanding\u201d (SLU).<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/slu\/SLU.png\" style=\"max-width:100%\" \/><\/center>\n\n<h2 id=\"the-decoupled-approach\">The \u201cdecoupled\u201d approach<\/h2>\n\n<p>The first thing I tried was a straightforward \u201cdecoupled\u201d approach: I trained a general-purpose automatic speech recognition (ASR) part and a separate domain-specific natural language understanding (NLU) part.<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/slu\/decoupled.png\" style=\"max-width:100%\" \/><\/center>\n<p><br \/>\nTo train the ASR part, I used LibriSpeech, a big dataset of American English speakers reading audiobooks. (Or rather, my SpeechBrain colleagues did, and I just loaded their best model checkpoint. Thanks, Ju-Chieh + Mirco + Abdel + Peter!)<\/p>\n\n<p>To train the NLU part, we need text labeled with intents. For this, I wrote a script to generate a bunch of labeled random phrases for the four types of commands I wanted: setting timers, converting units (length, volume, temperature), setting alarms, and simple math. Here\u2019s a few examples:<\/p>\n\n<pre><code style=\"font-size:14px\">\n(\"how many inches are there in 256 centimeters\",\n{\n  'intent': 'UnitConversion', \n  'slots': {\n    'unit1': 'centimeter', \n    'unit2': 'inch', \n    'amount': 256\n  }\n})\n\n(\"set my alarm for 8:03AM\",\n{\n  'intent': 'SetAlarm', \n  'slots': {\n    'am_or_pm': 'AM', \n    'alarm_hour': 8, \n    'alarm_minute': 3\n  }\n})\n\n(\"what's 37.67 minus 75.7\",\n{\n  'intent': 'SimpleMath', \n  'slots': {\n    'number1': 37.67, \n    'number2': 75.7, \n    'op': ' minus '\n  }\n})\n<\/code><\/pre>\n\n<p>I trained an attention model to ingest the transcript and autoregressively predict these dictionaries as strings, one character at a time.<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/slu\/nlu.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/>\nThis model gets perfect accuracy on the test transcripts, which makes sense, since this is a pretty simple domain.<\/p>\n\n<p>Does that mean our system as a whole will get perfect accuracy on test audio? No, because there is the chance that the ASR part will incorrectly transcribe the input and that the NLU part will fail as a result. To measure how well our system works, we need some actual in-domain audio data.<\/p>\n\n<h3 id=\"some-end-to-end-test-and-training-data\">Some end-to-end test (and training) data<\/h3>\n\n<p>So I recorded a few friends and colleagues speaking the generated prompts. My recordees kindly gave me their consent to release their anonymized recordings, which you can find <a href=\"https:\/\/zenodo.org\/record\/4110812\">here<\/a>. I manually segmented and cleaned (= fixed the label, when someone misspoke) their recordings, yielding me a modest 271 audios. (This exercise builds character. I highly recommend it! <a href=\"https:\/\/karpathy.github.io\/2019\/04\/25\/recipe\/\">So does Andrej Karpathy<\/a>.)<\/p>\n\n<p>I split the recordings into train\/dev\/test sets so that each speaker was in only one of the sets, with:<\/p>\n<ul>\n  <li>144 audios (4 speakers) for the train set (<code class=\"language-plaintext highlighter-rouge\">train-real<\/code>),<\/li>\n  <li>72 audios (2 speakers) for the dev set (<code class=\"language-plaintext highlighter-rouge\">dev-real<\/code>), and<\/li>\n  <li>55 audios (5 speakers) for the test set (<code class=\"language-plaintext highlighter-rouge\">test-real<\/code>).<\/li>\n<\/ul>\n\n<p>(For the decoupled approach, we don\u2019t need audio during training, but we will shortly for an alternative approach.)<\/p>\n\n<p>It\u2019s hard to get meaningful accuracy estimates with only 55 test examples, so I also generated a bunch of <em>synthetic<\/em> audio (<code class=\"language-plaintext highlighter-rouge\">train-synth<\/code>, <code class=\"language-plaintext highlighter-rouge\">dev-synth<\/code>, <code class=\"language-plaintext highlighter-rouge\">test-synth<\/code>) by synthesizing all my NLU training text data with Facebook\u2019s <a href=\"https:\/\/github.com\/facebookarchive\/loop\/\">VoiceLoop<\/a> text-to-speech model.<sup id=\"fnref:TTS\" role=\"doc-noteref\"><a href=\"#fn:TTS\" class=\"footnote\" rel=\"footnote\">2<\/a><\/sup> Like I did for the real speakers, I split the 22 synthetic speakers into train\/dev\/test sets with no speaker overlap.<\/p>\n\n<p>In total, I generated around 200,000 audios. I couldn\u2019t figure out how to parallelize the generation process\u2014with simple multithreading I ran into some concurrency problems with the vocoder tool called by VoiceLoop\u2014so I just did it sequentially, which took more than a week to finish.<\/p>\n\n<p>Lucky for you, I\u2019ve uploaded both the real speech and synthesized speech <a href=\"https:\/\/zenodo.org\/record\/4110812\">here<\/a>! I call the complete dataset <strong>Timers and Such v0.1<\/strong>. It\u2019s v0.1 because the test set of real speakers is probably too small for it to be used to meaningfully compare results from different approaches. It would be nice to scale this up to more real speakers and make a v1.0 that people can actually use for R&amp;D.<\/p>\n\n<h3 id=\"first-results\">First results<\/h3>\n\n<p>I tested the system out on <code class=\"language-plaintext highlighter-rouge\">test-real<\/code> and <code class=\"language-plaintext highlighter-rouge\">test-synth<\/code> and measured the overall accuracy (= if any slot is wrong, the whole utterance is considered wrong). Here\u2019s the results, averaged over 5 seeds:<\/p>\n\n<table>\n  <thead>\n    <tr>\n      <th>Model<\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">test-real<\/code><\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">test-synth<\/code><\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <td>Decoupled<\/td>\n      <td><strong>23.6%<\/strong> $\\pm$ 7.3%<\/td>\n      <td><strong>18.7%<\/strong> $\\pm$ 5.1%<\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n<p>Wow, it\u2019s only getting the answer right a fifth of the time. That sucks! What could the problem be?<\/p>\n\n<p>Let\u2019s look at some of the ASR outputs for <code class=\"language-plaintext highlighter-rouge\">test-real<\/code>:<\/p>\n\n<hr \/>\n<pre>\n<code style=\"font-size:14px\">\nTrue transcript: \"SET A TIMER FOR EIGHT MINUTES\"\nASR transcript: \"SAID A TIMID FOR EIGHT MINUTES\"\n\nTrue transcript: \"HOW MANY TEASPOONS ARE THERE IN SIXTY SEVEN TABLESPOONS\"\nASR transcript: \"AWMAN IN TEASPOILS ADHERED IN SEVEN ESSAYS SEVERN TABLESPOONS\"\n<\/code>\n<\/pre>\n<hr \/>\n\n<p>Now, this ASR model gets a WER of 3% (<a href=\"https:\/\/paperswithcode.com\/sota\/speech-recognition-on-librispeech-test-clean\">not SOTA, but good<\/a>) on the <code class=\"language-plaintext highlighter-rouge\">test-clean<\/code> subset of LibriSpeech. So why do the outputs look so bad here?<\/p>\n\n<ul>\n  <li>\n    <p>One issue is <strong>accent mismatch<\/strong>. LibriSpeech has only American English speakers, whereas only 3 of 11 real speakers in Timers and Such have American accents. There\u2019s not much we could do about this, save re-train the ASR model on a more diverse set of accents.<\/p>\n  <\/li>\n  <li>\n    <p>The other issue is <strong>language model (LM) mismatch<\/strong>. The ASR model has an LM trained on LibriSpeech\u2019s LM text data. That text data comes from Project Gutenberg books, which is a very different domain: for instance, \u201cSAID\u201d is more likely to appear at the beginning of a transcript from a book than \u201cSET\u201d.<\/p>\n  <\/li>\n<\/ul>\n\n<p>So I trained an LM on the Timers and Such transcripts, and used that as the ASR model\u2019s LM instead of the LibriSpeech LM. This ends up fixing a lot of ASR mistakes, but not all of them.<sup id=\"fnref:FST\" role=\"doc-noteref\"><a href=\"#fn:FST\" class=\"footnote\" rel=\"footnote\">3<\/a><\/sup><\/p>\n\n<table>\n  <thead>\n    <tr>\n      <th>Model<\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">test-real<\/code><\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">test-synth<\/code><\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <td>Decoupled (LibriSpeech LM)<\/td>\n      <td>23.6% $\\pm$ 7.3%<\/td>\n      <td>18.7% $\\pm$ 5.1%<\/td>\n    <\/tr>\n    <tr>\n      <td>Decoupled (Timers and Such LM)<\/td>\n      <td><strong>44.4%<\/strong> $\\pm$ 6.9%<\/td>\n      <td><strong>31.9%<\/strong> $\\pm$ 3.9%<\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n<h2 id=\"sort-of-end-to-end-training-the-multistage-approach\">Sort-of end-to-end training: the \u201cmultistage\u201d approach<\/h2>\n\n<p>Instead of training on the true transcript, we might want to train the NLU model using the ASR transcript, to make the NLU part more resilient to ASR errors. Google <a href=\"https:\/\/arxiv.org\/abs\/1809.09190\">calls this<\/a> a \u201cmultistage\u201d end-to-end SLU model: it still uses distinct ASR and NLU parts, but the complete system is trained on audio data.<\/p>\n\n<!-- [^decoupled]: There are some [software engineering benefits](https:\/\/papers.nips.cc\/paper\/2015\/file\/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf) to _not_ doing things end-to-end. By separating the problem into two stages, the ASR people do not need to care about the intent structure and can focus on optimizing word error rate, and the NLU people do not need to care about sampling rates and FFTs and can focus on optimizing semantic accuracy. -->\n\n<p>Running this experiment, multistage training works a lot better than the decoupled approach, with or without an appropriate LM.<\/p>\n\n<table>\n  <thead>\n    <tr>\n      <th>Model<\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">test-real<\/code><\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">test-synth<\/code><\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <td>Decoupled (LibriSpeech LM)<\/td>\n      <td>23.6% $\\pm$ 7.3%<\/td>\n      <td>18.7% $\\pm$ 5.1%<\/td>\n    <\/tr>\n    <tr>\n      <td>Decoupled (Timers and Such LM)<\/td>\n      <td>44.4% $\\pm$ 6.9%<\/td>\n      <td>31.9% $\\pm$ 3.9%<\/td>\n    <\/tr>\n    <tr>\n      <td>Multistage (LibriSpeech LM)<\/td>\n      <td>69.8% $\\pm$ 3.5%<\/td>\n      <td>69.9% $\\pm$ 2.5%<\/td>\n    <\/tr>\n    <tr>\n      <td>Multistage (Timers and Such LM)<\/td>\n      <td><strong>75.3%<\/strong> $\\pm$ 4.2%<\/td>\n      <td><strong>73.1%<\/strong> $\\pm$ 8.7%<\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n<h2 id=\"for-real-end-to-end-training-the-direct-approach\">For-real end-to-end training: the \u201cdirect\u201d approach<\/h2>\n\n<p>Still, there\u2019s a few disadvantages to the multistage ASR-NLU approach.<\/p>\n<ul>\n  <li>We need an intermediate search step to predict a transcript during training. The search is inherently sequential and ends up being the slowest part of training.<\/li>\n  <li>It\u2019s difficult<sup id=\"fnref:backprop\" role=\"doc-noteref\"><a href=\"#fn:backprop\" class=\"footnote\" rel=\"footnote\">4<\/a><\/sup> to backpropagate through the discrete search into the encoder, so the model can\u2019t learn to give more priority to recognizing words that are more relevant to the SLU task, as opposed to less informative words like \u201cthe\u201d and \u201cplease\u201d.<\/li>\n  <li>Ultimately, we don\u2019t actually care about the transcript for this application: we just want the intent. By predicting the transcript, we\u2019re wasting FLOPs.<sup id=\"fnref:non-transcript\" role=\"doc-noteref\"><a href=\"#fn:non-transcript\" class=\"footnote\" rel=\"footnote\">5<\/a><\/sup><\/li>\n<\/ul>\n\n<p>So: why not train a model to just map directly from speech to intent?<sup id=\"fnref:direct\" role=\"doc-noteref\"><a href=\"#fn:direct\" class=\"footnote\" rel=\"footnote\">6<\/a><\/sup> To quote Vapnik: <em>\u201cWhen solving a problem of interest, do not solve a more general problem as an intermediate step.\u201d<\/em><sup id=\"fnref:vapnik\" role=\"doc-noteref\"><a href=\"#fn:vapnik\" class=\"footnote\" rel=\"footnote\">7<\/a><\/sup><\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/slu\/direct.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/>\nThe direct approach faces an additional difficulty: it has to learn what speech sounds like from scratch. To make the comparison with the ASR-based models more fair, we can use transfer learning. This can be done simply by popping the encoder out of the pre-trained LibriSpeech ASR model and using it as a feature extractor in the SLU model.<sup id=\"fnref:pretrain\" role=\"doc-noteref\"><a href=\"#fn:pretrain\" class=\"footnote\" rel=\"footnote\">8<\/a><\/sup><\/p>\n\n<p>We get pretty good results with this approach: around the same performance as the multistage model with an appropriate language model (slightly worse, within a standard deviation). The direct model also trained a lot faster: 1h 42m for one epoch (running on a beastly Quadro RTX 8000), compared with 2h 53m for the multistage model on the same machine.<\/p>\n\n<table>\n  <thead>\n    <tr>\n      <th>Model<\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">test-real<\/code><\/th>\n      <th><code class=\"language-plaintext highlighter-rouge\">test-synth<\/code><\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <td>Decoupled (LibriSpeech LM)<\/td>\n      <td>23.6% $\\pm$ 7.3%<\/td>\n      <td>18.7% $\\pm$ 5.1%<\/td>\n    <\/tr>\n    <tr>\n      <td>Decoupled (Timers and Such LM)<\/td>\n      <td>44.4% $\\pm$ 6.9%<\/td>\n      <td>31.9% $\\pm$ 3.9%<\/td>\n    <\/tr>\n    <tr>\n      <td>Multistage (LibriSpeech LM)<\/td>\n      <td>69.8% $\\pm$ 3.5%<\/td>\n      <td>69.9% $\\pm$ 2.5%<\/td>\n    <\/tr>\n    <tr>\n      <td>Multistage (Timers and Such LM)<\/td>\n      <td><strong>75.3%<\/strong> $\\pm$ 4.2%<\/td>\n      <td>73.1% $\\pm$ 8.7%<\/td>\n    <\/tr>\n    <tr>\n      <td>Direct<\/td>\n      <td>74.5% $\\pm$ 6.9%<\/td>\n      <td><strong>96.1%<\/strong> $\\pm$ 0.2%<\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n<p>Notice anything interesting? The direct model performs a lot better on <code class=\"language-plaintext highlighter-rouge\">test-synth<\/code> than the other models. This makes sense: the direct model has access to the raw speech features, so it can learn the idiosyncrasies of the speech synthesizer and recognize the synthetic test speech more easily. (Of course, we don\u2019t care about the performance on synthetic speech; we only care about how well this works for human speakers.)<\/p>\n\n<h2 id=\"are-we-done\">Are we done?<\/h2>\n<p>Did we achieve our goal from the outset of \u201can offline on-device model that [recognizes numeric commands] well and nothing else\u201d?<\/p>\n\n<ul>\n  <li>\n    <p><em>As for \u201con-device\u201d:<\/em> The model has 170 million parameters, most of which live in the pre-trained ASR part. This requires 680MB of storage using single-precision floats; for reference, right now the maximum app download size for Android is 100MB, so we would probably have a hard time putting this on a phone. We would need to fiddle with the ASR hyperparameters a bit to shrink the encoder, but this is definitely doable. In fact, the direct SLU models I trained in my previous papers had a little over 1 million parameters\u2014this was one of the main selling points for end-to-end SLU models in the <a href=\"https:\/\/research.fb.com\/wp-content\/uploads\/2018\/02\/towards-end-to-end-spoken-language-understanding.pdf\">original paper<\/a>.<\/p>\n  <\/li>\n  <li>\n    <p><em>As for \u201cwell\u201d:<\/em> Does 75% accuracy count as good? Probably not, unless you\u2019re OK with your cooking timer being set for 10 hours instead of 10 minutes now and then. For starters, we saw that the training data\u2014LibriSpeech and the synthetic Timers and Such speech\u2014is almost entirely American English, so we would need to collect some more accents for the training data. But I\u2019m American, so it works well for me! (The linguistic equivalent of \u201cit runs on my machine\u201d.)<\/p>\n  <\/li>\n<\/ul>\n\n<h2 id=\"code-and-data\">Code and data<\/h2>\n<p>I wrote the code for all these experiments as a set of recipes to be included in the SpeechBrain toolkit. It\u2019s not available to the public yet, but it will be soon. In the meantime, you can train a model on <a href=\"https:\/\/zenodo.org\/record\/4110812\">Timers and Such v0.1<\/a> using my older end-to-end SLU code <a href=\"https:\/\/github.com\/lorenlugosch\/end-to-end-SLU\">here<\/a>\u2014though I would recommend waiting until SpeechBrain comes out, since my SpeechBrain SLU recipes are a lot cleaner and easier to use.<\/p>\n<div class=\"footnotes\" role=\"doc-endnotes\">\n  <ol>\n    <li id=\"fn:RL\" role=\"doc-endnote\">\n      <p>The disadvantage of formulating the problem this way is that we need someone to design this output format and write a program to map the intent to a sequence of actions. A better way to formulate the problem might be to use reinforcement learning: let the agent act in response to requests and learn to act to maximize some reward signal (\u201cdid the agent do what I want?\u201d), without the need for any hard-coded semantics, as was originally suggested in <a href=\"https:\/\/pdfs.semanticscholar.org\/9f62\/db97e65e042657d43b5739e9bbdba14ed159.pdf\">this paper<\/a>. The question then becomes: what should our action space look like? High-level actions, like pushing buttons in an app (inflexible, but easier to learn)? Or low-level actions, like reading and writing to locations in memory (flexible, but more difficult to learn)? And can we train the model with some sort of imitation learning or simulation, so that we don\u2019t have to wait forever for it to learn from human feedback? Interesting and challenging questions that I won\u2019t linger on here, but which I\u2019d like to think more about in the future.\u00a0<a href=\"#fnref:RL\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:TTS\" role=\"doc-endnote\">\n      <p>I wrote a <a href=\"https:\/\/ieeexplore.ieee.org\/abstract\/document\/9053063\">whole paper<\/a> about this idea!\u00a0<a href=\"#fnref:TTS\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:FST\" role=\"doc-endnote\">\n      <p>We could probably do better using an <a href=\"https:\/\/arxiv.org\/abs\/2010.01003\">FST<\/a>-based speech recognizer, which would allow us to perfectly constrain the model to only outputting sentences that fit a certain grammar (that is, the \u201cG\u201d part of \u201c<a href=\"https:\/\/cs.nyu.edu\/~mohri\/pub\/csl01.pdf\">HCLG<\/a>\u201d).\u00a0<a href=\"#fnref:FST\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:backprop\" role=\"doc-endnote\">\n      <p>Though not impossible, using tricks like Gumbel-Softmax and the straight-through estimator.\u00a0<a href=\"#fnref:backprop\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:non-transcript\" role=\"doc-endnote\">\n      <p>A fourth (more subtle) disadvantage of ASR-based SLU models is that the speech signal may contain information that is not present in the transcript. For example, sarcasm is not always apparent from just looking at a transcript. This is not really relevant for the simple numeric commands we\u2019re dealing with here, but for more general-purpose robust language understanding in robots of the future, non-transcript information might be crucial, in addition to other multimodal information like visual cues.\u00a0<a href=\"#fnref:non-transcript\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:direct\" role=\"doc-endnote\">\n      <p>This was proposed by my friend Dima Serdyuk in his <a href=\"https:\/\/research.fb.com\/wp-content\/uploads\/2018\/02\/towards-end-to-end-spoken-language-understanding.pdf\">2018 ICASSP paper<\/a>. A few other groups\u2014including <a href=\"fluent.ai\">Fluent.ai<\/a>, where I worked before my PhD\u2014had had similar ideas earlier, but as far as I\u2019m aware, Dima was the first to get a truly end-to-end SLU model to work, without any sort of ASR-based inductive bias or transfer learning.\u00a0<a href=\"#fnref:direct\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:vapnik\" role=\"doc-endnote\">\n      <p>Actually, there are tons of examples of solving a more general problem yielding better results for the problem of interest, like language model pre-training for text classification. But as the amount of data we have for the problem of interest goes to infinity, Vapnik is right.\u00a0<a href=\"#fnref:vapnik\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:pretrain\" role=\"doc-endnote\">\n      <p>In <a href=\"https:\/\/arxiv.org\/abs\/1904.03670\">this paper<\/a>, I proposed a somewhat more complicated way of pre-training the encoder using phoneme and word targets from a forced aligner. The idea was that using word targets would be ideal (and more amenable to an idea I had for using pre-trained word embeddings to help the model understand the meaning of synonyms not present in the SLU training set), but using too many word targets would be expensive, which is why we used phoneme targets as well. Using the pre-trained ASR model\u2019s encoder was a lot simpler to implement, though I haven\u2019t done a fair comparison with the forced alignment approach yet.\u00a0<a href=\"#fnref:pretrain\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n  <\/ol>\n<\/div>","author":{"name":"Loren Lugosch"},"summary":"I make fairly heavy use of the voice assistant on my phone for things like setting timers while cooking. As a result, when I spent some time this summer at my in-laws\u2019 place\u2014where there was no cell signal and not-very-good Wi-Fi\u2014I often tried using Siri only to get a sad little \u201csorry, no Internet :(\u201c response. (#FirstWorldProblems.)"},{"title":"Sequence-to-sequence learning with Transducers","link":{"@attributes":{"href":"https:\/\/lorenlugosch.github.io\/posts\/2020\/11\/transducer\/","rel":"alternate","type":"text\/html","title":"Sequence-to-sequence learning with Transducers"}},"published":"2020-11-16T00:00:00-08:00","updated":"2020-11-16T00:00:00-08:00","id":"https:\/\/lorenlugosch.github.io\/posts\/2020\/11\/transducer","content":"<p>The <strong>Transducer<\/strong> (sometimes called the \u201cRNN Transducer\u201d or \u201cRNN-T\u201d, though it need not use RNNs) is a sequence-to-sequence model proposed by Alex Graves in \u201c<a href=\"https:\/\/arxiv.org\/abs\/1211.3711\">Sequence Transduction with Recurrent Neural Networks<\/a>\u201d. The paper was published at the <a href=\"https:\/\/sites.google.com\/site\/representationworkshopicml2012\/\">ICML 2012 Workshop on Representation Learning<\/a>. Graves showed that the Transducer was a sensible model to use for speech recognition, achieving good results on a small dataset (TIMIT).<\/p>\n\n<p>Since then, the Transducer hasn\u2019t been used as much compared to <a href=\"https:\/\/www.cs.toronto.edu\/~graves\/icml_2006.pdf\">CTC<\/a> models (like <a href=\"https:\/\/arxiv.org\/abs\/1512.02595\">Deep Speech 2<\/a>) or <a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">attention<\/a> models (like <a href=\"https:\/\/arxiv.org\/abs\/1508.01211\">Listen, Attend, and Spell<\/a>). Last year, however, the Transducer got some serious attention when Google researchers showed that it could enable <a href=\"https:\/\/ai.googleblog.com\/2019\/03\/an-all-neural-on-device-speech.html\">entirely on-device low-latency speech recognition<\/a> for Pixel phones. And more recently, the Transducer was used to achieve a <a href=\"https:\/\/arxiv.org\/pdf\/2010.10504.pdf\">new state-of-the-art<\/a> word error rate for the LibriSpeech benchmark.<sup id=\"fnref:foot\" role=\"doc-noteref\"><a href=\"#fn:foot\" class=\"footnote\" rel=\"footnote\">1<\/a><\/sup><\/p>\n\n<p>So what is the Transducer, and when might you want to use it? In this post, we will see where Transducer models fit in with other sequence-to-sequence models and a detailed explanation of how they work.<\/p>\n\n<p>This post also includes a Colab notebook with a PyTorch implementation of the Transducer for a toy problem\u2014which you can skip straight to <a href=\"https:\/\/github.com\/lorenlugosch\/transducer-tutorial\/blob\/main\/transducer_tutorial_example.ipynb\">here<\/a>.<\/p>\n\n<h2 id=\"attention-models\">Attention models<\/h2>\n<p>The problems we\u2019re interested in here are <strong>sequence transduction<\/strong> problems, where the goal is to map an input sequence $\\mathbf{x} = \\{x_1, x_2, \\dots x_T\\}$ to an output sequence $\\mathbf{y} = \\{y_1, y_2, \\dots, y_U\\}$.<\/p>\n\n<p>The go-to models for sequence transduction problems are attention-based sequence-to-sequence models, like <a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">RNN encoder-decoder models<\/a> or <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Transformers<\/a>.<\/p>\n\n<p>Here\u2019s a diagram of an attention model. (In the diagrams below, I\u2019ll use <span style=\"color:red\"><strong>red<\/strong><\/span> to indicate that a module has access to $\\mathbf{x}$, <span style=\"color:blue\"><strong>blue<\/strong><\/span> to indicate access to $\\mathbf{y}$, and <span style=\"color:purple\"><strong>purple<\/strong><\/span> to indicate access to both $\\mathbf{x}$ and $\\mathbf{y}$.)<\/p>\n\n<!-- <describe attention model> -->\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/transducer\/attention-model.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/>\nThe model encodes the input $\\mathbf{x}$ into a sequence of feature vectors, then computes the probability of the next output $y_u$ as a function of the encoded input and previous outputs. The attention mechanism allows the decoder to look at different parts of the input sequence when predicting each output. Here, for example, is a heatmap of where the decoder is looking during a translation task (from <a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Bahdanau et al.<\/a>):<\/p>\n\n<!-- <show attention heatmap for speech recognition>  -->\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/transducer\/attention-heatmap.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/>\nAttention models can be applied to any problem, but they are not always the best choice for certain problems, like speech recognition, for a few reasons<sup id=\"fnref:mocha\" role=\"doc-noteref\"><a href=\"#fn:mocha\" class=\"footnote\" rel=\"footnote\">2<\/a><\/sup>:<\/p>\n<ul>\n  <li>The attention operation is expensive for long input sequences. The complexity of attending to the entire input for every output is $O(TU)$\u2014and for audio, $T$ and $U$ are big.<\/li>\n  <li>Attention models cannot be run <em>online<\/em> (in real time), since the entire input sequence needs to be available before the decoder can attend to it.<\/li>\n  <li>Attention models also don\u2019t take advantage of the fact that, for speech recognition, the alignment between inputs and outputs is <strong>monotonic<\/strong>: that is, if word A comes after word B in the transcript, word A must come after word B in the audio signal (see image below, from <a href=\"https:\/\/arxiv.org\/pdf\/1508.01211.pdf\">Chan et al.<\/a>, for an example of a monotonic alignment). The fact that attention models lack this inductive bias seems to make them <a href=\"https:\/\/awni.github.io\/train-sequence-models\/\">harder to train<\/a> for speech recognition; it\u2019s common to add <a href=\"https:\/\/arxiv.org\/abs\/1609.06773\">auxiliary loss terms<\/a> to stabilize training.<\/li>\n<\/ul>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/transducer\/attention-ASR.png\" style=\"max-width:75%\" \/><\/center>\n<p><br \/><\/p>\n\n<p>This leads us to Connectionist Temporal Classification (CTC) models, which are more suitable for some problems than attention models.<\/p>\n\n<h2 id=\"ctc-models\">CTC models<\/h2>\n\n<p>CTC models assume that there is a monotonic input-output alignment<sup id=\"fnref:CTC\" role=\"doc-noteref\"><a href=\"#fn:CTC\" class=\"footnote\" rel=\"footnote\">3<\/a><\/sup>. This ends up making the model a lot simpler.<\/p>\n\n<!-- <show lattice> -->\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/transducer\/ctc-model.png\" style=\"max-width:30%\" \/><\/center>\n<p><br \/>\nSo simple! We only need a single neural net to implement a CTC model, and no expensive global attention mechanism.<\/p>\n\n<p>But CTC models have a couple problems of their own:<\/p>\n<ul>\n  <li><strong>Problem 1:<\/strong> <em>The output sequence length $U$ has to be smaller than the input sequence length $T$.<\/em> This might not seem like a problem for speech recognition, where $T$ is much larger than $U$\u2014but it prevents us from using a model architecture that does a lot of pooling, which can make the model a lot faster.<\/li>\n  <li><strong>Problem 2:<\/strong> <em>The outputs are assumed to be independent of each other.<\/em> The result is that CTC models often produce outputs that are obviously wrong, like \u201cI eight food\u201d instead of \u201cI ate food\u201d. Getting good results with CTC usually requires a search algorithm that incorporates a secondary language model.<sup id=\"fnref:Jasper\" role=\"doc-noteref\"><a href=\"#fn:Jasper\" class=\"footnote\" rel=\"footnote\">4<\/a><\/sup><\/li>\n<\/ul>\n\n<p>Can we do better than CTC? Yes: using Transducer models.<\/p>\n\n<h2 id=\"transducer-models\">Transducer models<\/h2>\n\n<p>The Transducer elegantly solves both problems associated with CTC, while retaining some of its advantages over attention models.<\/p>\n\n<ul>\n  <li>It solves <strong>Problem 1<\/strong> by allowing multiple outputs for each input.<\/li>\n  <li>It solves <strong>Problem 2<\/strong> by adding a predictor network and joiner<sup id=\"fnref:joiner\" role=\"doc-noteref\"><a href=\"#fn:joiner\" class=\"footnote\" rel=\"footnote\">5<\/a><\/sup> network.<\/li>\n<\/ul>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/transducer\/transducer-model.png\" style=\"max-width:60%\" \/><\/center>\n<p><br \/>\nThe predictor is autoregressive: it takes as input the previous outputs and produces features that can be used for predicting the next output, like a standard language model.<\/p>\n\n<p>The joiner is a simple feedforward network that combines the encoder vector $f_t$ and predictor vector $g_u$ and outputs a softmax $h_{t,u}$ over all the labels, as well as a \u201cnull\u201d output $\\varnothing$.<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/transducer\/joiner-output.png\" style=\"max-width:33%\" \/><\/center>\n<p><br \/>\nGiven an input sequence $\\mathbf{x}$, generating an output sequence $\\mathbf{y}$ can be done using a simple greedy search algorithm:<\/p>\n\n<ol>\n  <li>\n    <p>Start by setting $t := 1$, $u := 0$, and $\\mathbf{y} :=$ an empty list.<\/p>\n  <\/li>\n  <li>\n    <p>Compute $f_t$ using $\\mathbf{x}$ and $g_u$ using $\\mathbf{y}$.<\/p>\n  <\/li>\n  <li>\n    <p>Compute $h_{t,u}$ using $f_t$ and $g_u$.<\/p>\n  <\/li>\n  <li>\n    <p>If the argmax of $h_{t,u}$ is a <em>label<\/em>, set $u := u + 1$, and output the label (append it to $\\mathbf{y}$ and feed it back into the predictor).<br \/><br \/>If the argmax of $h_{t,u}$ is $\\varnothing$, set $t := t + 1$ (in other words, just move to the next input timestep and output nothing).<\/p>\n  <\/li>\n  <li>\n    <p>If $t=T+1$, we\u2019re done. Else, go back to step 2.<\/p>\n  <\/li>\n<\/ol>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/transducer\/greedy-search.png\" style=\"max-width:90%\" \/><\/center>\n<p><br \/>\nA couple cool things about Transducers to note here:<\/p>\n\n<ul>\n  <li>\n    <p>If the encoder is causal (i.e., we\u2019re not using something like a bidirectional RNN), then the search can run in an online\/<a href=\"https:\/\/twitter.com\/lorenlugosch\/status\/1327330577104695297?s=20\">streaming<\/a> fashion, where we process each $x_t$ as soon as it arrives.<\/p>\n  <\/li>\n  <li>\n    <p>The predictor only has access to $\\mathbf{y}$, and not $\\mathbf{x}$\u2014unlike the decoder in an attention model, which sees both $\\mathbf{x}$ and $\\mathbf{y}$. That means we can easily pre-train the predictor on text-only data, which there\u2019s a lot more of than paired (speech, text) data.<\/p>\n  <\/li>\n<\/ul>\n\n<h2 id=\"alignment\">Alignment<\/h2>\n\n<p>Given an $(\\mathbf{x}, \\mathbf{y})$ pair, the Transducer defines a set of possible monotonic alignments between $\\mathbf{x}$ and $\\mathbf{y}$. For example, consider an input sequence of length $T = 4$ and an output sequence (\u201cCAT\u201d) of length $U = 3$. We can illustrate the set of alignments using a graph<sup id=\"fnref:FST\" role=\"doc-noteref\"><a href=\"#fn:FST\" class=\"footnote\" rel=\"footnote\">6<\/a><\/sup> like this:<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/transducer\/transducer-graph.png\" style=\"max-width:60%\" \/><\/center>\n<p><br \/>\nHere\u2019s one alignment: $\\mathbf{z} = \\varnothing, C, A, \\varnothing, T, \\varnothing, \\varnothing$<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/transducer\/cat-align-1.png\" style=\"max-width:60%\" \/><\/center>\n<p><br \/>\nHere\u2019s another alignment: $\\mathbf{z} = C, \\varnothing, A, \\varnothing, T, \\varnothing, \\varnothing$<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/transducer\/cat-align-2.png\" style=\"max-width:60%\" \/><\/center>\n<p><br \/>\n<!-- *(Why the dangling $\\varnothing$ at the end, you ask? I'm actually not sure!*  -->\n<!-- Training a model without it seems to work just fine. But that's how the original model is specified, so I've included it here.)* --><\/p>\n\n<p>We can calculate the probability of one of these alignments by multiplying together the values of each edge along the path:<\/p>\n\n<p>$\\mathbf{z} = \\varnothing, C, A, \\varnothing, T, \\varnothing, \\varnothing$<br \/>\n\u2193\n$p(\\mathbf{z} | \\mathbf{x}) = h_{1,0}[\\varnothing] \\cdot h_{2,0}[C] \\cdot h_{2,1}[A] \\cdot h_{2,2}[\\varnothing] \\cdot h_{3,2}[T] \\cdot h_{3,3}[\\varnothing] \\cdot h_{4,3}[\\varnothing],$<\/p>\n\n<p>where the value of an edge is the corresponding entry of $h_{t,u}$.<\/p>\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/transducer\/edge-weights.png\" style=\"max-width:50%\" \/><\/center>\n\n<!-- If you're wondering how I got the $(t,u)$ indices, we start from $(t=1, u=0)$, increment $t$ when the edge is $\\varnothing$, and increment $u$ when the edge is a label. -->\n\n<h2 id=\"training\">Training<\/h2>\n\n<p>How do we train the model? If we knew the true alignment<sup id=\"fnref:alignment\" role=\"doc-noteref\"><a href=\"#fn:alignment\" class=\"footnote\" rel=\"footnote\">7<\/a><\/sup> $\\mathbf{z}$, we could minimize the cross-entropy between $\\mathbf{h}$ and $\\mathbf{z}$, like a normal classifier. However, we usually don\u2019t know the true alignment (and for some tasks, a \u201ctrue\u201d alignment might not even exist).<\/p>\n\n<p>Instead, the Transducer defines $p(\\mathbf{y}|\\mathbf{x})$ as the sum of the probabilities of <em>all<\/em> possible alignments between $\\mathbf{x}$ and $\\mathbf{y}$. We train the model by minimizing the loss function $-\\log p(\\mathbf{y}|\\mathbf{x})$.<\/p>\n\n<p>There are usually too many possible alignments to compute the loss function by just adding them all up directly. To compute the sum efficiently, we compute the \u201cforward variable\u201d $\\alpha_{t,u}$, for $1 \\leq t \\leq T$ and $0 \\leq u \\leq U$:<\/p>\n\n<center>$$\\begin{eqnarray*} \\alpha_{t,u} = \\alpha_{t-1,u} \\cdot h_{t-1,u}[\\varnothing] \\\\+ \\alpha_{t,u-1} \\cdot h_{t,u-1}[y_{u-1}] \\end{eqnarray*}$$<\/center>\n\n<p>We can visualize this computation as passing values along the edges of the alignment graph:<\/p>\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/transducer\/forward-messages.png\" style=\"max-width:50%\" \/><\/center>\n<p><br \/>\nAfter we\u2019ve computed $\\alpha_{t,u}$ for every node in the alignment graph, we get $p(\\mathbf{y}|\\mathbf{x})$ using the forward variable at the last node of the graph:<\/p>\n\n<center>$$\\begin{eqnarray*} p(\\mathbf{y}|\\mathbf{x}) = \\alpha_{T,U} \\cdot h_{T,U}[\\varnothing]\\end{eqnarray*}$$<\/center>\n\n<p>We need to do everything in the log domain, for the <a href=\"https:\/\/lorenlugosch.github.io\/posts\/2020\/06\/logsumexp\/\">usual reasons<\/a>. In the log domain, the computation becomes:<\/p>\n\n<center>$$\\begin{eqnarray*} \\log \\alpha_{t,u} = \\text{logsumexp}([\\log \\alpha_{t-1,u} + \\log h_{t-1,u}[\\varnothing], \\\\ \\log \\alpha_{t,u-1} + \\log h_{t,u-1}[y_{u-1}] ]) \\end{eqnarray*}$$<\/center>\n\n<center>$$\\begin{eqnarray*} \\log p(\\mathbf{y}|\\mathbf{x}) = \\log \\alpha_{T,U} + \\log h_{T,U}[\\varnothing]\\end{eqnarray*}$$<\/center>\n\n<p>Finally, to compute the gradient of the loss $-\\log p(\\mathbf{y}|\\mathbf{x})$, there is a second algorithm that computes a backward variable $\\beta_{t,u}$, using the same computation as $\\alpha_{t,u}$, but in reverse, starting from the last node.<\/p>\n\n<p><em>In the notebook, I provide a simple PyTorch implementation of the loss function that only writes out the forward computation and uses automatic differentiation to compute the gradient. This is a lot slower than a lower-level implementation, but easier to program and to read.<\/em><\/p>\n\n<h2 id=\"memory-usage\">Memory usage<\/h2>\n\n<p>In general, Transducer models seem like a good idea. But here\u2019s the catch (and possibly the unspoken reason that the Transducer never caught on until recently):<\/p>\n\n<p>Suppose we have $T=1000$, $U=100$, $L=1000$ labels, and batch size $B=32$. Then to store $h_{t,u}$ for all $(t,u)$ to run the forward-backward algorithm, we need a tensor of size $B \\times T \\times U \\times L = $ 3,200,000,000, or 12.8 GB if we\u2019re using single-precision floats. And that\u2019s just the output tensor: there\u2019s also the hidden unit activations of the joiner network, which are of size $B \\times T \\times U \\times d_{\\text{joiner}}$.<\/p>\n\n<p>So unless you are, <em>ahem<\/em>, a certain tech company in possession of TPUs with plentiful RAM (guess who\u2019s been publishing the most Transducer papers!), you may need to find some way to reduce memory consumption during training\u2014e.g., by pooling in the encoder to reduce $T$, or by using a small batch size $B$.<\/p>\n\n<p>Ironically, this is only a problem during training; during inference, we only need a small amount of memory to store the current activations and hypotheses for $\\mathbf{y}$.<\/p>\n\n<h2 id=\"search\">Search<\/h2>\n\n<p>We saw earlier that you can predict $\\mathbf{y}$ using a greedy search, always picking the top output of $h_{t,u}$. Better results can be obtained using a beam search instead, maintaining a list of multiple hypotheses for $\\mathbf{y}$ and updating them at each input timestep.<\/p>\n\n<p>The Transducer beam search algorithm can be found in the original paper\u2014though it is somewhat gnarlier than the simple attention model beam search, and I confess I haven\u2019t implemented it myself yet. (Check out the soon-to-be-released <a href=\"https:\/\/speechbrain.github.io\/\">SpeechBrain<\/a> toolkit for my colleagues\u2019 implementation.)<\/p>\n\n<h2 id=\"code\">Code<\/h2>\n\n<p>Finally, the Colab notebook for the Transducer can be found <a href=\"https:\/\/github.com\/lorenlugosch\/transducer-tutorial\/blob\/main\/transducer_tutorial_example.ipynb\">here<\/a>. The notebook implements a Transducer model in PyTorch for a toy sequence transduction problem (filling in missing vowels in a sentence: \u201chll wrld\u201d \u2013&gt; \u201chello world\u201d), including the loss function, the greedy search, and a function for computing the probability of a single alignment. Enjoy!<\/p>\n\n<h2 id=\"citation\">Citation<\/h2>\n<p><em>If you found this tutorial helpful and would like to cite it, you can use the following BibTeX entry:<\/em><\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>@misc{\n\tlugosch_2020, \n\ttitle={Sequence-to-sequence learning with Transducers}, \n\turl={https:\/\/lorenlugosch.github.io\/posts\/2020\/11\/transducer\/}, \n\tauthor={Lugosch, Loren}, \n\tyear={2020}, \n\tmonth={Nov}\n}\n<\/code><\/pre><\/div><\/div>\n<div class=\"footnotes\" role=\"doc-endnotes\">\n  <ol>\n    <li id=\"fn:foot\" role=\"doc-endnote\">\n      <p>It always seems to take a few years between Alex Graves publishing a good idea and the research community fully recognizing it. There was an 8 year gap between CTC (2006) and Baidu\u2019s Deep Speech (2014), and an 8 year gap between the Transducer (2012) and Google\u2019s latest result (2020). This suggests a simple algorithm for achieving state-of-the-art results: select a paper written by Alex Graves from 8 years ago, and reimplement it using whatever advances in deep learning have been made since then. Maybe 2022 will be the year <a href=\"https:\/\/arxiv.org\/abs\/1410.5401\">Neural Turing Machines<\/a> really shine!\u00a0<a href=\"#fnref:foot\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:mocha\" role=\"doc-endnote\">\n      <p>There\u2019s been some interesting work developing attention models that do not have these three issues, like <a href=\"https:\/\/arxiv.org\/abs\/1712.05382\">monotonic chunkwise attention (MoChA)<\/a>.\u00a0<a href=\"#fnref:mocha\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:CTC\" role=\"doc-endnote\">\n      <p>As do their older cousins, <a href=\"https:\/\/lorenlugosch.github.io\/posts\/2020\/01\/hmm\/\">Hidden Markov Models<\/a>, <a href=\"https:\/\/pdfs.semanticscholar.org\/62d7\/9ced441a6c78dfd161fb472c5769791192f6.pdf\">Graph Transformer Networks<\/a>, and the more recent <a href=\"https:\/\/arxiv.org\/pdf\/1609.03193.pdf\">AutoSegCriterion<\/a>. See Awni Hannun\u2019s excellent <a href=\"https:\/\/distill.pub\/2017\/ctc\/\">introduction<\/a> if you want to learn more about CTC.\u00a0<a href=\"#fnref:CTC\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:Jasper\" role=\"doc-endnote\">\n      <p>Alternately, you can use a very big and deep network like <a href=\"https:\/\/arxiv.org\/pdf\/1904.03288.pdf\">Jasper<\/a>. Jasper was a CTC model proposed by NVIDIA researchers that, astonishingly, achieved nearly state-of-the-art performance using only a greedy search. If the model is big and deep, it can intelligently coordinate its outputs so as to not produce dumb predictions like \u201cI eight food\u201d instead of \u201cI ate food\u201d. Still, it seems to be more parameter-efficient to use a model that explicitly assumes that outputs are not independent, like attention models and Transducer models.\u00a0<a href=\"#fnref:Jasper\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:joiner\" role=\"doc-endnote\">\n      <p>In the original paper, there was no joiner; the encoder vector and predictor vector were simply added together. Graves and his co-authors added the joiner in a <a href=\"https:\/\/www.cs.toronto.edu\/~fritz\/absps\/RNN13.pdf\">subsequent paper<\/a>, finding that it reduced the number of deletion errors.\u00a0<a href=\"#fnref:joiner\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:FST\" role=\"doc-endnote\">\n      <p>If you\u2019re familiar with the powerful gadgets known as <a href=\"http:\/\/www.opengrm.org\/twiki\/bin\/view\/GRM\/PyniniDocs\">finite state transducers (FSTs)<\/a>, you may recognize that the Transducer graph is a weighted FST, where an alignment forms the input labels, $\\mathbf{y}$ forms the output labels, and the weight for each edge is dynamically generated by the joiner network.\u00a0<a href=\"#fnref:FST\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:alignment\" role=\"doc-endnote\">\n      <p>The true alignment of neural networks is known to be Chaotic Good.\u00a0<a href=\"#fnref:alignment\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n  <\/ol>\n<\/div>","author":{"name":"Loren Lugosch"},"category":{"@attributes":{"term":"sequence modeling"}},"summary":"The Transducer (sometimes called the \u201cRNN Transducer\u201d or \u201cRNN-T\u201d, though it need not use RNNs) is a sequence-to-sequence model proposed by Alex Graves in \u201cSequence Transduction with Recurrent Neural Networks\u201d. The paper was published at the ICML 2012 Workshop on Representation Learning. Graves showed that the Transducer was a sensible model to use for speech recognition, achieving good results on a small dataset (TIMIT)."},{"title":"My research goals","link":{"@attributes":{"href":"https:\/\/lorenlugosch.github.io\/posts\/2020\/08\/goals\/","rel":"alternate","type":"text\/html","title":"My research goals"}},"published":"2020-08-14T00:00:00-07:00","updated":"2020-08-14T00:00:00-07:00","id":"https:\/\/lorenlugosch.github.io\/posts\/2020\/08\/goals","content":"<p>I wanted to clarify to myself and others what some of my research goals are, and why I\u2019m working on certain problems. The hope is that putting this online for the world to see will help challenge me to keep focused and working towards those goals\u2014sort of like telling your friends that you\u2019re going to quit smoking, or something like that.<\/p>\n\n<p>My broad long-term goal is to build reliable, competent domestic robots: in other words, robots that can help you around the house with tidying up, folding laundry, and the kind of things you might now ask Alexa\/Siri\/Google Home for, like playing music and setting timers.<\/p>\n\n<p>In addition to the interesting technical challenge of it, domestic robots are just something I personally would love to have. They could also make life a lot better for elderly people and people who need long-term care\u2014if you can\u2019t pay for a caregiver, and if your loved ones aren\u2019t able to take on the role of caregiver, a robot might be an affordable alternative.<\/p>\n\n<p>An important aspect of my goal is: I don\u2019t want these robots to rely on the Internet\u2014I want their AI to live on the robot, offline. There\u2019s a few reasons for this.<\/p>\n\n<p>1) <em>Privacy.<\/em> Yes, yes, I know there\u2019s a lot of people working on things like computation on encrypted data (\u201csome of my best friends work on data privacy!\u201d), but barring some big breakthroughs, I\u2019d prefer to just cut the Gordian Knot and avoid sending my data to the cloud altogether.<\/p>\n\n<p>2) <em>Latency\/reliability.<\/em> Suppose a robot lives with an elderly person, and the person is about to slip and fall. The robot needs to detect this and quickly move to keep the person from getting hurt. We might use a neural network to map the robot\u2019s camera feed to a sequence of physical actions to take. If we store that neural network in the cloud, and the Internet cuts out, the robot won\u2019t be able to act.<\/p>\n\n<p>3) <em>The Internet shouldn\u2019t even be necessary.<\/em> If it\u2019s not something that inherently requires the Internet, like checking the weather forecast or buying groceries, I shouldn\u2019t have to use the Internet to do it. My brain takes up the space of about a hard drive and runs on 20 Watts\u2014I don\u2019t have to store it in the cloud and run it on a supercomputer. It\u2019s annoying that I need the Internet just to tell my phone to set a reminder.<\/p>\n\n<p>And this last one admittedly isn\u2019t a very good reason, but I want to be honest with myself:<\/p>\n\n<p>4) <em>I want to own all my stuff.<\/em> I\u2019m annoyed that less and less of my stuff consists of physical things that I definitely own, and instead is more virtual things in the cloud that I maybe kind of own or maybe am just renting. I want to own my data, for instance; I don\u2019t want to have to worry about where it lives and who can see it. (But maybe this is just a bit of nostalgia, or a bit of the tin-foil-hat-wearing libertarian\u2019s instinct to convert his digital money into a pile of gold and hide it under the bed.)<\/p>\n\n<blockquote class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">industry: would you like to move all your code and data to a rich man&#39;s personal computer?<br \/>programmer: what no that sounds like a bad idea<br \/>industry: ok would you like to run everything in &quot;the cloud&quot;?<br \/>programmer: oh that sounds fluffy and nice yes please<\/p>&mdash; Computer Facts (@computerfact) <a href=\"https:\/\/twitter.com\/computerfact\/status\/1192938091201335296?ref_src=twsrc%5Etfw\">November 8, 2019<\/a><\/blockquote>\n<script async=\"\" src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script>\n\n<p>I suspect that, to make really good and reliable domestic robots happen, solving the AI problems\u2014true spoken language understanding, fine-grained motor control, long-term planning\u2014will require <em>extremely large neural networks<\/em>. I\u2019m talking hundreds of trillions of parameters. For reference, the current biggest neural net reported in the literature has only <a href=\"https:\/\/arxiv.org\/abs\/2006.16668\">600 billion parameters<\/a>.<\/p>\n\n<p>Why do I say that? There\u2019s lots of evidence. <a href=\"https:\/\/www.gwern.net\/GPT-3\">GPT-3<\/a> has 175 billion parameters\u2014there\u2019s <a href=\"https:\/\/arxiv.org\/abs\/2001.08361\">theoretical<\/a> and <a href=\"https:\/\/www.youtube.com\/watch?v=9P_VAMyb-7k\">empirical<\/a> reasons to believe that big models really are necessary\u2014and that\u2019s <em>just<\/em> for language understanding\/generation: our robot will also need motor control, audio processing, vision, and haptics. The way we do AI now, each of these modalities is handled separately; I think ultimately we will need the modalities to be <a href=\"https:\/\/arxiv.org\/abs\/1706.05137\">handled together<\/a>, so that your robot doesn\u2019t think <a href=\"https:\/\/openai.com\/blog\/better-language-models\/\">a unicorn might have 4 horns<\/a>\u2014and so our models will need even more capacity, for understanding how the different modalities fit together. And if you like taking inspiration from human brains: ours have 100 trillion synapses\u2014assuming that 1 synapse = 1 weight, and assuming that Nature is efficient, then we\u2019ll need 100 trillion parameters to make AI that can do the range of things human brains can do.<\/p>\n\n<p>Running 100-trillion-parameter networks on a domestic robot is going to be hard. Just running a 600-billion-parameter network on a supercomputer was a significant engineering challenge for <a href=\"https:\/\/arxiv.org\/abs\/2006.16668\">Google<\/a>. Will Moore\u2019s Law alone get us there? Could we just wait until 2040, when a Raspberry Pi has 1 PetaFLOPS of processing and 100 TB of storage? Or do we need to consider fundamentally different hardware designs?<\/p>\n\n<p>No, I\u2019m not talking about matrix multiplication accelerators, which have been optimized to death already. I\u2019m talking about data storage\u2014way less sexy, but ripe for neural-net-specific optimizations. Right now, <a href=\"https:\/\/twitter.com\/hardmaru\/status\/1289498209320972290?s=20\">a 100 TB drive costs $40,000<\/a>. The alternative, connecting up dozens of cheaper, smaller drives, might not be energy-efficient or space-efficient enough for a robot. Could we make denser, cheaper storage by taking advantage of the fact that\u2014unlike the hard drives for storing your OS and bank account information\u2014neural nets can withstand a little noise?<\/p>\n\n<p>Whatever the hardware for domestic robot AI looks like, conditional computation\u2014that is, not using all 100 trillion parameters of the neural network every time the clock ticks\u2014is certainly going to be a part of the solution. That\u2019s the focus of my PhD right now. My current hunch is that the <a href=\"https:\/\/arxiv.org\/abs\/1701.06538\">hard mixture-of-experts<\/a> is a good starting point, but not the final word in conditional computation. Just as inductive biases like translational invariance can make computer vision easier to learn, there are simple inductive biases from computer architecture and psychology which I think can make the hard mixture-of-experts easier to train. The fun part of AI is figuring out how little inductive bias we can get away with\u2014and pilfering that bias from human brains and other sources of inspiration when we do need it.<\/p>","author":{"name":"Loren Lugosch"},"summary":"I wanted to clarify to myself and others what some of my research goals are, and why I\u2019m working on certain problems. The hope is that putting this online for the world to see will help challenge me to keep focused and working towards those goals\u2014sort of like telling your friends that you\u2019re going to quit smoking, or something like that."},{"title":"Predictive coding in machines and brains","link":{"@attributes":{"href":"https:\/\/lorenlugosch.github.io\/posts\/2020\/07\/predictive-coding\/","rel":"alternate","type":"text\/html","title":"Predictive coding in machines and brains"}},"published":"2020-07-11T00:00:00-07:00","updated":"2020-07-11T00:00:00-07:00","id":"https:\/\/lorenlugosch.github.io\/posts\/2020\/07\/predictive-coding","content":"<p>The name \u201cpredictive coding\u201d has been applied to a number of engineering techniques and scientific theories. All these techniques and theories involve predicting future observations from past observations, but what exactly is meant by \u201ccoding\u201d differs in each case. Here is a quick tour of some flavors of \u201cpredictive coding\u201d and how they\u2019re related.<\/p>\n\n<h2 id=\"what-is-coding\">What is \u201ccoding\u201d?<\/h2>\n<p>In signal processing and related fields, the term \u201ccoding\u201d generally means putting a signal into some format where the signal will be easier to handle for some task.<\/p>\n\n<p>In any coding scheme, there is an encoder, which puts the input signal into the new format, and a decoder, which puts the encoded signal back into the original format (or as close as possible to the original). The \u201ccode\u201d is the space of possible encoded signals. For example, the <a href=\"https:\/\/en.wikipedia.org\/wiki\/ASCII\">American Standard Code for Information Interchange (ASCII)<\/a> consists of a set of 8-bit representations of commonly used characters. Your computer doesn\u2019t understand characters, but it does understand bits, so we use a bit-based coding scheme to represent characters in a computer.<\/p>\n\n<p>Some other common types of coding are:<\/p>\n<ul>\n  <li><strong>Source coding<\/strong> (also known as \u201c<a href=\"http:\/\/mattmahoney.net\/dc\/dce.html\">data compression<\/a>\u201d). The encoder compresses the input into a small bitstream; the decoder decompresses that bitstream to get back the input. <em>Examples: arithmetic coding (used in JPEG), Huffman coding (used to <a href=\"https:\/\/arxiv.org\/abs\/1510.00149\">compress neural networks<\/a>).<\/em><\/li>\n  <li><strong>Channel coding<\/strong> (also known as \u201c<a href=\"https:\/\/lorenlugosch.github.io\/Masters_Thesis.pdf\">error correction<\/a>\u201d). The encoder adds redundancy to protect a message from noise in the channel through which it will be transmitted; the decoder maps the noisy received signal back to the most likely original message, with the help of the added redundancy. <em>Examples: low-density parity-check codes (used in your cell phone and your hard drive), Reed-Solomon codes (used in CDs, for those old enough to remember this technology).<\/em><\/li>\n  <li><strong>Encryption.<\/strong> (Encryption could also be called \u201creceiver coding\u201d because it considers possible <em>receivers<\/em>, but nobody uses this terminology.) The encoder encrypts the message into a format such that only a receiver with the appropriate keys can open the message; the decoder decrypts the encrypted message into the readable original. <em>Examples: RSA (used in HTTPS), SHA-1 (once used, but now broken).<\/em><\/li>\n<\/ul>\n\n<p><em>Aside:<\/em> The \u201csource coding\u201d and \u201cchannel coding\u201d lingo come from Claude Shannon\u2019s wonderful 1948 paper \u201c<a href=\"http:\/\/people.math.harvard.edu\/~ctm\/home\/text\/others\/shannon\/entropy\/entropy.pdf\">A mathematical theory of communication<\/a>\u201d.<\/p>\n\n<h2 id=\"predictive-coding-for-data-compression\">Predictive coding for data compression<\/h2>\n<p>OK, that\u2019s coding in general. So what\u2019s predictive coding?<\/p>\n\n<p>The term \u201cpredictive coding\u201d was coined in 1955 <a href=\"https:\/\/ieeexplore.ieee.org\/stamp\/stamp.jsp?arnumber=1055126&amp;casa_token=4M_d797vjM8AAAAA:bSflmHvRRXAGjcDRj2UldgoGkmYQggI1Up7hOo3a1wUTht-92EnD89CJ8JVz-xUqqnBmXVdd&amp;tag=1\">by Peter Elias<\/a>. Specifically, what Elias proposed was a method called <em>linear<\/em> predictive coding (LPC) for communication systems.<\/p>\n\n<p>In LPC, the next sample of a signal is predicted using a linear function of the previous $n$ samples. Then the error between the predicted sample and the actual sample is transmitted, along with the coefficients of the linear predictor. Predicting a sample from nearby previous samples works because in signals like speech, nearby samples are strongly correlated with each other.<\/p>\n\n<p>The idea behind transmitting the <em>error<\/em> in LPC is that if we have a good predictor, the error will be small; thus it will require less bandwidth to transmit than the original signal. (So here \u201ccoding\u201d specifically refers to \u201csource coding\u201d, or compression.)<\/p>\n\n<p>LPC has had a long history of successes for audio compression. <a href=\"https:\/\/ieeexplore.ieee.org\/stamp\/stamp.jsp?arnumber=6366797&amp;casa_token=QMjcWbI_Ms4AAAAA:d1m6vIdXHIwFnkOtjljmqxjtR5m_S0iywLUj9rqFKPoNlvotD7y5UF9Jg0GQbwKA4jmaAFyj&amp;tag=1\">Speak \u2018n\u2019 Spell used LPC<\/a> to store and synthesize speech sounds. The old <a href=\"https:\/\/en.wikipedia.org\/wiki\/Game_Boy_Sound_System\">Game Boy soundchip<\/a> mostly used simple square wave beeps \u2018n\u2019 boops to make music, but sometimes used a special case of LPC known as <em>DPCM<\/em> to store certain sounds, like Pikachu\u2019s voice in <em>Pok\u00e9mon Yellow<\/em>. (See <a href=\"https:\/\/www.youtube.com\/watch?v=q_3d1x2VPxk\">this video<\/a> for a great overview of this and other old-school soundchips. Audio compression was crucial back when game cartridges had limited data storage.) <a href=\"https:\/\/en.wikipedia.org\/wiki\/SILK\">The speech codec used in Skype<\/a> uses LPC, combined with a bunch of other gadgets.<\/p>\n\n<p>Predictive coding is also used in <em>video<\/em> compression, under the name \u201c<a href=\"https:\/\/en.wikipedia.org\/wiki\/Motion_compensation\">motion compensation<\/a>\u201d. Like adjacent audio samples, adjacent video frames are strongly correlated, and so can be predicted from each other. If you haven\u2019t already, it\u2019s good to take a moment to <a href=\"https:\/\/sidbala.com\/h-264-is-magic\/\">have your mind blown by H.264 video compression<\/a>\u2014the digital equivalent of shrinking a 3000-pound car to 0.4 pounds.<\/p>\n\n<p>And of course, <em>linear<\/em> models are not the only way to do predictive coding; nonlinear models like neural networks can be used as well. <a href=\"https:\/\/arxiv.org\/pdf\/1910.06464.pdf\">A speech codec using WaveNet<\/a> has been reported to get lower bitrate for the same quality as some traditional speech codecs.<\/p>\n\n<h2 id=\"predictive-coding-for-representation-learning\">Predictive coding for representation learning<\/h2>\n\n<p>Linear predictive coding and friends can be thought of as special cases of something called an <em>autoregressive model<\/em>. An autoregressive model is a model that cleverly splits a complicated probability distribution over sequences into a number of chunks that are easier to handle.<\/p>\n\n<p>Let $\\mathbf{x} = \\{x_1, x_2, \\dots\\}$ denote our sequence of interest. In an autoregressive model, the joint distribution $p(\\mathbf{x})$ is defined as<\/p>\n\n<center>$$\\begin{eqnarray*} p(\\mathbf{x}) = \\prod_t p(x_t|x_{t-1}, x_{t-2}, \\dots) \\end{eqnarray*}$$<\/center>\n\n<p>The $p(x_t|x_{t-1}, x_{t-2}, \\dots)$ term can be implemented in different ways. In something simple like an $n$-gram model for text, $p(x_t|x_{t-1}, x_{t-2}, \\dots)$ is just a lookup table containing the probability of the next letter being $x_t$, given that the previous letters were $x_{t-1}, x_{t-2}, \\dots$. Another way to implement $p(x_t|x_{t-1}, x_{t-2}, \\dots)$ is to feed $x_{t-1}, x_{t-2}, \\dots$ into a neural network, which outputs a feature vector $h$, and then predict $x_t$ using a linear model on top of $h$, like a softmax classifier (for discrete $x_t$) or a linear regression model (for real-valued $x_t$).<\/p>\n\n<p>Such a model is called \u201cautoregressive\u201d because if you want to <em>sample<\/em> from the distribution it defines, you must feed back the model\u2019s own outputs (= \u201cauto\u201d) to predict the next output (= \u201cregressive\u201d).<\/p>\n\n<p>It turns out that if you train a neural network as an autoregressive model, the internal representations learned by the network will <a href=\"https:\/\/arxiv.org\/abs\/1511.01432\">work really well for supervised downstream tasks<\/a>. Work like <a href=\"https:\/\/arxiv.org\/abs\/1801.06146\">ULMFiT<\/a> and <a href=\"https:\/\/worldmodels.github.io\/\">World Models<\/a> showed that this trick can make neural nets a lot more data-efficient.<\/p>\n\n<p>Notice that using the networks\u2019 internal representations is somewhat different from what we did in LPC: whereas in LPC the outputs are the <em>errors<\/em> (because the purpose is compression), in these autoregressive feature extractors the outputs are the <em>features<\/em> (because the purpose is extracting discriminative features). So in either case, we are predicting future observations, but the \u201ccoding scheme\u201d and purpose of predicting the future is different.<\/p>\n\n<hr \/>\n<p><em>A note about audio.<\/em> Autoregressive modeling with big neural nets operating on the raw audio signal works\u2014<a href=\"https:\/\/arxiv.org\/abs\/1609.03499\">WaveNet<\/a> did this for generative modeling, and the representations could be used on a downstream task (speech recognition)\u2014but it is expensive to do because audio signals are high-dimensional.<\/p>\n\n<p>Two methods have been developed to make it easier to do autoregressive modeling for audio: contrastive predictive coding and autoregressive predictive coding.<\/p>\n\n<p>The first method, <a href=\"https:\/\/arxiv.org\/abs\/1807.03748\">contrastive predictive coding (CPC)<\/a>, works by first encoding the input signal into a much lower-dimensional sequence of feature vectors using a convolutional neural network, and then training an autoregressive model on top of this sequence. Since the encoder could simply learn to output all 0s to make the objective function as easy as possible to optimize, the autoregressive component is instead trained using a <em>contrastive<\/em> loss: it must guess whether a sample is actually the next sample in the sequence, or a fake, thus preventing the encoder from collapsing to a trivial representation. The technique also works well for other high-dimensional signals, like images (represented as a sequence of pixels). Note that strictly speaking CPC is not a <em>true<\/em> autoregressive model because we can\u2019t draw samples from it.<\/p>\n\n<p><a href=\"https:\/\/arxiv.org\/abs\/1904.03240\">Autoregressive predictive coding (APC)<\/a> takes a slightly different approach. Instead of modeling the raw audio, it extracts low-dimensional frequency domain features, and then does plain old autoregressive modeling on top of those features. The disadvantage is that 1) maybe some low-level information in the original signal gets thrown out, and 2) now you need to hand-craft some feature extraction, since what works for audio will not necessarily work for other modalities. (Incidentally, I think \u201cautoregressive predictive coding\u201d is not a very good name, because WaveNet is already an \u201cautoregressive\u201d \u201cpredictive coding\u201d model. Engineers are not that great at naming things. Oh well.)<\/p>\n\n<p>A nice paper comparing CPC and APC was recently published at the ICML 2020 Workshop on Self-supervision in Audio and Speech\u2014check it out <a href=\"https:\/\/openreview.net\/forum?id=cnLz5ckGs1y\">here<\/a>, if you\u2019re interested in speech models.<\/p>\n\n<hr \/>\n\n<p>The drawback of using these predictive coding models for representation learning is that the representations we get are \u201cunidirectional\u201d: that is, they extract information only from the past, and not the future, to represent the current input. That\u2019s a problem because the future is often very informative for interpreting the present. If you think of a phrase like \u201cmilk the cow\u201d, we know that \u201cmilk\u201d is a verb, and not a noun, from the words that follow it.<\/p>\n\n<p>One way to overcome this problem is to do autoregression in both directions, and concatenate the representations from the forward and backward models, as is done in <a href=\"https:\/\/arxiv.org\/abs\/1802.05365\">ELMo<\/a>. Alternately, models like <a href=\"https:\/\/arxiv.org\/abs\/1810.04805\">BERT<\/a> use a bidirectional context and minimize a contrastive or denoising loss instead. But for tasks in which observations need to be processed in \u201creal-time\u201d, as is usually the case in control problems, a unidirectional context makes more sense.<\/p>\n\n<h2 id=\"predictive-coding-for-computational-efficiency\">Predictive coding for computational efficiency<\/h2>\n\n<p>Another really neat thing forward predictive models can do is tell you roughly whether an observation is \u201cdifficult\u201d or not. The idea is this: if an input is <em>surprising<\/em>\u2014if, for example, your autoregressive model assigns low probability to it\u2014it contains more <em>information<\/em>, and it is therefore probably worth more attention.<\/p>\n\n<p>This observation suggests yet another use for predictive coding: we can allocate less computation to more predictable inputs, a type of \u201c<a href=\"https:\/\/nervanasystems.github.io\/distiller\/conditional_computation.html\">conditional computation<\/a>\u201d. J\u00fcrgen Schmidhuber\u2019s \u201c<a href=\"https:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.34.1205&amp;rep=rep1&amp;type=pdf\">Neural Sequence Chunker<\/a>\u201d is an early instance of this idea, in which predictable inputs are ignored and not sent to a subsequent neural network, and (shameless plug!) recently I wrote <a href=\"https:\/\/arxiv.org\/abs\/2006.01659\">a paper<\/a> describing a slightly more general version of the idea.<\/p>\n\n<p>While unsurprising inputs merit less computation, the inverse is not necessarily true: surprising inputs do not always merit <em>more<\/em> computation. Alex Graves in his <a href=\"https:\/\/arxiv.org\/pdf\/1603.08983v4.pdf\">Adaptive Computation Time paper<\/a> gives an excellent example of unpredictable inputs\u2014specifically, random ID numbers in Wikipedia metadata\u2014which a neural network model does not bother to allocate more computation to, because the model realizes that throwing more processing time at the problem just won\u2019t help. This implies that we can do better than just using predictability by <em>learning<\/em> when to use more or less computation. Still, predictability is a good inductive bias for conditional computation in neural nets\u2014and it appears that human brains may use this inductive bias for similar purposes.<\/p>\n\n<h2 id=\"predictive-coding-in-the-brain\">Predictive coding in the brain<\/h2>\n\n<p>Artificial neural networks in machine learning were inspired by scientific theories about the structure of the human brain. For predictive coding, it was the other way around: engineers (starting with Elias) developed predictive coding to solve certain problems in signal processing, and only afterwards did scientists realize that the brain might be doing something like what engineers had developed.<\/p>\n\n<p>One such early inkling was described in Jeffrey Elman\u2019s classic \u201c<a href=\"https:\/\/crl.ucsd.edu\/~elman\/Papers\/fsit.pdf\">Finding Structure in Time<\/a>\u201d. Elman trained a recurrent neural network to predict the next letter in a stream of letters. The streams were formed by taking sentences and removing the spaces between words. This is sort of analogous to the scenario of children learning language. Children are not told where the boundaries between words are; presumably they only hear a relatively unbroken stream of phonemes when adults speak.<\/p>\n\n<p>What Elman found was that letters with high surprisal corresponded closely to the location of word boundaries. Predictive coding might therefore be one of the ingredients for language learning in humans: children could in theory infer word boundaries by noting which phonemes have high surprisal. (This method isn\u2019t foolproof: my uncle reports that as a kid he thought \u201ctractorworking\u201d was a word because he so often heard \u201ctractor\u201d and \u201cworking\u201d together, e.g. \u201cthere\u2019s a tractor working in the field\u201d.)<\/p>\n\n<p>Another piece of evidence for predictive coding in humans comes from studies of reading time. <a href=\"https:\/\/www.aclweb.org\/anthology\/W19-0101.pdf\">van Schijndel and Linzen<\/a> put it nicely: \u201cOne of the most robust findings in the reading literature is that more predictable words are read faster than less predictable words\u2026 Word predictability effects fit into a picture of human cognition in which humans constantly make predictions about upcoming events and test those predictions against their perceptual input.\u201d<\/p>\n\n<p>There is an even more general \u201c<a href=\"https:\/\/royalsocietypublishing.org\/doi\/pdf\/10.1098\/rstb.2005.1622\">predictive coding hypothesis<\/a>\u201d which claims that the brain does predictive coding at every level: sensory signals are predicted by the neurons that receive them, and the prediction errors become the input to other neurons, which also make predictions and errors, and so on. This hypothesis is somewhat controversial, though, as can be seen from the many responses to Andy Clark\u2019s \u201c<a href=\"https:\/\/www.fil.ion.ucl.ac.uk\/~karl\/Whatever%20next.pdf\">Whatever next? Predictive brains, situated agents, and the future of cognitive science<\/a>\u201d (the responses go from pg. 24 onward\u2014hear also <a href=\"http:\/\/unsupervisedthinkingpodcast.blogspot.com\/2018\/05\/episode-33-predictive-coding.html\">Grace Lindsay\u2019s podcast episode on predictive coding<\/a>, which discusses this essay).<\/p>\n\n<h2 id=\"the-future-of-predictive-coding\">The future of predictive coding<\/h2>\n<p>Regardless of the extent to which it actually happens in human brains, predictive coding is a very powerful idea. <a href=\"https:\/\/openai.com\/blog\/image-gpt\/\">As researchers from OpenAI put it recently<\/a>, it is \u201ca universal unsupervised learning algorithm\u201d. Indeed, OpenAI\u2019s GPT-3\u2014which is trained solely using next-step-prediction\u2014can do all kinds of things it was never explicitly taught to do, like <a href=\"https:\/\/www.gwern.net\/GPT-3\">word arithmetic and Tom Swifty puns<\/a>.<\/p>\n\n<p>I suspect that more and more AI systems will have something like a predictive coding component built in. Forward predictive models are already needed for things like <a href=\"http:\/\/rail.eecs.berkeley.edu\/deeprlcourse-fa17\/f17docs\/lecture_9_model_based_rl.pdf\">model-based reinforcement learning<\/a>, where a model of the environment is used to plan by simulating and optimizing over possible trajectories; so, why not take advantage of the rich representations learned by those predictive models, and use them as inputs to subsequent processing?<\/p>\n\n<hr \/>","author":{"name":"Loren Lugosch"},"category":{"@attributes":{"term":"sequence modeling"}},"summary":"The name \u201cpredictive coding\u201d has been applied to a number of engineering techniques and scientific theories. All these techniques and theories involve predicting future observations from past observations, but what exactly is meant by \u201ccoding\u201d differs in each case. Here is a quick tour of some flavors of \u201cpredictive coding\u201d and how they\u2019re related."},{"title":"A contemplation of $\\text{logsumexp}$","link":{"@attributes":{"href":"https:\/\/lorenlugosch.github.io\/posts\/2020\/06\/logsumexp\/","rel":"alternate","type":"text\/html","title":"A contemplation of $\\text{logsumexp}$"}},"published":"2020-06-30T00:00:00-07:00","updated":"2020-06-30T00:00:00-07:00","id":"https:\/\/lorenlugosch.github.io\/posts\/2020\/06\/logsumexp","content":"<p>$\\text{logsumexp}$ is an interesting little function that shows up surprisingly often in machine learning. Join me in this post to shed some light on $\\text{logsumexp}$: where it lives, how it behaves, and how to interpret it.<\/p>\n\n<h2 id=\"what-is-textlogsumexp\">What is $\\text{logsumexp}$?<\/h2>\n<p>Let $\\mathbf{x} \\in \\mathbb{R}^n$. $\\text{logsumexp}(\\mathbf{x})$ is defined as:<\/p>\n<center>$$\\begin{eqnarray*} \\text{logsumexp}(\\mathbf{x}) = \\text{log} \\left( \\sum_{i} \\text{exp}(x_i) \\right). \\end{eqnarray*}$$<\/center>\n\n<p>Numerically, $\\text{logsumexp}$ is similar to $\\text{max}$: in fact, it\u2019s sometimes called the \u201csmooth maximum\u201d function. For example:<\/p>\n\n<center>$$\\begin{eqnarray*} \\text{max}([1,2,3]) = 3 \\end{eqnarray*}$$<\/center>\n\n<center>$$\\begin{eqnarray*} \\text{logsumexp}([1,2,3]) = 3.4076 \\end{eqnarray*}$$<\/center>\n\n<h2 id=\"examples-of-textlogsumexp\">Examples of $\\text{logsumexp}$<\/h2>\n\n<p>Here are some places in machine learning where $\\text{logsumexp}$ is used.<\/p>\n\n<h3 id=\"softmax-classifiers\">Softmax classifiers<\/h3>\n\n<p>In a softmax classifier, the likelihood of label $i$ is defined as:<\/p>\n\n<center>$$\\begin{eqnarray*} p_{\\theta}(i | \\mathbf{l}) = \\text{softmax}(\\mathbf{l})_i, \\end{eqnarray*}$$<\/center>\n\n<p>where $\\mathbf{l}$ is the vector of logits (unnormalized scores for each label).<\/p>\n\n<p>Softmax classifiers are trained by minimizing the negative log-likelihood loss:\n\n    <div>\n\t$$\\begin{align*}\n    -\\text{log } p_{\\theta}(i | \\mathbf{l}) &= -\\text{log} \\left( \\text{softmax}(\\mathbf{l})_i \\right) \\\\\n    &= -\\text{log} \\left( \\text{exp}(l_i) \/ \\sum_j \\text{exp}(l_j) \\right) \\\\\n    &= -\\text{log} \\left( \\text{exp}(l_i)\\right) +\\text{log} \\left(\\sum_j \\text{exp}(l_j) \\right) \\\\\n    &= -l_i + \\text{logsumexp}(\\mathbf{l}).\n\\end{align*}$$\n    <\/div>\n<\/p>\n\n<p>Our friend $\\text{logsumexp}$ appears in the last line.<\/p>\n\n<h3 id=\"global-pooling\">Global pooling<\/h3>\n\n<p>For sequence classification tasks, it is usually necessary to map a variable length sequence of feature vectors to a single feature vector to be able to use something like a softmax classifier. To obtain a single vector, global pooling operations like max pooling or mean pooling can be used. Another aggregation method, which is less commonly used but has some of the advantages of both mean and max pooling, is $\\text{logsumexp}$ pooling. (See <a href=\"https:\/\/ronan.collobert.com\/pub\/matos\/2016_wordaggr_interspeech.pdf\">this paper<\/a> for an example.)<\/p>\n\n<h3 id=\"latent-alignment-models\">Latent alignment models<\/h3>\n\n<p>In latent alignment models, like <a href=\"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1143844.1143891\">connectionist temporal classification (CTC)<\/a>, dynamic programming is used to add up the probability of all possible alignments of an input sequence $\\mathbf{x}$ to an output sequence $\\mathbf{y}$ to train the model.<\/p>\n\n<p>The dynamic programming algorithm in CTC uses the following recursion:<\/p>\n\n<center>$$\\begin{eqnarray*} \\alpha_{s,t} = (\\alpha_{s,t-1} + \\alpha_{s-1,t-1} + \\alpha_{s-2,t-1}) \\cdot p_{\\theta}(y_s | x_t)\\end{eqnarray*}$$<\/center>\n\n<p>This algorithm multiplies a long chain of probabilities, and so will underflow when aligning long sequences (like speech). Instead, we can run the algorithm in the log domain, in which case the recursion becomes:<\/p>\n\n    <div>\n\t$$\\begin{align*}\n    \\text{log}(\\alpha_{s,t})  &=  \\text{log}(\\alpha_{s,t-1} + \\alpha_{s-1,t-1} + \\alpha_{s-2,t-1}) + \\text{log } p_{\\theta}(y_s | x_t)\\\\\n    &=  \\text{log}(\\text{exp}(\\text{log }\\alpha_{s,t-1}) + \\text{exp}(\\text{log }\\alpha_{s-1,t-1}) + \\text{exp}(\\text{log }\\alpha_{s-2,t-1})) + \\text{log } p_{\\theta}(y_s | x_t)\\\\\n    &=  \\text{logsumexp}([\\text{log }\\alpha_{s,t-1},\\text{log }\\alpha_{s-1,t-1}, \\text{log }\\alpha_{s-2,t-1}]) + \\text{log } p_{\\theta}(y_s | x_t)\\\\\n\\end{align*}$$\n    <\/div>\n\n<p>Similar recursions using $\\text{logsumexp}$ can be derived for the forward-backward algorithm used in <a href=\"https:\/\/lorenlugosch.github.io\/posts\/2020\/01\/hmm\/\">Hidden Markov Models<\/a> and <a href=\"https:\/\/arxiv.org\/pdf\/1211.3711.pdf\">Transducer<\/a> models.<\/p>\n\n<p>Fun fact: if we replace the $\\text{logsumexp}$ with $\\text{max}$, we get the Viterbi algorithm, which gives us the score of the single most likely alignment (<a href=\"https:\/\/arxiv.org\/pdf\/2002.00876.pdf\">and if we backpropagate, the alignment itself<\/a>).<\/p>\n\n<h2 id=\"some-properties-of-textlogsumexp\">Some properties of $\\text{logsumexp}$<\/h2>\n\n<p>$\\text{logsumexp}$ has some useful properties. It is:<\/p>\n<ul>\n  <li><em>Convex.<\/em> If you can pose your machine learning problem as a <a href=\"https:\/\/web.stanford.edu\/~boyd\/cvxbook\/\">convex optimization<\/a> problem, you can solve it quickly and reliably.<\/li>\n  <li><em>Differentiable everywhere.<\/em> This is nice to have if your optimization algorithm is picky and doesn\u2019t like functions with non-differentiable points, like $\\text{max}$. For those picky algorithms, we can approximate $\\text{max}$ using $\\text{logsumexp}$.<\/li>\n  <li><em>Associative.<\/em> So:<\/li>\n<\/ul>\n\n    <div>\n\t$$\\begin{align*}\n    \\text{logsumexp}([a,b,c,d]) &= \\text{logsumexp}([ \\\\\n\n    & \\text{logsumexp}([a,b]), \\\\\n    & \\text{logsumexp}([c,d]) \\\\\n    ]). \\\\ \n\\end{align*}$$\n    <\/div>\n\n<ul>\n  <li>(Hence, it can be computed in just $\\text{log}_2(n)$ timesteps using a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Reduction_Operator\">parallel reduction<\/a>, where $n$ is the length of the vector we\u2019re $\\text{logsumexp}$ing.)<\/li>\n  <li><em>Close to<\/em> $\\text{max}$. In what sense \u201cclose\u201d? It is bounded as follows:<\/li>\n<\/ul>\n<center>$$\\begin{eqnarray*} \\text{max}(\\mathbf{x}) \\leq \\text{logsumexp}(\\mathbf{x})  \\leq \\text{max}(\\mathbf{x}) + \\text{log}(n) \\end{eqnarray*}$$<\/center>\n\n<h2 id=\"the-textlogsumexp-trick\">The $\\text{logsumexp}$ trick<\/h2>\n\n<p>Computing $\\text{log} \\left( \\sum_{i} \\text{exp}(x_i) \\right)$ directly is numerically unstable because of the $\\text{exp}$. Try the following in numpy:<\/p>\n\n<pre style=\"font-size:13px\">\nx = np.array([7000,8000,9000])\nnp.log(np.sum(np.exp(x)))\n<\/pre>\n\n<p>From the approximation $\\text{max}(\\mathbf{x}) \\approx \\text{logsumexp}(\\mathbf{x})$, we know that the result should be a little <a href=\"https:\/\/www.youtube.com\/watch?v=SiMHTK15Pik\">over 9000<\/a>\u2014but if you run this code, the result will be infinity because of overflow.<\/p>\n\n<p>Instead, to compute $\\text{logsumexp}$, use the following trick:<\/p>\n\n<center>$$\\begin{eqnarray*}\n\\text{logsumexp}(\\mathbf{x}) = \\text{log} \\left( \\sum_{i} \\text{exp}(x_i - \\text{max}(\\mathbf{x})) \\right) + \\text{max}(\\mathbf{x}).\n\\end{eqnarray*}$$<\/center>\n\n<p>(See <a href=\"https:\/\/www.xarg.org\/2016\/06\/the-log-sum-exp-trick-in-machine-learning\/\">this post<\/a> for the proof that the trick works.)<\/p>\n\n<p>Now we won\u2019t get an overflow because we\u2019re taking the $\\text{exp}$ of $[-2000,-1000,0]$ instead of $[7000,8000,9000]$. If we now run this instead:<\/p>\n\n<pre style=\"font-size:13px\">\nx = np.array([7000,8000,9000])\nnp.log(np.sum(np.exp(x - x.max()))) + x.max()\n<\/pre>\n\n<p>we\u2019ll get what we expect.<\/p>\n\n<h2 id=\"takeaways\">Takeaways<\/h2>\n<ul>\n  <li>$\\text{logsumexp}$ is everywhere!<\/li>\n  <li>You can make certain dense equations easier to digest by identifying instances of $\\text{logsumexp}$ and mentally replacing them with $\\text{max}$.\n    <blockquote>\n      <p>For example, from the discussion of softmax classifiers above, you now know that the loss for a classifier is just the difference between the maximum of the logits (roughly) and the logit for the right answer!<\/p>\n    <\/blockquote>\n  <\/li>\n  <li>When computing $\\text{logsumexp}$, use the \u201clogsumexp trick\u201d. (Or just check if your library already has a numerically stable $\\text{logsumexp}$ function, <a href=\"https:\/\/pytorch.org\/docs\/stable\/torch.html#torch.logsumexp\">as PyTorch does<\/a>.)<\/li>\n<\/ul>\n\n<hr \/>","author":{"name":"Loren Lugosch"},"category":{"@attributes":{"term":"machine learning"}},"summary":"$\\text{logsumexp}$ is an interesting little function that shows up surprisingly often in machine learning. Join me in this post to shed some light on $\\text{logsumexp}$: where it lives, how it behaves, and how to interpret it."},{"title":"Notebook: Fun with Hidden Markov Models","link":{"@attributes":{"href":"https:\/\/lorenlugosch.github.io\/posts\/2020\/01\/hmm\/","rel":"alternate","type":"text\/html","title":"Notebook: Fun with Hidden Markov Models"}},"published":"2020-01-28T00:00:00-08:00","updated":"2020-01-28T00:00:00-08:00","id":"https:\/\/lorenlugosch.github.io\/posts\/2020\/01\/hmm","content":"<p>I\u2019ve written a notebook introducing Hidden Markov Models (HMMs) with a PyTorch implementation of the forward algorithm, the Viterbi algorithm, and training a model on a text dataset\u2014check it out <a href=\"https:\/\/colab.research.google.com\/drive\/1IUe9lfoIiQsL49atSOgxnCmMR_zJazKI\">here<\/a>!<\/p>","author":{"name":"Loren Lugosch"},"category":{"@attributes":{"term":"sequence modeling"}},"summary":"I\u2019ve written a notebook introducing Hidden Markov Models (HMMs) with a PyTorch implementation of the forward algorithm, the Viterbi algorithm, and training a model on a text dataset\u2014check it out here!"},{"title":"An introduction to sequence-to-sequence learning","link":{"@attributes":{"href":"https:\/\/lorenlugosch.github.io\/posts\/2019\/02\/seq2seq\/","rel":"alternate","type":"text\/html","title":"An introduction to sequence-to-sequence learning"}},"published":"2019-02-19T00:00:00-08:00","updated":"2019-02-19T00:00:00-08:00","id":"https:\/\/lorenlugosch.github.io\/posts\/2019\/02\/seq2seq","content":"<p>Many interesting problems in artificial intelligence can be described in the following way:<\/p>\n<blockquote>\n  <p>Map a sequence of inputs $\\mathbf{x}$ to the correct sequence of outputs $\\mathbf{y}$.<\/p>\n<\/blockquote>\n\n<p>Speech recognition is one example: the goal is to map an audio signal $\\mathbf{x}$ (a sequence of real-valued audio samples) to the correct text transcript $\\mathbf{y}$ (a sequence of letters). Other examples are machine translation, image captioning, and speech synthesis.<\/p>\n\n<p>This post is a tutorial introduction to sequence-to-sequence learning, a method for using neural networks to solve these \u201csequence transduction\u201d problems. In this post, I will:<\/p>\n<ul>\n  <li>show you why these problems are interesting and challenging<\/li>\n  <li>give a detailed description of sequence-to-sequence learning\u2014or \u201cseq2seq\u201d, as the cool kids call it<\/li>\n  <li>walk through an example seq2seq application (with PyTorch code)<\/li>\n<\/ul>\n\n<p>The only background you need to read this is basic knowledge of how to train and use neural networks for classification problems.<\/p>\n\n<h2 id=\"the-problem\">The problem<\/h2>\n<p>Let\u2019s start by defining the general problem of sequence transduction a little more carefully.<\/p>\n\n<p>Let $\\mathbf{x} = \\{x_1, x_2, \\dots x_T\\}$ be the input sequence and $\\mathbf{y} = \\{y_1, y_2, \\dots, y_U\\}$ be the output sequence, where $x_t \\in \\mathcal{S}_x$, $y_u \\in \\mathcal{S}_y$, and $\\mathcal{S}_x$ and $\\mathcal{S}_y$ are the sets of possible things that each $x_t$ and $y_u$ can be, respectively.<\/p>\n\n<p>In speech recognition, $\\mathcal{S}_x$ would be the set of real numbers, $\\mathcal{S}_y$ would be the set of letters, $T$ might be on the order of thousands, and $U$ might be on the order of tens.<\/p>\n\n<p>We\u2019ll assume that the inputs and outputs are random variables, and the value of $T$ and $U$ may vary between examples of $\\mathbf{x}$ and $\\mathbf{y}$\u2014for example, not all sentences have the same length. The fact that $U$ may vary is part of what makes these problems challenging: we need to guess how long the output should be, and often this can\u2019t simply be inferred from the input length.<\/p>\n\n<p>The goal is to find the best function $f(\\mathbf{x})$ for mapping $\\mathbf{x}$ to $\\mathbf{y}$. But what does \u201cbest\u201d mean here?<\/p>\n\n<p>For a classification problem, the \u201cbest\u201d $f(\\mathbf{x})$ is the one with the highest accuracy, i.e. the lowest probability of guessing an output $\\mathbf{\\hat{y}}$ which is not equal to the true $\\mathbf{y}$.<\/p>\n\n<p>Similarly, we can use \u201caccuracy\u201d as a performance measure if the output is a sequence: if the guess $\\mathbf{\\hat{y}}$ is at all different from the true output $\\mathbf{y}$, the output is incorrect. That is, if $\\mathbf{y} = \\text{\u201chello\u201d}$ and $\\mathbf{\\hat{y}} = \\text{\u201chelo\u201d}$, then $\\mathbf{\\hat{y}}$ is incorrect. This is also called the \u201c$0$-$1$ loss\u201d: $0$ if $\\mathbf{\\hat{y}} = \\mathbf{y}$, $1$ otherwise.<\/p>\n\n<p>The simple $0$-$1$ loss is not a realistic performance measure: an output like $\\text{\u201clskdjfl\u201d}$ should really be regarded as less accurate than $\\text{\u201chelo\u201d}$ if the true output is $\\text{\u201chello\u201d}$.<\/p>\n\n<p>In practice, we probably care more about some other performance measure, such as the word error rate (speech recognition) or the BLEU score (machine translation, image captioning). But the $0$-$1$ loss is often a good approximation or surrogate for other performance measures, and it doesn\u2019t require any domain-specific knowledge to apply, so let\u2019s assume that the $0$-$1$ loss is what we\u2019re trying to optimize.<\/p>\n\n<h2 id=\"the-ideal-solution\">The ideal solution<\/h2>\n<p>Suppose that we have access to a magical genie who can tell us $p(\\mathbf{y}|\\mathbf{x})$, the probability that the correct output sequence is $\\mathbf{y}$ given that the input is $\\mathbf{x}$, for any $\\mathbf{x}$ and $\\mathbf{y}$. In that case, what would be the best function $f(\\mathbf{x})$ to use?<\/p>\n\n<p>Since we\u2019re trying to minimize the probability of making an error (guessing a $\\mathbf{\\hat{y}}$ which is not equal to the correct $\\mathbf{y}$), the following function is the best choice:<\/p>\n\n<center>$$\\begin{eqnarray*}\\mathbf{\\hat{y}} = f(\\mathbf{x}) = \\underset{\\mathbf{y}}{\\text{argmax }} p(\\mathbf{y}|\\mathbf{x})\\end{eqnarray*}$$<\/center>\n\n<p>Alas! We usually don\u2019t have a magical genie who can tell us $p(\\mathbf{y}|\\mathbf{x})$. What we can do instead is fit a model $p_{\\theta}(\\mathbf{y}|\\mathbf{x})$ to some training data, and if we\u2019re lucky, that model will be close to the true distribution $p(\\mathbf{y}|\\mathbf{x})$.<\/p>\n\n<h2 id=\"modeling-pmathbfy\">Modeling $p(\\mathbf{y})$<\/h2>\n\n<p>Before we consider implementing the model $p_{\\theta}(\\mathbf{y}|\\mathbf{x})$, let\u2019s start with a slightly easier but closely related problem: implementing a model $p_{\\theta}(\\mathbf{y})$ that is close to $p(\\mathbf{y})$.<\/p>\n\n<p>What $p(\\mathbf{y})$ represents is the probability of observing a particular output sequence $\\mathbf{y}$, independent of whatever the input sequence might be.<\/p>\n\n<p>Intuitively, what does this mean? Well, consider the two following sequences of words:<\/p>\n\n<center>$$\\begin{eqnarray*}\\mathbf{y}_1 = \\text{I ate food}\\end{eqnarray*}$$<\/center>\n\n<center>$$\\begin{eqnarray*}\\mathbf{y}_2 = \\text{I eight food}\\end{eqnarray*}$$<\/center>\n\n<p>The probability of observing $\\mathbf{y}_1$ should be higher than the probability of observing $\\mathbf{y}_2$, since $\\mathbf{y}_1$ is a meaningful English sentence, and $\\mathbf{y}_2$ is nonsensical. This fact could be useful in speech recognition to figure out which of the two phrases the person actually said, since both of these phrases sound the same if you say them out loud.<\/p>\n\n<p>Likewise, consider the two following sequences of letters:<\/p>\n\n<center>$$\\begin{eqnarray*}\\mathbf{y}_1 = \\text{florpy}\\end{eqnarray*}$$<\/center>\n\n<center>$$\\begin{eqnarray*}\\mathbf{y}_2 = \\text{fhqhwgads}\\end{eqnarray*}$$<\/center>\n\n<p>Although these two sequences are both not real English words, the first one $\\mathbf{y}_1$ really could be an English word, whereas the second one $\\mathbf{y}_2$ just looks like someone banging on a keyboard. If we were to train a model $p_{\\theta}(\\mathbf{y})$ on English text, we would probably find in this case that $p_{\\theta}(\\mathbf{y}_1) &gt; p_{\\theta}(\\mathbf{y}_2)$.<\/p>\n\n<p>To assign a probability to any sequence, we could just use a gigantic lookup table, with one entry for every possible $\\mathbf{y}$. The problem is there are usually too many possible sequences for this to be feasible\u2014in fact, there may be infinitely many possible sequences.<\/p>\n\n<p>Instead, let\u2019s define a model that computes the probability of a sequence by dividing it into a number of simpler probabilities using the chain rule of probability.<\/p>\n\n<p>The chain rule of probability says that, for two random variables A and B, the probability of A <em>and<\/em> B is equal to the probability of A <em>given<\/em> B, multiplied by the probability of B:<\/p>\n\n<center>$$\\begin{eqnarray*} p(A,B) = p(A|B) \\cdot p(B) \\end{eqnarray*}$$<\/center>\n\n<p>That\u2019s one way to factorize $p(A,B)$\u2014we could also factorize it like this:<\/p>\n\n<center>$$\\begin{eqnarray*} p(A,B) = p(B|A) \\cdot p(A) \\end{eqnarray*}$$<\/center>\n\n<p>Moreover, you can apply the chain rule for as many random variables as you want:<\/p>\n\n    <div>\n\t$$\\begin{align*} \n\tp(A,B,C) &= p(A|B,C) \\cdot p(B,C) \\\\ \n\t&= p(A|B,C) \\cdot p(B|C) \\cdot p(C) \n\t\\end{align*}$$\n    <\/div>\n\n<p>Remember that our sequence $\\mathbf{y}$ is a collection of random variables $y_1, y_2, y_3, \\dots, y_U$. So let\u2019s use the chain rule to write out $p_{\\theta}(\\mathbf{y})$ as the product of the probability of each of these variables, given the previous ones:<\/p>\n\n    <div>\n$$\\begin{eqnarray*} p_{\\theta}(\\mathbf{y}) &=& p_{\\theta}(y_1, y_2, y_3, \\dots, y_U)\\\\ &=& \\overset{U}{\\underset{u=1}{\\prod}} p_{\\theta}(y_u|y_{u-1},y_{u-2},\\dots,y_{1}) \\end{eqnarray*}$$\n    <\/div>\n\n<p>This $p_{\\theta}(y_u|y_{u-1},y_{u-2},\\dots,y_{1})$ term means \u201cthe probability of the next element, given what came before\u201d. A model that implements $p_{\\theta}(y_u|y_{u-1},y_{u-2},\\dots,y_{1})$ is sometimes called a \u201cnext step predictor\u201d or \u201clanguage model\u201d. When you are texting someone, and your phone suggests the next word for you as you are typing, it uses a model like this.<\/p>\n\n<p>We can implement this \u201cnext step\u201d probability $p_{\\theta}(y_u|y_{u-1},y_{u-2},\\dots,y_{1})$ using a neural network<sup id=\"fnref:NPLM\" role=\"doc-noteref\"><a href=\"#fn:NPLM\" class=\"footnote\" rel=\"footnote\">1<\/a><\/sup>. The neural network takes as input the previous $y_u$\u2019s and predicts the next $y_u$. This is shown in the figure below:<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/seq2seq\/RNN_model.png\" style=\"max-width:50%\" \/><\/center>\n\n<p>For example, if $\\mathcal{S}_y$ is the 26 letters of the alphabet, then $y_{u-1},y_{u-2},\\dots,y_{1}$ would be the previous letters in the sequence, and the neural network would have a softmax output of size 26 representing the probability of the next letter given the previous letters.<\/p>\n\n<p>A common choice for the neural network is a recurrent neural network (RNN), which is what is shown in the diagram here.<\/p>\n\n<blockquote>\n  <p>An RNN, if you are not familiar, is a neural network with memory. At each timestep, the RNN takes in an input vector $i$ and its current state vector $h$, and outputs an updated state vector $h := f(i, h)$. Through the state $h$, the RNN can remember what inputs it has seen so far ($i_1, i_2, \\dots$). The RNN can assign probabilities to different classes based on its state using a softmax classifier: $p(\\text{class }c) = \\text{softmax}(Wh + b)_c$<\/p>\n<\/blockquote>\n\n<p>To get the probability of the entire sequence, the chain rule tells us to multiply each $p_{\\theta}(y_u|y_{u-1},y_{u-2},\\dots,y_{1})$ term together. In other words, just multiply the softmax outputs together.<\/p>\n\n<p>For example, let\u2019s say that $\\mathcal{S}_y$ is the three letters $\\{a,b,c\\}$, and we want to calculate the probability of the sequence $ba$. In this case, the neural network would have a softmax output of size 3. If we have:<\/p>\n<ul>\n  <li>$p_{\\theta}(y_1) = [0.3, 0.4, 0.3]$, and<\/li>\n  <li>$p_{\\theta}(y_2|y_1=b) = [0.5, 0.3, 0.2]$,<\/li>\n<\/ul>\n\n<p>then $p_{\\theta}(ba) = p(a|b) p(b) = 0.5 \\cdot 0.4 = 0.2$.<\/p>\n\n<p>One final ingredient we need for a complete model $p_{\\theta}(\\mathbf{y})$ is a special element called the \u201cend-of-sequence\u201d element. Each sequence needs to end with \u201cend-of-sequence\u201d. Predicting \u201cend-of-sequence\u201d along with all the other elements of $\\mathcal{S}_y$ allows the model to implicitly predict the <em>length<\/em> of the sequence.<\/p>\n\n<h2 id=\"modeling-pmathbfymathbfx\">Modeling $p(\\mathbf{y}|\\mathbf{x})$<\/h2>\n\n<p>We\u2019re not just interested in $p(\\mathbf{y})$: what we really want to model is $p(\\mathbf{y}|\\mathbf{x})$, the probability of an output sequence <em>given<\/em> a particular input sequence.<\/p>\n\n<p>The simplest way to condition the output on the input is to split the model into an encoder RNN and a decoder RNN, where the encoder RNN converts the input sequence into a single vector that is used to \u201cprogram\u201d the decoder RNN.<sup id=\"fnref:EncDec\" role=\"doc-noteref\"><a href=\"#fn:EncDec\" class=\"footnote\" rel=\"footnote\">2<\/a><\/sup><\/p>\n\n<p>The encoder RNN reads the input sequence element-by-element. As the encoder reads each input element, it updates its state. The final state of the encoder after it has read the entire input sequence represents a fixed-length encoding of the input sequence. This encoding then becomes the initial state of the decoder RNN:<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/seq2seq\/encoder_decoder.png\" style=\"max-width:75%\" \/><\/center>\n\n<p>Hopefully, the encoding (a single vector) will contain all the information in the input needed for the decoder to accurately model the correct output. Then we just proceed to calculate $p_{\\theta}(\\mathbf{y}|\\mathbf{x})$ as we would calculate $p_{\\theta}(\\mathbf{y})$, by multiplying the probabilities of all the $y_u$\u2019s together:<\/p>\n\n    <div>\n$$\\begin{eqnarray*} p_{\\theta}(\\mathbf{y}|\\mathbf{x}) &=& p_{\\theta}(y_1, y_2, y_3, \\dots, y_U|\\mathbf{x})\\\\ &=& \\overset{U}{\\underset{u=1}{\\prod}} p_{\\theta}(y_u|y_{u-1},y_{u-2},\\dots,y_{1},\\mathbf{x}) \\end{eqnarray*}$$\n    <\/div>\n\n<p>Here\u2019s a question you might ask, looking at the diagram of the model: Given that an RNN has state, and can remember what it has already outputted, why we do we need to feed $y_1, y_2, \\dots$ into the decoder to compute the probability of the next output $y_u$?<\/p>\n\n<p>There are two good reasons.<\/p>\n<ul>\n  <li>First, if we don\u2019t make the output explicitly conditioned on the previous outputs, we are implicitly saying that the outputs are independent, which may not be a good assumption. Consider transcribing a recording of someone saying \u201ctriple A\u201d. There are two valid transcriptions: $\\text{AAA}$ and $\\text{triple A}$. If the first output $y_1$ is $\\text{A}$, then we can be certain that the second output $y_2$ will be $\\text{A}$.<sup id=\"fnref:AAA\" role=\"doc-noteref\"><a href=\"#fn:AAA\" class=\"footnote\" rel=\"footnote\">3<\/a><\/sup><\/li>\n<\/ul>\n\n<ul>\n  <li>Second, feeding in previous outputs allows us to use a feedforward model (which does not have state) for the decoder. A feedforward model can be much faster to train, since each $p_{\\theta}(y_u|y_{u-1},y_{u-2},\\dots,y_{1},\\mathbf{x})$ term can be computed in parallel.<sup id=\"fnref:convS2S\" role=\"doc-noteref\"><a href=\"#fn:convS2S\" class=\"footnote\" rel=\"footnote\">4<\/a><\/sup><\/li>\n<\/ul>\n\n<p>Although we\u2019ve been writing $p_{\\theta}(\\mathbf{y}|\\mathbf{x})$, it\u2019s usually better to work with the log probability $\\text{log } p_{\\theta}(\\mathbf{y}|\\mathbf{x})$ instead, for a few reasons. First, it is often easier to work with sums than it is to work with products. Recall that $\\text{log }(a \\cdot b) = \\text{log }a + \\text{log }b$. Thus, if you take the log of $p_{\\theta}(\\mathbf{y}|\\mathbf{x})$, it becomes a sum instead of a product:<\/p>\n\n    <div>\n$$\\begin{eqnarray*} \\text{log } p_{\\theta}(\\mathbf{y}|\\mathbf{x}) &=& \\text{log } p_{\\theta}(y_1, y_2, y_3, \\dots, y_U|\\mathbf{x})\\\\ &=& \\overset{U}{\\underset{u=1}{\\sum}} \\text{log } p_{\\theta}(y_u|y_{u-1},y_{u-2},\\dots,y_{1},\\mathbf{x}) \\end{eqnarray*}$$\n    <\/div>\n\n<p>This prevents multiplying together a bunch of numbers smaller than 1, which could cause an underflow. It also makes training the model using gradient descent easier, since minimizing a sum of terms is easier than minimizing a product of terms.<sup id=\"fnref:product\" role=\"doc-noteref\"><a href=\"#fn:product\" class=\"footnote\" rel=\"footnote\">5<\/a><\/sup> And finally, since many useful probability distributions have an $\\text{exp}(\\cdot)$ term (including softmax, Gaussian, and Poisson), and $\\text{log }\\text{exp}(a) = a$, taking the log may transform $p_{\\theta}$ into a simpler form.<\/p>\n\n<h2 id=\"learning\">Learning<\/h2>\n<p>Now that we have a way of computing $p_{\\theta}(\\mathbf{y}|\\mathbf{x})$ using a neural network, we need to train the model, i.e. find $\\theta$ such that the model distribution $p_{\\theta}(\\mathbf{y}|\\mathbf{x})$ matches the true distribution $p(\\mathbf{y}|\\mathbf{x})$. As usual with neural networks, we can do that by minimizing a loss function.<\/p>\n\n<p>Many seq2seq papers don\u2019t explicitly write out a loss function, though. Instead, they will just say something like  \u201cwe use maximum likelihood\u201d or \u201cwe minimize the negative log likelihood\u201d. Here, we will see how this translates to a particular loss function that you can implement.<\/p>\n\n<p>Suppose that you have a training set $\\mathcal{T}$ composed of ($\\mathbf{x}^i, \\mathbf{y}^i$) pairs. If the training examples are considered to be fixed, and you think of $p_{\\theta}(\\mathbf{y}|\\mathbf{x})$ as a function of the parameters $\\theta$, then we call $p_{\\theta}(\\mathbf{y}^i|\\mathbf{x}^i)$ the \u201clikelihood\u201d of the training example ($\\mathbf{x}^i, \\mathbf{y}^i$). In maximum likelihood estimation, the model parameters $\\theta$ are learned by maximizing the likelihood of the entire training set, $L_{\\theta}(\\mathcal{T})$, which is the product of the likelihoods of all the training examples:<\/p>\n\n<center>$$\\begin{eqnarray*}L_{\\theta}(\\mathcal{T}) = \\overset{|\\mathcal{T}|}{\\underset{i=1}{\\prod}} p_{\\theta}(\\mathbf{y}^i|\\mathbf{x}^i)\\end{eqnarray*}$$<\/center>\n\n<p>If these ($\\mathbf{x}^i, \\mathbf{y}^i$) samples are independent and identically distributed (i.i.d.), then as the number of samples increases, maximum likelihood estimation gives you a model that is closer and closer to the true distribution.<\/p>\n\n<p>As mentioned earlier, it\u2019s easier to maximize a sum than it is to maximize a product, so we take the log, and maximize that instead. (The log function is monotonic\u2014that is, $\\text{log }a &gt; \\text{log }b$ implies that $a &gt; b$\u2014so maximizing the log likelihood is equivalent to maximizing the likelihood.) Also, in machine learning it\u2019s often more natural to think of <em>minimizing<\/em> a loss function, so we minimize the <em>negative<\/em> log likelihood:<\/p>\n\n<center>$$\\begin{eqnarray*}-\\text{log } L_{\\theta}(\\mathcal{T}) = -\\overset{|\\mathcal{T}|}{\\underset{i=1}{\\sum}} \\text{log } p_{\\theta}(\\mathbf{y}^i|\\mathbf{x}^i)\\end{eqnarray*}$$<\/center>\n\n<p>If we expand the summed term, we get:<\/p>\n\n<center>$$\\begin{eqnarray*}-\\text{log } L_{\\theta}(\\mathcal{T}) = -\\overset{|\\mathcal{T}|}{\\underset{i=1}{\\sum}} \\overset{U^i}{\\underset{u=1}{\\sum}} \\text{log } p_{\\theta}(y_{u}^i|y_{u-1}^i,y_{u-2}^i,\\dots,y_{1}^i,\\mathbf{x}^i)\\end{eqnarray*}$$<\/center>\n\n<p>In other words, for every example in the dataset, we sum up the negative log probability of the correct output at each timestep, given the previous correct outputs.<\/p>\n\n<blockquote>\n  <p>Feeding the previous <strong>correct<\/strong> outputs into the model during training, as opposed to the model\u2019s own predictions, is called \u201cteacher forcing\u201d. A long time ago (in deep learning years, which are like dog years), it was thought that teacher forcing is bad, and you should sometimes sample previous outputs from the model\u2019s output distribution.<sup id=\"fnref:Scheduled\" role=\"doc-noteref\"><a href=\"#fn:Scheduled\" class=\"footnote\" rel=\"footnote\">6<\/a><\/sup> Nowadays, this is less common, and big parallelizable seq2seq models like the Transformer<sup id=\"fnref:Transformer\" role=\"doc-noteref\"><a href=\"#fn:Transformer\" class=\"footnote\" rel=\"footnote\">7<\/a><\/sup> rely on teacher forcing to go fast. Also, the name \u201cteacher forcing\u201d makes it sound like a hack, but really it\u2019s the right way to apply maximum likelihood!<\/p>\n<\/blockquote>\n\n<p>To minimize the negative log likelihood loss, you can use stochastic gradient descent (SGD), just like in a regular classification problem.<\/p>\n\n<h2 id=\"inference\">Inference<\/h2>\n<p>So far, we\u2019ve described how you can use a neural network to compute $p_{\\theta}(\\mathbf{y}|\\mathbf{x})$, and how to train the model using maximum likelihood.<\/p>\n\n<p>The question remains: How do we generate an output? That is, how do we find $\\underset{\\mathbf{y}}{\\text{argmax }} p_{\\theta}(\\mathbf{y}|\\mathbf{x})$, given a new $\\mathbf{x}$?<\/p>\n\n<p>The brute force solution is an exhaustive search: just compute $p_{\\theta}(\\mathbf{y}|\\mathbf{x})$ for every possible $\\mathbf{y}$ and pick the $\\mathbf{y}$ with the highest probability. That\u2019s what exactly what we do for classification problems.<\/p>\n\n<p>However, unlike a typical classification problem, where you might have a thousand classes, in any practical sequence prediction problem there will be astronomically many possible output sequences, so an exhaustive search is infeasible.<\/p>\n\n<p>That means we need a new ingredient: an efficient search algorithm. The goal of the search is to approximately find $\\mathbf{y}^* \\approx \\underset{\\mathbf{y}}{\\text{argmax }} p_{\\theta}(\\mathbf{y}|\\mathbf{x})$.<\/p>\n\n<p>(Note: for historical reasons<sup id=\"fnref:Jelinek\" role=\"doc-noteref\"><a href=\"#fn:Jelinek\" class=\"footnote\" rel=\"footnote\">8<\/a><\/sup>, the search process is also often called \u201cdecoding\u201d. I\u2019m not a big fan of this terminology because \u201cdecoding\u201d already means several other things in machine learning.)<\/p>\n\n<p>We will consider two search algorithms:<\/p>\n<ul>\n  <li>greedy search<\/li>\n  <li>beam search<\/li>\n<\/ul>\n\n<p><strong>1) Greedy search.<\/strong> A greedy search works as follows: at each step, pick the top output of the network, and feed this output back into the network. In other words, for each timestep $u$, pick $y_u^* = \\underset{y_u}{\\text{argmax }}  p_{\\theta}(y_u | y_{u-1}^*, y_{u-2}^*, \\dots, y_{1}^*, \\mathbf{x})$.<\/p>\n\n<p>The search can continue until an \u201cend-of-sequence\u201d is predicted, or until a maximum number of steps has occurred. An example of a greedy search is shown in the diagram below:<\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/seq2seq\/greedy_search.png\" style=\"max-width:75%\" \/><\/center>\n\n<p>Because inference requires making a prediction, and feeding it back in to make the next prediction, this type of model is called \u201cautoregressive\u201d (\u201cauto\u201d=\u201dself\u201d, \u201cregress\u201d=\u201dpredict\u201d).<\/p>\n\n<p><strong>2) Beam search.<\/strong> Greedy searching is fast, but you can show that it will not always find the most likely output sequence. As an example, suppose that:<\/p>\n<ul>\n  <li>$p_{\\theta}(a|\\mathbf{x}) = 0.4$<\/li>\n  <li>$p_{\\theta}(b|\\mathbf{x}) = 0.6$<\/li>\n<\/ul>\n\n<p>and<\/p>\n\n<ul>\n  <li>$p_{\\theta}(aa|\\mathbf{x}) = 0.4$<\/li>\n  <li>$p_{\\theta}(ab|\\mathbf{x}) = 0.0$<\/li>\n  <li>$p_{\\theta}(ba|\\mathbf{x}) = 0.35$<\/li>\n  <li>$p_{\\theta}(bb|\\mathbf{x}) = 0.25$<\/li>\n<\/ul>\n\n<p>Here, a greedy search would pick $b$, then $a$, and thus return $ba$, which has probability $0.35$. But the most likely sequence is actually $aa$, which has a probability of $0.4$.<\/p>\n\n<p>We can get better results if we delay decisions about keeping a particular output until we have considered some future outputs. Beam search is one way of doing this.<\/p>\n\n<p>In a beam search, we maintain a list (\u201cbeam\u201d) of $B$ likely sequences (\u201chypotheses\u201d). At each step, for each hypothesis, we compute the top $B$ outputs, and append them to the hypothesis. Now we have $B^2$ hypotheses. Of these, we prune the beam down to the top $B$, and then we continue to the next step.<\/p>\n\n<p>An example of a beam search with $B=3$ is shown below.<sup id=\"fnref:CTC\" role=\"doc-noteref\"><a href=\"#fn:CTC\" class=\"footnote\" rel=\"footnote\">9<\/a><\/sup><\/p>\n\n<center><img src=\"https:\/\/lorenlugosch.github.io\/images\/seq2seq\/beam_search.png\" style=\"max-width:75%\" \/><\/center>\n\n<p>The algorithm returns the top $B$ hypotheses found. If your goal is just to estimate $\\mathbf{y}$, you would just keep the top hypothesis, but for some applications it may also be useful to keep the rest of the hypotheses.<\/p>\n\n<p>Notice that if $B = 1$, the beam search is equivalent to a greedy search. Also, if $B = \\infty$, the beam search becomes an exhaustive search.<\/p>\n\n<p>Beam search, too, is not guaranteed to find the most likely output sequence, but the wider you make the beam, the smaller the chance of a search error.<sup id=\"fnref:searcherror\" role=\"doc-noteref\"><a href=\"#fn:searcherror\" class=\"footnote\" rel=\"footnote\">10<\/a><\/sup> The tradeoff is that you must re-run the neural network $B$ times, since you need to feed outputs back in.<\/p>\n\n<h2 id=\"attention\">Attention<\/h2>\n<p>We now have a complete method for doing sequence-to-sequence learning.<\/p>\n\n<p>Unfortunately, if you apply the method exactly as described above\u2014using an encoder RNN to map the input to a fixed-length vector consumed by the decoder RNN\u2014it will not work for long sequences. The problem is that it is difficult to compress the entire input sequence into a single fixed-length vector.<sup id=\"fnref:Cho\" role=\"doc-noteref\"><a href=\"#fn:Cho\" class=\"footnote\" rel=\"footnote\">11<\/a><\/sup><\/p>\n\n<p>Attention<sup id=\"fnref:Bahdanau\" role=\"doc-noteref\"><a href=\"#fn:Bahdanau\" class=\"footnote\" rel=\"footnote\">12<\/a><\/sup> is a mechanism that removes this fixed-length bottleneck. With attention, the decoder does not rely on a single vector to represent the input; instead, at every decoding step, it \u201clooks\u201d at a different part of the input using a weighted sum.<\/p>\n\n<p>A more detailed introduction to the various forms of attention can be found <a href=\"https:\/\/lilianweng.github.io\/lil-log\/2018\/06\/24\/attention-attention.html\">here<\/a>.<\/p>\n\n<h2 id=\"a-toy-task\">A toy task<\/h2>\n<p>Let\u2019s use a toy task to test out our method. Consider the following string:<\/p>\n\n<blockquote>\n  <p>\u201cMst ppl hv lttl dffclty rdng ths sntnc\u201d<\/p>\n<\/blockquote>\n\n<p>This string is an example<sup id=\"fnref:Shannon\" role=\"doc-noteref\"><a href=\"#fn:Shannon\" class=\"footnote\" rel=\"footnote\">13<\/a><\/sup> of how natural language is highly redundant or predictable: it\u2019s generated by removing all the vowels from a normal sentence, and you can still understand the meaning.<\/p>\n\n<p>If human intelligence can infer the missing vowels, then maybe artificial intelligence can as well! We will train a sequence-to-sequence model to take as input a vowel-less sentence (\u201cMst ppl\u201d) and output the sentence with the correct vowels re-inserted (\u201cMost people\u201d).<\/p>\n\n<blockquote>\n  <p>This toy task is a bit easier to work with than a task like speech recognition or translation, in which you need lots of labelled data and lots of tricks to get something to work. A tutorial which covers the actual useful task of translating from French to English using PyTorch can be found <a href=\"https:\/\/pytorch.org\/tutorials\/intermediate\/seq2seq_translation_tutorial.html\">here<\/a>.<\/p>\n<\/blockquote>\n\n<h2 id=\"running-the-experiment\">Running the experiment<\/h2>\n<p>We can easily generate a dataset for the task of inferring missing vowels using existing text. I used the text of \u201cWar and Peace\u201d<sup id=\"fnref:Karpathy\" role=\"doc-noteref\"><a href=\"#fn:Karpathy\" class=\"footnote\" rel=\"footnote\">14<\/a><\/sup>: the input sequences are lines from the text with all the vowels removed, and the target output sequences are just the original lines. I also added the Penn Treebank (PTB) dataset, a commonly used dataset of news articles for language modeling experiments, to give the training data a little variety.<\/p>\n\n<p>To run the code for yourself, or to train a model for filling in missing vowels on a new dataset, the code and a pre-trained model for this experiment can be found <a href=\"https:\/\/github.com\/lorenlugosch\/infer_missing_vowels\">here<\/a>.<\/p>\n\n<p>We will first try training the simple encoder-decoder described above, without an attention mechanism. Here\u2019s the result when we run a new sentence through the model:<\/p>\n\n<pre style=\"font-size:13px\">\n<b>input:<\/b> Mst ppl hv lttl dffclty rdng ths sntnc.\n<b>truth:<\/b> Most people have little difficulty reading this sentence.\n<b>guess:<\/b> Mostov played with Prince Andrew and strengthers and\n<\/pre>\n\n<p>Oh dear! The search starts off strong\u2014it correctly outputs \u201cMost\u201d\u2014but then it gets distracted and tries to fill in the name \u201cRostov\u201d (the name of a character in \u201cWar and Peace\u201d). The next word, \u201cplayed\u201d, at least starts with the right letter, the \u201cp\u201d in \u201cpeople\u201d, but after that, the output really goes off the rails.<\/p>\n\n<p>If we let the simple encoder-decoder model train a lot longer, the results get a bit better, but we still find bizarre mistakes like this:<\/p>\n\n<pre style=\"font-size:13px\">\n<b>input:<\/b> th dy bfr, nmly, tht th cmmndr-n-chf\n<b>truth:<\/b> the day before, namely, that the commander-in-chief\n<b>guess:<\/b> the day before, manling that the mimicinacying Freemason\n<\/pre>\n\n<p>So let\u2019s add in that fancy attention mechanism I mentioned and see if that helps:<\/p>\n\n<pre style=\"font-size:13px\">\n<b>input:<\/b> Mst ppl hv lttl dffclty rdng ths sntnc.\n<b>truth:<\/b> Most people have little difficulty reading this sentence.\n<b>guess:<\/b> Most people have little difficulty riding this sentence.\n<\/pre>\n\n<p>Much better! But still not completely correct.<\/p>\n\n<p>Let\u2019s look at the beam (the $B$ hypotheses found by the beam search) and the beam scores (the hypotheses\u2019 log probabilities):<\/p>\n<pre style=\"font-size:13px\">\nMost people have little difficulty riding this sentence.   | -2.98\n<b>Most people have little difficulty reading this sentence.  | -3.28<\/b>\nMost people have little difficulty roading this sentence.  | -3.79\nMost people have little difficulty riding those sentence.  | -3.81\nMost people have little difficulty reading those sentence. | -4.11\nMost people have little difficulty riding these sentence.  | -4.16\nMost people have little difficulty reading these sentence. | -4.45\nMost people have little difficulty roading those sentence. | -4.60\n<\/pre>\n\n<p>The correct answer does in fact appear in the beam (the 2nd hypothesis), but the model incorrectly assigns a higher probability to the hypothesis with \u201criding\u201d instead of \u201creading\u201d. Maybe with more\/better training data this error would not occur, since \u201creading this sentence\u201d ought to be a lot more probable in the training data than \u201criding this sentence\u201d.<\/p>\n\n<p>Notice another peculiar aspect of the beam: it is ordered (roughly) from shortest to longest. Autoregressive models are biased towards shorter output sequences!<\/p>\n\n<blockquote>\n  <p>Why? At each timestep, to compute the probability of a sequence, we multiply it (or add it, in the log domain) by the probability of the next output, which is always less than 1 (less than 0, in the log domain), so the probability of the complete sequence keeps getting smaller. In fact, the only thing keeping the search from producing outputs of length 0 is the fact that the model needs to predict the \u201cend-of-sequence\u201d token, and from the training data the model learns to assign low probability to \u201cend-of-sequence\u201d until it makes sense. It\u2019s easy for the model to learn that a sentence like \u201cThe.\u201d is very unlikely, but comparing two plausible sequences of roughly the same length seems to be harder.<\/p>\n<\/blockquote>\n\n<p>How do people deal with the short-sequence bias in practice? Google Translate uses a variation<sup id=\"fnref:GNMT\" role=\"doc-noteref\"><a href=\"#fn:GNMT\" class=\"footnote\" rel=\"footnote\">15<\/a><\/sup> of beam search in which the log probability is divided by a \u201clength penalty\u201d (\u201clp\u201d), with a hyperparameter $\\alpha$, that is computed as follows:<\/p>\n\n<center>$$\\begin{eqnarray*} \\text{lp}(\\mathbf{y}) = \\frac{(5 + |\\mathbf{y}|)^{\\alpha}}{(5 + 1)^{\\alpha}}\\end{eqnarray*}$$<\/center>\n\n<p>Yikes. How many TPU hours did they burn finding that formula? I hope we find a better way to mitigate the short-sequence bias!<\/p>\n\n<h2 id=\"the-end\">The End<\/h2>\n\n<p>If you have any questions or if you find something wrong with this tutorial, please let me know.<\/p>\n\n<p>Check out the code and try it out! It\u2019s fun to feed the model random inputs, like your name, and see what stuff it comes up with trying to fill in the gaps. The code is also written in such a way that it should not be too hard to adapt it to a new task.<\/p>\n\n<hr \/>\n\n<p><em>Thanks to Christoph Conrads and Mirco Ravanelli for their feedback on the draft of this post.<\/em><\/p>\n\n<div class=\"footnotes\" role=\"doc-endnotes\">\n  <ol>\n    <li id=\"fn:NPLM\" role=\"doc-endnote\">\n      <p>See: Yoshua Bengio, R\u00e9jean Ducharme, Pascal Vincent, Christian Jauvin, \u201c<a href=\"http:\/\/www.jmlr.org\/papers\/volume3\/bengio03a\/bengio03a.pdf\">A Neural Probabilistic Language Model<\/a>\u201d, Journal of Machine Learning Research 3 (2003), 1137\u20131155.\u00a0<a href=\"#fnref:NPLM\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:EncDec\" role=\"doc-endnote\">\n      <p>Like many great ideas, the encoder-decoder model was invented independently and simultaneously by multiple groups. The paper that is usually cited, which invented the name \u201csequence-to-sequence learning\u201d, is this one: Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, <a href=\"https:\/\/papers.nips.cc\/paper\/5346-sequence-to-sequence-learning-with-neural-networks.pdf\">\u201cSequence to sequence learning with neural networks\u201d<\/a>, NeurIPS 2014.\u00a0<a href=\"#fnref:EncDec\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:AAA\" role=\"doc-endnote\">\n      <p>The \u201ctriple A\u201d example comes from: William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. \u201c<a href=\"https:\/\/arxiv.org\/abs\/1508.01211\">Listen, attend and spell: A neural network for large vocabulary conversational speech recognition<\/a>\u201d, ICASSP 2016.\u00a0<a href=\"#fnref:AAA\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:convS2S\" role=\"doc-endnote\">\n      <p>For an example of this, see: Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. \u201c<a href=\"https:\/\/arxiv.org\/abs\/1705.03122\">Convolutional sequence to sequence learning<\/a>\u201d, ICML 2017.\u00a0<a href=\"#fnref:convS2S\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:product\" role=\"doc-endnote\">\n      <p>Why? Consider minimizing $y_{sum} = x_1 + x_2$ and $y_{prod} = x_1 \\cdot x_2$. The derivative $\\frac{dy_{sum}}{dx_1}$ is just $1$ (independent of $x_2$), whereas the derivative $\\frac{dy_{prod}}{dx_1}$ is equal to $x_2$. If $x_2$ is very small, $x_1$ will be \u201cheld back\u201d from changing easily if you try to minimize $y_{prod}$\u2014imagine running a race if you are tied to a slow person by a rope.\u00a0<a href=\"#fnref:product\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:Scheduled\" role=\"doc-endnote\">\n      <p>See: Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer, \u201c<a href=\"https:\/\/arxiv.org\/abs\/1506.03099\">Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks<\/a>\u201d, NeurIPS 2015.\u00a0<a href=\"#fnref:Scheduled\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:Transformer\" role=\"doc-endnote\">\n      <p>See: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin, \u201c<a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention is all you need<\/a>\u201d, NeurIPS 2017.\u00a0<a href=\"#fnref:Transformer\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:Jelinek\" role=\"doc-endnote\">\n      <p>Some of the early pioneers in speech recognition and machine translation, like Fred Jelinek, originally worked on digital communications and error-correcting codes, where it really does make sense to refer to searching for the best output as \u201cdecoding\u201d. They didn\u2019t bother to find a better word and kept saying \u201cdecoding\u201d when they started working on speech recognition. It may not be entirely surprising that Jelinek once said \u201cevery time I fire a linguist, the performance of my speech recognizer goes up.\u201d\u00a0<a href=\"#fnref:Jelinek\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:CTC\" role=\"doc-endnote\">\n      <p>Taken from: Awni Hannun, \u201c<a href=\"https:\/\/distill.pub\/2017\/ctc\/\">Sequence Modeling with CTC<\/a>\u201d, Distill, 2017.\u00a0<a href=\"#fnref:CTC\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:searcherror\" role=\"doc-endnote\">\n      <p>A search error happens when $p_{\\theta}(\\mathbf{y}|\\mathbf{x}) &gt; p_{\\theta}(\\mathbf{\\hat{y}}|\\mathbf{x})$, but $\\mathbf{y}$ wasn\u2019t found during the search. In other words, the model correctly assigns more probability to the correct output sequence than an incorrect output sequence, but the search just didn\u2019t get a chance to evaluate the correct sequence. Another type of error is when the model assigns more probability to an incorrect sequence and picks that sequence as a result.\u00a0<a href=\"#fnref:searcherror\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:Cho\" role=\"doc-endnote\">\n      <p>This paper discovered and diagnosed the problem: Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, Yoshua Bengio, \u201c<a href=\"https:\/\/arxiv.org\/abs\/1409.1259\">On the Properties of Neural Machine Translation: Encoder\u2013Decoder Approaches<\/a>\u201d, SSST-8, 2014.\u00a0<a href=\"#fnref:Cho\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:Bahdanau\" role=\"doc-endnote\">\n      <p>See: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, \u201c<a href=\"http:\/\/arxiv.org\/abs\/1409.0473\">Neural Machine Translation by Jointly Learning to Align and Translate<\/a>\u201d, ICLR 2015.\u00a0<a href=\"#fnref:Bahdanau\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:Shannon\" role=\"doc-endnote\">\n      <p>Supposedly given by Claude Shannon, although I haven\u2019t been able to find the original reference where he wrote it.\u00a0<a href=\"#fnref:Shannon\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:Karpathy\" role=\"doc-endnote\">\n      <p>Provided by Andrej Karpathy in the code for <a href=\"https:\/\/karpathy.github.io\/2015\/05\/21\/rnn-effectiveness\/\">The Unreasonable Effectiveness of Recurrent Neural Networks<\/a>.\u00a0<a href=\"#fnref:Karpathy\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n    <li id=\"fn:GNMT\" role=\"doc-endnote\">\n      <p>See: Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, \u0141ukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean, \u201c<a href=\"https:\/\/arxiv.org\/abs\/1609.08144\">Google\u2019s neural machine translation system: Bridging the gap between human and machine translation<\/a>\u201d, arXiv preprint arXiv:1609.08144, 2016.\u00a0<a href=\"#fnref:GNMT\" class=\"reversefootnote\" role=\"doc-backlink\">&#8617;<\/a><\/p>\n    <\/li>\n  <\/ol>\n<\/div>","author":{"name":"Loren Lugosch"},"category":{"@attributes":{"term":"sequence modeling"}},"summary":"Many interesting problems in artificial intelligence can be described in the following way: Map a sequence of inputs $\\mathbf{x}$ to the correct sequence of outputs $\\mathbf{y}$."}]}