[{"content":"Most generative AI models nowadays are autoregressive. That means they&rsquo;re following the concept of next token prediction, and the transformer architecture is the current implementation that has been used for years now thanks to its computational efficiency. This is a rather simple concept that\u2019s easy to understand - as long as you aren&rsquo;t interested in the details - everything can be tokenized and fed into an autoregressive (AR) model. And by everything, I mean everything: text as you&rsquo;d expect, but also images, videos, 3D models and whatnot. There is no limit to what can be represented and generated by an autoregressive model, and while pre-training is far from solved, I think it&rsquo;s fair to say everyone more or less knows what to do. That&rsquo;s why today&rsquo;s autoregressive models, &ldquo;multimodal reasoning general&rdquo; large language models (LLMs), are statistical models so powerful that we may see traits of generalization.\nUpdate: since the article caught a bit of attention on HN, I modified some bits that weren&rsquo;t clear or reflective of the message I was trying to convey. This might be yet another critique against LLMs, but this is precisely because I use LLMs all the time that I was able to write how I felt about them. That said, I&rsquo;m very grateful for the criticism and skepticism. This is what this rant (of sorts) is all about: triggering a discussion that I find interesting!\nThe purpose of AI research But what is the original purpose of AI research? I will speak for myself here, but I know many other AI researchers will say the same: the ultimate goal is to understand how humans think. I think one of the best (or the funniest) ways to understand how humans think is to try to recreate it. I come from a medical science background, where I studied the brain from a &ldquo;traditional&rdquo; neuroscience perspective (biology, pathology, anatomy, psychology and whatnot). That a good way to understand human-level intelligence is actually to try to recreate it is honestly how I feel whenever I read about AI advancements where the clear goal is to achieve\/surpass human intelligence, something we don&rsquo;t fully understand yet.\n\u201cWhat I cannot create, I do not understand.\u201d - Richard Feynman\nBut today, when you see AI being mentioned, it&rsquo;s mostly about autoregressive models like LLMs. Major players in the field think they may achieve artificial general intelligence (AGI) by continuing to scale the models and applying all sorts of tricks that happen to work (multimodality, pure reinforcement learning, test-time compute and search, agentic systems). It&rsquo;s really too early to tell if there&rsquo;s a ceiling to this approach and I\u2019m not one to pretend to know the absolute truth.\nHowever, I keep asking myself the following question:\nAre autoregressive models the best way to approximate human thinking?\nYou can say LLMs are fundamentally dumb because of their inherent linearity. Are they? Isn&rsquo;t language by itself linear (more precisely, the presentation of it)? Autoregressive models may as well be a simple yet effective approach after all, since they can be remarkably effective at modeling human language use. But there are many limitations in practice.\nClarification on terminology In the field of statistics, an autoregressive model means that future outputs (or predictions) depend directly on all previous inputs (or tokens). Transformers also follow this principle, but unlike other traditional linear autoregressive models, they condition outputs on previous tokens using highly non-linear mechanisms (self-attention).\nUltimately, transformers remain next-token predictors; so when I mention &ldquo;linearity&rdquo;, I&rsquo;m specifically referring to the sequential nature of next-token generation itself, rather than implying transformers lack non-linear capabilities altogether.\nLimitations of AR models By design, AR models lack planning and reasoning capabilities. If you generate one word at a time, you don&rsquo;t really have a general idea of where you&rsquo;re heading. You just hope that you will reach a nice conclusion by following a chain of thoughts. Large reasoning models work the same. They were trained using RL on many formal proofs, not too easy but not too hard. AR models being stochastic, they won&rsquo;t always yield good results when formal logic is involved. They don&rsquo;t really master abstract principles the way humans do.\nTechnically speaking, neural networks, as they are usually used, are function approximators, and Large Language Models (LLMs) are basically approximating the function of how humans use language. And they&rsquo;re extremely good at that. But approximating a function is not the same thing as learning a function. \u2013 Gary Marcus (2025)\nThe current architecture of AR models lacks long-term memory, and has limited working memory. Everything has to be contained within a context window. Long-term memory may be vectorized information learned from previous interactions, but ultimately, it has to fit in the same context window. While LLMs with larger context windows are getting released, they still suffer from major coherence issues under heavy context workloads, mainly due to limitations in the attention mechanism itself. Transformers are computationally efficient during training, but their self-attention scales quadratically with input length during inference, which is also one of the practical limitations for having a &ldquo;long-memory&rdquo; model.\nThere is room for optimization here, but ultimately, LLMs have no recollection capabilities like humans have. Once trained, they will not learn from their mistakes. The context window can be compared to working memory in humans: it&rsquo;s fast, efficient but gets rapidly overloaded. Humans manage this limitation by offloading previously learned information into other memory forms, whereas LLMs can only mimic this process superficially at best.\nWhen explaining LLMs to people unfamiliar with the concept, I often use the ZIP file analogy to illustrate that these models aren\u2019t exactly \u201csmart knowledge databases\u201d. Pre-training essentially compresses human knowledge\u2014like the entire internet\u2014in a very lossy way. Alternatively, if you had infinite time, it would be like a person reading an enormous library. While there are ways to mitigate this lossiness, the resulting AR model will always produce non-deterministic output due to its inherently stochastic nature.\nSo, AR models hallucinate. Humans hallucinate too. But I fear the word &ldquo;hallucination&rdquo; here is being misused as it gives AR models human traits they don\u2019t really have. The nature of hallucination is very different between AR models and humans, as one has a world model and the other is a very elaborated pattern matching machine. While humans make (a lot of) mistakes, they do have this common-sense understanding that AR models lack. I would even say that I generally trust a SOTA LLM more than the average human, however, I may not as easily detect hallucination in LLMs, which can be problematic.\nOf course, there are ways to limit risks of LLM hallucination. Retrieval-augmented generation (RAG) is a common one: we fit as much relevant data as possible in the LLM context window during inference and we hope it&rsquo;s better at certain specific tasks. We can also tweak inference parameters, making token prediction a lot more rigid at the cost of creativity (temperature and others). Ultimately, stochastic models will always make plausible-sounding mistakes.\nThe exposure bias is also an inherent issue in an autoregressive paradigm. If they make a small mistake early on, this will eventually lead to more errors. The model can easily derail and produce irrelevant and repetitive output. Humans notice when they&rsquo;re going in circles and have the ability to &ldquo;course-correct&rdquo;, something LLMs lack. We may see traces of this capability emerging in reasoning models, but this is still somewhat limited. Think of it like driving: a human driver will quickly notice (hopefully) when they take a wrong turn and rethink the route to correct this mistake. While AR models might sometimes give the illusion of self-awareness, they tend to never check the route again once they&rsquo;ve started: they continue driving forward, they may turn randomly or take roads they&rsquo;ve already taken, hoping to eventually arrive at the right destination (if you&rsquo;ve seen Claude playing Pok\u00e9mon, you&rsquo;ll know what I mean).\nExploring other paradigms Human thinking involves more than just linking words together, and I don&rsquo;t think AGI can ever be achieved if an AI model doesn&rsquo;t show solid planning and memory capabilities. That isn\u2019t to say AR models should be ditched altogether, and they&rsquo;re still very useful tools. They may even be used as part of more complex architectures that will tackle these limitations.\nYann LeCun is a famous AI researcher that has also been a vocal critic of AR models. He suggests pursuing research in other paradigms to achieve human-like cognition. He&rsquo;s working on an architecture called JEPA (Joint Embedding Predictive Architecture) that generates stuff via iterative refinement instead of generating every detail step-by-step like traditional AR models. This means that JEPA&rsquo;s goal is to learn the world by not focusing on the details to generate, but by focusing on a state, an aspect.\nRather than raw sequence prediction, the idea is to use a kind of self-supervised learning focused on abstract prediction. Which makes sense after all, because for instance, humans don&rsquo;t really perceive the world pixel by pixel. A true mark of intelligence would be to focus on the core concept, on the essential information in an abstract way, and that might be how we can achieve goal-driven AI models.\nHaving studied diffusion models a fair bit, I&rsquo;ve also wondered how they can be used for text generation. Diffusion models are very different compared to AR models: they&rsquo;re also inherently stochastic, but they don&rsquo;t generate in a defined direction (like left-to-right text generation). Instead, they start from noise, and the model knows how to iteratively denoise at every step to achieve a result that makes sense, in other words, aligned with the training data distribution. You could say they work as the inverse of transformer-based models: their internal inference process is iterative, but they perform parallel prediction at every step. Unlike AR models which have exposure bias, this shows global coherence: if something seemingly doesn&rsquo;t fit, it can be corrected later on, because the model has a global idea that undergoes a refinement process.\nThis looks a lot more like the process of human-like drafting, because we don&rsquo;t necessarily think with words first. An example of a large diffusion text model might be the very recent LLaDA model; it&rsquo;s very interesting if you can take a look.\nWe&rsquo;re more than just a prediction machine Modern neuroscience states that the brain is a prediction machine. And I feel this makes sense: we predict constantly. When we have an idea of doing something, before acting, we may evaluate the outcome first, thus essentially predicting. I think this is not something only humans have, but animals in a broader sense. Language processing is no exception, and we know from imagery research that the brain actively anticipates upcoming patterns or words, much like an AR model does. If I write something like:\nThe cat is chasing a ____\nIt is evident you will strongly predict that the word is &ldquo;mouse&rdquo; before you even begin reading this sentence. Training AR models is essentially that: we cut the last chunk of text, and we train using backpropagation. So they get very good at that much like humans. My point is, the brain is also doing next-word prediction (although, LLMs don&rsquo;t really do next-word prediction, because tokens may be just chunks of text, not necessarily words that make sense).\nLanguage, while presented linearly, has an inherently hierarchical structure organized into nested layers of meaning, intention, and context. Can patterns emerging from pre-training alone sufficiently capture this hierarchical mechanism in LLMs?\nHuman thought, however, is a more complicated story. We do have inner speech, and we use language internally, thus something AR models can achieve too. Well, that&rsquo;s not necessarily the same as the language we use. But beyond inner speech, there is also non-sequential thought and planning, and we can&rsquo;t really represent them using simple Markov chains. Before speaking a sentence, we have a general idea of what we&rsquo;re going to say; we don&rsquo;t really choose what to say next based on the last words. That kind of planning isn&rsquo;t something that can be represented sequentially.\nThe human mind is not, like ChatGPT and its ilk, a lumbering statistical engine for pattern matching, gorging on hundreds of terabytes of data and extrapolating the most likely conversational response or most probable answer to a scientific question. On the contrary, the human mind is a surprisingly efficient and even elegant system that operates with small amounts of information; it seeks not to infer brute correlations among data points but to create explanations. \u2013 Noam Chomsky\nSo, while the brain is a prediction machine, there is strong evidence that not all thinking is linguistic or sequential. Not everything we think or represent has to follow an inner narrative. That &ldquo;gut feeling&rdquo; we sometimes have is an example that we don\u2019t even fully comprehend on a scientific level, let alone AR models. An idea is often first represented, then gets linearized for communication or refinement. Current large reasoning models still lack that kind of non-sequential planning, and I&rsquo;d argue post-training alone won&rsquo;t change their nature (but can yield great results in some specific tasks).\nThe human approach to semi-complex tasks, like trivial algebra operations, can show autoregressive characteristics, but this highlights a key limitation for LLMs, as their lack of an internal model beyond pattern matching makes them ineffective for such tasks. The open question remains whether post-training aimed at reasoning truly adds more than just a higher likelihood of correctness. Anecdotally, I believe this is precisely why reasoning-enhanced LLMs are powerful: reasoning abilities scale efficiently to handle tasks that can quickly overwhelm humans.\nLanguage and thought are not purely autoregressive in humans, and prediction can only go so far. That is exactly why AI research is headed towards incorporating planning, memory and world models in new architectures, and they will hopefully capture non-autoregressive aspects of thinking.\n","permalink":"https:\/\/wonderfall.dev\/autoregressive\/","summary":"Most generative AI models nowadays are autoregressive. That means they&rsquo;re following the concept of next token prediction, and the transformer architecture is the current implementation that has been used for years now thanks to its computational efficiency. This is a rather simple concept that\u2019s easy to understand - as long as you aren&rsquo;t interested in the details - everything can be tokenized and fed into an autoregressive (AR) model. And by everything, I mean everything: text as you&rsquo;d expect, but also images, videos, 3D models and whatnot.","title":"Some thoughts on autoregressive models"},{"content":"To this date, Proton Mail doesn&rsquo;t support MTA-STS for custom domains. While DANE for SMTP is a much better solution to the same problem, MTA-STS exists for a reason: many providers are slow at adopting DNSSEC. DNSSEC is essential to enabling standards such as DANE or SSHFP. Notably, Gmail still does not support DANE but has supported MTA-STS for years.\nTherefore, MTA-STS and DANE can complement each other, and you should ideally deploy both.\nWhy bother? A tale of DANE Mail security is challenging and complex, and humanity might never get it right. Unfortunately, we still rely on this ancient technology for nearly everything. Attempts to make SMTP more resilient and secure have been made in the past, including opportunistic transport encryption (STARTTLS). One tricky issue is that implementing TLS alone in this case does not prevent man-in-the-middle or various downgrade attacks (such as STRIPTLS) from happening.\nDANE for SMTP (RFC 7672) is an elegant solution to address these issues. DANE relies on DNSSEC to protect TLSA records of the same DNS zone as the SMTP server. TLSA records indicate that TLS should be enforced for inbound mail, and basically contain information about the public keys that are allowed to be used. Let&rsquo;s briefly dig (no pun intended) into how Proton implements this:\n$ dig +noall +answer mx protonmail.com protonmail.com. 1138 IN MX 10 mailsec.protonmail.ch. protonmail.com. 1138 IN MX 5 mail.protonmail.ch $ dig +noall +answer tlsa _25._tcp.mail.protonmail.ch _25._tcp.mail.protonmail.ch. 915 IN TLSA 3 1 1 6111A5698D23C89E09C36FF833C1487EDC1B0C841F87C49DAE8F7A09 E11E979E _25._tcp.mail.protonmail.ch. 915 IN TLSA 3 1 1 76BB66711DA416433CA890A5B2E5A0533C6006478F7D10A4469A947A CC8399E1 I will skip the details since this post is not about how DANE works, but as you can see the TLSA records can be easily retrieved and we can verify those. Tools like Hardenize may help you in investigating which providers have DANE.\nMTA-STS to the rescue Despite being a years-old standard, DANE for SMTP is not widely adopted. MTA-STS (RFC 8461) is an alternative solution which aims to prevent the same security issues, whilst not relying on DNSSEC. When MTA-STS is enabled, it effectively indicates to SMTP servers that TLS (1.2 or higher) should be used with valid certificates.\nIn order to do that, MTA-STS relies on a HTTPS web server (and thus the WebPKI) to publish the policy at a specific subdomain (mta-sts) and address (\/.well-known\/mta-sts.txt). A DNS TXT record will also be needed to signal that a MTA-STS policy is available for the domain. Let&rsquo;s see how Proton implements that.\nIf we navigate to https:\/\/mta-sts.protonmail.com\/.well-known\/mta-sts.txt, we will indeed find the MTA-STS policy published by Proton:\nversion: STSv1 mode: enforce mx: mail.protonmail.ch mx: mailsec.protonmail.ch max_age: 604800 The DNS &ldquo;discovery&rdquo; record should be defined at the _mta-sts subdomain:\n$ dig +noall +answer txt _mta-sts.protonmail.com _mta-sts.protonmail.com. 1114 IN CNAME _mta-sts.protonmail.ch. _mta-sts.protonmail.ch. 1114 IN TXT &#34;v=STSv1; id=190906205100Z;&#34; Everything is there as expected.\nEnable MTA-STS for custom domains The challenge is that Proton does not offer an easy way to host a policy for custom domains. They could eventually offer this in the future, and I expect them to. In the meantime, we can enable MTA-STS ourselves as it should be somewhat straightforward.\nThe main challenging part is really to find a way to host the policy. It&rsquo;s just a text file after all, so there are many ways to do that without going through the hassle of self-hosting a web server. GitHub Pages and Netlify have free offerings to host static websites, and they should be enough to meet our needs here. I will use Netlify since you can host multiple websites with the same account; the only drawback is that you have a 100GB bandwidth limit per month, but it should be more than enough for a simple text file.\nFirst, you will need to create a GitHub repository - public or private, it doesn&rsquo;t matter. Then, you should push a directory named .well-known which contains a mta-sts.txt file. The latter will be our policy, and you may simply copy the policy from Proton:\nversion: STSv1 mode: enforce mx: mail.protonmail.ch mx: mailsec.protonmail.ch max_age: 604800 If you&rsquo;re really not sure about what you&rsquo;re doing, setting mode: testing might be a conservative approach to avoid breaking things (a report will be sent if you&rsquo;ve configured TLS-RPT, more on that later). You may also consider decreasing or increasing max_age which corresponds to a time in seconds, and 604800 for instance means that the policy will be cached for one week.\nThen, head to your Netlify account, add a new site, register your GitHub repository and voil\u00e0. Once it&rsquo;s done and that you can access your policy through your netlify.app address, it&rsquo;s a matter of publishing a few DNS records:\nAn A record for your Netlify-hosted policy. If your domain name from which you intend to send and receive mails is @domain.tld then you should make this A record for mta-sts.domain.tld. Netlify will tell you to use a CNAME record but I personally caution against using CNAME to third parties in general. They have a universal load balancer IPv4 you can use, so you should use that.\nA TXT record for MTA-STS discovery. The content should resemble to the following: &quot;v=STSv1; id=2023071200&quot;. v=STSv1 declares the policy version, and id= is really just a random number you should increment to signal whenever your MTA-STS policy has been changed. If you&rsquo;re out of ideas you can use the Unix epoch time, or the YMD format followed by two numbers reserved for iterations (just like I do).\nAn optional TXT record for TLS-RPT so that the sending mail server will receive reports about successful and failed attempts at applying the MTA-STS policy. Since these reports can be very useful, I highly recommend you configure TLS-RPT. To do that, add the following TXT record to the _smtp._tls subdomain: &quot;v=TLSRPTv1; rua=mailto:reports@domain.tld&quot; where rua= should point to the mail address where you want to receive reports.\nSecurity tip : since mta-sts is a subdomain with an A record, I strongly recommend defining a &ldquo;reject all&rdquo; SPF policy and a null MX record (RFC 7505) for that subdomain. That is because even when an MX record does not exist, A records can be used as a fallback.\nWait a bit for the DNS propagation to take place, then verify with Hardenize that MTA-STS is enabled.\nConclusion MTA-STS is far from perfect and suffers from multiple flaws in my opinion: it relies on certificate authorities (CA), and is inherently a trust on first use security policy akin to HSTS for HTTPS (the DNSSEC infrastructure is already trusted in the case of DANE). I also believe that MTA-STS is more tedious to deploy compared to the simplicity and robustness of DANE, DNSSEC deployment quirks aside.\nNonetheless, I hope this article will prove somewhat useful to Proton Mail users who wish to use MTA-STS with their custom domains.\n","permalink":"https:\/\/wonderfall.dev\/mta-sts\/","summary":"To this date, Proton Mail doesn&rsquo;t support MTA-STS for custom domains. While DANE for SMTP is a much better solution to the same problem, MTA-STS exists for a reason: many providers are slow at adopting DNSSEC. DNSSEC is essential to enabling standards such as DANE or SSHFP. Notably, Gmail still does not support DANE but has supported MTA-STS for years.\nTherefore, MTA-STS and DANE can complement each other, and you should ideally deploy both.","title":"Setting up MTA-STS with a custom domain on Proton Mail"},{"content":"Passwordless authentication with OpenSSH keys has been the de facto security standard for years. SSH keys are more robust since they&rsquo;re cryptographically sane by default, and are therefore resilient to most bruteforce atacks. They&rsquo;re also easier to manage while enabling a form of decentralized authentication (it&rsquo;s easy and painless to revoke them). So, what&rsquo;s the next step? And more exactly, why would one need something even better?\nWhy? The main problem with SSH keys is that they&rsquo;re not magic: they consist of a key pair, of which the private key is stored on your disk. You should be wary of various exfiltration attempts, depending on your theat model:\nIf your disk is not encrypted, any physical access could compromise your keys. If your private key isn&rsquo;t encrypted, malicious applications could compromise it. Even with both encrypted, malicious applications could register your keystrokes. All these attempts are particularly a thing on desktop platforms, because they don&rsquo;t have a proper sandboxing model. On Windows, non-UWP apps could likely have full access to your .ssh directory. On desktop Linux distributions, sandboxing is also lacking, and the situation is even worse if you&rsquo;re using X.org since it allows apps to spy on each other (and on your keyboard) by design. A first good step would be to only use SSH from a trusted &amp; decently secure system.\nAnother layer of defense would obviously be multi-factor authentication, or the fact that you&rsquo;re relying on a shared secret instead. We can use FIDO2 security keys for that. That way, even if your private key is compromised, the attacker needs physical access to your security key. TOTP is another common 2FA technique, but it&rsquo;s vulnerable to various attacks, and relies on the quality of the implementation on the server.\nHow? Fortunately for us, OpenSSH 8.2 (released in February 2020) introduced native support for FIDO2\/U2F. Most OpenSSH distributions should have the middleware set to use the libfido2 library, including portable versions such as the one for Win32.\nBasically, ssh-keygen -t ${key_type}-sk will generate for us a token-backed key pair. The key types that are supported depend on your security key. Newer models should support both ECDSA-P256 (ecdsa-sk) and Ed25519 (ed25519-sk). If the latter is available, you should prefer it.\nClient configuration To get started:\nssh-keygen -t ed25519-sk This will generate a id_ed25519_sk private key and a id_ed25519_sk.pub public key in .ssh. These are defaults, but you can change them if you want. We will call this key pair a &ldquo;handle&rdquo;, because they&rsquo;re not sufficient by themselves to derive the real secret (as you guessed it, the FIDO2 token is needed). ssh-keygen should ask you to touch the key, and enter the PIN prior to that if you did set one (you probably should).\nYou can also generate a resident key (referred to as discoverable credential in the WebAuthn specification):\nssh-keygen -t ed25519-sk -O resident -O application=ssh:user1 As you can see, a few options must be specified:\n-O resident will tell ssh-keygen to generate a resident key, meaning that the private &ldquo;handle&rdquo; key will also be stored on the security key itself. This has security implications, but you may want that to move seamlessly between different computers. In that case, you should absolutely protect your key with a PIN beforehand. -O application=ssh: is necessary to instruct that the resident key will use a particular slot, because the security key will have to index the resident keys (by default, they use ssh: with an empty user ID). If this is not specified, the next key generation might overwrite the previous one. -O verify-required is optional but instructs that a PIN is required to generate\/access the key. Resident keys can be retrieved using ssh-keygen -K or ssh-add -K if you don&rsquo;t want to write them to the disk.\nServer configuration Next, transfer your public key over to the server (granted you have already access to it with a regular key pair):\nssh-copy-id -i ~\/.ssh\/id_ed25519_sk.pub user@server.domain.tld Ta-da! But one last thing: we need to make sure the server supports this public key format in sshd_config:\nPubkeyAcceptedKeyTypes ssh-ed25519,sk-ssh-ed25519@openssh.com Adding sk-ssh-ed25519@openssh.com to PubkeyAcceptedKeyTypes should suffice. It&rsquo;s best practice to only use the cryptographic primitives that you need, and hopefully ones that are also modern. This isn&rsquo;t a full-on SSH hardening guide, but you should take a look at the configuration file GrapheneOS uses for their servers to give you an idea on a few good practices.\nRestart the sshd service and try to connect to your server using your key handle (by passing -i ~\/.ssh\/id_ed25519_sk to ssh for instance). If that works for you (your FIDO2 security key should be needed to derive the real secret), feel free to remove your previous keys from .ssh\/authorized_keys on your server.\nThat&rsquo;s cool, right? If you don&rsquo;t have a security key, you can buy one from YubiKey (I&rsquo;m very happy with my 5C NFC by the way), Nitrokey, SoloKeys or OnlyKey (to name a few). If you have an Android device with a hardware security module (HSM), such as the Google Pixels equipped with Titan M (Pixel 3+), you could even use them as bluetooth security keys.\nNo reason to miss out on the party if you can afford it!\n","permalink":"https:\/\/wonderfall.dev\/openssh-fido2\/","summary":"Passwordless authentication with OpenSSH keys has been the de facto security standard for years. SSH keys are more robust since they&rsquo;re cryptographically sane by default, and are therefore resilient to most bruteforce atacks. They&rsquo;re also easier to manage while enabling a form of decentralized authentication (it&rsquo;s easy and painless to revoke them). So, what&rsquo;s the next step? And more exactly, why would one need something even better?\nWhy? The main problem with SSH keys is that they&rsquo;re not magic: they consist of a key pair, of which the private key is stored on your disk.","title":"Securing OpenSSH keys with hardware-based authentication (FIDO2)"},{"content":"Containers aren&rsquo;t that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:\n- Hey, your software doesn&rsquo;t work&hellip;\n- Sorry, it works on my computer! Can&rsquo;t help you.\nWhether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries. The developer can therefore provide a known-good environment where it is expected that their software &ldquo;just works&rdquo;. That is particularly useful for development to eliminate environment-related issues, and that is often used in production as well.\nContainers are often perceived as a great tool for isolation, that is, they can provide an isolated workspace that won&rsquo;t pollute your host OS - all that without the overhead of virtual machines. Security-wise: containers, as we know them on Linux, are glorified namespaces at their core. Containers usually share the same kernel with the host, and namespaces is the kernel feature for separating kernel resources across containers (IDs, networks, filesystems, IPC, etc.). Containers also leverage the features of cgroups to separate system resources (CPU, memory, etc.), and security features such as seccomp to restrict syscalls, or MACs (AppArmor, SELinux).\nAt first, it seems that containers may not provide the same isolation boundary as virtual machines. That&rsquo;s fine, they were not designed to. But they can&rsquo;t be simplified to a simple chroot either. We&rsquo;ll see that a &ldquo;container&rdquo; can mean a lot of things, and their definition may vary a lot depending on the implementation: as such, containers are mostly defined by their semantics.\nDocker is dead, long live Docker&hellip; and OCI When people think of containers, a large group of them may think of Docker. While Docker played a big role in the popularity of containers a few years ago, it didn&rsquo;t introduce the technology: on Linux, LXC did (Linux Containers). In fact, Docker in its early days was a high-level wrapper for LXC which already combined the power of namespaces and cgroups. Docker then replaced LXC with libcontainer which does more or less the same, plus extra features.\nThen, what happened? Open Container Initiative (OCI). That is the current standard that defines the container ecosystem. That means that whether you&rsquo;re using Docker, Podman, or Kubernetes, you&rsquo;re in fact running OCI-compliant tools. That is a good thing, as it saves a lot of interoperability headaches.\nDocker is no longer the monolithic platform it once was. libcontainer was absorbed by runc, the reference OCI runtime. The high-level components of Docker split into different parts related to the upstream Moby project (Docker is the &ldquo;assembled product&rdquo; of the &ldquo;Moby components&rdquo;). When we refer to Docker, we refer in fact at this powerful high-level API that manages OCI containers. By design, Docker is a daemon that communicates with containerd, a lower-level layer, which in turn communicates with the OCI runtime. That also means that you could very well skip Docker altogether and use containerd or even runc directly.\nDocker client &lt;=&gt; Docker daemon &lt;=&gt; containerd &lt;=&gt; containerd-shim &lt;=&gt; runc Podman is an alternative to Docker developed by RedHat, that also intends to be a drop-in replacement for Docker. It doesn&rsquo;t work with a daemon, and can work rootless by design (Docker has support for rootless too, but that is not without caveats). I would largely recommend Podman over Docker for someone who wants a simple tool to run containers and test code on their machine.\nKubernetes (also known as K8S) is the container platform made by Google. It is designed with scaling in mind, and is about running containers across a cluster whereas Docker focuses on packaging containers on a single node. Docker Swarm is the direct alternative to that, but it has never really took off due to the popularity of K8S.\nFor the rest of this article, we will use Docker as the reference for our examples, along with the Compose specification format. Most of these examples can be adapted to other platforms without issues.\nThe nightmare of dependencies Containers are made from images, and images are typically built from a Dockerfile. Images can be built and distributed through OCI registries: Docker Hub, Google Container Registry, GitHub Container Registry, and so on. You can also set up your own private registry as well, but the reality is that people often pull images from these public registries.\nImages, immutability and versioning Images are what make containers, well, containers. Containers made from the same image should behave similarly on different machines. Images can have tags, which are useful for software versioning. The usage of generic tags such as latest is often discouraged because it defeats the purpose of the expected behavior of the container. Tags are not necessarily immutable by design, and they shouldn&rsquo;t be (more on that below). Digest, however, is the attribute of an immutable image, and is often generated with the SHA-256 algorithm.\ndocker.io\/library\/golang:1.17.1@sha256:232a180dbcbcfa7250917507f3827d88a9ae89bb1cdd8fe3ac4db7b764ebb25 ^ ^ ^ ^ | | | | Registry Image Tag Digest (immutable) Now onto why tags shouldn&rsquo;t be immutable: as written above, containers bring us an abstraction over the OS dependencies that are used by the packaged software. That is nice indeed, but this shouldn&rsquo;t lure us into believing that we can forget security updates. The fact is, there is still a whole OS to care about, and we can&rsquo;t just think of the container as a simple package tool for software.\nFor these reasons, good practices were established:\nAn image should be as minimal as possible (Alpine Linux, or scratch\/distroless). An image, with a given tag, should be regularly built, without cache to ensure all layers are freshly built. An image should be rebuilt when the images it&rsquo;s based on are updated. A minimal base system Alpine Linux is often the choice for official images for the first reason. This is not a typical Linux distribution as it uses musl as its C library, but it works quite well. Actually, I&rsquo;m quite fond of Alpine Linux and apk (its package manager). If a supervision suite is needed, I&rsquo;d look into s6. If you need a glibc distribution, Debian provides slim variants for lightweight base images. We can do even better than using Alpine by using distroless images, allowing us to have state-of-the-art application containers.\n&ldquo;Distroless&rdquo; is a fancy name referring to an image with a minimal set of dependencies, from none (for fully static binaries) to some common libraries (typically the C library). Google maintains distroless images you can use as a base for your own images. If you were wondering, the difference with scratch (empty starting point) is that distroless images contain common dependencies that &ldquo;almost-statically compiled&rdquo; binaries may need, such as ca-certificates.\nHowever, distroless images are not suited for every application. In my experience though, distroless is an excellent option with pure Go binaries. Going with minimal images drastically reduces the available attack surface in the container. For example, here&rsquo;s a multi-stage Dockerfile resulting in a minimal non-root image for a simple Go project:\nFROM golang:alpine as build WORKDIR \/app COPY . . RUN CGO_ENABLED=0 go mod -o \/my_app cmd\/my_app FROM gcr.io\/distroless\/static COPY --from=build \/my_app \/ USER nobody ENTRYPOINT [&#34;\/my_app&#34;] The main drawback of using minimal images is the lack of tools that help with debugging, which also constitute the very attack surface we&rsquo;re trying to get rid of. The trade-off is probably not worth the hassle for development-focused containers, and if you&rsquo;re running such images in production, you have to be confident enough to operate with them. Note that the gcr.io\/distroless images have a :debug tag to help in that regard.\nKeeping images up-to-date The two other points are highly problematic, because most software vendors just publish an image on release, and forget about it. You should take it up to them if you&rsquo;re running images that are versioned but not regularly updated. I&rsquo;d say running scheduled builds once a week is the bare minimum to make sure dependencies stay up-to-date. Alpine Linux is a better choice than most other &ldquo;stable&rdquo; distributions because it usually has more recent packages.\nStable distributions often rely on backporting security fixes from CVEs, which is known to be a flawed approach to security since CVEs aren&rsquo;t always assigned or even taken care of. Alpine has more recent packages, and it has versioning, so it&rsquo;s once again a particularly good choice as long as musl doesn&rsquo;t cause issues.\nIs it really a security nightmare? When people say Docker is a security nightmare because of that, that&rsquo;s a fair point. On a traditional system, you could upgrade your whole system with a single command or two. With Docker, you&rsquo;ll have to recreate several containers&hellip; if the images were kept up-to-date in the first place. Recreating itself is not a big deal actually: hot upgrades of binaries and libraries often require the services that use them to restart, otherwise they could still use an old (and vulnerable) version of them in memory. But yeah, the fact is most people are running outdated containers, and more often than not, they don&rsquo;t have the choice if they rely on third-party images.\nTrivy is an excellent tool to scan images for a subset of known vulnerabilities an image might have. You should play with it and see for yourself how outdated many publicly available images are.\nSupply-chain attacks As with any code downloaded from a software vendor, OCI images are not exempt from supply-chain attacks. The good practice is quite simple: rely on official images, and ideally build and maintain your own images. One should definitely not automatically trust random third-party images they can find on Docker Hub. Half of these images, if not more, contain vulnerabilities, and I bet a good portion of them contains malwares such as miners or worse.\nAs an image maintainer, you can sign your images to improve the authenticity assurance. Most official images make use of Docker Content Trust, which works with a OCI registry attached to a Notary server. With the Docker toolset, setting the environment variable DOCKER_CONTENT_TRUST=1 enforces signature verification (a signature is only good if it&rsquo;s checked in the first place). The SigStore initiative is developing cosign, an alternative that doesn&rsquo;t require a Notary server because it works with features already provided by the registry such as tags. Kubernetes users may be interested in Connaisseur to ensure all signatures have been validated.\nLeave my root alone Attack surface Traditionally, Docker runs as a daemon owned by root. That also means that root in the container is actually the root on the host and may be a few commands away from compromising the host. More generally, the attacker has to exploit the available attack surface to escape the container. There is a huge attack surface, actually: the Linux kernel. Someone wise once said:\nThe kernel can effectively be thought of as the largest, most vulnerable setuid root binary on the system.\nThat applies particularly to traditional containers which weren&rsquo;t designed to provide a robust level of isolation. A recent example was CVE-2022-0492: the attacker could abuse root in the container to exploit cgroups v1, and compromise the host. Of course defense-in-depth measures would have prevented that, and we&rsquo;ll mention them. But fundamentally, container escapes are possible by design.\nBreaking out via the OCI runtime runc is also possible, although CVE-2019-5736 was a particularly nasty bug. The attacker had to gain access to root in the container first in order to access \/proc\/[runc-pid]\/exe, which indicates them where to overwrite the runc binary.\nGood practices have been therefore established:\nAvoid using root in the container, plain and simple. Keep the host kernel, Docker and the OCI runtime updated. Consider the usage of user namespaces. By the way, it goes without saying that any user who has access to the Docker daemon should be considered as privileged as root. Mounting the Docker socket (\/var\/run\/docker.sock) in a container makes it highly privileged, and so it should be avoided. The socket should only be owned by root, and if that doesn&rsquo;t work with your environment, use Docker rootless or Podman.\nAvoiding root root can be avoided in different ways in the final container:\nImage creation time: setting the USER instruction in the Dockerfile. Container creation time: via the tools available (user: in the Compose file). Container runtime: degrading privileges with entrypoints scripts (gosu UID:GID). Well-made images with security in mind will have a USER instruction. In my experience, most people will run images blindly, so it&rsquo;s good harm reduction. Setting the user manually works in some images that aren&rsquo;t designed without root in mind, and it&rsquo;s also great to mitigate some scenarii where the image is controlled by an attacker. You also won&rsquo;t have surprises when mounting volumes, so I highly recommend setting the user explicitly and make sure volume permissions are correct once.\nSome images allow users to define their own user with UID\/GID environment variables, with an entrypoint script that runs as root and takes care of the volume permissions before dropping privileges. While technically fine, it is still attack surface, and it requires the SETUID\/SETGID capabilities to be available in the container.\nUser namespaces: sandbox or paradox? As mentioned just above, user namespaces are a solution to ensure root in the container is not root on the host. Docker supports user namespaces, for instance you could set the default mapping in \/etc\/docker\/daemon.json:\n&#34;userns-remap&#34;: &#34;default&#34; whoami &amp;&amp; sleep 60 in the container will return root, but ps -fC sleep on the host will show us the PID of another user. That is nice, but it has limitations and therefore shouldn&rsquo;t be considered as a real sandbox. In fact, the paradox is that user namespaces are attack surface (and vulnerabilities are still being found years later), and it&rsquo;s common wisdom to restrict them to privileged users (kernel.unprivileged_userns_clone=0). That is fine for Docker with its traditional root daemon, but Podman expects you to let unprivileged users interact with user namespaces (so essentially privileged code).\nEnabling userns-remap in Docker shouldn&rsquo;t be a substitute for running unprivileged application containers (where applicable). User namespaces are mostly useful if you intend to run full-fledged OS containers which need root in order to function, but that is out of the scope of the container technologies mentioned in this article; for them, I&rsquo;d argue exposing such a vulnerable attack surface from the host kernel for dubious sandboxing benefits isn&rsquo;t an interesting trade-off to make.\nThe no_new_privs bit After ensuring root isn&rsquo;t used in your containers, you should look into setting the no_new_privs bit. This Linux feature restricts syscalls such as execve() from granting privileges, which is what you want to restrict in-container privilege escalation. This flag can be set for a given container in a Compose file:\nsecurity_opt: - no-new-privileges: true Gaining privileges in the container will be much harder that way.\nCapabilities Furthermore, we should mention capabilities: root powers are divided into distinct units by the Linux kernel, called capabilities. Each granted capability also grants privilege and therefore access to a significant amount of attack surface. Security researcher Brad Spengler enumerates 19 important capabilities. Docker restricts certain capabilities by default, but some of the most important ones are still available to a container by default.\nYou should consider the following rule of thumb:\nDrop all capabilities by default. Allow only the ones you really need to. If you already run your containers unprivileged without root, your container will very likely work fine with all capabilities dropped. That can be done in a Compose file:\ncap_drop: - ALL #cap_add: # - CHOWN # - DAC_READ_SEARCH # - SETUID # - SETGID Never use the --privileged option unless you really need to: a privileged container is given access to almost all capabilities, kernel features and devices.\nOther security features MACs and seccomp are robust tools that may vastly improve container security.\nMandatory Access Control MAC stand for Mandatory Access Control: traditionally a Linux Security Module that will enforce a policy to restrict the userspace. Examples are AppArmor and SELinux: the former being more easy-to-use, the later being more fine-grained. Both are strong tools that can help&hellip; Yet, their sole presence does not mean they&rsquo;re really effective. A robust policy starts from a deny all policy, and only allows the necessary resources to be accessed.\nseccomp seccomp (short for secure computing mode) on the other hand is a much simpler and complementary tool, and there is no reason not to use it. What it does is restricting a process to a set of system calls, thus drastically reducing the attack surface available.\nDocker provides default profiles for AppArmor and seccomp, and they&rsquo;re enabled by default for newly created containers unless the unconfined option is explicitly passed. Note: Kubernetes doesn&rsquo;t enable the default seccomp profile by default, so you should probably try it.\nThese profiles are a great start, but you should do much more if you take security seriously, because they were made to not break compatibility with a large range of images. The default seccomp profile only disables around 44 syscalls, which are mostly not very common and\/or obsoleted. Of course, the best profile you can get is supposed to be written for a given program. It also doesn&rsquo;t make sense to insist on the permissiveness of the default profiles, and a lof of work has gone into hardening containers.\ncgroups Use cgroups to restrict access to hardware and system resources. You likely don&rsquo;t want a guest container to monopolize the host resources. You also don&rsquo;t want to be vulnerable to stupid fork bomb attacks. In a Compose file, consider setting these limits:\nmem_limit: 4g cpus: 4 pids_limit: 256 More runtime options can be found in the official documentation. All of them should have a Compose spec equivalent.\nThe --cgroup-parent option should be avoided as it uses the host cgroup and not the one configured from Docker (or else), which is the default.\nRead-only filesystem It is good practice to treat the image as some refer to as the &ldquo;golden image&rdquo;.\nIn other words, you&rsquo;ll run containers in read-only mode, with an immutable filesystem inherited from the image. Only the mounted volumes will be read\/write accessible, and those should ideally be mounted with the noexec, nosuid and nodev options for extra security. If read\/write access isn&rsquo;t needed, mount these volumes as read-only too.\nHowever, the image may not be perfect and still require read\/write access to some parts of the filesystem, likely directories such as \/tmp, \/run or \/var. You can make a tmpfs for those (a temporary filesystem in the container attributed memory), because they&rsquo;re not persistent data anyway.\nIn a Compose file, that would look like the following settings:\nread_only: true tmpfs: - \/tmp:size=10M,mode=0770,uid=1000,gid=1000,noexec,nosuid,nodev That is quite verbose indeed, but that&rsquo;s to show you the different options for a tmpfs mount. You want to restrict them in size and permissions ideally.\nNetwork isolation By default, all Docker containers will use the default network bridge. They will see and be able to communicate with each other. Each container should have its own user-defined bridge network, and each connection between containers should have an internal network. If you intend to run a reverse proxy in front of several containers, you should make a dedicated network for each container you want to expose to the reverse proxy.\nThe --network host option also shouldn&rsquo;t be used for obvious reasons since the container would share the same network as the host, providing no isolation at all.\nAlternative runtimes (gVisor) runc is the reference OCI runtime, but that means other runtimes can exist as well as long as they&rsquo;re compliant with the OCI standard. These runtimes can be interchanged quite seamlessly. There&rsquo;s a few alternatives, such as crun or youki, respectively implemented in C and Rust (runc is a Go implementation). However, there is one particular runtime that does a lot more for security: runsc, provided by the gVisor project by the folks at Google.\nContainers are not a sandbox, and while we can improve their security, they will fundamentally share a common attack surface with the host. Virtual machines are a solution to that problem, but you might prefer container semantics and ecosystem. gVisor can be perceived as an attempt to get the &ldquo;best of both worlds&rdquo;: containers that are easy to manage while providing a native isolation boundary. gVisor did just that by implementing two things:\nSentry: an application kernel in Go, a language known to be memory-safe. It implements the Linux logic in userspace such as various system calls. Gofer: a host process which communicates with Sentry and the host filesystem, since Sentry is restricted in that aspect. A platform like ptrace or KVM is used to intercept system calls and redirect them from the application to Sentry, which is running in the userspace. This has some costs: there is a higher per-syscall overhead, and compatibility is reduced since not all syscalls are implemented. On top of that, gVisor employs security mechanisms we&rsquo;ve glanced over above, such as a very restrictive seccomp profile between Sentry and the host kernel, the no_new_privs bit, and isolated namespaces from the host.\nThe security model of gVisor is comparable to what you would expect from a virtual machine. It is also very easy to install and use. The path to runsc along with its different configuration flags (runsc flags) should be added to \/etc\/docker\/daemon.json:\n&#34;runtimes&#34;: { &#34;runsc-ptrace&#34;: { &#34;path&#34;: &#34;\/usr\/local\/bin\/runsc&#34;, &#34;runtimeArgs&#34;: [ &#34;--platform=ptrace&#34; ] }, &#34;runsc-kvm&#34;: { &#34;path&#34;: &#34;\/usr\/local\/bin\/runsc&#34;, &#34;runtimeArgs&#34;: [ &#34;--platform=kvm&#34; ] } } runsc needs to start with root to set up some mitigations, including the use of its own network stack separated from the host. The sandbox itself drops privileges to nobody as soon as possible. You can still use runsc rootless if you want (which should be needed for Podman):\n.\/runsc --rootless do uname -a *** Warning: sandbox network isn&#39;t supported with --rootless, switching to host *** Linux 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 GNU\/Linux Linux 4.4.0 is shown because that is the version of the Linux API that Sentry tries to mimic. As you&rsquo;ve probably guessed, you&rsquo;re not really using Linux 4.4.0, but the application kernel that behaves like it. By the way, gVisor is of course compatible with cgroups.\nConclusion: what&rsquo;s a container after all? Like I wrote above, a container is mostly defined by its semantics and ecosystem. Containers shouldn&rsquo;t be solely defined by the OCI reference runtime implementation, as we&rsquo;ve seen with gVisor that provides an entirely different security model.\nStill not convinced? What if I told you a container can leverage the same technologies as a virtual machine? That is exactly what Kata Containers does by using a VMM like QEMU-lite to provide containers that are in fact lightweight virtual machines, with their traditional resources and security model, compatibility with container semantics and toolset, and an optimized overhead. While not in the OCI ecosystem, Amazon achieves quite the same with Firecracker.\nIf you&rsquo;re running untrusted workloads, I highly suggest you consider gVisor instead of a traditional container runtime. Your definition of &ldquo;untrusted&rdquo; may vary: for me, almost everything should be considered untrusted. That is how modern security works, and how mobile operating systems work. It&rsquo;s quite simple, security should be simple, and gVisor simply offers native security.\nContainers are a popular, yet strange world. They revolutionized the way we make and deploy software, but one should not loose the sight of what they really are and aren&rsquo;t. This hardening guide is non-exhaustive, but I hope it can make you aware of some aspects you&rsquo;ve never thought of.\n","permalink":"https:\/\/wonderfall.dev\/docker-hardening\/","summary":"Containers aren&rsquo;t that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:\n- Hey, your software doesn&rsquo;t work&hellip;\n- Sorry, it works on my computer! Can&rsquo;t help you.\nWhether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries.","title":"Docker and OCI: a humble hardening guide"},{"content":"You may call me &ldquo;Wonderfall&rdquo;. I was young and it sounded cool.\n$ whoami I&#39;m just a random guy passing by on the Internet who is interested in all kinds of things. And as you can tell, I&#39;m a nerd. $ ls -l content\/ technology security privacy rants photography pharmacology medicine science $ git config --get remote.origin.url https:\/\/github.com\/Wonderfall\/wonderfall.github.io ","permalink":"https:\/\/wonderfall.dev\/about\/","summary":"You may call me &ldquo;Wonderfall&rdquo;. I was young and it sounded cool.\n$ whoami I&#39;m just a random guy passing by on the Internet who is interested in all kinds of things. And as you can tell, I&#39;m a nerd. $ ls -l content\/ technology security privacy rants photography pharmacology medicine science $ git config --get remote.origin.url https:\/\/github.com\/Wonderfall\/wonderfall.github.io ","title":"About"}]