<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
    <title>tontinton</title>
    <link href="https://tontinton.com/atom.xml" rel="self" type="application/atom+xml"/>
    <link href="https://tontinton.com"/>
    <generator uri="https://www.getzola.org/">Zola</generator>
    <updated>2026-02-08T00:00:00+00:00</updated>
    <id>https://tontinton.com/atom.xml</id>
    <entry xml:lang="en">
        <title>Lance table format explained simply</title>
        <published>2026-02-08T00:00:00+00:00</published>
        <updated>2026-02-08T00:00:00+00:00</updated>
        <author>
          <name>Unknown</name>
        </author>
        <link rel="alternate" href="https://tontinton.com/posts/lance/" type="text/html"/>
        <id>https://tontinton.com/posts/lance/</id>
        
        <content type="html">&lt;p&gt;&lt;strong&gt;TLDR&lt;&#x2F;strong&gt; (but stay for the animations!): Lance is a successor to Iceberg &#x2F; Delta Lake, more optimized for random reads, and supports adding ad-hoc columns without needing to copy all the data.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;Some big things happened in the &lt;em&gt;big data over object storage&lt;&#x2F;em&gt; world in 2025:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Iceberg V3 spec got released and added cool stuff like &lt;a href=&quot;https:&#x2F;&#x2F;iceberg.apache.org&#x2F;spec&#x2F;#semi-structured-types&quot;&gt;VARIANT&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;turbopuffer.com&#x2F;&quot;&gt;turbopuffer&lt;&#x2F;a&gt; announced a vector search over object storages (similar to &lt;a href=&quot;&#x2F;posts&#x2F;new-age-data-intensive-apps&#x2F;#quickwit&quot;&gt;Quickwit&lt;&#x2F;a&gt;).&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;jack-vanlightly.com&#x2F;blog&#x2F;2025&#x2F;9&#x2F;2&#x2F;understanding-apache-fluss&quot;&gt;Apache Fluss&lt;&#x2F;a&gt; lets Flink manage real-time streams with tiering to object storage.&lt;&#x2F;li&gt;
&lt;li&gt;Datadog bought Quickwit.&lt;&#x2F;li&gt;
&lt;li&gt;Databricks bought &lt;a href=&quot;&#x2F;posts&#x2F;new-age-data-intensive-apps&#x2F;#neon&quot;&gt;Neon&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;But something way bigger flew completely under my radar, most likely as I was pretty busy building at &lt;a href=&quot;https:&#x2F;&#x2F;vega.io&#x2F;&quot;&gt;$DAY_JOB&lt;&#x2F;a&gt; (some &lt;a href=&quot;https:&#x2F;&#x2F;blog.vega.io&#x2F;posts&#x2F;partial_stream&#x2F;&quot;&gt;pretty cool stuff&lt;&#x2F;a&gt;, I must say).&lt;&#x2F;p&gt;
&lt;p&gt;This thing is called &lt;a href=&quot;https:&#x2F;&#x2F;lance.org&#x2F;&quot;&gt;Lance&lt;&#x2F;a&gt;. It&#x27;s a file format (like Apache Parquet), a table format (like Apache Iceberg), and a catalog spec (like Iceberg&#x27;s REST catalog spec).&lt;&#x2F;p&gt;
&lt;h1 id=&quot;lance-file-format&quot;&gt;Lance file format&lt;&#x2F;h1&gt;
&lt;p&gt;Lance file format is similar to Parquet, but more optimized for random reads (&lt;code&gt;WHERE id = 123&lt;&#x2F;code&gt;), while still preserving Parquet&#x27;s performance when doing sequential reads over all values of a specific column.&lt;&#x2F;p&gt;
&lt;p&gt;Official docs &lt;a href=&quot;https:&#x2F;&#x2F;lance.org&#x2F;format&#x2F;file&#x2F;&quot;&gt;here&lt;&#x2F;a&gt;. &lt;&#x2F;p&gt;
&lt;iframe src=&quot;&amp;#x2F;animations&amp;#x2F;lance&amp;#x2F;io.min.html&quot; scrolling=&quot;no&quot; onload=&quot;this.style.height=this.contentDocument.body.scrollHeight+&#x27;px&#x27;&quot; style=&quot;width:90vw;margin-left:calc(50% - 45vw);border:1px solid var(--border-color);border-radius:10px;overflow:hidden;&quot;&gt;&lt;&#x2F;iframe&gt;
&lt;blockquote&gt;
&lt;p&gt;Numbers are not exactly real, and should only serve as an order of magnitude estimation to build intution.&lt;&#x2F;p&gt;
&lt;p&gt;Something interesting to test is how would Parquet behave if we configure it to store each page as 64kb instead of the default 1mb 🤔.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h1 id=&quot;lance-table-format&quot;&gt;Lance table format&lt;&#x2F;h1&gt;
&lt;p&gt;Lance table format is similar to Iceberg, but allows adding columns ad-hoc without copying all the data (just to add a value for the new column to all rows), while still preserving Iceberg&#x27;s MVCC.&lt;&#x2F;p&gt;
&lt;p&gt;Another great feature of Lance tables is they also support &lt;a href=&quot;https:&#x2F;&#x2F;lance.org&#x2F;format&#x2F;table&#x2F;index&#x2F;&quot;&gt;indexes&lt;&#x2F;a&gt;, such as &lt;a href=&quot;https:&#x2F;&#x2F;lance.org&#x2F;format&#x2F;table&#x2F;index&#x2F;&quot;&gt;BTree&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;lance.org&#x2F;format&#x2F;table&#x2F;index&#x2F;scalar&#x2F;fts&#x2F;&quot;&gt;inverted index (FTS)&lt;&#x2F;a&gt;, and &lt;a href=&quot;https:&#x2F;&#x2F;lance.org&#x2F;format&#x2F;table&#x2F;index&#x2F;vector&#x2F;&quot;&gt;vectors (e.g. HNSW)&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Official docs &lt;a href=&quot;https:&#x2F;&#x2F;lance.org&#x2F;format&#x2F;table&#x2F;&quot;&gt;here&lt;&#x2F;a&gt;. &lt;&#x2F;p&gt;
&lt;iframe src=&quot;&amp;#x2F;animations&amp;#x2F;lance&amp;#x2F;table.min.html&quot; scrolling=&quot;no&quot; onload=&quot;this.style.height=this.contentDocument.body.scrollHeight+&#x27;px&#x27;&quot; style=&quot;width:90vw;margin-left:calc(50% - 45vw);border:1px solid var(--border-color);border-radius:10px;overflow:hidden;&quot;&gt;&lt;&#x2F;iframe&gt;
&lt;h1 id=&quot;thanks-to-ai&quot;&gt;Thanks to AI?&lt;&#x2F;h1&gt;
&lt;p&gt;Apparently there&#x27;s another open-source Parquet competing file format called &lt;a href=&quot;https:&#x2F;&#x2F;vortex.dev&#x2F;&quot;&gt;vortex&lt;&#x2F;a&gt; created by &lt;a href=&quot;https:&#x2F;&#x2F;spiraldb.com&#x2F;&quot;&gt;SpiralDB&lt;&#x2F;a&gt; which seems like a direct competitor to &lt;a href=&quot;https:&#x2F;&#x2F;lancedb.com&#x2F;&quot;&gt;LanceDB&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;These technologies only came about because of a need for multi-modal data lakes now that AI is so prevalent.&lt;&#x2F;p&gt;
&lt;p&gt;I wonder what other technologies will come from this AI software era.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>The New Age of Data-Intensive Applications</title>
        <published>2024-07-21T00:00:00+00:00</published>
        <updated>2024-07-21T00:00:00+00:00</updated>
        <author>
          <name>Unknown</name>
        </author>
        <link rel="alternate" href="https://tontinton.com/posts/new-age-data-intensive-apps/" type="text/html"/>
        <id>https://tontinton.com/posts/new-age-data-intensive-apps/</id>
        
        <content type="html">&lt;p&gt;In his book &lt;code&gt;Designing Data-Intensive Applications&lt;&#x2F;code&gt;, Martin Kleppmann suggests that all data applications follow a similar pattern. Their goal is to read data, run some transformation on it, and store the result somewhere, all to have a faster way to read that data later.&lt;&#x2F;p&gt;
&lt;p&gt;We see this pattern everywhere:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;A RDBMS (e.g. Postgres, MySQL) receives rows and computes a B-Tree.&lt;&#x2F;li&gt;
&lt;li&gt;A log search engine (e.g. Elasticsearch, Splunk) receives documents and computes an inverted index.&lt;&#x2F;li&gt;
&lt;li&gt;A streaming data pipeline using Spark &#x2F; Flink, which receives records from Kafka and computes a pre-aggregated Iceberg table.
&lt;ul&gt;
&lt;li&gt;If you squint hard enough, Kafka looks like a transaction log (just distributed), and the data pipeline looks like a materialized view (just distributed and fault-tolerant). Not that far off from a database huh?&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Lately, I&#x27;ve been seeing more and more data applications using object storages (e.g. AWS S3, Azure Blob Store, Google Cloud Storage) instead of the traditional file system to store data, claiming to be a much cheaper solution than the old alternatives.&lt;&#x2F;p&gt;
&lt;p&gt;In this post, we&#x27;ll explore the benefits and drawbacks of this architecture, with 3 real-world examples:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;quickwit.io&#x2F;&quot;&gt;Quickwit&lt;&#x2F;a&gt; - A cheap log search engine as an alternative to Elasticsearch.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;warpstream.com&#x2F;&quot;&gt;WarpStream&lt;&#x2F;a&gt; - A cheap distributed log as an alternative to Kafka.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;neon.tech&#x2F;&quot;&gt;Neon&lt;&#x2F;a&gt; - Serverless Postgres, a sort of alternative to &lt;a href=&quot;https:&#x2F;&#x2F;aws.amazon.com&#x2F;rds&#x2F;aurora&#x2F;&quot;&gt;AWS Aurora&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Choosing between file system and object storage is &lt;strong&gt;critical&lt;&#x2F;strong&gt; to do before you write even a single line of code, as they have different APIs, performance characteristics, costs, and deployment operations. The architecture you choose to make will turn out to be &lt;strong&gt;vastly&lt;&#x2F;strong&gt; different.&lt;&#x2F;p&gt;
&lt;p&gt;I hope this can serve as a guide to deciding whether object storage is the correct approach for your next data application.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;what-is-an-object-storage&quot;&gt;What is an object storage?&lt;&#x2F;h1&gt;
&lt;p&gt;Before we begin, here&#x27;s a quick intro to what makes an object storage (also sometimes called blob storage).&lt;&#x2F;p&gt;
&lt;p&gt;Object storage is a service that is a kind of key-value database, that is very cheap to store huge amounts of unstructured data called blobs. &amp;quot;Blob&amp;quot; stands for &amp;quot;Binary large object&amp;quot;.&lt;&#x2F;p&gt;
&lt;p&gt;They usually store data on the cheapest hardware, using HDDs instead of SSDs (until they&#x27;ll be &lt;a href=&quot;https:&#x2F;&#x2F;thecuberesearch.com&#x2F;qlc-flash-hamrs-hdd&#x2F;&quot;&gt;cheaper&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;The API looks like:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;py&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-py &quot;&gt;&lt;code class=&quot;language-py&quot; data-lang=&quot;py&quot;&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;PutObject&lt;&#x2F;span&gt;&lt;span&gt;(bucket, path)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;GetObject&lt;&#x2F;span&gt;&lt;span&gt;(bucket, path)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But you can also do more things, like listing files under a specific prefix:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;py&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-py &quot;&gt;&lt;code class=&quot;language-py&quot; data-lang=&quot;py&quot;&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;ListObjects&lt;&#x2F;span&gt;&lt;span&gt;(bucket, prefix)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;One of the limitations of the API, compared to file systems, is that there&#x27;s no way to overwrite a file partially, you can only overwrite the entirety of the file, replacing it.&lt;&#x2F;p&gt;
&lt;p&gt;AWS S3, the most popular object storage, doesn&#x27;t even have a &lt;code&gt;MoveObject&lt;&#x2F;code&gt; request, you must &lt;code&gt;CopyObject(new)&lt;&#x2F;code&gt; -&amp;gt; &lt;code&gt;DeleteObject(old)&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Also in S3, there are no transaction guarantees other than read-after-write consistency (a &lt;code&gt;GetObject&lt;&#x2F;code&gt; after a successful &lt;code&gt;PutObject&lt;&#x2F;code&gt; will always work). We&#x27;ll soon learn how you can achieve ACID transactions over the different object storages.&lt;&#x2F;p&gt;
&lt;p&gt;Now that we get the gist, let&#x27;s start diving deeper.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;separation-of-storage-and-compute&quot;&gt;Separation of storage and compute&lt;&#x2F;h2&gt;
&lt;p&gt;The biggest advantage of object storages are that they are &lt;strong&gt;extremely cheap&lt;&#x2F;strong&gt; at scale, but why is that?&lt;&#x2F;p&gt;
&lt;p&gt;Look at S3&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;aws.amazon.com&#x2F;s3&#x2F;pricing&#x2F;&quot;&gt;pricing&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;$21 - $23 per TB (monthly).&lt;&#x2F;li&gt;
&lt;li&gt;PUT, COPY, POST, and LIST requests cost $5 per million requests.&lt;&#x2F;li&gt;
&lt;li&gt;GET, SELECT, and all other requests cost $0.4 per million requests.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Notice how you don&#x27;t pay much for the compute it takes AWS to keep S3 running. That&#x27;s the magic of these object storage services. They allow for a pattern known as the separation of storage and compute.&lt;&#x2F;p&gt;
&lt;p&gt;In a traditional storage solution (for example Elasticsearch), when it runs out of disk space, it scales up to run another node. Thus, you pay for the accumulated CPU time of all the nodes running to hold your data.&lt;&#x2F;p&gt;
&lt;p&gt;What if most of the time, the data just sits there, accumulating dust, almost never to be queried? You pay for wasted CPU time.&lt;&#x2F;p&gt;
&lt;p&gt;This is why in the big data analytics world, we see products like &lt;a href=&quot;https:&#x2F;&#x2F;www.snowflake.com&quot;&gt;Snowflake&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;delta.io&#x2F;&quot;&gt;Delta Lake&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;iceberg.apache.org&#x2F;&quot;&gt;Apache Iceberg&lt;&#x2F;a&gt; (we&#x27;ll expand on these later) being so popular lately. It&#x27;s mainly because of costs.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Be wary that you also pay for the network egress (data going outside the data center). On AWS specifically, it will cost ~53$ per TB, which depending on your workload, can be a deal breaker. As long as you run in the same AZ (availability zone) though, you &lt;a href=&quot;https:&#x2F;&#x2F;docs.aws.amazon.com&#x2F;cur&#x2F;latest&#x2F;userguide&#x2F;cur-data-transfers-charges.html&quot;&gt;shouldn&#x27;t pay for egress&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;There&#x27;s a pretty new service by Cloudflare called &lt;a href=&quot;https:&#x2F;&#x2F;www.cloudflare.com&#x2F;developer-platform&#x2F;r2&#x2F;&quot;&gt;R2&lt;&#x2F;a&gt; which provides the same API as S3, without the egress costs.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;&lt;em&gt;&amp;quot;Why not use a mounted file system like EBS?&amp;quot;&lt;&#x2F;em&gt;, the reason is, again, cost. S3 is much cheaper in comparison (~8x cheaper per replica, 3 replicas will cost ~24x more). There&#x27;s also the simplicity of not needing to deal with resizing the volume. For a more thorough explanation I&#x27;ll link WarpStream&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;www.warpstream.com&#x2F;blog&#x2F;cloud-disks-are-expensive&quot;&gt;Cloud Disks are (Really!) Expensive&lt;&#x2F;a&gt; blog post.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;stateless&quot;&gt;Stateless&lt;&#x2F;h2&gt;
&lt;p&gt;The separation of storage and compute also means your service is stateless, allowing for simple scalability and operation:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;You can add nodes &#x2F; remove nodes by monitoring CPU &#x2F; RAM &#x2F; network usage.
&lt;ul&gt;
&lt;li&gt;Pay only for what you use.&lt;&#x2F;li&gt;
&lt;li&gt;Do so quickly. No need to synchronize the state with the cluster.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Scale to zero with &lt;a href=&quot;https:&#x2F;&#x2F;aws.amazon.com&#x2F;lambda&#x2F;&quot;&gt;AWS Lambda&lt;&#x2F;a&gt; and pay nothing if there&#x27;s usually no workload.
&lt;ul&gt;
&lt;li&gt;Or better yet, run on a cheap serverless edge solution like &lt;a href=&quot;https:&#x2F;&#x2F;workers.cloudflare.com&#x2F;&quot;&gt;Cloudflare Workers&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Ability to break the monolith into different services.
&lt;ul&gt;
&lt;li&gt;For example a service for the write path, and a service for the read path.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;You can restart pods in case of a bug, and know that they will start with a clean state.&lt;&#x2F;li&gt;
&lt;li&gt;Throw away that &lt;code&gt;StatefulSet&lt;&#x2F;code&gt; in k8s, and deploy a simple &lt;code&gt;Deployment&lt;&#x2F;code&gt;, just like your regular stateless web server.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;reliability&quot;&gt;Reliability&lt;&#x2F;h2&gt;
&lt;p&gt;S3 is designed to provide 99.999999999% durability and 99.99% availability of objects over a given year, but all object storages have similar guarantees.&lt;&#x2F;p&gt;
&lt;p&gt;They are designed to last bit-flips from cosmic rays and random earthquakes that destroy a data center.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;But can they protect against human error? Probably not 🙃&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;They remove the need to manage replicas of data in your system, which is a very complex problem.&lt;&#x2F;p&gt;
&lt;p&gt;One of the ways they achieve this is by using something called &lt;a href=&quot;https:&#x2F;&#x2F;brooker.co.za&#x2F;blog&#x2F;2023&#x2F;01&#x2F;06&#x2F;erasure.html&quot;&gt;erasure coding&lt;&#x2F;a&gt;. It&#x27;s an algorithm that breaks an object into &lt;code&gt;X&lt;&#x2F;code&gt; amount of chunks and distributes the storage of these chunks on different data centers. The beauty is that you only need a &lt;code&gt;Y&lt;&#x2F;code&gt; amount of chunks to reconstruct the object, where &lt;code&gt;Y &amp;lt; X&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;&#x2F;h2&gt;
&lt;p&gt;As object storages are designed to be cheap and durable, it comes at the cost of performance, specifically latency.&lt;&#x2F;p&gt;
&lt;p&gt;When running a &lt;code&gt;GetObject&lt;&#x2F;code&gt; request to download a blob file, you can expect the median latency to be ~15ms, with P90 at ~60ms. Although these numbers got better with time and will continue to slowly improve, the latency of an NVMe SSD is 20–100 μs, which is 1000x faster.&lt;&#x2F;p&gt;
&lt;p&gt;The throughput is also not amazing by default, being somewhere around 50MB&#x2F;s (while NVMe 5.0 can get to 12GB&#x2F;s), but there&#x27;s a trick to reach the throughput of even SSDs, and that is running multiple &lt;code&gt;GetObject&lt;&#x2F;code&gt; requests in parallel. For example, getting 20 blob files in parallel will give you 1GB&#x2F;s of throughput.&lt;&#x2F;p&gt;
&lt;p&gt;This trick works even when you want to download 1 big file. For example in S3, there&#x27;s a &lt;code&gt;Range&lt;&#x2F;code&gt; header you can provide to &lt;code&gt;GetObject&lt;&#x2F;code&gt;, where you specify the byte offset and size to download. Split the download into chunks, and fire multiple &lt;code&gt;GetObject&lt;&#x2F;code&gt; requests concurrently. Adding a bit of complexity for the benefit of better throughput.&lt;&#x2F;p&gt;
&lt;p&gt;Usually, the cloud providers also provide a more expensive but lower latency solution. For example, AWS has &lt;a href=&quot;https:&#x2F;&#x2F;aws.amazon.com&#x2F;s3&#x2F;storage-classes&#x2F;express-one-zone&#x2F;&quot;&gt;S3 Express&lt;&#x2F;a&gt;, which is somewhere in the middle of the pricing range between regular S3 and EBS, but it allows for tiering strategies without changing much of the architecture and code.&lt;&#x2F;p&gt;
&lt;p&gt;For example, if most reads from users are on new data and you write to an immutable log, like a LSM Tree, you can first write into the more expensive solution, and then on compaction write to the cheaper one. Access to new data will be fast, without paying that much more, as most of the time the data is in the cold storage.&lt;&#x2F;p&gt;
&lt;p&gt;Be wary of rate limits though. S3 for example, states they support 3500 write requests per second and 5500 read requests per second. Just remember the rate limits are applied per prefix, so storing data in different prefixes will allow you to have greater rate limits.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, &lt;code&gt;ListObjects&lt;&#x2F;code&gt; requests are notoriously slow, mostly because object storages are flat and not hierarchical. Prefixes are called prefixes and not directories, because that&#x27;s exactly what they are, a prefix to the key (remember how I said object storages are similar to K&#x2F;V DBs?). To accommodate that, you should not store a bunch of small blobs, but a few big blobs. I can&#x27;t say the best absolute size you should go for, experiment, and benchmark for your use case.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;mostly-cloud-based&quot;&gt;Mostly cloud-based&lt;&#x2F;h2&gt;
&lt;p&gt;Almost all object storages are services provided as part of the cloud. If you want your data to sit inside the internal company servers (On-Premise), it gets a bit more complex but definitely doable.&lt;&#x2F;p&gt;
&lt;p&gt;A popular solution for having an On-Prem object storage is to deploy &lt;a href=&quot;https:&#x2F;&#x2F;min.io&#x2F;&quot;&gt;MinIO&lt;&#x2F;a&gt; using k8s or OpenShift.&lt;&#x2F;p&gt;
&lt;p&gt;MinIO strives to provide an API compatible with S3, but it has some differences. For example, in S3, a file and a directory can have the same name, while it is not supported in MinIO.&lt;&#x2F;p&gt;
&lt;p&gt;That&#x27;s why when writing automated tests for your service, you should consider using &lt;a href=&quot;https:&#x2F;&#x2F;docs.localstack.cloud&#x2F;user-guide&#x2F;aws&#x2F;s3&#x2F;&quot;&gt;LocalStack&#x27;s S3&lt;&#x2F;a&gt; instead of MinIO.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Both MinIO and LocalStack have &lt;a href=&quot;https:&#x2F;&#x2F;testcontainers.com&#x2F;&quot;&gt;testcontainers&lt;&#x2F;a&gt; modules, greatly simplifying the setup of your tests.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h1 id=&quot;acid-transactions&quot;&gt;ACID transactions?&lt;&#x2F;h1&gt;
&lt;p&gt;If you have no idea what ACID is, you can go read my &lt;a href=&quot;&#x2F;posts&#x2F;database-fundementals&quot;&gt;Database Fundamentals&lt;&#x2F;a&gt; post.&lt;&#x2F;p&gt;
&lt;p&gt;Storing data in object storages and guaranteeing ACID transactions is possible, but has to be carefully designed.&lt;&#x2F;p&gt;
&lt;p&gt;This is not a novel problem anymore, let&#x27;s look at how open-source solutions have solved this.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.vldb.org&#x2F;pvldb&#x2F;vol13&#x2F;p3411-armbrust.pdf&quot;&gt;Delta lake&lt;&#x2F;a&gt; (&lt;strong&gt;highly recommended&lt;&#x2F;strong&gt; white paper linked) is an open source ACID table storage layer over cloud object storages, developed at &lt;a href=&quot;https:&#x2F;&#x2F;www.databricks.com&#x2F;&quot;&gt;Databricks&lt;&#x2F;a&gt;. Think of it as adding the ability to run SQL over data stored in object storages.&lt;&#x2F;p&gt;
&lt;p&gt;In chapter 3.2 in the white paper, they state that both Google Cloud Storage and Azure Blob Store support an atomic put-if-absent operation, so they simply use these as the atomicity primitive. S3 is trickier, as it doesn&#x27;t support any atomic put-if-absent &#x2F; atomic rename operations, so you need to roll a coordination service that uses some concurrency primitive like locks, where all S3 write requests go through it.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A very clever business move by Databricks. If you write to S3 with Spark running in Databricks, the writes automatically go through a coordination service implemented by them.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;In Delta Lake version 1.2, they&#x27;ve included a way to use &lt;a href=&quot;https:&#x2F;&#x2F;delta.io&#x2F;blog&#x2F;2022-05-18-multi-cluster-writes-to-delta-lake-storage-in-s3&#x2F;&quot;&gt;DynamoDB as the coordination service&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;This method of using a database that already implements ACID transactions is common, as it also improves the performance when listing files.&lt;&#x2F;p&gt;
&lt;p&gt;The biggest disadvantage with this approach is that the availability and durability guarantees of your application are only as good as the worst guarantees your different services provide. If you run Postgres self-hosted, and the node crashes for any reason, it can mean you don&#x27;t have access to the data anymore, or at least transactional and efficient access to the data, depending on your architecture.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Iceberg, a competitor to Delta Lake, developed by Netflix and is quickly becoming the industry standard, has an open-source coordination service called &lt;a href=&quot;https:&#x2F;&#x2F;projectnessie.org&#x2F;&quot;&gt;Nessie&lt;&#x2F;a&gt;, which also supports git-like branching on your data (very cool 😎).&lt;&#x2F;p&gt;
&lt;p&gt;Snowflake uses &lt;a href=&quot;https:&#x2F;&#x2F;www.snowflake.com&#x2F;blog&#x2F;how-foundationdb-powers-snowflake-metadata-forward&#x2F;&quot;&gt;FoundationDB&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;I would really like it if one day AWS added an &lt;code&gt;IfMatch&lt;&#x2F;code&gt; header that checks, right before the end of a &lt;code&gt;PutObject&lt;&#x2F;code&gt; request, whether the ETag is different, and if it is, to fail the request. I mean there&#x27;s already one in &lt;code&gt;GetObject&lt;&#x2F;code&gt;...&lt;&#x2F;p&gt;
&lt;p&gt;It would allow you to implement optimistic concurrency control right over the object storage by:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Reading the current &amp;quot;metadata&amp;quot; file with &lt;code&gt;GetObject&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Treat the ETag as the version and increment it by 1.&lt;&#x2F;li&gt;
&lt;li&gt;Upload a new file with the header &lt;code&gt;IfMatch: &amp;lt;just-read-version&amp;gt;&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;If the request fails on the &lt;code&gt;IfMatch&lt;&#x2F;code&gt;, repeat from the beginning.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;This will be less efficient in most cases than using Postgres, as you would need to upload a whole metadata file for each change, but it&#x27;s much simpler when you don&#x27;t need speed.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Tony from the future here: AWS has just announced &lt;a href=&quot;https:&#x2F;&#x2F;aws.amazon.com&#x2F;about-aws&#x2F;whats-new&#x2F;2024&#x2F;08&#x2F;amazon-s3-conditional-writes&#x2F;&quot;&gt;conditional writes&lt;&#x2F;a&gt;, really exciting. Do you think this post had an influence? Probably not 🙃&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h1 id=&quot;implementation-tips&quot;&gt;Implementation tips&lt;&#x2F;h1&gt;
&lt;p&gt;It used to be that you would need to roll your own abstraction over object storages.&lt;&#x2F;p&gt;
&lt;p&gt;Since Apache&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;opendal.apache.org&#x2F;&quot;&gt;OpenDAL&lt;&#x2F;a&gt; was introduced, it made working with all the different object storages much simpler, by providing a single unified API.&lt;&#x2F;p&gt;
&lt;p&gt;Here&#x27;s what it looks like in rust:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;rust&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-rust &quot;&gt;&lt;code class=&quot;language-rust&quot; data-lang=&quot;rust&quot;&gt;&lt;span&gt;#[tokio::main]
&lt;&#x2F;span&gt;&lt;span&gt;async &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;main&lt;&#x2F;span&gt;&lt;span&gt;() &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#66d9ef;&quot;&gt;opendal&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#ff79c6;&quot;&gt;::&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#66d9ef;&quot;&gt;Result&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;lt;()&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mut&lt;&#x2F;span&gt;&lt;span&gt; builder &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#66d9ef;&quot;&gt;opendal&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#ff79c6;&quot;&gt;::&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#66d9ef;&quot;&gt;services&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#ff79c6;&quot;&gt;::&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;S3&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#ff79c6;&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;default();
&lt;&#x2F;span&gt;&lt;span&gt;    builder&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;bucket&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;test&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; op &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#66d9ef;&quot;&gt;opendal&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#ff79c6;&quot;&gt;::&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#66d9ef;&quot;&gt;Operator&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#ff79c6;&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;new(builder)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;?
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;layer&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#66d9ef;&quot;&gt;opendal&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#ff79c6;&quot;&gt;::&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#66d9ef;&quot;&gt;layers&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#ff79c6;&quot;&gt;::&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#66d9ef;&quot;&gt;LoggingLayer&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#ff79c6;&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;default())
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;finish&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Get the file length.
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; meta &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; op&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;stat&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;hello.txt&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span&gt;await&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;?&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; length &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; meta&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;content_length&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Read first 1024 bytes.
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; data &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; op&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;read_with&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;hello.txt&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;range&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;..&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1024&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span&gt;await&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;?&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#66d9ef;&quot;&gt;Ok&lt;&#x2F;span&gt;&lt;span&gt;(())
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;OpenDAL also supports the file system with &lt;code&gt;opendal::services::FS&lt;&#x2F;code&gt;, allowing you to run your object storage native app without relying on object storage. This can be great for testing, for example. However, don&#x27;t expect it to be as optimized as an app designed to run on the file system from the start.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, because object storages don&#x27;t allow for partial writes, you should use immutable data structures like the LSM Tree, where files are only deleted or read after being written.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;real-world-examples&quot;&gt;Real-world examples&lt;&#x2F;h1&gt;
&lt;p&gt;Ok, we&#x27;re done with the theory, let&#x27;s look at some real-world data applications that have explicitly decided to use an object storage.&lt;&#x2F;p&gt;
&lt;p&gt;We&#x27;ll look at what they gained, and what they lost in the process.&lt;&#x2F;p&gt;
&lt;p&gt;Get ready for some opinions 🤠&lt;&#x2F;p&gt;
&lt;h2 id=&quot;quickwit&quot;&gt;Quickwit&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;quickwit.io&#x2F;&quot;&gt;Quickwit&lt;&#x2F;a&gt; is a highly scalable, distributed and cheap log search engine. Or in simpler words: &lt;em&gt;&amp;quot;Elasticsearch but on an object storage&amp;quot;&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;It&#x27;s open source (AGPL license) and written in rust using &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;quickwit-oss&#x2F;tantivy&quot;&gt;tantivy&lt;&#x2F;a&gt; (MIT license), a fast text search engine, similar to Apache&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;lucene.apache.org&#x2F;&quot;&gt;Lucene&lt;&#x2F;a&gt; (Elasticsearch&#x27;s search engine).&lt;&#x2F;p&gt;
&lt;p&gt;Tantivy and Lucene are libraries that receive text, tokenize it, and write to a data structure called an inverted index.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s say you provide them the following two strings: &amp;quot;My dog ate my food!&amp;quot;, &amp;quot;My cat likes my dog&amp;quot;, here&#x27;s the resulting inverted index:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Word&lt;&#x2F;th&gt;&lt;th&gt;Documents&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;my&lt;&#x2F;td&gt;&lt;td&gt;0, 1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;dog&lt;&#x2F;td&gt;&lt;td&gt;0, 1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;ate&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;food&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;cat&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;likes&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The tokenizer may also stem words and convert &amp;quot;changing&amp;quot;, &amp;quot;changed&amp;quot; and &amp;quot;change&amp;quot; into &amp;quot;chang&amp;quot;, so searching for &amp;quot;change&amp;quot; will find &amp;quot;My dog is changing&amp;quot;. The inverted index may also store how many times a word comes up in each document, for sorting more relevant results on a search (the algorithm used is &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Okapi_BM25&quot;&gt;BM25&lt;&#x2F;a&gt;). There&#x27;s more to it, but I think you get the idea.&lt;&#x2F;p&gt;
&lt;p&gt;So what Quickwit does is:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Read documents from a stream, for example, Kafka.&lt;&#x2F;li&gt;
&lt;li&gt;Use tantivy to create an inverted index once every configurable amount of seconds.&lt;&#x2F;li&gt;
&lt;li&gt;Store the newly created inverted index as a file in an object storage.&lt;&#x2F;li&gt;
&lt;li&gt;Update the metadata store (Postgres) about this file.
&lt;ul&gt;
&lt;li&gt;Metadata can also be managed with a &lt;code&gt;metadata.json&lt;&#x2F;code&gt; file that&#x27;s uploaded to the object storage. Less recommended on S3, which, as we&#x27;ve already discussed, can&#x27;t guarantee ACID.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;This is what they call the indexing pipeline.&lt;&#x2F;p&gt;
&lt;p&gt;Then on a search query:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Get relevant file paths on object storage by querying the metadata.&lt;&#x2F;li&gt;
&lt;li&gt;Download the files.
&lt;ul&gt;
&lt;li&gt;Note that it downloads the files in parallel, getting higher throughput.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Use tantivy to search the relevant documents in each file&#x27;s inverted index.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;img class=&quot;svg&quot; src=&quot;&#x2F;quickwit_architecture.svg&quot;&#x2F;&gt;
&lt;blockquote&gt;
&lt;p&gt;Image inspired by &lt;a href=&quot;https:&#x2F;&#x2F;quickwit.io&#x2F;blog&#x2F;quickwit-101&quot;&gt;Quickwit 101 - Architecture of a distributed search engine on object storage&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Quickwit is much cheaper than Elasticsearch, ~10x cheaper (depending on the workload of course), and you can control which nodes and how many nodes are in the indexing and searching clusters, tuning it to match your read &#x2F; write workload.&lt;&#x2F;p&gt;
&lt;p&gt;Sounds amazing, what&#x27;s the catch? Latency.&lt;&#x2F;p&gt;
&lt;p&gt;As we&#x27;ve already discussed, each round trip takes 1000x more time than a modern SSD. Quickwit has built a few measures to lower the latency, for example:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Designing a protocol with a maximum of 3 round trips per file.&lt;&#x2F;li&gt;
&lt;li&gt;Caching the important sections of the inverted indexes in the searcher pods.
&lt;ul&gt;
&lt;li&gt;Cache hit lowers round trips.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Rendezvous hashing (similar to consistent hashing) load balancing for a better cache hit rate.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;There is also another more minor issue I found: no monitoring and alerting system. Minor because it can be implemented in the future&lt;&#x2F;p&gt;
&lt;p&gt;The bottom line is: if you don&#x27;t need consistent sub 200ms search times, and you don&#x27;t need an alerting system, then Quickwit is probably a good fit for you.&lt;&#x2F;p&gt;
&lt;p&gt;For most use cases, the drawbacks are so minor compared to the advantages, I truly think this is the future of log search engines.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;After learning about Quickwit, I got hyped and started implementing something like it myself, using tantivy and OpenDAL: &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;tontinton&#x2F;toshokan&#x2F;&quot;&gt;toshokan&lt;&#x2F;a&gt; 😛&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;warpstream&quot;&gt;WarpStream&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;warpstream.com&#x2F;&quot;&gt;WarpStream&lt;&#x2F;a&gt; is a cheap distributed log and streaming platform with an API compatible with Kafka. Or in simpler words: &lt;em&gt;&amp;quot;Kafka but on an object storage&amp;quot;&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;It&#x27;s not open-source, which means I can&#x27;t recommend it.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you&#x27;re from WarpStream (now Confluent?), please understand that I don&#x27;t want support, I want to read code when stuff doesn&#x27;t work.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Main differences with Kafka are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;No leader &#x2F; followers.&lt;&#x2F;li&gt;
&lt;li&gt;Max latency starts at 250ms, as the WarpStream agents (the stateless service) buffer records in memory, and flush after 250ms have passed. This is only the default and can be modified, but lowering the time to flush will mean it&#x27;s less cost efficient (more PUT &#x2F; GET requests to S3).&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The WarpStream devs understand S3&#x27;s drawbacks well, they have implemented multiple nice tricks to design against them:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Getting good throughput on S3 by distributing written records to multiple agents, and letting them write to S3 in parallel.&lt;&#x2F;li&gt;
&lt;li&gt;Data locality for reads. Each agent is elected to specific split files. When an agent receives a request to a split file not owned by it, it will redirect the request to the owner agent, which caches these files in memory. This is especially useful as the most common pattern in a stream is to read from the end, meaning most read requests will want to read the latest file, which is most likely to be cached in memory.&lt;&#x2F;li&gt;
&lt;li&gt;Data locality for historical reads. Split files are combined, sorted and compacted to allow for better efficiency when reading old historical records serially one after another.&lt;&#x2F;li&gt;
&lt;li&gt;Can be configured to write new data to S3 Express, which is the most likely data to be read in a stream, and write old data (after compaction) to standard S3.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;As you can probably already guess, WarpStream is ~5-10x cheaper than Kafka, and much simpler to operate as it&#x27;s stateless.&lt;&#x2F;p&gt;
&lt;p&gt;Other than being new and mostly unproven &lt;em&gt;yet&lt;&#x2F;em&gt;, it has a pretty big problem. Try to guess what it is 😊&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#282a36;color:#f8f8f2;&quot;&gt;&lt;code&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;Some space for you to think :)
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Latency.&lt;&#x2F;p&gt;
&lt;p&gt;The producer-to-consumer latency is (at the time of writing the post):&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;P50 - half a second.&lt;&#x2F;li&gt;
&lt;li&gt;P95 - almost a second.&lt;&#x2F;li&gt;
&lt;li&gt;P99 - a second.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Can they improve it? Maybe. But probably not near the latency of Kafka.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;So, where do I see this product winning over Kafka?&lt;&#x2F;p&gt;
&lt;p&gt;Mostly in high throughput workloads, where you don&#x27;t care about a second of latency, and you have enough throughput to start worrying about costs. For example, streaming security logs (e.g. AWS CloudTrail) into Quickwit to be searched by security analysts.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;neon&quot;&gt;Neon&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;neon.tech&#x2F;&quot;&gt;Neon&lt;&#x2F;a&gt; is an open-source (Apache license) serverless Postgres.&lt;&#x2F;p&gt;
&lt;p&gt;They took Postgres and made it work with an architecture that stores the actual data in an object storage instead of local disk.&lt;&#x2F;p&gt;
&lt;img class=&quot;svg&quot; src=&quot;&#x2F;neon_architecture.svg&quot;&#x2F;&gt;
&lt;p&gt;Postgres stores transaction logs into a data structure called a WAL (Write-Ahead-Log). Neon streams log entries from this WAL to a service they called Safekeeper, using the native Postgres replication protocol. Safekeeper nodes provide durability and fault-tolerance using a &lt;a href=&quot;https:&#x2F;&#x2F;neon.tech&#x2F;blog&#x2F;paxos&quot;&gt;custom made Paxos&lt;&#x2F;a&gt;, where the Postgres nodes are the proposers and safekeepers are the acceptors (verified by this &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;neondatabase&#x2F;neon&#x2F;blob&#x2F;main&#x2F;safekeeper&#x2F;spec&#x2F;ProposerAcceptorConsensus.tla&quot;&gt;TLA+&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;Once logs are accepted by the safekeepers, they stream to the next service called the page server. The page server behaves like an LSM Tree, where it buffers logs until they reach the size of 1GB, and then flushes them as a new immutable file into the object storage. Of course, just like the usual LSM Tree, you can query these logs even while they are buffered.&lt;&#x2F;p&gt;
&lt;p&gt;All read requests go directly to the page server, with the page id and a LSN (Log Sequence Number). The LSN is a monotonically increasing number that identifies a specific log in the WAL. So you know what that means, right?&lt;&#x2F;p&gt;
&lt;p&gt;Neon is an event source of Postgres` WAL! It has &lt;strong&gt;history&lt;&#x2F;strong&gt;, meaning you can have time-traveling queries and copy-on-write to your data. Or in other words: &amp;quot;git branching for your data&amp;quot;.&lt;&#x2F;p&gt;
&lt;p&gt;Here are some use cases for git branching to data:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Create a branch at the start of automation tests in CI.
&lt;ul&gt;
&lt;li&gt;This way you can test schema migrations in an isolated way.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Simpler CD with zero downtime. Each deployment has a version and a branch, and services communicate with your DB on a specific branch.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Wow, this is so grea... Wait, don&#x27;t tell me, latency?&lt;&#x2F;p&gt;
&lt;p&gt;Yep, cache misses go to the slow object storage.&lt;&#x2F;p&gt;
&lt;p&gt;Plus, you have to be careful to not treat it as a general-purpose distributed database. For example, JOIN queries are not distributed, they run on one of the stateless Postgres services. Neon is more similar to a single-writer, multiple-read-replicas kind of architecture.&lt;&#x2F;p&gt;
&lt;p&gt;I don&#x27;t know whether I can recommend this one as a replacement for your usual OLTP workloads, as these must be super quick. It looks &lt;a href=&quot;https:&#x2F;&#x2F;neon-latency-benchmarks.vercel.app&#x2F;&quot;&gt;promising&lt;&#x2F;a&gt;, but I&#x27;d have to play around with it more.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h1&gt;
&lt;p&gt;Ok, hopefully you&#x27;ve learned of object storages, when they might be good and when they might be bad, by examining how they work on a high level, and by learning of 3 real solutions already running in the wild.&lt;&#x2F;p&gt;
&lt;p&gt;Think a bit, which of the 3 did you like the most? Why?&lt;&#x2F;p&gt;
&lt;p&gt;Object storage solutions can definitely be market-disrupting when applied to the right solution.&lt;&#x2F;p&gt;
&lt;p&gt;Don&#x27;t be a sleeper, for your next open-source database startup, think about whether using them can be a right fit!&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Scheduling Internals</title>
        <published>2024-02-25T00:00:00+00:00</published>
        <updated>2024-02-25T00:00:00+00:00</updated>
        <author>
          <name>Unknown</name>
        </author>
        <link rel="alternate" href="https://tontinton.com/posts/scheduling-internals/" type="text/html"/>
        <id>https://tontinton.com/posts/scheduling-internals/</id>
        
        <content type="html">&lt;div class=&quot;block&quot; style=&quot;margin-top: 30px; height: 600px&quot;&gt;
  &lt;div style=&quot;margin-left: 20px&quot;&gt;
    &lt;div class=&quot;input-box&quot;&gt;
      &lt;input id=&quot;number-cpus&quot; type=&quot;range&quot; min=&quot;1&quot; max=&quot;3&quot; value=&quot;1&quot;&gt;
      &lt;p class=&quot;input-text&quot;&gt;CPUs&lt;&#x2F;p&gt;
    &lt;&#x2F;div&gt;
    &lt;div class=&quot;input-box&quot;&gt;
      &lt;input id=&quot;speed&quot; type=&quot;range&quot; min=&quot;1&quot; max=&quot;5&quot; value=&quot;2&quot; style=&quot;direction: rtl&quot;&gt;
      &lt;p class=&quot;input-text&quot;&gt;Speed&lt;&#x2F;p&gt;
    &lt;&#x2F;div&gt;
    &lt;label class=&quot;cr-wrapper&quot; style=&quot;margin-top: 12px&quot;&gt;
      &lt;input id=&quot;timer-interrupt&quot; type=&quot;checkbox&quot;&#x2F;&gt;
      &lt;div class=&quot;cr-input&quot;&gt;&lt;&#x2F;div&gt;
      &lt;span&gt;Limit Runtime&lt;&#x2F;span&gt;
    &lt;&#x2F;label&gt;
    &lt;label class=&quot;cr-wrapper&quot; style=&quot;margin-top: 4px&quot;&gt;
      &lt;input id=&quot;deadline&quot; type=&quot;checkbox&quot;&#x2F;&gt;
      &lt;div class=&quot;cr-input&quot;&gt;&lt;&#x2F;div&gt;
      &lt;span&gt;Deadline&lt;&#x2F;span&gt;
    &lt;&#x2F;label&gt;
  &lt;&#x2F;div&gt;
  &lt;div id=&quot;app&quot; style=&quot;height: calc(100% - 130px)&quot;&gt;&lt;&#x2F;div&gt;
&lt;&#x2F;div&gt;
&lt;script&gt;
const loadScript=r=&gt;new Promise((e,t)=&gt;{var n=document.createElement(&quot;script&quot;);n.src=r,n.onload=e,n.onerror=t,document.head.appendChild(n)}),waitForDocumentLoad=()=&gt;new Promise(e=&gt;{document.addEventListener(&quot;DOMContentLoaded&quot;,e)});function getSize(){return window.innerWidth&lt;400?&quot;small&quot;:window.innerWidth&lt;600?&quot;medium&quot;:&quot;big&quot;}function applySize(e){for(const t of document.querySelectorAll(&quot;.block&quot;))switch(e){case&quot;small&quot;:t.style.height=t.classList.contains(&quot;small-block&quot;)?&quot;320px&quot;:&quot;420px&quot;;break;case&quot;medium&quot;:t.style.height=t.classList.contains(&quot;small-block&quot;)?&quot;400px&quot;:&quot;500px&quot;;break;case&quot;big&quot;:t.style.height=t.classList.contains(&quot;small-block&quot;)?&quot;500px&quot;:&quot;600px&quot;}}let size=null,mainApp=null,apps=[];function reCreateMainApp(){mainApp&amp;&amp;mainApp.stop(),mainApp=window.createApp(size,document.querySelector(&quot;#app&quot;),document.querySelector(&quot;#speed&quot;).value,document.querySelector(&quot;#timer-interrupt&quot;).checked?32:null,document.querySelector(&quot;#deadline&quot;).checked,document.querySelector(&quot;#number-cpus&quot;).value)}function reCreateApps(){for(const e of apps)e.stop();apps=[],reCreateMainApp(),apps.push(window.createApp(size,document.querySelector(&quot;#app_sched_fifo&quot;),2)),apps.push(window.createApp(size,document.querySelector(&quot;#app_sched_rr&quot;),2,32)),apps.push(window.createApp(size,document.querySelector(&quot;#app_sched_deadline&quot;),3,32,!0)),apps.push(window.createApp(size,document.querySelector(&quot;#app_sched_multicore&quot;),4,32,!1,2))}function run(e){var t=getSize();null===size||t!==size?(applySize(size=t),reCreateApps()):e&amp;&amp;reCreateMainApp()}Promise.all([loadScript(&quot;&#x2F;pixi.min.js&quot;),loadScript(&quot;&#x2F;scheduling.min.js&quot;),waitForDocumentLoad()]).then(()=&gt;{run(),window.addEventListener(&quot;resize&quot;,()=&gt;{run()}),document.querySelector(&quot;#number-cpus&quot;).addEventListener(&quot;input&quot;,run),document.querySelector(&quot;#speed&quot;).addEventListener(&quot;input&quot;,run),document.querySelector(&quot;#timer-interrupt&quot;).addEventListener(&quot;input&quot;,run),document.querySelector(&quot;#deadline&quot;).addEventListener(&quot;input&quot;,run)});
&lt;&#x2F;script&gt;
&lt;blockquote&gt;
&lt;p&gt;A sneak peek to what&#x27;s coming!&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;I remember when I first learned that you can write a server handling millions of clients running on just a single thread, my mind was simply blown away 🤯&lt;&#x2F;p&gt;
&lt;p&gt;I used Node.js while knowing it is single threaded, I used &lt;code&gt;async&lt;&#x2F;code&gt; &#x2F; &lt;code&gt;await&lt;&#x2F;code&gt; in Python, and I used threads, but never asked myself &lt;em&gt;&amp;quot;How is any of this possible?&amp;quot;&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;This post is written to spread the genius of concurrency and hopefully getting you excited about it too.&lt;&#x2F;p&gt;
&lt;p&gt;My goal is for you to want to send a link to this post to an engineer in your team asking out loud &lt;em&gt;&amp;quot;Wait, but how does async even work?&amp;quot;&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Questions I&#x27;m going to answer:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Why not create a thread per client?&lt;&#x2F;li&gt;
&lt;li&gt;How to sleep when waiting on I&#x2F;O?&lt;&#x2F;li&gt;
&lt;li&gt;How does Node.js achieve concurrency?&lt;&#x2F;li&gt;
&lt;li&gt;What&#x27;s concurrency? What&#x27;s parallelism?&lt;&#x2F;li&gt;
&lt;li&gt;What are coroutines?
&lt;ul&gt;
&lt;li&gt;With an implementation we&#x27;ll build piece by piece.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;What are preemptive and non-preemptive schedulers?&lt;&#x2F;li&gt;
&lt;li&gt;How does Go and Rust implement concurrency in the language (stackful vs stackless)?&lt;&#x2F;li&gt;
&lt;li&gt;What scheduling algorithms are used by linux, Go and Rust&#x27;s tokio?
&lt;ul&gt;
&lt;li&gt;With animations 🙃&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;I assume proficiency in reading code and OS internals at an intermediate level, but don&#x27;t stress over details you don&#x27;t understand, try to get the bigger picture!&lt;&#x2F;p&gt;
&lt;p&gt;With all of that out of the way, let us begin.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;just-create-a-thread-bro&quot;&gt;Just create a thread, bro&lt;&#x2F;h1&gt;
&lt;p&gt;Let&#x27;s try to write a simple echo server (whatever we receive, we send back) in C code, we&#x27;ll call it &lt;code&gt;echod&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;c&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-c &quot;&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Assume server_fd is already initialized to start accepting clients.
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;serve&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;server_fd&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt; client_fd &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;accept&lt;&#x2F;span&gt;&lt;span&gt;(server_fd, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;NULL&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;NULL&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span&gt; buffer[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;2048&lt;&#x2F;span&gt;&lt;span&gt;];
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#66d9ef;&quot;&gt;ssize_t&lt;&#x2F;span&gt;&lt;span&gt; len &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;read&lt;&#x2F;span&gt;&lt;span&gt;(client_fd, buffer, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;sizeof&lt;&#x2F;span&gt;&lt;span&gt;(buffer));
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(len &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;== -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;break&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;write&lt;&#x2F;span&gt;&lt;span&gt;(client_fd, buffer, len);
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;close&lt;&#x2F;span&gt;&lt;span&gt;(client_fd);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Cool, now what should we do if want to handle multiple clients concurrently? While one client is being handled, another tries to &lt;code&gt;connect&lt;&#x2F;code&gt; to our server, without ever succeeding, as our server reaches the &lt;code&gt;accept&lt;&#x2F;code&gt; call only once it is done handling the current client.&lt;&#x2F;p&gt;
&lt;p&gt;How can we fix that?&lt;&#x2F;p&gt;
&lt;p&gt;The first thing most people will think is &lt;em&gt;&amp;quot;Can&#x27;t you just create a thread for each client?&amp;quot;&lt;&#x2F;em&gt;, something that looks like this:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;c&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-c &quot;&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;serve_client&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;client_fd&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span&gt; buffer[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;2048&lt;&#x2F;span&gt;&lt;span&gt;];
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#66d9ef;&quot;&gt;ssize_t&lt;&#x2F;span&gt;&lt;span&gt; len &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;read&lt;&#x2F;span&gt;&lt;span&gt;(client_fd, buffer, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;sizeof&lt;&#x2F;span&gt;&lt;span&gt;(buffer));
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(len &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;== -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;break&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;write&lt;&#x2F;span&gt;&lt;span&gt;(client_fd, buffer, len);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;close&lt;&#x2F;span&gt;&lt;span&gt;(client_fd);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;serve&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;server_fd&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt; client_fd &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;accept&lt;&#x2F;span&gt;&lt;span&gt;(server_fd, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;NULL&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;NULL&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;run_thread&lt;&#x2F;span&gt;&lt;span&gt;(serve_client, client_fd);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The first problem with threads is how the OS allocates a stack for a new thread. The stack is allocated virtual memory (10mb on linux by default), and physical pages are only commited once the pages are actually written to. This is really nice as it means that you don&#x27;t really reserve 10mb of RAM for each thread right out of the gate, &lt;strong&gt;but&lt;&#x2F;strong&gt; it does mean the granularity of allocation is at least that of a page (run &lt;code&gt;getconf PAGESIZE&lt;&#x2F;code&gt;, my machine is 4kb). Using &lt;code&gt;pthread_attr_setstacksize&lt;&#x2F;code&gt; won&#x27;t fix the problem, you still must provide a value that is a multiple of a page size. A page might be a lot more that what you actually use, depending on the application.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;I also think that relying on overcommitment of memory is pretty annoying. We are getting killed by the OOM killer instead of having an opportunity cleaning up resources when an allocation fails indicating we are out of memory.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The second problem we need to fix when creating a bunch of OS threads is to change all the relevant limits:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; cat &#x2F;proc&#x2F;sys&#x2F;kernel&#x2F;threads-max
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;63704
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# Threads in linux are light weight processes (LWP).
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; cat &#x2F;proc&#x2F;sys&#x2F;kernel&#x2F;pid_max
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;131072
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# Maximum number of VMAs a process can own.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; cat &#x2F;proc&#x2F;sys&#x2F;vm&#x2F;max_map_count
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;65530
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# Each of these can be set by simply writing to the file:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; echo 2000000 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt; &#x2F;proc&#x2F;sys&#x2F;kernel&#x2F;threads-max
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; echo 2000000 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt; &#x2F;proc&#x2F;sys&#x2F;kernel&#x2F;pid_max
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; echo 2000000 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt; &#x2F;proc&#x2F;sys&#x2F;vm&#x2F;max_map_count
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Just the files I showed you might not be enough for your system, for example &lt;code&gt;systemd&lt;&#x2F;code&gt; also sets maximums.&lt;&#x2F;p&gt;
&lt;p&gt;The third problem is performance. Context switching between kernel and user mode is expensive in terms of CPU cycles. A single context switch isn&#x27;t that expensive on its own, but doing a lot of them adds up.&lt;&#x2F;p&gt;
&lt;p&gt;The fourth problem is that the stack allocation is static, we can&#x27;t modify the stack size (grow) or free up commited physical pages in the stack once they are unused (shrink).&lt;&#x2F;p&gt;
&lt;p&gt;Because of all these problems, threads should not be your go-to solution for running a lot of tasks concurrently (especially for I&#x2F;O bound tasks like in &lt;code&gt;echod&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;How else can we make &lt;code&gt;echod&lt;&#x2F;code&gt; serve millions of clients concurrently?&lt;&#x2F;p&gt;
&lt;h1 id=&quot;async-i-o&quot;&gt;Async I&#x2F;O&lt;&#x2F;h1&gt;
&lt;p&gt;Why block an entire thread from running, when calling &lt;code&gt;read&lt;&#x2F;code&gt; &#x2F; &lt;code&gt;write&lt;&#x2F;code&gt; &#x2F; &lt;code&gt;accept&lt;&#x2F;code&gt;? If you think about it, we waste a precious resource (CPU) from doing anything while the application waits for I&#x2F;O.&lt;&#x2F;p&gt;
&lt;p&gt;When calling &lt;code&gt;read&lt;&#x2F;code&gt; for example, the kernel waits for a network packet to be received from the network interface card or &lt;code&gt;NIC&lt;&#x2F;code&gt;. The CPU is free to run something else meanwhile.&lt;&#x2F;p&gt;
&lt;p&gt;In linux, you can mark a socket as non-blocking by either using &lt;code&gt;ioctl(fd, FIONBIO)&lt;&#x2F;code&gt; or &lt;code&gt;fcntl &amp;amp; O_NONBLOCK&lt;&#x2F;code&gt; (posix). A &lt;code&gt;read&lt;&#x2F;code&gt; call on that same socket will return immediately. If there&#x27;s a packet written by the &lt;code&gt;NIC&lt;&#x2F;code&gt; we haven&#x27;t read yet, &lt;code&gt;read&lt;&#x2F;code&gt; will copy the buffer like usual, otherwise it will return an error, with &lt;code&gt;errno&lt;&#x2F;code&gt; equal to &lt;code&gt;EWOULDBLOCK&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s patch &lt;code&gt;echod&lt;&#x2F;code&gt; to be single threaded again, but this time, supporting multiple concurrent clients using non-blocking sockets:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;c&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-c &quot;&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;struct &lt;&#x2F;span&gt;&lt;span&gt;client &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt; fd;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span&gt; buffer[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;2048&lt;&#x2F;span&gt;&lt;span&gt;]; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Not allocated dynamically for brevity.
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#66d9ef;&quot;&gt;ssize_t&lt;&#x2F;span&gt;&lt;span&gt; offset; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; -1 when in reading state, otherwise the offset in buffer to write.
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#66d9ef;&quot;&gt;ssize_t&lt;&#x2F;span&gt;&lt;span&gt; length; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; The length left to write (only useful when in writing state).
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;set_nonblock&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;fd&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Ignore errors for brevity.
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;fcntl&lt;&#x2F;span&gt;&lt;span&gt;(fd, F_SETFL, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;fcntl&lt;&#x2F;span&gt;&lt;span&gt;(fd, F_GETFL) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;|&lt;&#x2F;span&gt;&lt;span&gt; O_NONBLOCK);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;serve&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;server_fd&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;set_nonblock&lt;&#x2F;span&gt;&lt;span&gt;(server_fd);
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span&gt; client clients[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span&gt;]; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Maximum 64 concurrent clients.
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt; num_clients &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(num_clients &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;ARRAY_SIZE&lt;&#x2F;span&gt;&lt;span&gt;(clients)) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt; client_fd &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;accept&lt;&#x2F;span&gt;&lt;span&gt;(server_fd, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;NULL&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;NULL&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(client_fd &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;!= -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;set_nonblock&lt;&#x2F;span&gt;&lt;span&gt;(client_fd);
&lt;&#x2F;span&gt;&lt;span&gt;                clients[num_clients]&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;fd &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; client_fd;
&lt;&#x2F;span&gt;&lt;span&gt;                clients[num_clients]&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;offset &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;                num_clients&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span&gt; num_clients; i&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span&gt; client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt; client &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &amp;amp;&lt;&#x2F;span&gt;&lt;span&gt;clients[i];
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;bool&lt;&#x2F;span&gt;&lt;span&gt; is_reading &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;offset &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;== -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#66d9ef;&quot;&gt;ssize_t&lt;&#x2F;span&gt;&lt;span&gt; result;
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(is_reading) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                result &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;read&lt;&#x2F;span&gt;&lt;span&gt;(client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;fd, client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;buffer, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;sizeof&lt;&#x2F;span&gt;&lt;span&gt;(client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;buffer));
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;} &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                result &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;write&lt;&#x2F;span&gt;&lt;span&gt;(client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;fd, client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;buffer &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span&gt; client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;offset, client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;length);
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(result &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;!= -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(is_reading) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                    client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;offset &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Move to writing state.
&lt;&#x2F;span&gt;&lt;span&gt;                    client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;length &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; result;
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;} &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                    client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;length &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-=&lt;&#x2F;span&gt;&lt;span&gt; result;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;                    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;length &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                        client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;offset &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Move to reading state.
&lt;&#x2F;span&gt;&lt;span&gt;                    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;} &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                        client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;offset &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;+=&lt;&#x2F;span&gt;&lt;span&gt; result;
&lt;&#x2F;span&gt;&lt;span&gt;                    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;} &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;else if &lt;&#x2F;span&gt;&lt;span&gt;(errno &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;!=&lt;&#x2F;span&gt;&lt;span&gt; EWOULDBLOCK) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;close&lt;&#x2F;span&gt;&lt;span&gt;(client&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;fd);
&lt;&#x2F;span&gt;&lt;span&gt;                num_clients&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;--&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;memcpy&lt;&#x2F;span&gt;&lt;span&gt;(client, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span&gt;clients[num_clients], &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;sizeof&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt;client));
&lt;&#x2F;span&gt;&lt;span&gt;                i&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;--&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Don&amp;#39;t skip the moved client.
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;blockquote&gt;
&lt;p&gt;A bit lengthy, don&#x27;t try to understand everything, just that we are dealing with a lot of different tasks &amp;quot;at once&amp;quot;. For a compilable version, click &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;tontinton&#x2F;echod-hog&#x2F;blob&#x2F;master&#x2F;main.c&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The main problem with this solution is that we are now always busy doing something, the CPU runs at 100%, even when most loop iterations will result in &lt;code&gt;EWOULDBLOCK&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Another problem is code complexity, we are now prohibited from running code that will block, to not block our entire server application.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;What we really want is to sleep when we have nothing &lt;em&gt;useful&lt;&#x2F;em&gt; to do. I.e. When there is no client waiting to connect, no client has sent any packet and we can&#x27;t yet send a packet to the client for whatever reason (maybe the client is busy doing something of its own).&lt;&#x2F;p&gt;
&lt;p&gt;The reasons we want this are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;To be good neighbours to other applications running on the same machine, and not take CPU cycles they might want to utilize.&lt;&#x2F;li&gt;
&lt;li&gt;The more the CPU runs, the more energy it takes:
&lt;ul&gt;
&lt;li&gt;Worse battery life.&lt;&#x2F;li&gt;
&lt;li&gt;More expensive.&lt;&#x2F;li&gt;
&lt;li&gt;Less environmental friendly 🌲🌳🌿&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Good news though, most operating systems provide an API to do just that. Maybe even too many APIs:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;select(2)&lt;&#x2F;strong&gt; - A posix API. The man page is excellent, so let&#x27;s copy the important bits:&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre style=&quot;background-color:#282a36;color:#f8f8f2;&quot;&gt;&lt;code&gt;&lt;span&gt;$ man 2 select
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;select() allows a program to monitor multiple file descriptors, waiting until one or more of the
&lt;&#x2F;span&gt;&lt;span&gt;file descriptors become &amp;quot;ready&amp;quot; for some class of I&#x2F;O operation (e.g., input possible). A file
&lt;&#x2F;span&gt;&lt;span&gt;descriptor is considered ready if it is possible to perform a corresponding I&#x2F;O operation (e.g.,
&lt;&#x2F;span&gt;&lt;span&gt;read(2), or a sufficiently small write(2)) without blocking.
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) and
&lt;&#x2F;span&gt;&lt;span&gt;epoll(7) do not have this limitation. See BUGS.
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;...
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Main takeaway is that select is limited to &lt;code&gt;FD_SETSIZE&lt;&#x2F;code&gt; number of fds to monitor, which is usually 1024 (glibc). Another thing to note is that when a FD becomes ready, it scans all its registered FDs (&lt;code&gt;O(n)&lt;&#x2F;code&gt;), so when you have a lot of FDs, you can spend a lot of time just on this.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;poll(2)&lt;&#x2F;strong&gt; - A posix API. Not limited to &lt;code&gt;FD_SETSIZE&lt;&#x2F;code&gt; but still &lt;code&gt;O(n)&lt;&#x2F;code&gt; like &lt;code&gt;select&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;epoll(7)&lt;&#x2F;strong&gt; - A linux API. A non portable &lt;code&gt;poll&lt;&#x2F;code&gt; but at least it scales better as it&#x27;s &lt;code&gt;O(1)&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;aio(7)&lt;&#x2F;strong&gt; - A linux API. Unlike previous APIs, it supports both sockets and files, but with some major disadvantages:
&lt;ul&gt;
&lt;li&gt;Supports only files opened with &lt;code&gt;O_DIRECT&lt;&#x2F;code&gt;, which are complex to work with.&lt;&#x2F;li&gt;
&lt;li&gt;Blocks if file metadata isn&#x27;t available until it becomes available.&lt;&#x2F;li&gt;
&lt;li&gt;Blocks when the storage device is out of request slots (each storage device has a fixed number of slots).&lt;&#x2F;li&gt;
&lt;li&gt;Each IO submission copies 72 bytes and each completion copies 32 bytes. 104 bytes copied for each IO operation using 2 syscalls.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;io_uring(7)&lt;&#x2F;strong&gt; - A linux API (since 5.1). Like &lt;code&gt;aio&lt;&#x2F;code&gt;, it unifies disk and network operations under a single API, but without &lt;code&gt;aio&lt;&#x2F;code&gt;&#x27;s shortcomings. It is designed to be fast, creating 2 queues that live in shared memory (between user and kernel space), one for submission of I&#x2F;O operations, the other is populated with the results of the I&#x2F;O operations once they are ready. For more info head over to &lt;a href=&quot;https:&#x2F;&#x2F;unixism.net&#x2F;loti&#x2F;what_is_io_uring.html&quot;&gt;&amp;quot;What is io_uring?&amp;quot;&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;ScyllaDB&lt;&#x2F;code&gt; have successfully implemented their database using &lt;code&gt;aio&lt;&#x2F;code&gt; in &lt;a href=&quot;https:&#x2F;&#x2F;seastar.io&#x2F;&quot;&gt;seastar&lt;&#x2F;a&gt;, you can read more on async disk I&#x2F;O on &lt;a href=&quot;https:&#x2F;&#x2F;scylladb.com&#x2F;2017&#x2F;10&#x2F;05&#x2F;io-access-methods-scylla&#x2F;&quot;&gt;their blog&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;If you&#x27;re interested about platforms other than linux, windows has &lt;a href=&quot;https:&#x2F;&#x2F;learn.microsoft.com&#x2F;en-us&#x2F;windows&#x2F;win32&#x2F;fileio&#x2F;i-o-completion-ports&quot;&gt;I&#x2F;O Completion Ports&lt;&#x2F;a&gt;, with FreeBSD and MacOS both using &lt;a href=&quot;https:&#x2F;&#x2F;man.freebsd.org&#x2F;cgi&#x2F;man.cgi?kqueue&quot;&gt;kqueue&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Prior to &lt;code&gt;io_uring&lt;&#x2F;code&gt;, async I&#x2F;O abstraction libraries used a thread pool to run disk I&#x2F;O in a non-blocking manner.&lt;&#x2F;p&gt;
&lt;p&gt;There are libraries like &lt;a href=&quot;https:&#x2F;&#x2F;libuv.org&#x2F;&quot;&gt;libuv&lt;&#x2F;a&gt; (what powers Node.js) that you can use to run highly concurrent servers while using just a single thread (they finally changed it to &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;libuv&#x2F;libuv&#x2F;pull&#x2F;3952&quot;&gt;use io_uring&lt;&#x2F;a&gt;). These kind of libraries are often called &lt;code&gt;Event Loops&lt;&#x2F;code&gt;, let&#x27;s talk about them a bit.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;event-loop&quot;&gt;Event Loop&lt;&#x2F;h2&gt;
&lt;p&gt;At its essence an event loop (also sometimes called &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Reactor_pattern&quot;&gt;reactor pattern&lt;&#x2F;a&gt;) is basically this:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;should_continue_running&lt;&#x2F;span&gt;&lt;span&gt;():
&lt;&#x2F;span&gt;&lt;span&gt;    ready_fds &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;poll_fds&lt;&#x2F;span&gt;&lt;span&gt;()  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# Implemented using any of the async APIs we talked about.
&lt;&#x2F;span&gt;&lt;span&gt;    events &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;fds_to_events&lt;&#x2F;span&gt;&lt;span&gt;(ready_fds)
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;event &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;:= &lt;&#x2F;span&gt;&lt;span&gt;events&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;pop&lt;&#x2F;span&gt;&lt;span&gt;():
&lt;&#x2F;span&gt;&lt;span&gt;        event&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;callback&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Yeah, that&#x27;s it, just look at &lt;code&gt;libuv&lt;&#x2F;code&gt;&#x27;s &lt;code&gt;uv_run&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;c&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-c &quot;&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv_run&lt;&#x2F;span&gt;&lt;span&gt;(uv_loop_t&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;loop&lt;&#x2F;span&gt;&lt;span&gt;, uv_run_mode &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;mode&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt; r;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; ...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;(r &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;!= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&amp;amp;&lt;&#x2F;span&gt;&lt;span&gt; loop&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;stop_flag &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; ...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv__run_pending&lt;&#x2F;span&gt;&lt;span&gt;(loop);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv__run_idle&lt;&#x2F;span&gt;&lt;span&gt;(loop);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv__run_prepare&lt;&#x2F;span&gt;&lt;span&gt;(loop);
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; ...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv__io_poll&lt;&#x2F;span&gt;&lt;span&gt;(loop, timeout);
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; ...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    r &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv__loop_alive&lt;&#x2F;span&gt;&lt;span&gt;(loop);
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; ...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt; r;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And here&#x27;s &lt;code&gt;uv__run_pending&lt;&#x2F;code&gt; without omitting any details:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;c&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-c &quot;&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;static &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv__run_pending&lt;&#x2F;span&gt;&lt;span&gt;(uv_loop_t&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;loop&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span&gt; uv__queue&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt; q;
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span&gt; uv__queue pq;
&lt;&#x2F;span&gt;&lt;span&gt;  uv__io_t&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt; w;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv__queue_move&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span&gt;loop&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;pending_queue, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span&gt;pq);
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv__queue_empty&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span&gt;pq)) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    q &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv__queue_head&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span&gt;pq);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv__queue_remove&lt;&#x2F;span&gt;&lt;span&gt;(q);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv__queue_init&lt;&#x2F;span&gt;&lt;span&gt;(q);
&lt;&#x2F;span&gt;&lt;span&gt;    w &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;uv__queue_data&lt;&#x2F;span&gt;&lt;span&gt;(q, uv__io_t, pending_queue);
&lt;&#x2F;span&gt;&lt;span&gt;    w&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;cb&lt;&#x2F;span&gt;&lt;span&gt;(loop, w, POLLOUT);
&lt;&#x2F;span&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Pretty awesome huh? This is what all Node.js applications are running on.&lt;&#x2F;p&gt;
&lt;p&gt;What if you need to call a blocking 3rd party library function? For that, most event loop libraries have a thread pool you can run arbitrary code on, for example in &lt;code&gt;libuv&lt;&#x2F;code&gt;, you can use &lt;a href=&quot;https:&#x2F;&#x2F;docs.libuv.org&#x2F;en&#x2F;v1.x&#x2F;threadpool.html&quot;&gt;uv_queue_work()&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Pop quiz: how would something like &lt;code&gt;setTimeout&lt;&#x2F;code&gt; be implemented in an event loop? If nothing comes to mind, try cloning &lt;code&gt;libuv&lt;&#x2F;code&gt; and reading the implementation of &lt;code&gt;uv__run_timers&lt;&#x2F;code&gt; in &lt;code&gt;src&#x2F;timer.c&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;event-driven-development&quot;&gt;Event Driven Development&lt;&#x2F;h3&gt;
&lt;p&gt;The programming model when using an event loop is inherently event driven, with each registered event having a callback to execute once it is ready.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s see how &lt;code&gt;echod&lt;&#x2F;code&gt; would look like using an imaginary event loop library instead (start with &lt;code&gt;serve&lt;&#x2F;code&gt; from the bottom):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;c&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-c &quot;&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;struct &lt;&#x2F;span&gt;&lt;span&gt;context &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span&gt; buffer[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;2048&lt;&#x2F;span&gt;&lt;span&gt;];
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#66d9ef;&quot;&gt;size_t&lt;&#x2F;span&gt;&lt;span&gt; offset;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#66d9ef;&quot;&gt;size_t&lt;&#x2F;span&gt;&lt;span&gt; length;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;on_write&lt;&#x2F;span&gt;&lt;span&gt;(loop_t&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;loop&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span&gt; context&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;client_fd&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#66d9ef;&quot;&gt;ssize_t &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;written&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(written &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;== -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Error, stop.
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;free&lt;&#x2F;span&gt;&lt;span&gt;(ctx);
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;offset &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;+=&lt;&#x2F;span&gt;&lt;span&gt; written;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;offset &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span&gt; ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;length) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt; buf &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;buffer &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span&gt; ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;offset;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#66d9ef;&quot;&gt;size_t&lt;&#x2F;span&gt;&lt;span&gt; len &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;length &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span&gt; ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;offset;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;register_write&lt;&#x2F;span&gt;&lt;span&gt;(loop, ctx, client_fd, buf, len, on_write);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;} &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;register_read&lt;&#x2F;span&gt;&lt;span&gt;(loop, ctx, client_fd, on_read);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;on_read&lt;&#x2F;span&gt;&lt;span&gt;(loop_t&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;loop&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span&gt; context&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;client_fd&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#66d9ef;&quot;&gt;ssize_t &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;read&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(read &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;== -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Error, stop.
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;free&lt;&#x2F;span&gt;&lt;span&gt;(ctx);
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;offset &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;length &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; read;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;register_write&lt;&#x2F;span&gt;&lt;span&gt;(loop, ctx, client_fd, ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;buffer, read, on_write);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;on_accept&lt;&#x2F;span&gt;&lt;span&gt;(loop_t&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;loop&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;unused&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;server_fd&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;client_fd&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span&gt; context&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt; ctx &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;malloc&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;sizeof&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt;ctx));
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;register_read&lt;&#x2F;span&gt;&lt;span&gt;(loop, ctx, client_fd, ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;buffer, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;sizeof&lt;&#x2F;span&gt;&lt;span&gt;(ctx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;buffer), on_read);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;register_accept&lt;&#x2F;span&gt;&lt;span&gt;(loop, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;NULL&lt;&#x2F;span&gt;&lt;span&gt;, server_fd, on_accept);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;serve&lt;&#x2F;span&gt;&lt;span&gt;(loop_t&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;loop&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;server_fd&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;register_accept&lt;&#x2F;span&gt;&lt;span&gt;(loop, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;NULL&lt;&#x2F;span&gt;&lt;span&gt;, server_fd, on_accept);
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;run_event_loop&lt;&#x2F;span&gt;&lt;span&gt;(loop);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If you worked with javascript before, this should look familiar to you.&lt;&#x2F;p&gt;
&lt;p&gt;The code will be very lightweight in terms of performance, and highly concurrent, but is often criticized for being:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Not intuitive&lt;&#x2F;strong&gt; - Complex for most programmers who are used to reading and writing synchronous code, see &lt;a href=&quot;http:&#x2F;&#x2F;callbackhell.com&quot;&gt;&amp;quot;Callback Hell&amp;quot;&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Hard to debug&lt;&#x2F;strong&gt; - The call stack is very short and will not show you the flow of how you got to a specific breakpoint. The caller of each callback will always be &lt;code&gt;uv_run&lt;&#x2F;code&gt; in &lt;code&gt;libuv&lt;&#x2F;code&gt;, or &lt;code&gt;run_event_loop&lt;&#x2F;code&gt; in our example.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;A lot of modern programming languages and runtimes try to solve these problems by letting you write code that looks synchronous while being fully asynchronous. In the next chapter, we&#x27;re gonna learn how.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;preemption&quot;&gt;Preemption&lt;&#x2F;h1&gt;
&lt;p&gt;Have you ever wondered how is it that you can run more than 1 thread on a computer with just a single CPU core?&lt;&#x2F;p&gt;
&lt;p&gt;In this section, we&#x27;ll go over the secret technique that enables this magic, called &lt;strong&gt;preemption&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s say we have the following two tasks we would like to execute concurrently:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;font-style:italic;color:#ff79c6;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;task_0&lt;&#x2F;span&gt;&lt;span&gt;():
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;True&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;print&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;#39;0&amp;#39;&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ff79c6;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;task_1&lt;&#x2F;span&gt;&lt;span&gt;():
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;True&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;print&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;#39;1&amp;#39;&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;run_all&lt;&#x2F;span&gt;&lt;span&gt;([task_0, task_1])
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;What should &lt;code&gt;run_all&lt;&#x2F;code&gt; do to make sure that both tasks run &lt;strong&gt;concurrently&lt;&#x2F;strong&gt;?&lt;&#x2F;p&gt;
&lt;p&gt;If we had 2 CPU cores, we could have simply run 1 task on 1 core, which would mean we would run the two tasks &lt;strong&gt;parallelly&lt;&#x2F;strong&gt;, or in other words the two tasks run at the same time in the &lt;em&gt;real&lt;&#x2F;em&gt; world (at least the one we base physics on).&lt;&#x2F;p&gt;
&lt;p&gt;A &lt;strong&gt;concurrent&lt;&#x2F;strong&gt; program deals with running multiple things at once, just not at the same time, thus they may seem &lt;strong&gt;parallel&lt;&#x2F;strong&gt; even if they&#x27;re really not.&lt;&#x2F;p&gt;
&lt;p&gt;One way &lt;code&gt;run_all&lt;&#x2F;code&gt; can achieve &lt;strong&gt;concurrency&lt;&#x2F;strong&gt; is by running 1 task for some amount of time, pause, and then resume running the next task, forever in a loop until all tasks exit for example. To magically make it appear as if it&#x27;s &lt;strong&gt;parallel&lt;&#x2F;strong&gt;, you simply need to configure the amount of time before pausing to be really small relative to the human experience (e.g. 100μs?).&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;from &lt;&#x2F;span&gt;&lt;span&gt;itertools &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;import &lt;&#x2F;span&gt;&lt;span&gt;cycle  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# cycle([1, 2, 3]) -&amp;gt; (1, 2, 3, 1, 2, 3, ...)
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ff79c6;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;run_all&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;tasks&lt;&#x2F;span&gt;&lt;span&gt;):
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;task &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;cycle&lt;&#x2F;span&gt;&lt;span&gt;(tasks):
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;run_task_for_a_little&lt;&#x2F;span&gt;&lt;span&gt;(task)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But how do you pause and resume execution of code?&lt;&#x2F;p&gt;
&lt;p&gt;The answer lies in programs being deterministic state machines, as long as you give a program&#x27;s executor (e.g. CPU for native code) the same inputs (e.g. registers, memory, etc...), it doesn&#x27;t matter if it executes today or in a few years, the output will be the same.&lt;&#x2F;p&gt;
&lt;p&gt;Basically, pausing a task can be implemented as copying the current state of the program, and resuming a task can be implemented by loading that saved state.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;It doesn&#x27;t matter if the program runs on a real CPU or a virtual one like in &lt;code&gt;python&lt;&#x2F;code&gt;&#x27;s bytecode or on the &lt;code&gt;JVM&lt;&#x2F;code&gt; for example, they are all deterministic state machines. As long as you copy all the necessary state, the task will resume as if it was never even paused.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;To make it easy to understand how you might implement preemption (saving and loading of program state), let&#x27;s look at &lt;code&gt;setjmp.h&lt;&#x2F;code&gt;, which implements saving and loading program state in a lot of different CPU architectures, and is part of &lt;code&gt;libc&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;See the following example (copied directly from &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Setjmp.h&quot;&gt;wikipedia&lt;&#x2F;a&gt;):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;c&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-c &quot;&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;#include &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;lt;stdio.h&amp;gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;#include &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;lt;setjmp.h&amp;gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;static&lt;&#x2F;span&gt;&lt;span&gt; jmp_buf buf;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;second&lt;&#x2F;span&gt;&lt;span&gt;() &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;printf&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;second&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;);         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; prints
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;longjmp&lt;&#x2F;span&gt;&lt;span&gt;(buf,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;);             &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; jumps back to where setjmp was called - making setjmp now return 1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;first&lt;&#x2F;span&gt;&lt;span&gt;() &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;second&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;printf&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;first&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;);          &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; does not print
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;main&lt;&#x2F;span&gt;&lt;span&gt;() &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;setjmp&lt;&#x2F;span&gt;&lt;span&gt;(buf))
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;first&lt;&#x2F;span&gt;&lt;span&gt;();                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; when executed, setjmp returned 0
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;else                        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; when longjmp jumps back, setjmp returns 1
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;printf&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;main&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;);       &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; prints
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Running it will output:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; gcc example.c&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt; -o&lt;&#x2F;span&gt;&lt;span&gt; example &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;.&#x2F;example
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;second
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;main
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;setjmp&lt;&#x2F;code&gt; saves the program state (in &lt;code&gt;buf&lt;&#x2F;code&gt;), and &lt;code&gt;longjmp&lt;&#x2F;code&gt; loads whatever is in &lt;code&gt;buf&lt;&#x2F;code&gt; to the CPU.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s look behind the curtains, the following is the &lt;code&gt;x86_64&lt;&#x2F;code&gt; assembly code for &lt;code&gt;setjmp&lt;&#x2F;code&gt; and &lt;code&gt;longjmp&lt;&#x2F;code&gt; in &lt;a href=&quot;https:&#x2F;&#x2F;musl.libc.org&#x2F;&quot;&gt;musl&lt;&#x2F;a&gt; (a popular &lt;code&gt;libc&lt;&#x2F;code&gt; implementation):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;asm&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-asm &quot;&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;setjmp:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rbx&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;         ; rdi is jmp_buf, move registers onto it
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rbp&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;r12&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;r13&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;24&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;r14&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;r15&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;40&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;lea &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rsp&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;        ; this is our rsp WITHOUT current ret addr
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdx&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;48&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rsp&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;         ; save return addr ptr for new rip
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdx&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;56&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;xor &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;           ; always return 0
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;ret
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;longjmp:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;xor &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;eax
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;cmp &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;esi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;             ; CF = val ? 0 : 1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;adc &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;esi&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;eax&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;           ; eax = val + !val
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rbx&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;         ; rdi is the jmp_buf, restore regs from it
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rbp
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;16&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;r12
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;24&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;r13
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;r14
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;40&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;r15
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mov &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;48&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rsp
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;jmp &lt;&#x2F;span&gt;&lt;span&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;56&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;(%&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;rdi&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;           ; goto saved address without altering rsp
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Don&#x27;t stress it if you don&#x27;t understand assembly. The point is that saving and loading program state is pretty short and simple.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;&lt;code&gt;setjmp&lt;&#x2F;code&gt; saves all callee-saved registers into &lt;code&gt;jmp_buf&lt;&#x2F;code&gt;. Callee-saved registers are registers used to hold long-lived values that should be preserved across function calls. &lt;code&gt;longjmp&lt;&#x2F;code&gt; restores the callee-saved registers stored inside a &lt;code&gt;jmp_buf&lt;&#x2F;code&gt; directly to the CPU registers.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;To the curious, the reason caller-saved registers (like &lt;code&gt;rcx&lt;&#x2F;code&gt; for example) are not saved, is because to the compiler &lt;code&gt;setjmp&lt;&#x2F;code&gt; is just another function call, meaning it will not use caller-saved registers to hold state. It assumes just like with any function call, that these registers might be changed.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;non-preemptive-schedulers&quot;&gt;Non-Preemptive Schedulers&lt;&#x2F;h2&gt;
&lt;p&gt;Already, we have a solid foundation to start running multiple tasks concurrently.&lt;&#x2F;p&gt;
&lt;p&gt;Instead of relying on time to pause execution of a running task, we can instead assume the programmer manually inserts calls to &lt;code&gt;longjmp&lt;&#x2F;code&gt;, see example (this time in C for &lt;code&gt;setjmp.h&lt;&#x2F;code&gt;):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;c&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-c &quot;&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;#include &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;lt;stdbool.h&amp;gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;#include &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;lt;stdio.h&amp;gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;#include &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;lt;setjmp.h&amp;gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;jmp_buf&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt; current_buffer;
&lt;&#x2F;span&gt;&lt;span&gt;jmp_buf main_buffer;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;#define &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;ARRAY_SIZE&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;arr&lt;&#x2F;span&gt;&lt;span&gt;) (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;sizeof&lt;&#x2F;span&gt;&lt;span&gt;(arr) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&#x2F; sizeof&lt;&#x2F;span&gt;&lt;span&gt;(arr[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;]))
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;#define &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;YIELD&lt;&#x2F;span&gt;&lt;span&gt;() &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;setjmp&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt;current_buffer)) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;longjmp&lt;&#x2F;span&gt;&lt;span&gt;(main_buffer, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;); &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;task_0&lt;&#x2F;span&gt;&lt;span&gt;() &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;printf&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;YIELD&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;task_1&lt;&#x2F;span&gt;&lt;span&gt;() &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;printf&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;YIELD&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;main&lt;&#x2F;span&gt;&lt;span&gt;() &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt;tasks[])(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span&gt;task_0, task_1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    jmp_buf buffers[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;ARRAY_SIZE&lt;&#x2F;span&gt;&lt;span&gt;(tasks)];
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;bool&lt;&#x2F;span&gt;&lt;span&gt; started &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;false&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;ARRAY_SIZE&lt;&#x2F;span&gt;&lt;span&gt;(tasks); i&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;setjmp&lt;&#x2F;span&gt;&lt;span&gt;(main_buffer)) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;continue&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;            current_buffer &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &amp;amp;&lt;&#x2F;span&gt;&lt;span&gt;buffers[i];
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span&gt;started) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                tasks[i]();
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;} &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;                &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;longjmp&lt;&#x2F;span&gt;&lt;span&gt;(buffers[i], &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        started &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Let&#x27;s run it:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; gcc example.c&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt; -o&lt;&#x2F;span&gt;&lt;span&gt; example &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;.&#x2F;example
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;0
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;0
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;^C
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Cool, but... There&#x27;s actually a hidden bug (can you find it?).&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s change &lt;code&gt;task_0&lt;&#x2F;code&gt; to hold some state on the stack:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;c&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-c &quot;&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;task_0&lt;&#x2F;span&gt;&lt;span&gt;() &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;printf&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;0: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;%d&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, i&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;YIELD&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Run it again:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; gcc example.c&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt; -o&lt;&#x2F;span&gt;&lt;span&gt; example &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;.&#x2F;example
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;0:&lt;&#x2F;span&gt;&lt;span&gt; 0
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;0:&lt;&#x2F;span&gt;&lt;span&gt; 32765
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;0:&lt;&#x2F;span&gt;&lt;span&gt; 32765
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;^C
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Whoops... Because all our tasks share the same stack, each task (including our &lt;code&gt;main&lt;&#x2F;code&gt; function) may overwrite whatever is in the stack. See the following illustration:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#282a36;color:#f8f8f2;&quot;&gt;&lt;code&gt;&lt;span&gt;----------------
&lt;&#x2F;span&gt;&lt;span&gt;| main&amp;#39;s stack |
&lt;&#x2F;span&gt;&lt;span&gt;----------------
&lt;&#x2F;span&gt;&lt;span&gt;               ^
&lt;&#x2F;span&gt;&lt;span&gt;First, a call to setjmp in main saves the stack address here.
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre style=&quot;background-color:#282a36;color:#f8f8f2;&quot;&gt;&lt;code&gt;&lt;span&gt;---------------------------------
&lt;&#x2F;span&gt;&lt;span&gt;| main&amp;#39;s stack | task_0&amp;#39;s stack |
&lt;&#x2F;span&gt;&lt;span&gt;---------------------------------
&lt;&#x2F;span&gt;&lt;span&gt;                                ^
&lt;&#x2F;span&gt;&lt;span&gt;Then, a call to setjmp in task_0 saves the stack address here.
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre style=&quot;background-color:#282a36;color:#f8f8f2;&quot;&gt;&lt;code&gt;&lt;span&gt;---------------------------------
&lt;&#x2F;span&gt;&lt;span&gt;| main&amp;#39;s stack | task_0&amp;#39;s stack |
&lt;&#x2F;span&gt;&lt;span&gt;---------------------------------
&lt;&#x2F;span&gt;&lt;span&gt;               ^
&lt;&#x2F;span&gt;&lt;span&gt;Once we longjmp back to main, we reset the stack register (rsp in x86_64) back here.
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre style=&quot;background-color:#282a36;color:#f8f8f2;&quot;&gt;&lt;code&gt;&lt;span&gt;Then any pushing to the stack, will overwrite any data saved in task_0&amp;#39;s stack.
&lt;&#x2F;span&gt;&lt;span&gt;So once we longjmp back to task_0, this is how our stack will look like:
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;---------------------------------
&lt;&#x2F;span&gt;&lt;span&gt;| main&amp;#39;s stack | &amp;quot;random&amp;quot; stuff |
&lt;&#x2F;span&gt;&lt;span&gt;---------------------------------
&lt;&#x2F;span&gt;&lt;span&gt;                                ^
&lt;&#x2F;span&gt;&lt;span&gt;No wonder we get a random looking value when we print something inside the stack.
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The fix is to create a stack for each task, and switch to it right before calling the task function:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;diff&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-diff &quot;&gt;&lt;code class=&quot;language-diff&quot; data-lang=&quot;diff&quot;&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;@@ -25,6 +26,7 @@ void task_1() {
&lt;&#x2F;span&gt;&lt;span&gt; int main() {
&lt;&#x2F;span&gt;&lt;span&gt;     void(*tasks[])(void) = {task_0, task_1};
&lt;&#x2F;span&gt;&lt;span&gt;     jmp_buf buffers[ARRAY_SIZE(tasks)];
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;+    char stacks[ARRAY_SIZE(tasks)][1024];  &#x2F;&#x2F; Stack size of 1kb.
&lt;&#x2F;span&gt;&lt;span&gt;     bool started = false;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;     while (true) {
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;@@ -35,7 +37,12 @@ int main() {
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;             current_buffer = &amp;amp;buffers[i];
&lt;&#x2F;span&gt;&lt;span&gt;             if (!started) {
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-                tasks[i]();
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;+                &#x2F;&#x2F; Stack goes down on push, up on pop.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;+                char* stack = stacks[i] + sizeof(stacks[i]);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;+                asm(&amp;quot;movq %0, %%rax;&amp;quot;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;+                    &amp;quot;movq %1, %%rsp;&amp;quot;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;+                    &amp;quot;call *%%rax&amp;quot;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;+                    :: &amp;quot;rm&amp;quot; (tasks[i]), &amp;quot;rm&amp;quot; (stack) : &amp;quot;rax&amp;quot;);
&lt;&#x2F;span&gt;&lt;span&gt;             } else {
&lt;&#x2F;span&gt;&lt;span&gt;                 longjmp(buffers[i], 1);
&lt;&#x2F;span&gt;&lt;span&gt;             }
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;blockquote&gt;
&lt;p&gt;The reason for saving the task function in the register &lt;code&gt;rax&lt;&#x2F;code&gt;, was to not lookup &lt;code&gt;tasks[i]&lt;&#x2F;code&gt; inside the stack, as we just changed the stack to some other memory location. The &lt;code&gt;asm&lt;&#x2F;code&gt; syntax is fully documented &lt;a href=&quot;https:&#x2F;&#x2F;gcc.gnu.org&#x2F;onlinedocs&#x2F;gcc&#x2F;extensions-to-the-c-language-family&#x2F;how-to-use-inline-assembly-language-in-c-code.html&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Run it one last time:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; gcc example.c&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt; -o&lt;&#x2F;span&gt;&lt;span&gt; example &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;.&#x2F;example
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;0:&lt;&#x2F;span&gt;&lt;span&gt; 0
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;0:&lt;&#x2F;span&gt;&lt;span&gt; 1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;0:&lt;&#x2F;span&gt;&lt;span&gt; 2
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;^C
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We have successfully implemented a user mode non-preemptive scheduler!&lt;&#x2F;p&gt;
&lt;p&gt;In real non-preemptive (also called cooperative) systems, the runtime should yield when it knows that the CPU has nothing useful to do anymore in the current task, for example waiting on I&#x2F;O. They do that by registering for I&#x2F;O and move the task to a different queue that holds blocked tasks (which the scheduler skips from running). Once there&#x27;s I&#x2F;O, they move the task from the blocked queue back to the regular queue for execution. This can be done for example by integrating with an event loop.&lt;&#x2F;p&gt;
&lt;p&gt;Here are some examples of non-preemptive schedulers in popular mainstream runtimes:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Rust&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;tokio.rs&#x2F;&quot;&gt;tokio&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; - To yield, you either call &lt;code&gt;tokio::task::yield_now()&lt;&#x2F;code&gt;, or run until blocking (e.g. waiting on I&#x2F;O or &lt;code&gt;tokio::time::sleep()&lt;&#x2F;code&gt;). In version 0.3.1 they introduced an &lt;a href=&quot;https:&#x2F;&#x2F;tokio.rs&#x2F;blog&#x2F;2020-04-preemption&quot;&gt;automatic yield&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Go (prior to 1.14)&lt;&#x2F;strong&gt; - At release (version 1.0), to yield, you would either call &lt;code&gt;runtime.Gosched()&lt;&#x2F;code&gt;, or run until blocking. In version 1.2 the scheduler is also invoked occasionally upon entry to a function.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Erlang&lt;&#x2F;strong&gt; - In &lt;a href=&quot;https:&#x2F;&#x2F;blog.stenmans.org&#x2F;theBeamBook&#x2F;&quot;&gt;BEAM&lt;&#x2F;a&gt; (erlang&#x27;s awesome runtime), the scheduler is invoked at function calls. Since there are no other loop constructs than recursion and list comprehensions, there is no way to loop forever without doing a function call. You can cheat though by running native C code using a &lt;code&gt;NIF&lt;&#x2F;code&gt; (native implemented function).&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Non-preemptive schedulers are risky, as we assume developers remember to put &lt;code&gt;yield&lt;&#x2F;code&gt; calls when doing long computations:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; In Go (prior to 1.14), this code would not yield until done.
&lt;&#x2F;span&gt;&lt;span&gt;sum &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;i &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;:= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;i &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1e8&lt;&#x2F;span&gt;&lt;span&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;i&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;++ &lt;&#x2F;span&gt;&lt;span&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;sum&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;++
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h2 id=&quot;preemptive-schedulers&quot;&gt;Preemptive Schedulers&lt;&#x2F;h2&gt;
&lt;p&gt;A preemptive scheduler context switches (yields) once in a while, even without a developer inserting yield calls.&lt;&#x2F;p&gt;
&lt;p&gt;Most modern operating systems utilize timer interrupts. The CPU receives an interrupt once every X amount of time is passed. The interrupt stops execution of whatever is currently running, and the interrupt handler calls the scheduler which decides whether to context switch.&lt;&#x2F;p&gt;
&lt;p&gt;That&#x27;s cool and all, but user mode applications can&#x27;t register to interrupts, so what can we do if we want to implement a preemptive scheduler in user mode?&lt;&#x2F;p&gt;
&lt;p&gt;One simple solution would be to utilize the kernel&#x27;s preemptive scheduler. Create a thread that periodically sends a signal to threads running our scheduler.&lt;&#x2F;p&gt;
&lt;p&gt;This is exactly how Go made their scheduler preemptive in version 1.14. By periodically sending signals from their monitoring thread (&lt;a href=&quot;https:&#x2F;&#x2F;sobyte.net&#x2F;post&#x2F;2021-12&#x2F;golang-sysmon&#x2F;&quot;&gt;runtime.sysmon&lt;&#x2F;a&gt;) to the scheduler threads running goroutines.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;For more info on their solution, I recommend you watch &lt;a href=&quot;https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=1I1WmeSjRSw&quot;&gt;&amp;quot;Pardon the Interruption: Loop Preemption in Go 1.14&amp;quot;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;stackful-vs-stackless&quot;&gt;Stackful vs Stackless&lt;&#x2F;h2&gt;
&lt;p&gt;Up until now, I have been calling them tasks to not confuse you, but they have many different names like fibers, greenlets, user mode threads, green threads, virtual threads, coroutines and goroutines.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;When people say threads, they usually mean OS threads (managed by the kernel scheduler).&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;A coroutine is simply a program that can be paused and resumed. There are mainly two ways to implement them: either you allocate a stack for each coroutine (stackful), or you make each function marked as &lt;code&gt;async&lt;&#x2F;code&gt; return an object that can hold all the state needed to pause and resume that function (stackless).&lt;&#x2F;p&gt;
&lt;p&gt;Stackful and stackless impact the API greatly, each with its own advantages and disadvantages. Here&#x27;s an overview:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Stackful&lt;&#x2F;strong&gt; - Coroutines have the exact same API and semantics as OS threads, which makes sense, as they both allocate a stack at runtime. Our example scheduler using &lt;code&gt;setjmp&lt;&#x2F;code&gt; is stackful. Go is another example of a stackful implementation. Just like Go needs to periodically context switch, it also needs to periodically check whether there is enough free stack space to continue running, if not, it reallocates the stack to have more memory, copies what it had before and fixes all pointers that pointed to the old stack to now point to the new stack. Just like the stack can grow dynamically, it can also shrink if needed. The real beauty is that you can choose to run any function either synchronously or asynchronously in the background, without affecting the code around it:&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre data-lang=&quot;go&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-go &quot;&gt;&lt;code class=&quot;language-go&quot; data-lang=&quot;go&quot;&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;go &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;fmt&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;Println&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;world&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;fmt&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;Println&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;hello&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Stackless&lt;&#x2F;strong&gt; - If you have ever used a language with &lt;code&gt;async&lt;&#x2F;code&gt; &amp;amp; &lt;code&gt;await&lt;&#x2F;code&gt;, you&#x27;ve used a stackless implementation. Examples include Rust and Python&#x27;s &lt;code&gt;asyncio&lt;&#x2F;code&gt;. Rust&#x27;s &lt;code&gt;async&lt;&#x2F;code&gt; transforms a block of code into a state machine that is not run until you &lt;code&gt;await&lt;&#x2F;code&gt; it. The biggest advantage of this approach is how &lt;a href=&quot;https:&#x2F;&#x2F;pkolaczk.github.io&#x2F;memory-consumption-of-async&#x2F;&quot;&gt;lightweight it is at runtime&lt;&#x2F;a&gt;, memory is allocated exactly as needed, which served well for Rust&#x27;s embedded use case as well. The main problem with this approach is &amp;quot;function coloring&amp;quot;. An &lt;code&gt;async&lt;&#x2F;code&gt; function can only be called inside another &lt;code&gt;async&lt;&#x2F;code&gt; function:&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre data-lang=&quot;rs&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-rs &quot;&gt;&lt;code class=&quot;language-rs&quot; data-lang=&quot;rs&quot;&gt;&lt;span&gt;async &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;say&lt;&#x2F;span&gt;&lt;span&gt;() &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    println!(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;hello world&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;fn &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;main&lt;&#x2F;span&gt;&lt;span&gt;() &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Can&amp;#39;t call say(), as main() is not async.
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;mut&lt;&#x2F;span&gt;&lt;span&gt; rt &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#66d9ef;&quot;&gt;tokio&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#ff79c6;&quot;&gt;::&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#66d9ef;&quot;&gt;runtime&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#ff79c6;&quot;&gt;::&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#66d9ef;&quot;&gt;Runtime&lt;&#x2F;span&gt;&lt;span style=&quot;text-decoration:underline;color:#ff79c6;&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;new()&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;unwrap&lt;&#x2F;span&gt;&lt;span&gt;();
&lt;&#x2F;span&gt;&lt;span&gt;    rt&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;block_on&lt;&#x2F;span&gt;&lt;span&gt;(async &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;{
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#8be9fd;&quot;&gt;let&lt;&#x2F;span&gt;&lt;span&gt; future &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;say&lt;&#x2F;span&gt;&lt;span&gt;(); &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Calling does not execute.
&lt;&#x2F;span&gt;&lt;span&gt;        future&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span&gt;await; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;&#x2F;&#x2F; Starts executing.
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span&gt;)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Rust started with stackful prior to release, but ultimately ended up switching to stackless: &lt;a href=&quot;https:&#x2F;&#x2F;without.boats&#x2F;blog&#x2F;why-async-rust&#x2F;&quot;&gt;&amp;quot;Why async rust?&amp;quot;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h1 id=&quot;scheduler-algorithms&quot;&gt;Scheduler Algorithms&lt;&#x2F;h1&gt;
&lt;p&gt;A scheduler is also responsible for deciding which task it should run next once one finishes.&lt;&#x2F;p&gt;
&lt;p&gt;One of the simplest methods is one we have already seen before in the event loop section, and that is to run tasks in the order that they are added to the task queue.&lt;&#x2F;p&gt;
&lt;p&gt;Linux&#x27;s &lt;code&gt;SCHED_FIFO&lt;&#x2F;code&gt; scheduler does exactly this:&lt;&#x2F;p&gt;
&lt;div class=&quot;block small-block&quot; style=&quot;height: 500px&quot;&gt;
  &lt;div id=&quot;app_sched_fifo&quot; style=&quot;height: 100%&quot;&gt;&lt;&#x2F;div&gt;
&lt;&#x2F;div&gt;
&lt;blockquote&gt;
&lt;p&gt;Each circle is a task. The white progress circle around tasks is the time left to run until the task is blocked.&lt;br&gt;
&lt;span style=&quot;color: #BD93F9&quot;&gt;&lt;b&gt;Purple box&lt;&#x2F;b&gt;&lt;&#x2F;span&gt; - The queue holding tasks ready to run.&lt;br&gt;
&lt;span style=&quot;color: #50FA7B&quot;&gt;&lt;b&gt;Green box&lt;&#x2F;b&gt;&lt;&#x2F;span&gt; - The CPU.&lt;br&gt;
&lt;span style=&quot;color: #7282c4&quot;&gt;&lt;b&gt;Gray box&lt;&#x2F;b&gt;&lt;&#x2F;span&gt; - Tasks blocked on something (e.g. I&#x2F;O).&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Taking &lt;code&gt;SCHED_FIFO&lt;&#x2F;code&gt; and adding a task runtime limit is what &lt;code&gt;SCHED_RR&lt;&#x2F;code&gt; does, allowing the CPU to be shared in a more uniform manner:&lt;&#x2F;p&gt;
&lt;div class=&quot;block small-block&quot; style=&quot;height: 500px&quot;&gt;
  &lt;div id=&quot;app_sched_rr&quot; style=&quot;height: 100%&quot;&gt;&lt;&#x2F;div&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;What if you have a task that &lt;em&gt;must&lt;&#x2F;em&gt; run once every 5ms, even if for a really short amount of time? For example in audio programming, you have a buffer to fill with a signal in time (e.g. &lt;code&gt;sin(x)&lt;&#x2F;code&gt;) that the audio device reads from at some interval. Missing out on filling this buffer, will result in a random signal which sounds like crackling noise, potentially ruining a recording of an entire orchestra.&lt;&#x2F;p&gt;
&lt;p&gt;These kind of programs are usually called soft real time programs. Hard real time means missing a deadline will result in the whole system failing, for example autopilot and spacecrafts.&lt;&#x2F;p&gt;
&lt;p&gt;Linux has a nice answer for soft real time systems called &lt;code&gt;SCHED_DEADLINE&lt;&#x2F;code&gt;, where each thread sets the amount of time until their deadline, and the scheduler always runs the task that is closest to reaching the deadline:&lt;&#x2F;p&gt;
&lt;div class=&quot;block small-block&quot; style=&quot;height: 500px&quot;&gt;
  &lt;div id=&quot;app_sched_deadline&quot; style=&quot;height: 100%&quot;&gt;&lt;&#x2F;div&gt;
&lt;&#x2F;div&gt;
&lt;blockquote&gt;
&lt;p&gt;The &lt;span style=&quot;color: #a7fb78&quot;&gt;&lt;b&gt;green&lt;&#x2F;b&gt;&lt;&#x2F;span&gt; progress circle is how much time is left until the deadline.&lt;br&gt;
Follow the &lt;span style=&quot;color: #FF79C6&quot;&gt;&lt;b&gt;pink&lt;&#x2F;b&gt;&lt;&#x2F;span&gt; circle, it has a short deadline, making it run a lot more than others.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;SCHED_FIFO&lt;&#x2F;code&gt; and &lt;code&gt;SCHED_RR&lt;&#x2F;code&gt; can also be used in soft real time systems because of their deterministic nature, depending on the problem you need to solve.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;To guarantee all tasks are able to run according to their configured deadline, &lt;code&gt;SCHED_DEADLINE&lt;&#x2F;code&gt; calculates and rejects threads with a configuration that will steal too much run time. You can learn more about it on lwn&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;lwn.net&#x2F;Articles&#x2F;743740&#x2F;&quot;&gt;&amp;quot;Deadline scheduling&amp;quot;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;For general purpose workloads, like a laptop running arbitrary processes, you usually want fairness. Fairness can be achieved by continuously tracking which processes have gotten less CPU time than others, and always run the task with the lowest tracked runtime. Linux&#x27;s default scheduler &lt;code&gt;SCHED_OTHER&lt;&#x2F;code&gt;, also known as &lt;code&gt;CFS&lt;&#x2F;code&gt; (Completely Fair Scheduler), does exactly this. You can also configure priorities to processes by setting a &lt;code&gt;nice&lt;&#x2F;code&gt; value, where processes with a lower &lt;code&gt;nice&lt;&#x2F;code&gt; value will be scheduled more.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;CFS&lt;&#x2F;code&gt; has served well for the last 26 years, but in v6.6, the new default scheduling algorithm is &lt;a href=&quot;https:&#x2F;&#x2F;lwn.net&#x2F;Articles&#x2F;925371&#x2F;&quot;&gt;EEVDF&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;multi-core&quot;&gt;Multi-Core&lt;&#x2F;h2&gt;
&lt;p&gt;So far, I have pretty much ignored the fact that modern machines have more than 1 CPU core.&lt;&#x2F;p&gt;
&lt;p&gt;The simplest way to achieve multi-core scheduling, is to do exactly as before. Having a global queue of tasks that are ready to run, and run them once a core is ready:&lt;&#x2F;p&gt;
&lt;div class=&quot;block small-block&quot; style=&quot;height: 500px&quot;&gt;
  &lt;div id=&quot;app_sched_multicore&quot; style=&quot;height: 100%&quot;&gt;&lt;&#x2F;div&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;You just need to ensure that the task queue is thread-safe for &lt;code&gt;MPMC&lt;&#x2F;code&gt; operations (multi-producer multi-consumer), by using atomics or locks.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;MPMC&lt;&#x2F;code&gt; queues are a lot slower than the more restrictive &lt;code&gt;SPMC&lt;&#x2F;code&gt; (single-producer multi-consumer) queues, which is why Go decided to have a fixed size &lt;code&gt;SPMC&lt;&#x2F;code&gt; queue for each scheduler (Go runs a scheduler per core configured by &lt;code&gt;GOMAXPROCS&lt;&#x2F;code&gt;), with a global &lt;code&gt;MPMC&lt;&#x2F;code&gt; queue to push to when the &lt;code&gt;SPMC&lt;&#x2F;code&gt; queue is full.&lt;&#x2F;p&gt;
&lt;p&gt;To ensure all cores are fully utilized, when a core is free to run but has nothing in its local queue and there are no tasks in the global queue, it &lt;strong&gt;steals&lt;&#x2F;strong&gt; tasks from other local queues (which is why they are multi-consumer).&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Go&#x27;s solution is so good, tokio borrowed a lot from it. I highly recommend reading it on their blog: &lt;a href=&quot;https:&#x2F;&#x2F;tokio.rs&#x2F;blog&#x2F;2019-10-scheduler&quot;&gt;&amp;quot;Making the Tokio scheduler 10x faster&amp;quot;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h1&gt;
&lt;p&gt;Congratulations 🥳! You are a real hero reaching the end, hopefully you have learned a thing or two.&lt;&#x2F;p&gt;
&lt;p&gt;The topic has a lot more to cover, the links left throughout this post are a great place to start exploring the endless rabbit hole of concurrency and parallelism.&lt;&#x2F;p&gt;
&lt;p&gt;If you want to play around with the animations yourself, here&#x27;s a link to the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;tontinton&#x2F;sched_animation&quot;&gt;code&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a onclick=&quot;window.scrollTo(0,0);&quot; style=&quot;cursor: pointer;&quot;&gt;Click here&lt;&#x2F;a&gt; to scroll back to the animation at the top.&lt;&#x2F;p&gt;
&lt;script&gt;applySize(getSize())&lt;&#x2F;script&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Database Fundamentals</title>
        <published>2023-12-15T00:00:00+00:00</published>
        <updated>2023-12-15T00:00:00+00:00</updated>
        <author>
          <name>Unknown</name>
        </author>
        <link rel="alternate" href="https://tontinton.com/posts/database-fundementals/" type="text/html"/>
        <id>https://tontinton.com/posts/database-fundementals/</id>
        
        <content type="html">&lt;p&gt;About a year ago, I tried thinking which database I should choose for my next project, and came to the realization that I don&#x27;t really know the differences of databases enough. I went to different database websites and saw mostly marketing and words I don&#x27;t understand.&lt;&#x2F;p&gt;
&lt;p&gt;This is when I decided to read the excellent books &lt;code&gt;Database Internals&lt;&#x2F;code&gt; by Alex Petrov and &lt;code&gt;Designing Data-Intensive Applications&lt;&#x2F;code&gt; by Martin Kleppmann.&lt;&#x2F;p&gt;
&lt;p&gt;The books piqued my curiosity enough to write my own little database I called &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;tontinton&#x2F;dbeel&quot;&gt;dbeel&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;This post is basically a short summary of these books, with a focus on the fundamental problems a database engineer thinks about in the shower.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;bashdb&quot;&gt;bashdb&lt;&#x2F;h1&gt;
&lt;p&gt;Let&#x27;s start with the simplest database program ever written, just 2 bash functions (we&#x27;ll call it &lt;code&gt;bashdb&lt;&#x2F;code&gt;):&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;#!&#x2F;bin&#x2F;bash
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;db_set&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;echo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;$&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;,$&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt; database
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;db_get&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;grep &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;^$&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;,&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt; database &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;sed&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt; -e &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;s&#x2F;^$&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;,&#x2F;&#x2F;&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;tail&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt; -n&lt;&#x2F;span&gt;&lt;span&gt; 1
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Try it out:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;sh&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-sh &quot;&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; db_set 500 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;#39;{&amp;quot;movie&amp;quot;: &amp;quot;Airplane!&amp;quot;, &amp;quot;rating&amp;quot;: 9}&amp;#39;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; db_set 111 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;#39;{&amp;quot;movie&amp;quot;: &amp;quot;Tokio Drift&amp;quot;, &amp;quot;rating&amp;quot;: 6}&amp;#39;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;$&lt;&#x2F;span&gt;&lt;span&gt; db_get 500
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;movie&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;Airplane!&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;rating&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;: 9}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Before you continue reading, I want you to pause and think about why you wouldn&#x27;t use &lt;code&gt;bashdb&lt;&#x2F;code&gt; in production.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#282a36;color:#f8f8f2;&quot;&gt;&lt;code&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;Some space for you to think :)
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You probably came up with at least a dozen issues in &lt;code&gt;bashdb&lt;&#x2F;code&gt;. Now I won&#x27;t go over &lt;em&gt;all&lt;&#x2F;em&gt; of the possible issues, for this post I will focus on the following ones:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Durability&lt;&#x2F;strong&gt; - If the machine crashes after a successful &lt;code&gt;db_set&lt;&#x2F;code&gt;, the data might be lost, as it was not flushed to disk.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Atomicity&lt;&#x2F;strong&gt; - If the machine crashes while you call &lt;code&gt;db_set&lt;&#x2F;code&gt;, data might be written partially, corrupting our data.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Isolation&lt;&#x2F;strong&gt; - If one process calls &lt;code&gt;db_get&lt;&#x2F;code&gt;, while another calls &lt;code&gt;db_set&lt;&#x2F;code&gt; concurrently on the same item, the first process might read only part of the data, leading to a corrupt result.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Performance&lt;&#x2F;strong&gt; - &lt;code&gt;db_get&lt;&#x2F;code&gt; uses &lt;code&gt;grep&lt;&#x2F;code&gt;, so search goes line by line and is &lt;code&gt;O(n)&lt;&#x2F;code&gt;, &lt;code&gt;n&lt;&#x2F;code&gt; = all items saved.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Could you figure out these problems yourself? If you could, well done, you don&#x27;t need me, you already understand databases 😀&lt;&#x2F;p&gt;
&lt;p&gt;In the next section, we&#x27;ll try get rid of these problems, to make &lt;code&gt;bashdb&lt;&#x2F;code&gt; a &lt;em&gt;real&lt;&#x2F;em&gt; database we might use in production (not really, please don&#x27;t, just use &lt;code&gt;PostgreSQL&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;improving-bashdb-to-be-acid&quot;&gt;Improving bashdb to be ACID&lt;&#x2F;h2&gt;
&lt;p&gt;Before we begin, know that I did not come up with most of these problems on my own, they are part of an acronym named &lt;code&gt;ACID&lt;&#x2F;code&gt;, which almost all databases strive to guarantee:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Atomicity&lt;&#x2F;strong&gt; - Not to be confused with multi-threading&#x27;s definition of atomicity (which is more similar to isolation), a transaction is considered atomic when a fault happens in the middle of a write, and the database either undos or aborts it completely, as if the write never started, leaving no partially written data.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Consistency&lt;&#x2F;strong&gt; - Illegal transactions should not corrupt the database. To be honest, consistency in ACID is a bit convoluted and overloaded, and it&#x27;s less interesting.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Isolation&lt;&#x2F;strong&gt; - No race conditions in concurrent accesses to the same data. There are multiple isolation levels, and we will discuss some of them later.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Durability&lt;&#x2F;strong&gt; - The first thing that comes to mind when talking about a database. It should store data you wrote to it, forever, even in the event of monkeys pulling the power plug out.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Not all database transactions need to guarantee ACID, for some use cases, it is fine to drop guarantees for performance reasons.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;But &lt;em&gt;how&lt;&#x2F;em&gt; can we make &lt;code&gt;bashdb&lt;&#x2F;code&gt; ACID?&lt;&#x2F;p&gt;
&lt;p&gt;We can start with durability, as it&#x27;s pretty easy to make &lt;code&gt;bashdb&lt;&#x2F;code&gt; durable by running &lt;code&gt;sync&lt;&#x2F;code&gt; right after writing in &lt;code&gt;db_set&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;db_set&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;echo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;$&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;,$&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt; database &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;sync&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt; -d&lt;&#x2F;span&gt;&lt;span&gt; database
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But wait a minute, what is going on, what is &lt;code&gt;sync&lt;&#x2F;code&gt; really doing? And what is that &lt;code&gt;-d&lt;&#x2F;code&gt;?&lt;&#x2F;p&gt;
&lt;h3 id=&quot;durability&quot;&gt;Durability&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;code&gt;write&lt;&#x2F;code&gt; syscall writes a buffer to a file, but who said it writes to disk?&lt;&#x2F;p&gt;
&lt;p&gt;The buffer you write could end up in any cache along the way to the non volatile memory. For example, the kernel stores the buffer in the page cache with each page marked as dirty, meaning it will flush it to disk sometime in the future.&lt;&#x2F;p&gt;
&lt;p&gt;To make matters worse, the disk device, or something managing your disks (for example a RAID system), might have a write cache as well.&lt;&#x2F;p&gt;
&lt;p&gt;So how do you tell all the systems in the middle to flush all dirty pages to the disk? For that we have &lt;code&gt;fsync&lt;&#x2F;code&gt; &#x2F; &lt;code&gt;fdatasync&lt;&#x2F;code&gt;, let&#x27;s see what &lt;code&gt;man&lt;&#x2F;code&gt; has to say:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#282a36;color:#f8f8f2;&quot;&gt;&lt;code&gt;&lt;span&gt;$ man 2 fsync
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;fsync() transfers (&amp;quot;flushes&amp;quot;) all modified in-core data of (i.e., modified buffer cache pages for)
&lt;&#x2F;span&gt;&lt;span&gt;the file referred to by the file descriptor fd to the disk device (or other permanent storage
&lt;&#x2F;span&gt;&lt;span&gt;device) so that all changed information can be retrieved even if the system crashes or is rebooted.
&lt;&#x2F;span&gt;&lt;span&gt;This includes writing through or flushing a disk cache if present.
&lt;&#x2F;span&gt;&lt;span&gt;The call blocks until the device reports that the transfer has completed.
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;...
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata itself
&lt;&#x2F;span&gt;&lt;span&gt;in order to allow a subsequent data  retrieval to be correctly handled.
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;...
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In short, &lt;code&gt;fdatasync&lt;&#x2F;code&gt; flushes the dirty raw buffers we gave &lt;code&gt;write&lt;&#x2F;code&gt;. &lt;code&gt;fsync&lt;&#x2F;code&gt; also flushes the file&#x27;s metadata like &lt;code&gt;mtime&lt;&#x2F;code&gt;, which we don&#x27;t really care about.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;sync&lt;&#x2F;code&gt; program is basically like running &lt;code&gt;fsync&lt;&#x2F;code&gt; on all dirty pages, unless a specific file is specified as one of the arguments. It has the &lt;code&gt;-d&lt;&#x2F;code&gt; flag for us to call &lt;code&gt;fdatasync&lt;&#x2F;code&gt; instead of &lt;code&gt;fsync&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The biggest drawback in adding &lt;code&gt;sync&lt;&#x2F;code&gt; is that we get worse performance. Usually sync is slower than even the write itself. But hey, at least we are now &lt;em&gt;durable&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A short but important note about fsync. When fsync() returns success it means &amp;quot;all writes since the last fsync have hit disk&amp;quot; when you might have assumed it means &amp;quot;all writes since the last SUCCESSFUL fsync have hit disk&amp;quot;. PostgreSQL learned about this only recently (2018), which led to them modifying the behavior of syncing from retrying fsync until a success is returned, to simply panic on fsync failure. This incident got famous and was named fsyncgate. You can learn a lot more about fsync failures &lt;a href=&quot;https:&#x2F;&#x2F;www.usenix.org&#x2F;system&#x2F;files&#x2F;atc20-rebello.pdf&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Dear &lt;code&gt;MongoDB&lt;&#x2F;code&gt; users, know that by default writes are &lt;a href=&quot;https:&#x2F;&#x2F;www.mongodb.com&#x2F;docs&#x2F;manual&#x2F;core&#x2F;journaling&#x2F;&quot;&gt;synced every 100ms&lt;&#x2F;a&gt;, meaning it is not 100% durable.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;isolation&quot;&gt;Isolation&lt;&#x2F;h3&gt;
&lt;p&gt;The simplest way to have multiprocess isolation in &lt;code&gt;bashdb&lt;&#x2F;code&gt; is to add a lock before we read &#x2F; write to the storage file.&lt;&#x2F;p&gt;
&lt;p&gt;There&#x27;s a program in linux called &lt;code&gt;flock&lt;&#x2F;code&gt;, which locks a file, and you can even provide it with the &lt;code&gt;-s&lt;&#x2F;code&gt; flag, to specify that you will not modify the file, meaning all callers who specify &lt;code&gt;-s&lt;&#x2F;code&gt; are allowed to read the file concurrently. &lt;code&gt;flock&lt;&#x2F;code&gt; blocks until it has taken the lock.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;flock simply calls the flock syscall&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;With such an awesome program, &lt;code&gt;bashdb&lt;&#x2F;code&gt; can guarantee &lt;em&gt;isolation&lt;&#x2F;em&gt;, here&#x27;s the code:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;bash&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-bash &quot;&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;db_set&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;    (
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;flock&lt;&#x2F;span&gt;&lt;span&gt; 9 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;echo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;$&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;,$&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt; database &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;sync&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt; -d&lt;&#x2F;span&gt;&lt;span&gt; database
&lt;&#x2F;span&gt;&lt;span&gt;    ) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;database.lock
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;db_get&lt;&#x2F;span&gt;&lt;span&gt;() {
&lt;&#x2F;span&gt;&lt;span&gt;    (
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;flock&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt; -s&lt;&#x2F;span&gt;&lt;span&gt; 9 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;amp;&amp;amp; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;grep &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;^$&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;,&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt; database &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;sed&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt; -e &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;quot;s&#x2F;^$&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ffffff;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;,&#x2F;&#x2F;&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;tail&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt; -n&lt;&#x2F;span&gt;&lt;span&gt; 1
&lt;&#x2F;span&gt;&lt;span&gt;    ) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span&gt;database.lock
&lt;&#x2F;span&gt;&lt;span&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The biggest drawback is that we are now locking the entire database whenever we write to it.&lt;&#x2F;p&gt;
&lt;p&gt;The only things left are atomicity and improving the algorithm to not be &lt;code&gt;O(n)&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;bad-news&quot;&gt;Bad News&lt;&#x2F;h2&gt;
&lt;p&gt;I&#x27;m sorry, this is as far as I could get with &lt;code&gt;bashdb&lt;&#x2F;code&gt;, I could not find a simple way to ensure atomicity in bash ☹️&lt;&#x2F;p&gt;
&lt;p&gt;I mean you could somehow probably use &lt;code&gt;mv -T&lt;&#x2F;code&gt; &#x2F; &lt;code&gt;rename&lt;&#x2F;code&gt; for this, I&#x27;ll leave it as an exercise for you.&lt;&#x2F;p&gt;
&lt;p&gt;And even if it was possible, we still need to fix the &lt;code&gt;O(n)&lt;&#x2F;code&gt; situation.&lt;&#x2F;p&gt;
&lt;p&gt;Before beginning the &lt;code&gt;bashdb&lt;&#x2F;code&gt; adventure, I knew that we won&#x27;t be able to easily solve all these problems in less than 10 lines of bash, but by trying to, you&#x27;ve hopefully started to get a feel for the problems database engineers face.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;storage-engine&quot;&gt;Storage Engine&lt;&#x2F;h1&gt;
&lt;p&gt;Let&#x27;s start with the first big component of a database, the &lt;code&gt;Storage Engine&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The purpose of the storage engine is to provide an abstraction over reading and writing data to persistent storage, with the main goal to be &lt;strong&gt;fast&lt;&#x2F;strong&gt;, i.e. have &lt;strong&gt;high throughput&lt;&#x2F;strong&gt; and &lt;strong&gt;low latency&lt;&#x2F;strong&gt; on requests.&lt;&#x2F;p&gt;
&lt;p&gt;But what makes software slow?&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#282a36;color:#f8f8f2;&quot;&gt;&lt;code&gt;&lt;span&gt;Latency Comparison Numbers (~2012)
&lt;&#x2F;span&gt;&lt;span&gt;----------------------------------
&lt;&#x2F;span&gt;&lt;span&gt;L1 cache reference                           0.5 ns
&lt;&#x2F;span&gt;&lt;span&gt;Branch mispredict                            5   ns
&lt;&#x2F;span&gt;&lt;span&gt;L2 cache reference                           7   ns                      14x L1 cache
&lt;&#x2F;span&gt;&lt;span&gt;Mutex lock&#x2F;unlock                           25   ns
&lt;&#x2F;span&gt;&lt;span&gt;Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
&lt;&#x2F;span&gt;&lt;span&gt;Compress 1K bytes with Zippy             3,000   ns        3 us
&lt;&#x2F;span&gt;&lt;span&gt;Send 1K bytes over 1 Gbps network       10,000   ns       10 us
&lt;&#x2F;span&gt;&lt;span&gt;Read 4K randomly from SSD              150,000   ns      150 us          ~1GB&#x2F;sec SSD
&lt;&#x2F;span&gt;&lt;span&gt;Read 1 MB sequentially from memory     250,000   ns      250 us
&lt;&#x2F;span&gt;&lt;span&gt;Round trip within same datacenter      500,000   ns      500 us
&lt;&#x2F;span&gt;&lt;span&gt;Read 1 MB sequentially from SSD      1,000,000   ns    1,000 us    1 ms  ~1GB&#x2F;sec SSD, 4X memory
&lt;&#x2F;span&gt;&lt;span&gt;Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
&lt;&#x2F;span&gt;&lt;span&gt;Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
&lt;&#x2F;span&gt;&lt;span&gt;Send packet CA-&amp;gt;Netherlands-&amp;gt;CA    150,000,000   ns  150,000 us  150 ms
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If L1 cache reference took as long as a heart beat (around half a second), reading 1 MB sequentially from SSD would take ~12 days and reading 1 MB sequentially from disk would take ~8 months.&lt;&#x2F;p&gt;
&lt;p&gt;This is why the main limitation of storage engines is the disk itself, and thus all designs try to minimize disk I&#x2F;O and disk seeks as much as possible. Some designs even get rid of disks in favor of SSDs (although they are much more expensive).&lt;&#x2F;p&gt;
&lt;p&gt;A storage engine design usually consists of:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;The underlying data structure to store items on disk.&lt;&#x2F;li&gt;
&lt;li&gt;ACID transactions.
&lt;ul&gt;
&lt;li&gt;Some may skip this to achieve better performance for specific use cases where ACID is not important.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Some cache - to not read from disk &lt;em&gt;every&lt;&#x2F;em&gt; time.
&lt;ul&gt;
&lt;li&gt;Most use buffered I&#x2F;O to let the OS cache for us.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;API layer - SQL &#x2F; document &#x2F; graph &#x2F; ...&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Storage engine data structures come in all shapes and sizes, I&#x27;m going to focus on the 2 categories you will most likely find in the wild - mutable and immutable data structures.&lt;&#x2F;p&gt;
&lt;p&gt;Mutable means that after writing data to a file, the data can be overwritten later in the future, while immutable means that after writing data to a file, it can only be read again.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;mutable-b-trees&quot;&gt;Mutable B-Trees&lt;&#x2F;h2&gt;
&lt;p&gt;To achieve the goal of maintaining good performance as the amount of data scales up, the data structure we use should be able to search an item in at most logarithmic time, and not linear time like in &lt;code&gt;bashdb&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;A simple data structure you are probably familiar with is the BST (binary search tree), where lookups are made in &lt;code&gt;O(log n)&lt;&#x2F;code&gt; time.&lt;&#x2F;p&gt;
&lt;p&gt;The problem with BSTs is nodes are placed randomly apart from each other, which means that after reading a node while traversing the tree, the next node is most likely going to be somewhere far away on disk. To minimize disk I&#x2F;O &amp;amp; seeks, each page read from disk should be read as much as possible from memory again, without reaching to disk.&lt;&#x2F;p&gt;
&lt;p&gt;The property we&#x27;re looking for is called &amp;quot;spatial locality&amp;quot;, and one of the most famous &amp;quot;spatially local&amp;quot; variations of BSTs are B-trees.&lt;&#x2F;p&gt;
&lt;p&gt;B-tree generalizes BST, allowing for nodes with more than two children. Here&#x27;s what they look like:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#282a36;color:#f8f8f2;&quot;&gt;&lt;code&gt;&lt;span&gt;                  ------------------------------------
&lt;&#x2F;span&gt;&lt;span&gt;                  |     7     |     16     |    |    |
&lt;&#x2F;span&gt;&lt;span&gt;                  ------------------------------------
&lt;&#x2F;span&gt;&lt;span&gt;                 &#x2F;            |             \
&lt;&#x2F;span&gt;&lt;span&gt;-----------------     ----------------       -----------------
&lt;&#x2F;span&gt;&lt;span&gt;| 1 | 2 | 5 | 6 |     | 9 | 12 |  |  |       | 18 | 21 |  |  |
&lt;&#x2F;span&gt;&lt;span&gt;-----------------     ----------------       -----------------
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;With the search algorithm in pseudo python code:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;font-style:italic;color:#ff79c6;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;get&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;node&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;key&lt;&#x2F;span&gt;&lt;span&gt;):
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;i, child &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;enumerate&lt;&#x2F;span&gt;&lt;span&gt;(node&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span&gt;children):
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if not &lt;&#x2F;span&gt;&lt;span&gt;child:
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;None
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;child&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span&gt;key &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span&gt;key:
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# Found it!
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;child&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span&gt;value
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;child&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span&gt;key &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span&gt;key:
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;get&lt;&#x2F;span&gt;&lt;span&gt;(node&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span&gt;nodes[i], key)
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;get&lt;&#x2F;span&gt;&lt;span&gt;(node&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span&gt;nodes[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;], key)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;On each read of a page from disk (usually 4KB or 8KB), we iterate over multiple nodes sequentially from memory and the various CPU caches, trying to keep the least amount of bytes read go to waste.&lt;&#x2F;p&gt;
&lt;p&gt;Remember, reading from memory and the CPU caches is a few order of magnitudes faster than disk, so much faster in fact, that it can be considered to be basically free in comparison.&lt;&#x2F;p&gt;
&lt;p&gt;I know some of you reading this right now think to themselves &lt;em&gt;&amp;quot;Why not binary search instead of doing it linearly?&amp;quot;&lt;&#x2F;em&gt;, to you I say, please look at the L1 &#x2F; L2 cache reference times in the latency comparison numbers table again. Also, modern CPUs execute multiple operations in parallel when it operates on sequential memory thanks to &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Single_instruction,_multiple_data&quot;&gt;SIMD&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Instruction_pipelining&quot;&gt;instruction pipelining&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Cache_prefetching&quot;&gt;prefetching&lt;&#x2F;a&gt;. You would be surprised just how far reading sequential memory can take you in terms of performance.&lt;&#x2F;p&gt;
&lt;p&gt;There&#x27;s a variation of the B-tree that takes this model even further, called a B+ tree, where the final leaf nodes hold a value and all other nodes hold only keys, thus fetching a page from disk results in a lot more keys to compare.&lt;&#x2F;p&gt;
&lt;p&gt;B-trees, to be space optimized, need to sometimes reclaim space as a consequence of data fragmentation created by operations on the tree like:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Big value updates - updating a value into a larger value might overwrite data of the next node, so the tree relocates the item to a different location, leaving a &amp;quot;hole&amp;quot; in the original page.&lt;&#x2F;li&gt;
&lt;li&gt;Small value updates - updating a value to a smaller value leaves a &amp;quot;hole&amp;quot; at the end.&lt;&#x2F;li&gt;
&lt;li&gt;Deletes - deletion causes a &amp;quot;hole&amp;quot; right where the deleted value used to reside.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The process that takes care of space reclamation and page rewrites can sometimes be called vacuum, compaction, page defragmentation, and maintenance. It is usually done in the background to not interfere and cause latency spikes to user requests.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;See for example how in &lt;code&gt;PostgreSQL&lt;&#x2F;code&gt; you can configure an &lt;a href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;routine-vacuuming.html&quot;&gt;auto vacuum daemon&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;B-trees are most commonly used as the underlying data structure of an index (&lt;code&gt;PostgreSQL&lt;&#x2F;code&gt; creates B-tree indexes by default), or all data (I&#x27;ve seen &lt;code&gt;DynamoDB&lt;&#x2F;code&gt; once jokingly called &lt;em&gt;&amp;quot;a distributed B-tree&amp;quot;&lt;&#x2F;em&gt;).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;immutable-lsm-tree&quot;&gt;Immutable LSM Tree&lt;&#x2F;h2&gt;
&lt;p&gt;As we have already seen in the latency comparison numbers table, disk seeks are really expensive, which is why the idea of sequentially written immutable data structures got so popular.&lt;&#x2F;p&gt;
&lt;p&gt;The idea is that if you only append data to a file, the disk needle doesn&#x27;t need to move as much to the next position where data will be written. On write heavy workloads it has been proven very beneficial.&lt;&#x2F;p&gt;
&lt;p&gt;One such append only data structure is called the &lt;code&gt;Log Structured Merge tree&lt;&#x2F;code&gt; or &lt;code&gt;LSM tree&lt;&#x2F;code&gt; in short, and is what powers &lt;em&gt;a lot&lt;&#x2F;em&gt; of modern database storage engines, such as &lt;code&gt;RocksDB&lt;&#x2F;code&gt;, &lt;code&gt;Cassandra&lt;&#x2F;code&gt; and my personal favorite &lt;code&gt;ScyllaDB&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;LSM trees&#x27; general concept is to buffer writes to a data structure in memory, preferably one that is easy to iterate in a sorted fashion (for example &lt;code&gt;AVL tree&lt;&#x2F;code&gt; &#x2F; &lt;code&gt;Red Black tree&lt;&#x2F;code&gt; &#x2F; &lt;code&gt;Skip List&lt;&#x2F;code&gt;), and once it reaches some capacity, flush it sorted to a new file called a &lt;code&gt;Sorted String Table&lt;&#x2F;code&gt; or &lt;code&gt;SSTable&lt;&#x2F;code&gt;. An SSTable stores sorted data, letting us leverage binary search and sparse indexes to lower the amount of disk I&#x2F;O.&lt;&#x2F;p&gt;
&lt;img class=&quot;svg&quot; src=&quot;&#x2F;lsm_tree_write.svg&quot;&#x2F;&gt;
&lt;p&gt;To maintain durability, when data is written to memory, the action is stored in a &lt;code&gt;Write-Ahead Log&lt;&#x2F;code&gt; or &lt;code&gt;WAL&lt;&#x2F;code&gt;, which is read on program&#x27;s startup to reset state to as it was before shutting down &#x2F; crashing.&lt;&#x2F;p&gt;
&lt;p&gt;Deletions are also appended the same way a write would, it simply holds a tombstone instead of a value. The tombstones get deleted in the compaction process detailed later.&lt;&#x2F;p&gt;
&lt;p&gt;The read path is where it gets a bit wonky, reading from an LSM tree is done by first searching for the item of the provided key in the data structure in memory, if not found, it then searches for the item by iterating over all SSTables on disk, from the newest one to the oldest.&lt;&#x2F;p&gt;
&lt;img class=&quot;svg&quot; src=&quot;&#x2F;lsm_tree_read.svg&quot;&#x2F;&gt;
&lt;p&gt;You can probably already tell that as more and more data is written, there will be more SSTables to go through to find an item of a specific key, and even though each file is sorted, going over a lot of small files is slower than going over one big file with all items (lookup time complexity: &lt;code&gt;log(num_files * table_size) &amp;lt; num_files * log(table_size)&lt;&#x2F;code&gt;). This is another reason why LSM trees require compaction, in addition to removing tombstones.&lt;&#x2F;p&gt;
&lt;p&gt;In other words: compaction combines a few small SSTables into one big SSTable, removing all tombstones in the process, and is usually run as a background process.&lt;&#x2F;p&gt;
&lt;img class=&quot;svg&quot; src=&quot;&#x2F;lsm_tree_compact.svg&quot;&#x2F;&gt;
&lt;p&gt;Compaction can be implemented using a binary heap &#x2F; priority queue, something like:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;font-style:italic;color:#ff79c6;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;compact&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;sstables&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;output_sstable&lt;&#x2F;span&gt;&lt;span&gt;): 
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# Ordered by ascending key. pop() results in the item of the smallest key.
&lt;&#x2F;span&gt;&lt;span&gt;    heap &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span&gt;heapq&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;heapify&lt;&#x2F;span&gt;&lt;span&gt;([(sstable&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;next&lt;&#x2F;span&gt;&lt;span&gt;(), sstable) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;sstable &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;sstables])
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span&gt;(item, sstable) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;:= &lt;&#x2F;span&gt;&lt;span&gt;heap&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;pop&lt;&#x2F;span&gt;&lt;span&gt;()
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if not &lt;&#x2F;span&gt;&lt;span&gt;item&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;is_tombstone&lt;&#x2F;span&gt;&lt;span&gt;():
&lt;&#x2F;span&gt;&lt;span&gt;            output_sstable&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;write&lt;&#x2F;span&gt;&lt;span&gt;(item)
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;item &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;:= &lt;&#x2F;span&gt;&lt;span&gt;sstable&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;next&lt;&#x2F;span&gt;&lt;span&gt;():
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# For code brevity, imagine pushing an item with a key that exists
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# in the heap removes the item with the smaller timestamp,
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# resulting in last write wins.
&lt;&#x2F;span&gt;&lt;span&gt;            heap&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;push&lt;&#x2F;span&gt;&lt;span&gt;((item, sstable))
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;blockquote&gt;
&lt;p&gt;For a real working example in rust 🦀, &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;tontinton&#x2F;dbeel&#x2F;blob&#x2F;ee3de152a5&#x2F;src&#x2F;storage_engine&#x2F;lsm_tree.rs#L1038&quot;&gt;click here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;To optimize an LSM tree, you should decide &lt;em&gt;when&lt;&#x2F;em&gt; to compact and on &lt;em&gt;which&lt;&#x2F;em&gt; sstable files. &lt;code&gt;RocksDB&lt;&#x2F;code&gt; for example implements &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;facebook&#x2F;rocksdb&#x2F;wiki&#x2F;Leveled-Compaction&quot;&gt;Leveled Compaction&lt;&#x2F;a&gt;, where the newly flushed sstables are said to reside in level 0, and once a configured N number of files are created in a level, they are compacted and the new file is promoted to the next level.&lt;&#x2F;p&gt;
&lt;p&gt;It&#x27;s important to handle removal of tombstones with care to not cause data resurrection. An item might be removed and then resurrected on compaction with another file that holds that item, even if the write happened before the deletion, there is no way to know once deleted in a previous compaction. &lt;code&gt;RocksDB&lt;&#x2F;code&gt; keeps tombstones around until a compaction of files that result in a promotion to the last level.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;bloom-filters&quot;&gt;Bloom Filters&lt;&#x2F;h3&gt;
&lt;p&gt;LSM trees can be further optimized by something called a bloom filter.&lt;&#x2F;p&gt;
&lt;p&gt;A bloom filter is a probabilistic set data structure that lets you to efficiently check whether an item doesn&#x27;t exist in a set. Checking whether an item exists in the set results in either &lt;code&gt;false&lt;&#x2F;code&gt;, which means the item is definitely not in the set, or in &lt;code&gt;true&lt;&#x2F;code&gt;, which means the item is &lt;strong&gt;maybe&lt;&#x2F;strong&gt; in the set, and that&#x27;s why it&#x27;s called a &lt;em&gt;probabilistic&lt;&#x2F;em&gt; data structure.&lt;&#x2F;p&gt;
&lt;p&gt;The beauty is that the space complexity of a bloom filter set of &lt;code&gt;n&lt;&#x2F;code&gt; items is &lt;code&gt;O(log n)&lt;&#x2F;code&gt;, while a regular set with &lt;code&gt;n&lt;&#x2F;code&gt; items is &lt;code&gt;O(n)&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;How do they work? The answer is hash functions! On insertion, they run multiple different hash functions on the inserted key, then take the results and store 1 in the corresponding bit (&lt;code&gt;result % number_of_bits&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# A bloom filter&amp;#39;s bitmap of size 8 (bits).
&lt;&#x2F;span&gt;&lt;span&gt;bloom &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# Inserting key - first run 2 hash functions.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;Hash1&lt;&#x2F;span&gt;&lt;span&gt;(key1) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;100
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;Hash2&lt;&#x2F;span&gt;&lt;span&gt;(key1) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;55
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# Then calculate corresponding bits.
&lt;&#x2F;span&gt;&lt;span&gt;bits &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;100 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;% &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;55 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;% &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# Set 1 to corresponding bits.
&lt;&#x2F;span&gt;&lt;span&gt;bloom[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;bloom[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# After insertion it should look like:
&lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now comes the exciting part - checking!&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span&gt;bloom &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# To check a key, simply run the 2 hash functions and find the corresponding
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# bits, exactly like you would on insertion:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;Hash1&lt;&#x2F;span&gt;&lt;span&gt;(key2) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;34
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;Hash2&lt;&#x2F;span&gt;&lt;span&gt;(key2) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;35
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;bits &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;34 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;% &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;35 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;% &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# And then check whether all the corresponding bits hold 1, if true, the item
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# maybe exists in the set, otherwise it definitely isn&amp;#39;t.
&lt;&#x2F;span&gt;&lt;span&gt;result &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span&gt;[bloom[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span&gt;], bloom[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;]] &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span&gt;false
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# false. key2 was never inserted in the set, otherwise those exact same bits
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# would have all been set to 1.
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Think about why it is that even when all checked bits are 1, it doesn&#x27;t guarantee that the same exact key was inserted before.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;A nice benefit of bloom filters is that you can control the chance of being certain that the item doesn&#x27;t exist in the set, by allocating more memory for the bitmap and by adding more hash functions. There are even &lt;a href=&quot;https:&#x2F;&#x2F;hur.st&#x2F;bloomfilter&#x2F;&quot;&gt;calculators&lt;&#x2F;a&gt; for it.&lt;&#x2F;p&gt;
&lt;p&gt;LSM trees can store a bloom filter for each SSTable, to skip searching in SSTables if their bloom filter validates that an item doesn&#x27;t exist in it. Otherwise, we search the SSTable normally, even if the item doesn&#x27;t necessarily exist in it.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;write-ahead-log&quot;&gt;Write Ahead Log&lt;&#x2F;h2&gt;
&lt;p&gt;Remember ACID? Let&#x27;s talk briefly about how storage engines achieve ACID transactions.&lt;&#x2F;p&gt;
&lt;p&gt;Atomicity and durability are properties of whether data is correct at all times, even when the machine shuts down due to a power shortage.&lt;&#x2F;p&gt;
&lt;p&gt;The most popular method to survive sudden crashes is to log all transaction actions into a special file called a &lt;code&gt;Write-Ahead Log&lt;&#x2F;code&gt; &#x2F; &lt;code&gt;WAL&lt;&#x2F;code&gt; (we touched on this briefly in the &lt;code&gt;LSM tree&lt;&#x2F;code&gt; section).&lt;&#x2F;p&gt;
&lt;p&gt;When the database process starts, it reads the &lt;code&gt;WAL&lt;&#x2F;code&gt; file, and reconstructs the state of the data, skipping all transactions that don&#x27;t have a commit log, thus achieving atomicity.&lt;&#x2F;p&gt;
&lt;p&gt;Also, as long as a write request&#x27;s data is written + flushed to the &lt;code&gt;WAL&lt;&#x2F;code&gt; file before the user receives the response, the data is going to be 100% read at startup, meaning you also achieve durability.&lt;&#x2F;p&gt;
&lt;p&gt;WALs are basically a sort of &lt;a href=&quot;https:&#x2F;&#x2F;martinfowler.com&#x2F;eaaDev&#x2F;EventSourcing.html&quot;&gt;event sourcing&lt;&#x2F;a&gt; of the transactional events.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;isolation-1&quot;&gt;Isolation&lt;&#x2F;h2&gt;
&lt;p&gt;To achieve isolation, you can either:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Use pessimistic locks - Block access to data that is currently being written to.&lt;&#x2F;li&gt;
&lt;li&gt;Use optimistic locks - Update a copy of the data and then commit it only whether the data was not modified during the transaction, if it did, retry on the new data. Also known as optimistic concurrency control.&lt;&#x2F;li&gt;
&lt;li&gt;Read a copy of the data - MVCC (Multiversion concurrency control) is a common method used to avoid blocking user requests. In MVCC when data is mutated, instead of locking + overwriting it, you create a new version of the data that new requests read from. Once no readers remain that are reading the old data it can be safely removed. With MVCC, each user sees a &lt;em&gt;snapshot&lt;&#x2F;em&gt; of the database at a specific instant in time.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Some applications don&#x27;t require perfect isolation (or &lt;code&gt;Serializable Isolation&lt;&#x2F;code&gt;), and can relax their read isolation levels.&lt;&#x2F;p&gt;
&lt;p&gt;The ANSI&#x2F;ISO standard SQL 92 includes 3 different possible outcomes from reading data in a transaction, while another transaction might have updated that data:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dirty reads&lt;&#x2F;strong&gt; - A dirty read occurs when a transaction retrieves a row that has been updated by another transaction that is not yet committed.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre data-lang=&quot;sql&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-sql &quot;&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;BEGIN&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;SELECT&lt;&#x2F;span&gt;&lt;span&gt; age &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;FROM&lt;&#x2F;span&gt;&lt;span&gt; users &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;WHERE&lt;&#x2F;span&gt;&lt;span&gt; id &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;-- retrieves 20
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;                                        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;BEGIN&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;                                        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;UPDATE&lt;&#x2F;span&gt;&lt;span&gt; users &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;SET&lt;&#x2F;span&gt;&lt;span&gt; age &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;21 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;WHERE&lt;&#x2F;span&gt;&lt;span&gt; id &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;                                        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;-- no commit here
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;SELECT&lt;&#x2F;span&gt;&lt;span&gt; age &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;FROM&lt;&#x2F;span&gt;&lt;span&gt; users &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;WHERE&lt;&#x2F;span&gt;&lt;span&gt; id &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;-- retrieves in 21
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;COMMIT&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Non-repeatable reads&lt;&#x2F;strong&gt; - A non-repeatable read occurs when a transaction retrieves a row twice and that row is updated by another transaction that is committed in between.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre data-lang=&quot;sql&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-sql &quot;&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;BEGIN&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;SELECT&lt;&#x2F;span&gt;&lt;span&gt; age &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;FROM&lt;&#x2F;span&gt;&lt;span&gt; users &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;WHERE&lt;&#x2F;span&gt;&lt;span&gt; id &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;-- retrieves 20
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;                                        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;BEGIN&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;                                        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;UPDATE&lt;&#x2F;span&gt;&lt;span&gt; users &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;SET&lt;&#x2F;span&gt;&lt;span&gt; age &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;21 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;WHERE&lt;&#x2F;span&gt;&lt;span&gt; id &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;                                        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;COMMIT&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;SELECT&lt;&#x2F;span&gt;&lt;span&gt; age &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;FROM&lt;&#x2F;span&gt;&lt;span&gt; users &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;WHERE&lt;&#x2F;span&gt;&lt;span&gt; id &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;-- retrieves 21
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;COMMIT&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Phantom reads&lt;&#x2F;strong&gt; - A phantom read occurs when a transaction retrieves a set of rows twice and new rows are inserted into or removed from that set by another transaction that is committed in between.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre data-lang=&quot;sql&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-sql &quot;&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;BEGIN&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;SELECT&lt;&#x2F;span&gt;&lt;span&gt; name &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;FROM&lt;&#x2F;span&gt;&lt;span&gt; users &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;WHERE&lt;&#x2F;span&gt;&lt;span&gt; age &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;17&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;-- retrieves Alice and Bob
&lt;&#x2F;span&gt;&lt;span&gt;	
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;                                        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;BEGIN&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;                                        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;INSERT INTO&lt;&#x2F;span&gt;&lt;span&gt; users &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;VALUES&lt;&#x2F;span&gt;&lt;span&gt; (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#f1fa8c;&quot;&gt;&amp;#39;Carol&amp;#39;&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;26&lt;&#x2F;span&gt;&lt;span&gt;);
&lt;&#x2F;span&gt;&lt;span&gt;                                        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;COMMIT&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;SELECT&lt;&#x2F;span&gt;&lt;span&gt; name &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;FROM&lt;&#x2F;span&gt;&lt;span&gt; users &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;WHERE&lt;&#x2F;span&gt;&lt;span&gt; age &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;17&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;-- retrieves Alice, Bob and Carol
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;COMMIT&lt;&#x2F;span&gt;&lt;span&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Your application might not need a guarantee of no dirty reads for example in a specific transaction, so it can choose a different isolation level to allow greater performance, as to achieve higher isolation levels, you usually sacrifice performance.&lt;&#x2F;p&gt;
&lt;p&gt;Here are isolation levels defined by the ANSI&#x2F;SQL 92 standard from highest to lowest (higher levels guarantee at least everything lower levels guarantee):&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Serializable&lt;&#x2F;strong&gt; - The highest isolation level. Reads always return data that is committed, including range based writes on multiple rows (avoiding phantom reads).&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Repeatable reads&lt;&#x2F;strong&gt; - Phantom reads are acceptable.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Read committed&lt;&#x2F;strong&gt; - Non-repeatable reads are acceptable.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Read uncommitted&lt;&#x2F;strong&gt; - The lowest isolation level. Dirty reads are acceptable.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;The ANSI&#x2F;SQL 92 standard isolation levels are often criticized for not being complete. For example, many MVCC implementations offer &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Snapshot_isolation&quot;&gt;snapshot isolation&lt;&#x2F;a&gt; and not serializable isolation (for the differences, read the provided wikipedia link). If you want to learn more about MVCC, I recommend reading about &lt;a href=&quot;https:&#x2F;&#x2F;db.in.tum.de&#x2F;~muehlbau&#x2F;papers&#x2F;mvcc.pdf&quot;&gt;HyPer&lt;&#x2F;a&gt;, a fast serializable MVCC algorithm.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;So to conclude the storage engine part of this post, the fundamental problems you solve writing a storage engine are: how to store &#x2F; retrieve data while trying to guarantee some ACID transactions in the most performant way.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;One topic I left out is the API to choose when writing a database &#x2F; storage engine, but I&#x27;ll leave a post called &lt;a href=&quot;https:&#x2F;&#x2F;www.scattered-thoughts.net&#x2F;writing&#x2F;against-sql&#x2F;&quot;&gt;&amp;quot;Against SQL&amp;quot;&lt;&#x2F;a&gt; for you to start exploring the topic yourself.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h1 id=&quot;distributed-systems&quot;&gt;Distributed Systems&lt;&#x2F;h1&gt;
&lt;p&gt;Going distributed should be a last mile resort, introducing it to a system adds a &lt;strong&gt;ton&lt;&#x2F;strong&gt; of complexity, as we will soon learn. Please avoid using distributed systems when non distributed solutions suffice.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable. ~Leslie Lamport&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The common use cases of needing to distribute data across multiple machines are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Availability&lt;&#x2F;strong&gt; - If for some reason the machine running the database crashes &#x2F; disconnects from our users, we might still want to let users use the application. By distributing data, when one machine fails, you can simply point requests to another machine holding the &amp;quot;redundant&amp;quot; data.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Horizontal Scaling&lt;&#x2F;strong&gt; - Conventionally, when an application needed to serve more user requests than it can handle, we would have upgraded the machine&#x27;s resources (faster &#x2F; more disk, RAM, CPUs). This is called &lt;code&gt;Vertical Scaling&lt;&#x2F;code&gt;. It can get very expensive and for some workloads there just doesn&#x27;t exist hardware to match the amount of resources needed. Also, most of the time you don&#x27;t need all those resources, except in peaks of traffic (imagine Shopify on Black Friday). Another strategy called &lt;code&gt;Horizontal Scaling&lt;&#x2F;code&gt;, is to operate on multiple separate machines connected over a network, seemingly working as a single machine.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Sounds like a dream, right? What can go wrong with going distributed?&lt;&#x2F;p&gt;
&lt;p&gt;Well, you have now introduced operational complexity (deployments &#x2F; etc...) and more importantly partitioning &#x2F; network partitioning, infamous for being the P in something called the CAP theorem.&lt;&#x2F;p&gt;
&lt;p&gt;The CAP theorem states that a system can guarantee only 2 of the following 3:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Consistency&lt;&#x2F;strong&gt; - Reads receive the most recent write.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Availability&lt;&#x2F;strong&gt; - All requests succeed, no matter the failures.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Partition Tolerance&lt;&#x2F;strong&gt; - The system continues to operate despite dropped &#x2F; delayed messages between nodes.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;To understand why this is, imagine a database operating on a single machine. It is definitely &lt;em&gt;partition tolerant&lt;&#x2F;em&gt;, as messages in the system are not sent through something like a network, but through function calls operating on the same hardware (CPU &#x2F; memory). It is also &lt;em&gt;consistent&lt;&#x2F;em&gt;, as the state of the data is saved on the same hardware (memory &#x2F; disk) that all other read &#x2F; write requests operate on. Once the machine fails (be it software failures like SIGSEGV or hardware failures like the disk overheating) all new requests to it fail, violating &lt;em&gt;availability&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Now imagine a database operating on 2 machines with separate CPUs, memory and disks, connected through some cable. When a request to one of the machines fails, for whatever reason, the system can choose to do one of the following:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Cancel the request, thus sacrificing &lt;em&gt;availability&lt;&#x2F;em&gt; for &lt;em&gt;consistency&lt;&#x2F;em&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Allow the request to continue only on the working machine, meaning our other machine will now have inconsistent data (reads from it will not return the most recent write), thus sacrificing &lt;em&gt;consistency&lt;&#x2F;em&gt; for &lt;em&gt;availability&lt;&#x2F;em&gt;. When a system does this, it is called eventually consistent. &lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Network partitioning also means that you lose the ability to efficiently &lt;code&gt;JOIN&lt;&#x2F;code&gt; data, as you now need to pull together scattered data throughout the cluster. To mitigate that the &lt;code&gt;NoSQL&lt;&#x2F;code&gt; movement of databases tell you to &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Denormalization&quot;&gt;denormalize&lt;&#x2F;a&gt; your data.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The original &lt;a href=&quot;https:&#x2F;&#x2F;www.allthingsdistributed.com&#x2F;files&#x2F;amazon-dynamo-sosp2007.pdf&quot;&gt;dynamo paper&lt;&#x2F;a&gt; is famous for many things, one of them being Amazon stating that amazon.com&#x27;s shopping cart should be highly available, and that it&#x27;s more important to them than consistency. In the unlikely scenario a user sees 2 of the same item in the shopping cart, they will simply remove one of them, which is a better situation then them not being able to purchase and pay money!&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;I really enjoy out of the box thinking of sacrificing something that adds software complexity (like consistency in Amazon&#x27;s shopping cart) for a simpler human solution like the user getting a refund. Software complexity can get more expensive to operate than having a refund budget for example.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;To achieve &lt;em&gt;availability&lt;&#x2F;em&gt; it&#x27;s not enough to have multiple nodes together combining all the data, there must also be data redundancy, or in other words, for each item a node stores there must be at least 1 other node to store a copy of that item. These nodes are usually called &lt;strong&gt;replicas&lt;&#x2F;strong&gt;, and the process of copying the data is called &lt;strong&gt;replication&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Assigning more replica nodes means that the system will be more &lt;em&gt;available&lt;&#x2F;em&gt;, with the obvious drawback of needing more resources to store all these copies.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Copies of data don&#x27;t need to be stored &amp;quot;whole&amp;quot;, they can be split and scattered across multiple nodes using a technique called erasure coding, which also has some interesting &lt;a href=&quot;https:&#x2F;&#x2F;brooker.co.za&#x2F;blog&#x2F;2023&#x2F;01&#x2F;06&#x2F;erasure.html&quot;&gt;latency characteristics&lt;&#x2F;a&gt; (by the way brooker&#x27;s blog is simply amazing for learning distributed systems).&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;consistent-hashing&quot;&gt;Consistent Hashing&lt;&#x2F;h2&gt;
&lt;p&gt;Now that you have multiple nodes, you need some kind of load balancing &#x2F; data partitioning method. When a request to store some data comes in, how do you determine which node receives the request?&lt;&#x2F;p&gt;
&lt;p&gt;You could go for the simplest solution, which is to simply always take a primary key (some id) in addition to the data, hash the key and modulo the result by the number of available nodes, something like:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;font-style:italic;color:#ff79c6;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;get_owning_node&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;nodes&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;key&lt;&#x2F;span&gt;&lt;span&gt;):
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;nodes[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;hash&lt;&#x2F;span&gt;&lt;span&gt;(key) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;% &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;len&lt;&#x2F;span&gt;&lt;span&gt;(nodes)] 
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This modulo method works fine, until a node is either added or removed from the cluster. Once that happens, the calculation returns a different result because the number of available nodes changed, meaning a different node will be selected for the same key. To accommodate, each node can migrate keys that should now live on different nodes, but then almost all items are migrated, which is really expensive.&lt;&#x2F;p&gt;
&lt;p&gt;One method to lower the amount of items to be migrated on node addition &#x2F; removal that is used by some databases (e.g. &lt;code&gt;Dynamo&lt;&#x2F;code&gt; and &lt;code&gt;Cassandra&lt;&#x2F;code&gt;) is &lt;code&gt;Consistent Hashing&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Consistent hashing creates a ring of nodes instead of an array, placing each node&#x27;s name hash on the ring. Then each request&#x27;s key is hashed just like before, but instead of doing the modulo operation, we get the first node in the ring whose name&#x27;s hash is smaller than the request key hash:&lt;&#x2F;p&gt;
&lt;pre data-lang=&quot;python&quot; style=&quot;background-color:#282a36;color:#f8f8f2;&quot; class=&quot;language-python &quot;&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span style=&quot;color:#6272a4;&quot;&gt;# Assume nodes are sorted, with the first node having the smallest hash value.
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ff79c6;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;color:#50fa7b;&quot;&gt;get_owning_node&lt;&#x2F;span&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;nodes&lt;&#x2F;span&gt;&lt;span&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#ffb86c;&quot;&gt;key&lt;&#x2F;span&gt;&lt;span&gt;):
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;len&lt;&#x2F;span&gt;&lt;span&gt;(nodes) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;None
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    key_hash &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#8be9fd;&quot;&gt;hash&lt;&#x2F;span&gt;&lt;span&gt;(key)
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span&gt;node &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span&gt;nodes:
&lt;&#x2F;span&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span&gt;node&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span&gt;hash &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span&gt;key_hash:
&lt;&#x2F;span&gt;&lt;span&gt;            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;node
&lt;&#x2F;span&gt;&lt;span&gt;
&lt;&#x2F;span&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ff79c6;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span&gt;nodes[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#bd93f9;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span&gt;]
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For a visual explanation, imagine a ring that goes from 0 -&amp;gt; 99, holding nodes with the names &amp;quot;half&amp;quot;, &amp;quot;quarter&amp;quot; and &amp;quot;zero&amp;quot; whose hashes are 50, 25 and 0 respectively:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#282a36;color:#f8f8f2;&quot;&gt;&lt;code&gt;&lt;span&gt;   zero
&lt;&#x2F;span&gt;&lt;span&gt; &#x2F;      \
&lt;&#x2F;span&gt;&lt;span&gt;|     quarter 
&lt;&#x2F;span&gt;&lt;span&gt; \      &#x2F;
&lt;&#x2F;span&gt;&lt;span&gt;   half
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Let&#x27;s say a user now wants to set an item with the key &amp;quot;four-fifths&amp;quot;, with a hash value of 80. The first node with a name hash smaller than 80 is &amp;quot;half&amp;quot; (with hash value of 50), so that&#x27;s the node to receive the request!&lt;&#x2F;p&gt;
&lt;p&gt;Choosing replicas is very simple, when an item is set to be stored on a specific node, go around the ring counter-clockwise, the next node will store a copy of that item. In our example, &amp;quot;zero&amp;quot; is the replica node for all items &amp;quot;half&amp;quot; owns, so when &amp;quot;half&amp;quot; dies and requests will now be routed to &amp;quot;zero&amp;quot;, it can serve these requests, keeping our system &lt;em&gt;available&lt;&#x2F;em&gt;. This method is sometimes called &lt;code&gt;Leaderless Replication&lt;&#x2F;code&gt; and is used by &amp;quot;Dynamo&amp;quot; style databases like &lt;code&gt;Cassandra&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Another method is to choose a leader node and replica nodes is &lt;code&gt;Leader Election&lt;&#x2F;code&gt;, which is a huge topic on its own that I won&#x27;t get into in this post.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Now, what happens when a node is added to the cluster? Let&#x27;s add a node named &amp;quot;three-quarters&amp;quot; with a hash value of 75, the item &amp;quot;four-fifths&amp;quot; should be migrated to the new &amp;quot;three-quarters&amp;quot; node, as new requests to it will now point to it.&lt;&#x2F;p&gt;
&lt;p&gt;This migration process is a lot less expensive than what we previously had in the modulo solution. The number of keys that need to be migrated is equal to &lt;code&gt;num_keys &#x2F; num_nodes&lt;&#x2F;code&gt; on average.&lt;&#x2F;p&gt;
&lt;p&gt;A cool trick is to introduce the concept of virtual nodes, where you add multiple instances of a node to the ring, to lower the chances of some nodes owning more items than other nodes (in our example &amp;quot;half&amp;quot; will store twice as many items on average than the other nodes). You can generate virtual node names by for example adding an index as a suffix to the node name (&amp;quot;half-0&amp;quot;, &amp;quot;half-1&amp;quot;, etc...) and then the hash will result in a completely different location on the ring.&lt;&#x2F;p&gt;
&lt;p&gt;Here&#x27;s a more detailed example of a migration in a cluster with a replication factor of 3:&lt;&#x2F;p&gt;
&lt;img class=&quot;svg&quot; src=&quot;&#x2F;migration.svg&quot;&#x2F;&gt;
&lt;blockquote&gt;
&lt;p&gt;Same colored nodes are virtual nodes of the same node, green arrows show to which node an item is being migrated to, red arrows show item deletions from nodes and the brown diamonds are items.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h2 id=&quot;leaderless-replication&quot;&gt;Leaderless Replication&lt;&#x2F;h2&gt;
&lt;p&gt;In a leaderless setup, you get amazing &lt;em&gt;availability&lt;&#x2F;em&gt;, while sacrificing &lt;em&gt;consistency&lt;&#x2F;em&gt;. If the owning node is down on a write request, it will be written to the replica, and once the owning node is up and running again, a read request will read stale data.&lt;&#x2F;p&gt;
&lt;p&gt;When &lt;em&gt;consistency&lt;&#x2F;em&gt; is needed for a specific request, read requests can be sent in parallel to several replica nodes as well as to the owning node. The client will pick the most up to date data. Write requests are usually sent in parallel to all replica nodes but wait for an acknowledgement from only some of them. By choosing the number of read requests and number of write requests acknowledge, you can tune the &lt;em&gt;consistency&lt;&#x2F;em&gt; level on a request level.&lt;&#x2F;p&gt;
&lt;p&gt;To know whether a request is &lt;em&gt;consistent&lt;&#x2F;em&gt;, you just need to validate that &lt;code&gt;R + W &amp;gt; N&#x2F;2 + 1&lt;&#x2F;code&gt;, where:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;N&lt;&#x2F;strong&gt; - Number of nodes holding a copy of the data.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;W&lt;&#x2F;strong&gt; - Number of nodes that will acknowledge a write for it to succeed.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;R&lt;&#x2F;strong&gt; - Number of nodes that have to respond to a read operation for it to succeed.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Sending a request to a majority of nodes (where &lt;code&gt;W&lt;&#x2F;code&gt; or &lt;code&gt;R&lt;&#x2F;code&gt; is equal to &lt;code&gt;N&#x2F;2 + 1&lt;&#x2F;code&gt;) is called a quorum.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Picking the correct read as the latest written one is called &lt;code&gt;Conflict Resolution&lt;&#x2F;code&gt; and it is not a simple task, you might think that simply comparing timestamps and choosing the biggest one is enough, but using times in a distributed system is unreliable.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;This didn&#x27;t stop &lt;a href=&quot;https:&#x2F;&#x2F;cassandra.apache.org&#x2F;doc&#x2F;latest&#x2F;cassandra&#x2F;architecture&#x2F;dynamo.html#data-versioning&quot;&gt;Cassandra from using timestamps&lt;&#x2F;a&gt; though.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Each machine has its own hardware clock, and the clocks &lt;em&gt;drift&lt;&#x2F;em&gt; apart as they are not perfectly accurate (usually a quartz crystal oscillator). Synchronizing clocks using NTP (Network Time Protocol), where a server returns the time from a more accurate time source such as a GPS receiver, is not enough to provide accurate results, as the NTP request is over the network (another distributed system) and we can&#x27;t know exactly how much time will pass before receiving a response.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Google&#x27;s &lt;code&gt;Spanner&lt;&#x2F;code&gt; successfuly achieved providing consistency guarantees with clocks, by using special high precision time hardware and its API exposes the time range uncertainty of each timestamp. You can read more about it &lt;a href=&quot;https:&#x2F;&#x2F;research.google&#x2F;pubs&#x2F;pub39966.pdf&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;But if clocks are so unreliable, how else are we supposed to know which value is correct?&lt;&#x2F;p&gt;
&lt;p&gt;Some systems (for example &lt;code&gt;Dynamo&lt;&#x2F;code&gt;) try to solve this partially using &lt;code&gt;Version Vectors&lt;&#x2F;code&gt;, where you attach a (node, counter) pair for each version of an item, which gives you the ability to find causality between the different versions. By finding versions of values that are definitely newer (have a higher counter) you can remove some versions of a value, which makes the problem easier.&lt;&#x2F;p&gt;
&lt;img class=&quot;svg&quot; src=&quot;&#x2F;version_vector.svg&quot;&#x2F;&gt;
&lt;blockquote&gt;
&lt;p&gt;An example showing how easily conflicts arise. At the end we are left with {v2, v3} as the conflicting values for the same key. The reason I removed v1 is to show that by using something like &lt;code&gt;Version Vectors&lt;&#x2F;code&gt;, versions of values can be safely removed to minimize the amount of conflicts. To learn more on &lt;code&gt;Version Vectors&lt;&#x2F;code&gt; and their implementations, I recommend reading &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;ricardobcl&#x2F;Dotted-Version-Vectors&quot;&gt;Dotted Version Vectors&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;We could also decide to simply let the application decide how to deal with conflicts, by returning all conflicting values for the requested item. The application might know a lot more on the data than the database, so why not let it resolve conflicts? This is what &lt;code&gt;Riak KV&lt;&#x2F;code&gt; does for example.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;An idea I think about often is that you could even allow users to compile conflict resolution logic as a WASM module, and upload it to the database, so that when conflicts occur, the database resolves them, never relying on the application.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;There are lots of different ideas to reduce conflicts in an eventually consistent system, they usually fall under the umbrella term &lt;code&gt;Anti Entropy&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;anti-entropy&quot;&gt;Anti Entropy&lt;&#x2F;h2&gt;
&lt;p&gt;Here are examples of some of the most popular &lt;code&gt;Anti Entropy&lt;&#x2F;code&gt; mechanisms:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Read Repair&lt;&#x2F;strong&gt; - After a client chooses the &amp;quot;latest&amp;quot; value from a read request that went to multiple nodes (by conflict resolution), it sends that value back to all the nodes that don&#x27;t currently store that value, thus &lt;em&gt;repairing&lt;&#x2F;em&gt; them.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Hinted Handoff&lt;&#x2F;strong&gt; - When a write request can&#x27;t reach one of the target nodes, send it instead as a &amp;quot;hint&amp;quot; to some other node. As soon as that target node is available again, send it the saved &amp;quot;hint&amp;quot;. On a quorum write, this mechanism is also called &lt;code&gt;Sloppy Quorum&lt;&#x2F;code&gt;, which provides even better &lt;em&gt;availability&lt;&#x2F;em&gt; for quorum requests.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Merkle Trees&lt;&#x2F;strong&gt; - Because read repair only fixes queried data, a lot of data can still become inconsistent for a long time. Nodes can choose to start a synchronization process by talking to each other and see the differences in data. This is really expensive when there is a lot of data (&lt;code&gt;O(n)&lt;&#x2F;code&gt;). To make the sync algorithm faster (&lt;code&gt;O(log n)&lt;&#x2F;code&gt;) we can introduce &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Merkle_tree&quot;&gt;merkle trees&lt;&#x2F;a&gt;. A merkle tree stores the hash of a range of the data in lowest leaf nodes, with the parent leaf nodes being a combined hash of the 2 of its children, thus creating a hierarchy of hashes up to the root of the tree. The sync process now starts by one node comparing the root of the merkle tree to another node&#x27;s merkle tree, if the hashes are the same, it means they have exactly the same data. If the hashes differ, the leaf hashes are checked the same way, recursively until the inconsistent data is found.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Gossip Dissemination&lt;&#x2F;strong&gt; - Send broadcast events to all nodes in the cluster in a simple and reliable way, by imitating how humans spread rumors or a disease. You send the event message to a configured number of randomly chosen nodes (called the &amp;quot;fanout&amp;quot;), then when they receive the message they repeat the process and send the message to another set of randomly chosen &lt;code&gt;N&lt;&#x2F;code&gt; nodes. To not repeat the message forever in the cluster, a node stops broadcasting a gossip message when it sees it a configured number of times. To get a feel for how data converges using gossip, head over to the &lt;a href=&quot;https:&#x2F;&#x2F;www.serf.io&#x2F;docs&#x2F;internals&#x2F;simulator.html&quot;&gt;simulator&lt;&#x2F;a&gt;! As an optimization, gossip messages are usually sent using UDP, as the mechanism is just that reliable.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h1&gt;
&lt;p&gt;There is a lot more to talk about databases, be it the use of &lt;a href=&quot;https:&#x2F;&#x2F;yarchive.net&#x2F;comp&#x2F;linux&#x2F;o_direct.html&quot;&gt;O_DIRECT&lt;&#x2F;a&gt; in linux and implementing your own page cache, failure detection in distributed systems, consensus algorithms like &lt;a href=&quot;https:&#x2F;&#x2F;raft.github.io&#x2F;&quot;&gt;raft&lt;&#x2F;a&gt;, distributed transactions, leader election, and an almost infinite amount more.&lt;&#x2F;p&gt;
&lt;p&gt;I hope I have piqued your curiosity enough to explore the world of databases further, or provided the tools for you to better understand which database to pick in your next project 😀&lt;&#x2F;p&gt;
</content>
        
    </entry>
</feed>
