{"@attributes":{"version":"2.0"},"channel":{"title":"Jin Cong Ho | ML & Analytics with HPC","link":"https:\/\/jincongho.com\/","description":"Recent content on Jin Cong Ho | ML & Analytics with HPC","generator":"Hugo -- 0.144.2","language":"en-us","lastBuildDate":"Mon, 23 Feb 2026 00:00:00 +0000","item":[{"title":"Designing Binary Encodings for JSON and VARIANT","link":"https:\/\/jincongho.com\/posts\/designing-binary-encodings-for-json-and-variant\/","pubDate":"Mon, 23 Feb 2026 00:00:00 +0000","guid":"https:\/\/jincongho.com\/posts\/designing-binary-encodings-for-json-and-variant\/","description":"In this post, we&rsquo;ll dig into the <strong>internal designs of binary encoding for JSON<\/strong>. There&rsquo;s been little discussion of the trade-offs these encodings make. We&rsquo;ll see how even a straightforward binary encoding can materially improve retrieval performance."},{"title":"Paper Notes: Yellowbrick: An Elastic Data Warehouse on Kubernetes","link":"https:\/\/jincongho.com\/dbinternals\/yellowbrick\/","pubDate":"Thu, 31 Jul 2025 22:08:00 +0100","guid":"https:\/\/jincongho.com\/dbinternals\/yellowbrick\/","description":"<hr>\n<ul>\n<li><a href=\"https:\/\/15721.courses.cs.cmu.edu\/spring2024\/papers\/21-yellowbrick\/p2-cusack.pdf\">Yellowbrick: An Elastic Data Warehouse on Kubernetes, 2024 VLDB<\/a><\/li>\n<\/ul>\n<hr>\n<h1 id=\"1-key-design\">1 Key Design<\/h1>\n<p>Yellowbrick Data Warehou delivers efficient, scalable and resilient data warehousing in public clouds and in private data centers.<\/p>\n<h1 id=\"2-architecture\">2 Architecture<\/h1>\n<p>Storage is separated from compute and data is persisted in object storage as column-oriented, compressed files known as shards.<\/p>\n<p>Microservices Architectrue<\/p>\n<p>Deployment Approach<\/p>\n<h1 id=\"3-software-optimizations\">3 Software Optimizations<\/h1>\n<h2 id=\"31-database-optimizations\">3.1 Database Optimizations<\/h2>\n<p>parallel query plans, cost-based optimization, workload mangaement and parallel query execution<\/p>"},{"title":"Paper Notes: Amazon Redshift and the Case for Simpler Data Warehouses","link":"https:\/\/jincongho.com\/dbinternals\/redshift\/","pubDate":"Sun, 27 Apr 2025 18:38:10 +0100","guid":"https:\/\/jincongho.com\/dbinternals\/redshift\/","description":"<hr>\n<ul>\n<li><a href=\"https:\/\/www.cs.cmu.edu\/~15721-f24\/papers\/Redshift.pdf\">Amazon Redshift and the Case for Simpler Data Warehouses, 2015 SIGMOD<\/a><\/li>\n<\/ul>\n<hr>\n<h1 id=\"1-key-design\">1 Key Design<\/h1>\n<p>Redshift is a fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze large volumes of data. It uses familiar data warehousing techniques, including columnar layout, per-column compression, co-locating compute and data, co-locating joins, compilation to machine code and scale-out MPP processing. It also had a number of additional design goals:<\/p>"},{"title":"Paper Notes: ClickHouse - Lightning Fast Analytics for Everyone","link":"https:\/\/jincongho.com\/dbinternals\/clickhouse\/","pubDate":"Sun, 27 Apr 2025 16:21:04 +0100","guid":"https:\/\/jincongho.com\/dbinternals\/clickhouse\/","description":"<hr>\n<ul>\n<li><a href=\"https:\/\/www.vldb.org\/pvldb\/vol17\/p3731-schulze.pdf\">ClickHouse - Lightning Fast Analytics for Everyone, 2024 PVLDB<\/a><\/li>\n<\/ul>\n<hr>\n<p>ClickHouse is an OLAP database designed for high-performance analytics over petabyte-scale data sets with high ingestion rates.<\/p>\n<h1 id=\"1-key-design\">1 Key Design<\/h1>\n<p>ClikcHouse is designed to address 5 key challenges of modern analytical data management:<\/p>\n<ol>\n<li>\n<p>Huge data sets with <strong>high ingestion rates<\/strong><\/p>\n<\/li>\n<li>\n<p>Many <strong>simultaneous queries<\/strong> with an expectation of low latencies: ad-hoc and recurring queries, pruning techniques allow optimizing frequent queries. Managing shared system resources.<\/p>"},{"title":"Paper Notes: The Snowflake Elastic Data Warehouse","link":"https:\/\/jincongho.com\/dbinternals\/snowflake\/","pubDate":"Thu, 24 Apr 2025 21:54:23 +0100","guid":"https:\/\/jincongho.com\/dbinternals\/snowflake\/","description":"<hr>\n<ul>\n<li><a href=\"https:\/\/www.cs.cmu.edu\/~15721-f24\/papers\/Snowflake.pdf\">The Snowflake Elastic Data Warehouse, 2016 ACM<\/a><\/li>\n<\/ul>\n<hr>\n<h1 id=\"1-key-design\">1 Key Design<\/h1>\n<p>Snowflake is an <strong>enterprise-ready data warehousing solution for the cloud<\/strong>.<\/p>\n<p>Cloud promises increased economies of scale, extreme scalability and availability and a pay-as you go cost model \u2014 but it can only be captured if the software itself is able to scale elastically over the pool of commodity resources in the cloud.<\/p>\n<p>Meanwhile, Saas brings enterprise-class systems to users who previously could not afford them. Snowflake key features includes: relational model, semi-structured data, elastic compute and storage, highly available, durable, cost-efficient and secure.<\/p>"},{"title":"Database Internals","link":"https:\/\/jincongho.com\/dbinternals\/","pubDate":"Mon, 01 Jan 0001 00:00:00 +0000","guid":"https:\/\/jincongho.com\/dbinternals\/","description":"<div class=\"tabs-nav\">\n<button class=\"tabs-nav-button\" onclick=\"showTab(this, 0)\">Systems<\/button>\n<button class=\"tabs-nav-button\" onclick=\"showTab(this, 1)\">Components<\/button>\n<\/div>\n<div id=\"tab0\">\n<h3 id=\"data-lakehouse\">Data Lakehouse<\/h3>\n<ul>\n<li>Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics<\/li>\n<li>Petabyte-Scale Row-Level Operations in Data Lakehouses<\/li>\n<\/ul>\n<h3 id=\"data-warehouse\">Data Warehouse<\/h3>\n<ul>\n<li><a href=\"https:\/\/jincongho.com\/dbinternals\/snowflake\/\">The Snowflake Elastic Data Warehouse<\/a><\/li>\n<li><a href=\"https:\/\/jincongho.com\/dbinternals\/clickhouse\/\">ClickHouse - Lightning Fast Analytics for Everyone<\/a><\/li>\n<li><a href=\"https:\/\/jincongho.com\/dbinternals\/redshift\/\">Amazon Redshift and the Case for Simpler Data Warehouses<\/a><\/li>\n<li><a href=\"https:\/\/jincongho.com\/dbinternals\/yellowbrick\/\">Yellowbrick: An Elastic Data Warehouse on Kubernetes<\/a><\/li>\n<\/ul>\n<h3 id=\"relational-database\">Relational Database<\/h3>\n<ul>\n<li>Postgres<\/li>\n<\/ul>\n<h3 id=\"distributed-processing\">Distributed Processing<\/h3>\n<ul>\n<li>Spark<\/li>\n<li>Ray<\/li>\n<\/ul>\n<h3 id=\"stream-processing\">Stream Processing<\/h3>\n<ul>\n<li>Kafka<\/li>\n<li>Flink<\/li>\n<\/ul>\n<\/div>\n<div id=\"tab1\" style=\"display:none;\">\n<h3 id=\"query-planner\">Query Planner<\/h3>\n<h3 id=\"execution-engine\">Execution Engine<\/h3>\n<h3 id=\"storage-engine\">Storage Engine<\/h3>\n<\/div>\n<style>\n.tabs-nav {\nborder-bottom: 1px solid #ddd;\n}\n.tabs-nav-button {\npadding: 5px 10px;\nborder: 1px solid #ddd;\nborder-bottom: 0;\nmargin: 0;\n}\n.tabs-nav-button:first-child {\nbackground-color: #eee;\n}\n<\/style>\n<script>\nfunction showTab(button, i) {\nconst elements = document.querySelectorAll('.tabs-nav-button');\nelements.forEach(el => {\nel.style.backgroundColor = '#fff';\n});\nbutton.style.backgroundColor = '#eee';\ndocument.getElementById('tab0').style.display = (i === 0) ? 'block' : 'none';\ndocument.getElementById('tab1').style.display = (i === 1) ? 'block' : 'none';\n}\n<\/script>"}]}}