{"@attributes":{"version":"2.0"},"channel":{"title":"Grab Tech","description":"Grab's Engineering team solves critical transportation challenges and makes transport freedom a reality for 620 million people in Southeast Asia.\n","link":"https:\/\/engineering.grab.com\/","pubDate":"Thu, 19 Mar 2026 09:52:20 +0000","lastBuildDate":"Thu, 19 Mar 2026 09:52:20 +0000","generator":"Jekyll v4.4.1","item":[{"title":"From firefighting to building: How AI agents restored our team\u2019s core productivity","description":"<h2 id=\"abstract\">Abstract<\/h2>\n\n<p>Grab\u2019s Analytics Data Warehouse (ADW) team supports over 1,000 users each month. These users support an extensive repository of more than 15,000 tables, which powers approximately 50% of all queries within our data lake.<br \/>\nHowever, the manual process of addressing \u201cquick questions\u201d is time-consuming and labor-intensive, thus creating a bottleneck in our operations.<\/p>\n\n<p>The team was drowning in repetitive requests, spending approximately 40% of their time or an equivalent of roughly 2 days every week, on tasks like:<\/p>\n\n<ul>\n  <li>Answering the same questions about data definitions<\/li>\n  <li>Tracing data sources and troubleshooting<\/li>\n  <li>Running quality checks to verify data integrity<\/li>\n  <li>Basic enhancement requests<\/li>\n<\/ul>\n\n<p>We deployed a <strong>multi-agent AI system<\/strong> that autonomously answers simpler questions and collaboratively addresses more complex requests. This led us to reclaim significant engineering bandwidth and unlock hundreds of hours of productivity monthly.<\/p>\n\n<h2 id=\"introduction\">Introduction<\/h2>\n\n<p>It\u2019s 5:00 PM on a Friday. You\u2019re wrapping up for the week when you receive a Slack message: \u201cThe vehicle_id in our production table looks gibberish. Is the pipeline broken?\u201d<\/p>\n\n<p>We tracked the anatomy of these \u2018simple\u2019 questions. It involved a fragmented journey through data catalogs, manual lineage tracing, SQL validation, and log diving. By the time a stakeholder received an answer, hours of high-value engineering time had been diverted into investigative overhead.<\/p>\n\n<p>This process consumed nearly half of our team\u2019s bandwidth. We recognized that while problems differed, the process of solving them was consistent. This led us to build a multi-agent AI system to automate the context hunt, allowing engineers to focus on complex challenges.<\/p>\n\n<h3 id=\"solution\">Solution<\/h3>\n\n<h3 id=\"tech-stack\">Tech stack<\/h3>\n\n<ul>\n  <li><strong>FastAPI &amp; LangGraph<\/strong>: We use FastAPI to handle requests and LangGraph to manage the complex state and cyclical logic required for multi-agent collaboration. Unlike simple Large Language Model (LLM) calls, LangGraph allows our agents to loop back, ask for more information, or hand off tasks to one another.<\/li>\n  <li><strong>Redis &amp; PostgreSQL<\/strong>: Redis handles our caching and real-time session needs, while PostgreSQL serves as the persistent memory, storing conversation history and agent metadata.<\/li>\n<\/ul>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/firefighting\/figure-1.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 1. Architecture tech stack.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<ul>\n  <li><strong>Hubble:<\/strong> A centralized metadata management platform and data catalog.<\/li>\n  <li><strong>Genchi:<\/strong> A data quality observability platform that enforces data contracts.<\/li>\n  <li><strong>Lighthouse:<\/strong> A platform that tracks execution status and monitors pipeline health.<\/li>\n<\/ul>\n\n<h3 id=\"from-request-to-resolution\">From request to resolution<\/h3>\n\n<p>The journey begins in Slack. When a user submits a request, it is categorized into one of two streams:<\/p>\n\n<ul>\n  <li>Enhancement requests: These are routed to the Enhancement Agent, which interacts directly with our core engineering tools like GitLab, Apache Spark, and Airflow to propose and test code changes.<\/li>\n  <li>General questions: These are funneled through our investigation pathway. The system orchestrates a \u201chuddle\u201d between the Data Agent (querying Trino, Hive, or Delta Lake), the Code Search Agent (analyzing GitLab), and the On-call Agent (checking Confluence and Slack for ongoing incidents).<\/li>\n<\/ul>\n\n<p>By decoupling the \u201cbrain\u201d (the LLM) from the \u201chands\u201d (the specialized agents and tools), we created a system that is both capable and easy to debug.<\/p>\n\n<h3 id=\"why-specialized-agents-beat-a-single-super-ai\">Why specialized agents beat a single \u201cSuper AI\u201d<\/h3>\n\n<p>We could have built one massive AI trained to handle every question, but specialized agents are easier to build, maintain, and improve than a monolithic system.<\/p>\n\n<p>The table below illustrates the comparison between a single AI system and a multi-agent system:<\/p>\n\n<table>\n  <thead>\n    <tr>\n      <th style=\"text-align: left\"><strong>Approach<\/strong><\/th>\n      <th style=\"text-align: left\"><strong>Advantages<\/strong><\/th>\n      <th style=\"text-align: left\"><strong>Challenges<\/strong><\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <td style=\"text-align: left\">Single AI (Monolithic)<\/td>\n      <td style=\"text-align: left\">One model to maintain, single inference call<\/td>\n      <td style=\"text-align: left\">Hard to debug, changes affect everything, generalist performance<\/td>\n    <\/tr>\n    <tr>\n      <td style=\"text-align: left\">Multi-Agent System<\/td>\n      <td style=\"text-align: left\">Focused expertise, modular updates, specialist accuracy<\/td>\n      <td style=\"text-align: left\">Sequential execution adds latency, coordination complexity<\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n<p>We chose the multi-agent approach because maintainability and accuracy mattered more than shaving off a few seconds of latency. When you\u2019re replacing a multi-hour manual investigation, taking a few minutes for a precise answer is a massive leap in operational throughput.<\/p>\n\n<h2 id=\"the-architecture-two-pathways-five-specialized-agents\">The architecture: Two pathways, five specialized agents<\/h2>\n\n<p>When a question arrives through Slack, the system first determines which pathway to take:<\/p>\n\n<ul>\n  <li><strong>Pathway 1<\/strong>: Enhancement Requests \u2192 Enhancement Agent (handles code changes)<\/li>\n  <li><strong>Pathway 2<\/strong>: Investigation Questions \u2192 Classifier \u2192 Specialized agents \u2192 Summarizer<\/li>\n<\/ul>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/firefighting\/figure-2.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 2. Agent workflows, using a Supervisor that controls communication flow and task delegation.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"enhancement-pathway-semi-automated-code-changes\">Enhancement pathway: Semi-Automated code changes<\/h3>\n\n<p>For requests like <em>\u201cCan you add a new column for customer_segment?\u201d<\/em> or <em>\u201cWe need to change the aggregation logic for revenue\u201d<\/em>, the Enhancement Agent handles the heavy lifting.<\/p>\n\n<p><strong>Enhancement Agent<\/strong> receives user requirements and proposes code changes:<\/p>\n\n<ul>\n  <li>Gathers context: schema, lineage, dependencies, existing codebase.<\/li>\n  <li>Generates code changes and creates a merge request (MR).<\/li>\n  <li>Runs changes in a test environment.<\/li>\n  <li>Flags governance concerns (Personally Identifiable Information (PII) classification, Service Level Agreements (SLAs), backward compatibility).<\/li>\n<\/ul>\n\n<p><strong>The workflow<\/strong>:<\/p>\n\n<ol>\n  <li>User creates a JIRA request.<\/li>\n  <li>Agent analyzes requirements and gathers context through interactive dialogue with the engineer.<\/li>\n  <li>Agent creates an MR with suggested code.<\/li>\n  <li>Engineer reviews the MR.<\/li>\n  <li>If valid, agent runs changes in test environment.<\/li>\n  <li>Engineer reviews results against test cases.<\/li>\n  <li>If tests pass, engineer merges the MR.<\/li>\n<\/ol>\n\n<p>Why is the workflow semi-automated by design? Code changes to production pipelines require human judgment. The agent accelerates the process by doing the research, writing the code, and running tests, but humans make the final approval.<\/p>\n\n<h3 id=\"investigation-pathway-four-agents-working-together\">Investigation pathway: Four agents working together<\/h3>\n\n<p>For questions like <em>\u201cWhy does this data look wrong?\u201d<\/em> or <em>\u201cWhere does this metric come from?\u201d<\/em>, the system uses a coordinated team of specialists.<\/p>\n\n<p><strong>The Classifier<\/strong> is the first responder for investigation questions. It:<\/p>\n\n<ul>\n  <li>Parses the question to extract key information (tables, scripts, specific data requests).<\/li>\n  <li>Detects guardrail violations (PII requests, out-of-scope queries).<\/li>\n  <li>Determines which specialist agents are needed and in what sequence.<\/li>\n  <li>Provides reasoning and task descriptions for each recommended agent.<\/li>\n<\/ul>\n\n<p>Example: For the question \u201cWhy does this ID look wrong?\u201d, the Classifier routes the question to: Data Agent \u2192 Code Search Agent \u2192 On-call Agent (if needed).<\/p>\n\n<p><strong>Data Agent<\/strong> performs the data investigation:<\/p>\n\n<ul>\n  <li>Enhances the prompt\u2019s context with the table and column metadata.<\/li>\n  <li>Executes queries with guardrails (PII detection, command validation).<\/li>\n  <li>Validates schemas to avoid unnecessary scans and hallucinations.<\/li>\n  <li>Retrieves sample data with LLM exploratory comments.<\/li>\n<\/ul>\n\n<p>Example: It queries vehicle_id from the table to validate the user\u2019s observation against the actual data.<\/p>\n\n<p><strong>Code Search Agent<\/strong> analyzes the code:<\/p>\n\n<ul>\n  <li>Traces column transformations through the codebase.<\/li>\n  <li>Follows table lineage through multiple transformation steps.<\/li>\n  <li>Generates plain-language explanations of transformation logic.<\/li>\n  <li>Highlights divergences from documentation or stakeholder expectations.<\/li>\n<\/ul>\n\n<p>Example: It can trace a vehicle_id column from the final table back through 5 transformation steps to the original source, explaining each change along the way.<\/p>\n\n<p><strong>On-call Agent<\/strong> monitors production systems and assists with urgent issues:<\/p>\n\n<ul>\n  <li>Searches Slack channels for announcements about outages, source table failures, and delays.<\/li>\n  <li>Checks observability platforms for pipeline health, logs, and retry policies.<\/li>\n  <li>Validates data quality metrics (null counts, duplicates, range validation).<\/li>\n  <li>Produces incident notes and initial Root Cause Analysis (RCA) when issues are identified.<\/li>\n<\/ul>\n\n<p>Example: If the Data Agent detects SLA breaches or missing partitions, it may consult the On-call Agent for production context.<\/p>\n\n<p><strong>Summarizer Agent<\/strong> refines responses from the previous agents:<\/p>\n\n<ul>\n  <li>Handles conflicting information.<\/li>\n  <li>Combines responses into a coherent narrative.<\/li>\n  <li>Makes the answer concise and structured.<\/li>\n  <li>Ensures consistency across agent findings.<\/li>\n<\/ul>\n\n<p>Generating the summary is the final step before human review.<\/p>\n\n<h2 id=\"seeing-the-system-in-action\">Seeing the system in action<\/h2>\n\n<p>The best way to understand how this multi-agent system works is to see it handle real scenarios. Let\u2019s walk through two common situations our team faces daily.<\/p>\n\n<h3 id=\"scenario-1-adding-a-new-column\">Scenario 1: Adding a new column<\/h3>\n\n<p>The request: A stakeholder raises a JIRA ticket requesting, \u201cPlease add a <em>customer_segment<\/em> column to the <em>rides<\/em> table. Source data is available in the <em>user_profiles<\/em> table.\u201d<\/p>\n\n<p>In the traditional workflow, a data engineer would spend a significant portion of their afternoon clarifying requirements, developing and testing code, similar to the workflow steps in <em>\u201cFigure 2: Agent workflows\u201d<\/em>.<\/p>\n\n<p>With the Enhancement Agent, the entire process is completed autonomously in minutes. The agent performs these tasks in sequence:<\/p>\n\n<ol>\n  <li>Read the JIRA ticket: Agent fetches the ticket details to understand the exact requirements: what column needs to be added, which table is involved, and where the source data comes from.<\/li>\n  <li>Discover the relevant code: Using intelligent search capabilities, it locates the specific pipeline files in our codebase that need modification. It navigates through the repository structure to find the right transformation scripts.<\/li>\n  <li>Run validation checks: Before making any changes, it validates:\n    <ul>\n      <li>The requested column exists in the upstream source table.<\/li>\n      <li>The column doesn\u2019t already exist in the target table.<\/li>\n      <li>Schema compatibility and data quality requirements are met.<\/li>\n    <\/ul>\n  <\/li>\n  <li>Generate database schema changes: The agent references existing Data Definition Language (DDL) scripts to understand the standard format, then automatically generates the necessary schema modification scripts. These scripts are added to the MR alongside the code changes.<\/li>\n  <li>Create the MR: All changes, including code modifications and schema scripts, are packaged into an MR with proper documentation, making it ready for review.<\/li>\n  <li>Enable pipeline execution: Once the MR is validated, users can interact with the bot to trigger the data pipeline and start testing their changes on Airflow. They can optionally specify date ranges or other parameters to control the test runs.<\/li>\n<\/ol>\n\n<p>The entire process, from ticket to deployable MR, completes autonomously in minutes, with full traceability at every step.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/firefighting\/figure-3.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 3. Enhancement Agent workflow.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"scenario-2-investigating-faulty-looking-data\">Scenario 2: Investigating faulty-looking data<\/h3>\n\n<p>The question: \u201cWhy is the ID in the vehicles table unreadable?\u201d<\/p>\n\n<p>Traditionally, the data engineer typically performs these steps:<\/p>\n\n<ol>\n  <li>Search through various data catalogs to locate relevant information.<\/li>\n  <li>Manually track the data\u2019s origin and transformation path.<\/li>\n  <li>Validate SQL queries.<\/li>\n  <li>Examine logs.<\/li>\n<\/ol>\n\n<p>This is how it looks with agents:<\/p>\n\n<p><strong>Step 1: Classifier analyzes the question<\/strong><\/p>\n\n<ul>\n  <li>Parses the question: determines all three specialist agents are needed.<\/li>\n  <li>Plans the sequence: Data Agent \u2192 Code Search Agent \u2192 On-call Agent<\/li>\n  <li>Provides reasoning: \u201cNeed to verify data format, trace transformation logic, and check for production incidents\u201d.<\/li>\n<\/ul>\n\n<p><strong>Step 2: Data Agent investigates<\/strong><\/p>\n\n<ul>\n  <li>Retrieves metadata, which helps in building a SQL query for exploring samples.<\/li>\n  <li>Queries actual data. The result confirms the user\u2019s observation with the actual sample and identifies that IDs appear in Universally Unique Identifier (UUID) format, and they\u2019re \u201cunreadable\u201d.<\/li>\n  <li>Searches Grab\u2019s data catalog to find dimension tables that can help decipher UUID in a more human-readable format.<\/li>\n  <li>Finds an appropriate dimension table and builds a join query to test readability.<\/li>\n<\/ul>\n\n<p>Conclusion from Data Agent: <em>\u201cThe ID column contains UUID format values. These can be joined with dim_vehicles table to get human-readable vehicle names. The format is consistent and valid\u2014not corrupted data.\u201d<\/em><\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/firefighting\/figure-4.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 4. Data Agent response.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p><strong>Step 3: Code Search Agent traces lineage<\/strong><\/p>\n\n<ul>\n  <li>Scans the transformation and lineage logic in the codebase to see exactly how the ID is extracted. It discovers that the ID is a raw UUID from a JSON payload directly from the source system.<\/li>\n  <li>Queries the source table for samples directly. The \u201cunreadable\u201d text pattern matches the data in the vehicles table, confirming that it is not a bug introduced by Spark transformations.<\/li>\n<\/ul>\n\n<p>Conclusion from Code Search Agent: \u201cThe \u2018unreadable\u2019 UUID format comes directly from the source system. No transformation is applied. This is not a bug introduced by our Spark pipelines\u2014it\u2019s the native format from the upstream system\u201d.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/firefighting\/figure-5.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 5. Code Search Agent response.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p><strong>Step 4: On-call Agent checks production health<\/strong><\/p>\n\n<ul>\n  <li>Checks Airflow pipeline status.<\/li>\n  <li>Searches Slack channels for incidents.<\/li>\n  <li>Checks data quality metrics.<\/li>\n<\/ul>\n\n<p>Conclusion from On-call Agent: <em>\u201cNo production incidents detected. Pipeline running successfully. Data quality metrics are within normal ranges. No recent complaints or issues reported in communication channels.\u201d<\/em><\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/firefighting\/figure-6.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 6. On-call Agent response.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p><strong>Step 5: Summarizer Agent synthesizes the answer<\/strong><\/p>\n\n<ul>\n  <li>User concern: ID values appear \u201cunreadable\u201d.<\/li>\n  <li>Data Agent finding: IDs are valid UUIDs, can be joined with dim_vehicles for readable names.<\/li>\n  <li>Code Search finding: UUID format comes directly from source system, not a transformation bug.<\/li>\n  <li>On-call finding: No production issues, pipeline healthy, data quality normal.<\/li>\n<\/ul>\n\n<p>Provides a structured answer to: <em>\u201cWhy is the ID in the vehicles table unreadable?\u201d<\/em><\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/firefighting\/figure-7.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 7. Summarizer Agent response.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p><strong>Step 6: Human review and delivery<\/strong><br \/>\nWe publish answers to Slack immediately and mark them as unreviewed. This provides users with quick responses while they await an engineer\u2019s review.<\/p>\n\n<p>The initial response time has been reduced to just a few minutes, in contrast to the previous hours-long manual search.<\/p>\n\n<p><strong>Step 7: Continue conversation<\/strong><br \/>\nAfter an answer is posted, anyone can engage in a continued conversation with the agents, restarting the loop.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/firefighting\/figure-8.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 8. Continuing the conversation.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h2 id=\"optimizing-the-architecture\">Optimizing the architecture<\/h2>\n\n<p>Building the system was one challenge. Making it production-ready was another.<\/p>\n\n<p>Our initial prototype worked in controlled demos, but real-world usage revealed critical gaps. Users asked complex questions, conversations grew long, and edge cases exposed vulnerabilities. Here\u2019s how we optimized the system to handle production demands while maintaining accuracy and safety.<\/p>\n\n<h3 id=\"challenge-1-excessive-context\">Challenge 1: Excessive context<\/h3>\n\n<p>In multi-agent systems, context accumulates fast. Information is continuously passed from one agent to the next. Without careful management, excessive context and tokens cause performance degradation.<\/p>\n\n<p><strong>Our solution<\/strong>:<br \/>\nThe orchestrator maintains a rich state throughout execution, tracking three critical elements:<\/p>\n\n<ul>\n  <li>Conversation and tooling history: Full message context for each agent.<\/li>\n  <li>Execution tracking: Which agents have run, current progress, and execution steps.<\/li>\n  <li>Agent responses: Structured responses from each agent, passed to subsequent agents.<\/li>\n<\/ul>\n\n<p>This state is carefully managed to ensure each agent has the right context without overwhelming token limits.<\/p>\n\n<ul>\n  <li>Token tracking: Every message is counted using <a href=\"https:\/\/github.com\/openai\/tiktoken\">tiktoken<\/a>, giving us real-time visibility into our token budget.<\/li>\n  <li>Intelligent summarization: When token limits are exceeded, earlier messages are automatically summarized while retaining information relevant to the original question. Recent messages and critical context remain unsummarized to preserve accuracy.<\/li>\n  <li>Retrieval-Augmented Generation (RAG) context pruning: We reduce context from tool outputs when enhancing prompts:\n    <ul>\n      <li>Instead of passing full code files to the Code Search Agent, we use smaller LLM models to extract the most relevant code snippets and a short description.<\/li>\n      <li>For database queries, we apply filters to retrieve only the top relevant results.<\/li>\n    <\/ul>\n  <\/li>\n  <li>Handoffs Pattern: The previous agent returns its response to a central orchestrator. The orchestrator cleans the context, prunes unnecessary tokens, and invokes the next agent.<\/li>\n<\/ul>\n\n<p><strong>The result<\/strong>:<br \/>\nAgents can handle extended investigations without drowning in excessive context, maintaining performance even in complex, multi-turn conversations.<\/p>\n\n<h3 id=\"challenge-2-excessive-tool-usage\">Challenge 2: Excessive tool usage<\/h3>\n\n<p>Our initial design presented a significant performance bottleneck due to excessive context. Early models were equipped with a large and unwieldy set of over 30 distinct tools, each structured similarly to a generic API. Since tool calling is part of an agent\u2019s prompt, agents had to process verbose tool descriptions and outputs, which degraded efficiency.<\/p>\n\n<p><strong>Our solution<\/strong>:<br \/>\nWe focused on tool design based on real-world usage scenarios:<\/p>\n\n<ul>\n  <li>Included only the relevant portions required for decision-making.<\/li>\n  <li>Aggressively truncated verbose information from tool outputs.<\/li>\n  <li>Streamlined tool descriptions to be concise and actionable.<\/li>\n<\/ul>\n\n<p><strong>The result<\/strong>:<br \/>\nBy significantly reducing the data load agents needed to process during inference, we achieved a substantial leap in system responsiveness and throughput.<\/p>\n\n<h3 id=\"challenge-3-risky-code-executions\">Challenge 3: Risky code executions<\/h3>\n\n<p>AI agents with database access and code generation capabilities pose significant risks. Without proper safeguards, they could access sensitive PII data, execute dangerous SQL operations, run expensive queries, or generate breaking code changes. We needed to make the system safe.<\/p>\n\n<p><strong>Our solution<\/strong>:\nWe implemented multiple layers of safety to protect against misuse from both agents and users:<\/p>\n\n<p><em>Layer 1: Input classification<\/em><br \/>\nBefore any agent executes, the Classifier detects:<\/p>\n\n<ul>\n  <li>PII requests: Questions asking for personally identifiable information<\/li>\n  <li>Out-of-scope queries: Requests beyond the agent\u2019s capabilities<\/li>\n<\/ul>\n\n<p><em>Layer 2: SQL validation before execution<\/em><br \/>\nThe Data Agent validates every query for:<\/p>\n\n<ul>\n  <li>PII column access: Checks against column metadata to ensure it doesn\u2019t access confidential information.<\/li>\n  <li>Data definition and manipulation language (DDL\/DML) operations: The agent doesn\u2019t have access to DELETE, DROP, TRUNCATE, or UPDATE operations, but this check acts as an additional safeguard.<\/li>\n  <li>Slow queries: Detects missing partition filters or excessive date ranges that could cause expensive full-table scans.<\/li>\n  <li>Schema validation: Confirms tables and columns exist before execution.<\/li>\n<\/ul>\n\n<p><em>Layer 3: Timeout protection<\/em><br \/>\nAll database queries have strict execution limits to prevent runaway queries from impacting system performance.<\/p>\n\n<p><em>Layer 4: Enhancement agent controls<\/em><br \/>\nFor the Enhancement Agent, which generates code changes:<\/p>\n\n<ul>\n  <li>Cannot commit to master\/main directly: All changes go through MRs.<\/li>\n  <li>Mandatory human review: A human reviewer must validate all inputs before execution.<\/li>\n  <li>Test environment first: Changes run in staging before production deployment.<\/li>\n<\/ul>\n\n<p><strong>The result<\/strong>:\nA safe environment where AI agents can operate in production without compromising security or stability. Users and engineers trust the system because they know it has robust guardrails protecting critical data and systems.<\/p>\n\n<h3 id=\"challenge-4-ensuring-user-trust\">Challenge 4: Ensuring user trust<\/h3>\n\n<p>Even with RAG and guardrails, AI agents aren\u2019t perfect. Hallucinations, misinterpretations, and edge cases could erode user trust.<\/p>\n\n<p><strong>Our solution<\/strong>: \nAfter generating a summarized response, the multi-agent system routes to human reviewers who can take five actions:<\/p>\n\n<ul>\n  <li>Approve: Post the response as-is and add a footnote that the response has been deemed accurate by a human reviewer.<\/li>\n  <li>Reject: Mark the response as incorrect and log it for improvement. The response will not be posted, protecting users from bad information.<\/li>\n  <li>Refine: Add a prompt to improve the summarized response from the sub-agents. The system regenerates the answer with additional guidance.<\/li>\n  <li>Re-route to Sub-Agents: Send the question to a specific agent with additional context. For example: \u201cData Agent, can you check the last 30 days instead of 7 days?\u201d<\/li>\n  <li>Annotate: Provide structured feedback to the response, where it gets saved to a database for continuous improvement.<\/li>\n<\/ul>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/firefighting\/figure-9.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 9. Human review.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/firefighting\/figure-10.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 10. Annotations.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p><strong>The result<\/strong>:<br \/>\nThis human-in-the-loop model ensures answers are accurate and reliable, increasing user trust in the responses. The annotations help us iteratively improve the model\u2019s future responses.<\/p>\n\n<h3 id=\"challenge-5-balancing-speed-and-quality\">Challenge 5: Balancing speed and quality<\/h3>\n\n<p>Our initial design withheld AI-generated responses until authorized by an engineering team member. This introduced a bottleneck in the response process, potentially leaving inquiries unresolved for extended periods, particularly during peak workload times.<\/p>\n\n<p><strong>Our solution<\/strong>: \nWe redesigned the process to allow responses to be posted without immediate human review, provided they are clearly and prominently marked as unreviewed. All posts can still be reviewed and modified by the on-call engineer as needed, but users get answers immediately rather than waiting.<\/p>\n\n<p><strong>The result<\/strong>: \nThis approach maintains a crucial balance between response speed and quality:<\/p>\n\n<ul>\n  <li>Users get fast answers when they need them.<\/li>\n  <li>Transparency (unreviewed label) sets appropriate expectations.<\/li>\n  <li>Engineers still review all responses to catch errors and improve the system.<\/li>\n  <li>Feedback loop remains intact for continuous learning.<\/li>\n<\/ul>\n\n<h3 id=\"challenge-6-closing-the-feedback-loop\">Challenge 6: Closing the feedback loop<\/h3>\n\n<p>Collecting feedback through annotations was just the first step. Without systematic analysis, we had a goldmine of information about what worked and what didn\u2019t, but we weren\u2019t learning from it. Every rejected response was a lesson unlearned, every annotation a pattern unrecognized. We needed to close the loop.<\/p>\n\n<p><strong>Our solution<\/strong>:<br \/>\nWe transformed annotations from passive records into an active improvement engine through five mechanisms:<\/p>\n\n<ol>\n  <li>Automated evaluation: Random annotations are pulled to create test cases for offline evaluation. This ensures the system is tested against real-world failure scenarios, not just synthetic test cases we invented.<\/li>\n  <li>Pattern analysis: We analyze annotations to identify systemic issues:\n    <ul>\n      <li>Is the Classifier consistently routing to the wrong agents?<\/li>\n      <li>Does a specific agent have quality issues?<\/li>\n      <li>Are certain types of queries prone to hallucinations?<\/li>\n      <li>Do particular table schemas cause confusion?<\/li>\n    <\/ul>\n  <\/li>\n  <li>Quality metrics: Tracking annotation rates over time measures system reliability and identifies regression. If the rejection rate suddenly increases, we know something has changed that needs investigation.<\/li>\n  <li>Targeted improvements: Annotations guide where to focus development effort:\n    <ul>\n      <li>Improving prompts: Refining agent system prompts with better examples.<\/li>\n      <li>Adding guardrails: Enhancing input classification to catch problematic queries earlier.<\/li>\n      <li>Enhancing specific agents: Adding examples or tools to handle struggling query types.<\/li>\n    <\/ul>\n  <\/li>\n  <li>Training data: Annotated failures can be used to:\n    <ul>\n      <li>Fine-tune models on domain-specific patterns.<\/li>\n      <li>Improve few-shot examples in prompts.<\/li>\n      <li>Build regression test suites from actual failures.<\/li>\n    <\/ul>\n  <\/li>\n<\/ol>\n\n<p><strong>The result<\/strong>:\nThe system transformed from static to continuous learning. Every mistake became an opportunity for improvement, and the system got smarter with each interaction. We had data-driven insights guiding our optimization priorities, ensuring we focused on the highest-impact improvements.<\/p>\n\n<h2 id=\"impact\">Impact<\/h2>\n\n<p>The deployment of this multi-agent system yielded transformative results across key performance indicators, shifting the team\u2019s entire operational dynamic.<\/p>\n\n<ul>\n  <li><strong>Automated resolution<\/strong>:The bots now autonomously handle the majority of standard user inquiries and a significant portion of common enhancement requests.<\/li>\n  <li><strong>Velocity gains<\/strong>: The time required to resolve issues has seen an order-of-magnitude reduction, effectively eliminating the support backlog. Simple inquiries are autonomously answered and brought to a resolution within minutes.<\/li>\n  <li><strong>Productivity gains<\/strong>: The team has successfully reclaimed several full-time equivalents (FTE) worth of engineering bandwidth, shifting hundreds of hours from reactive support to proactive roadmap delivery.<\/li>\n<\/ul>\n\n<p>With this newfound capacity unlocked, the data engineering team pivots from reactive support to proactive, high-value work, ultimately leading to \u201chappier downstream users.\u201d<\/p>\n\n<h2 id=\"conclusions\">Conclusions<\/h2>\n\n<p>Our journey from overwhelmed data engineers to a team empowered by AI agents revealed three core principles that made this transformation possible:<\/p>\n\n<p><strong>Multi-Agent architecture: Specialists over generalists<\/strong><br \/>\nSpecialized AI agents outperform a single generalist by mastering specific domains (e.g., data quality, code analysis). This modularity allows for independent improvement, easy additions, and clear responsibilities, boosting maintainability and flexibility.<\/p>\n\n<p><strong>Strategic human oversight: Building trust through transparency<\/strong><br \/>\nRouting AI responses through human reviewers achieved rapid adoption via trust and continuous system improvement by generating annotated training feedback.<\/p>\n\n<p><strong>Focus on augmentation: Automating repetitive tasks<\/strong><br \/>\nAI agents operate autonomously in repetitive tasks (context gathering, running queries, checking logs) with human oversight if needed, and collaborate with us in augmenting higher-value work: architectural decisions and building new capabilities.<\/p>\n\n<h2 id=\"join-us\">Join us<\/h2>\n\n<p>Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility, and digital financial services sectors. Serving over 900 cities in eight Southeast Asian countries: Cambodia, Indonesia, Malaysia, Myanmar, the Philippines, Singapore, Thailand, and Vietnam. Grab enables millions of people every day to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. We operate supermarkets in Malaysia under Jaya Grocer and Everrise, which enables us to bring the convenience of on-demand grocery delivery to more consumers in the country. As part of our financial services offerings, we also provide digital banking services through GXS Bank in Singapore and GXBank in Malaysia. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line. We aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.<\/p>\n\n<p>Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, <a href=\"https:\/\/grab.careers\/\">join our team today<\/a>!<\/p>\n","pubDate":"Thu, 19 Mar 2026 00:23:00 +0000","link":"https:\/\/engineering.grab.com\/from-firefighting-to-building","guid":"https:\/\/engineering.grab.com\/from-firefighting-to-building","category":["AI","Analytics","Database","Automation","Engineering"]},{"title":"Enabling R8 optimization at scale with AI-assisted debugging","description":"<p>Grab is Southeast Asia\u2019s leading superapp, providing a suite of services that bring essential needs to users throughout the region. Its offerings include ride-hailing, food delivery, parcel delivery, mobile payments, and more. With safety, efficiency, and user-centered design at heart, Grab remains dedicated to solving everyday issues and improving the lives of millions. As our app continued to expand, we identified platform-level performance challenges that were affecting user experience across the board. In this article, we share how we successfully enabled R8 optimization for the <a href=\"https:\/\/play.google.com\/store\/apps\/details?id=com.grabtaxi.passenger&amp;hl=en\">Grab Android app<\/a>, achieving significant improvements in app size, startup time, and stability through innovative AI-assisted debugging techniques.<\/p>\n\n<h2 id=\"introduction\">Introduction<\/h2>\n\n<p>Since 2024, our team observed a concerning trend: Application Not Responding (ANR) rates were spiking across the Grab app. Unlike typical isolated issues, the data revealed that ANRs were happening everywhere, not confined to specific features or modules. This pattern pointed to platform-level causes, with our analysis showing strong correlations between ANRs and several factors: memory pressure (particularly when garbage collection was triggered), ad-heavy user flows, complex layouts involving Jetpack Compose embedded within XML layouts, and XML views embedded within Compose code.<\/p>\n\n<p>The Android community had long proven that R8 optimization (beyond basic code shrinking) could deliver substantial performance gains and app size reductions. As Grab has been adopting Jetpack Compose over the last two years, <a href=\"https:\/\/developer.android.com\/develop\/ui\/compose\/performance#config\">Google\u2019s Jetpack Compose performance documentation<\/a> specifically recommends R8 optimization for Compose-heavy applications. It became particularly relevant, making it a natural solution for our systemic performance issues.<\/p>\n\n<p>In fact, enabling R8 optimization was not a new idea for our team. It had been identified and flagged as a high-impact solution multiple times over the years, yet each attempt fell short. Here\u2019s why.<\/p>\n\n<h2 id=\"the-challenge-at-scale\">The challenge at scale<\/h2>\n\n<p>Our app operates at scale, with over 9 million lines of code and 100+ engineers working on it daily. While we had basic R8 shrinking enabled, advanced optimization had proven challenging despite multiple attempts over six years (with different tools and approaches over the years).<\/p>\n\n<p>In 2022, we almost made it - successfully rolling out R8 optimization to GEA (our early access build), but unfortunately, we faced <a href=\"https:\/\/issuetracker.google.com\/u\/0\/issues\/240077160\">critical roadblocks<\/a> that compelled us to put the project on hold. After analyzing our previous attempts and the current project situation, we identified three fundamental challenges that had to be solved simultaneously.<\/p>\n\n<p>This article details how we overcame each challenge through targeted innovations: <strong>AI-Assisted Debugging<\/strong> for slow investigation cycles, <strong>Pragmatic Testing Strategy<\/strong> for validation at scale, and <strong>Optimized Feedback Loop<\/strong> for rapid iteration.<\/p>\n\n<h2 id=\"understanding-r8-optimization\">Understanding R8 optimization<\/h2>\n\n<p>Before diving into our solution, it\u2019s important to understand what R8 optimization actually provides beyond basic R8 shrinking.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/r8-optimization\/figure-1.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 1. The R8 processing pipeline involves multiple interconnected phases that transform, analyze, and optimize code. Understanding this complexity helps explain both the benefits of optimization and why debugging issues become challenging.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"what-we-had-in-place\">What we had in place<\/h3>\n\n<p>With <code class=\"language-plaintext highlighter-rouge\">minifyEnabled=true<\/code> and <code class=\"language-plaintext highlighter-rouge\">shrinkResources=true<\/code> using <code class=\"language-plaintext highlighter-rouge\">proguard-android.txt<\/code>, we already had:<\/p>\n\n<ul>\n  <li><strong>Tree shaking (shrink phase)<\/strong>: Removes unused\/unreachable code.<\/li>\n  <li><strong>Code minification (obfuscation)<\/strong>: Renames classes\/methods to short names.<\/li>\n  <li><strong>Resource shrinking<\/strong>: Removes unused XML files and drawables.<\/li>\n  <li><strong>Desugaring<\/strong>: Java 8+ compatibility.<\/li>\n<\/ul>\n\n<h3 id=\"whats-new-with-optimization\">What\u2019s new with optimization<\/h3>\n\n<p>By switching to <code class=\"language-plaintext highlighter-rouge\">proguard-android-optimize.txt<\/code>, we gained access to:<\/p>\n\n<ul>\n  <li><strong>Method inlining<\/strong>: Replaces method calls with actual code, reducing call overhead.<\/li>\n  <li><strong>Class merging<\/strong>: Makes code more compact by combining similar classes.<\/li>\n  <li><strong>Constant folding<\/strong>: Pre-computes constant expressions at compile time.<\/li>\n  <li><strong>Dead code elimination<\/strong>: More aggressive than tree shaking, removes unreachable branches.<\/li>\n  <li><strong>Devirtualization<\/strong>: Converts virtual calls to direct calls when possible.<\/li>\n<\/ul>\n\n<p>These optimizations work together to improve runtime performance while significantly reducing app size.<\/p>\n\n<h2 id=\"three-core-challenges\">Three core challenges<\/h2>\n\n<p>With context on R8 optimization benefits, why is enabling it so difficult at Grab\u2019s scale? After analyzing our previous failed attempts and the current project situation, we identified three fundamental challenges that had to be solved simultaneously.<\/p>\n\n<h3 id=\"challenge-1-slow-debugging\">Challenge 1: Slow debugging<\/h3>\n\n<p>R8 optimization issues are notoriously difficult to debug:<\/p>\n\n<ul>\n  <li>Code is <strong>obfuscated<\/strong>, class names become <code class=\"language-plaintext highlighter-rouge\">a<\/code>, <code class=\"language-plaintext highlighter-rouge\">b<\/code>, <code class=\"language-plaintext highlighter-rouge\">c<\/code>.<\/li>\n  <li>Code is <strong>modified<\/strong>, inlined, merged, and optimized beyond recognition.<\/li>\n  <li>Stack traces are <strong>unreadable<\/strong> without proper mapping files when crashes occur.<\/li>\n  <li><strong>Pinpointing the root cause<\/strong> requires manual reverse engineering.<\/li>\n<\/ul>\n\n<p>Our limited resources compound the challenge: with only one engineer leading the project, most issues had to be either addressed directly or have solutions provided for other teams to fix. Manual decompilation, deobfuscation, and context gathering for each issue are inherently time-consuming, making the investigation cycle slow.<\/p>\n\n<h3 id=\"challenge-2-testing-at-scale\">Challenge 2: Testing at scale<\/h3>\n\n<p>R8 optimization affects every corner of the app. Unlike feature-specific changes, enabling optimization transforms how the entire codebase is compiled, inlined, and optimized. A single misconfiguration or missing keep rule can break seemingly unrelated features across different modules and libraries.<\/p>\n\n<p>When we first enabled R8 optimization, the impact was immediate and widespread: most of the app\u2019s features simply stopped working correctly. This presented us with a deeper problem, not just how to test, but what kind of testing strategy would actually give us confidence to roll out to production.<\/p>\n\n<p>In theory, R8 optimization works reliably with standard codebases that follow Google\u2019s and the community\u2019s best practices. However, the Grab app is a ~10-year-old project at a massive scale. Legacy code patterns, reflection usage, and SDK integrations accumulated over a decade create numerous edge cases.<\/p>\n\n<p>This combination makes comprehensive testing necessary, but at our scale, it\u2019s nearly impossible to execute due to:<\/p>\n\n<ul>\n  <li><strong>Full regression testing<\/strong> would require significant effort from all teams across the organization.<\/li>\n  <li><strong>Quality Assurance (QA) resource constraints<\/strong> make exhaustive testing impractical.<\/li>\n  <li><strong>High-quality bar<\/strong>: At Grab, app stability and zero runtime errors are non-negotiable standards.<\/li>\n<\/ul>\n\n<p>This creates a paradox: we need comprehensive testing precisely because we can\u2019t guarantee standards everywhere, yet the scale makes such testing infeasible.<\/p>\n\n<h3 id=\"challenge-3-slow-feedback\">Challenge 3: Slow feedback<\/h3>\n\n<p>Due to the large scale of the project, compiling a build with R8 optimization enabled on a standard engineering laptop is physically impossible. This created a significant bottleneck: a slow feedback loop where every experimental change required a remote CI build to verify, with each R8-optimized build taking up to 2 hours to complete.<\/p>\n\n<p>Additionally, R8 treats debug and release build types differently. At Grab, we have a QA build for QA testing. This is a debug build type with R8 enabled, pointed to our staging environment. We had to ensure this QA build\u2019s R8 configuration matched our production build exactly. This alignment was critical for catching R8-specific issues during QA testing that would actually reflect production behavior<\/p>\n\n<h2 id=\"our-three-innovation-solution\">Our three-innovation solution<\/h2>\n\n<p>To overcome these three fundamental challenges, we developed a comprehensive strategy centered on targeted innovations that addressed each bottleneck.<\/p>\n\n<h3 id=\"innovation-1-ai-assisted-debugging\">Innovation 1: AI-assisted debugging<\/h3>\n\n<p><strong>Solving challenge 1<\/strong>:<\/p>\n\n<p>How do we speed up the investigation of R8 issues in obfuscated, optimized code at scale? The answer is in emerging AI technology that wasn\u2019t available during our previous attempts.<\/p>\n\n<p><strong>The AI context at Grab<\/strong>:<\/p>\n\n<p>Unlike 2022 and earlier attempts, the landscape had changed dramatically. After the LLM explosion, Grab proactively promoted AI (LLM) usage to boost engineering productivity. Over the past two years, Grab has dedicated 1-2 months annually for engineers to learn how to use AI efficiently. This investment in AI literacy became crucial for this project.<\/p>\n\n<p>This year (2025), my team gained experience building MCP (Model Context Protocol) servers and identified an opportunity: applying this technology to solve the R8 debugging challenge.<\/p>\n\n<p><strong>Our solution<\/strong>:<\/p>\n\n<p>At Grab, we use <strong>GitLab for Continuous Integration and Continuous Delivery (CI\/CD)<\/strong>. To tackle R8 debugging bottlenecks, we built a comprehensive solution combining:<\/p>\n<ul>\n  <li><a href=\"#build-mcp-tools-eliminate-manual-reverse-engineering\">Custom MCP tools<\/a>.<\/li>\n  <li><a href=\"#ai-and-ci-pipeline-workflow\">AI-assisted GitLab CI integration<\/a>.<\/li>\n<\/ul>\n\n<h4 id=\"build-mcp-tools-eliminate-manual-reverse-engineering\">Build MCP tools: Eliminate manual reverse engineering<\/h4>\n\n<ul>\n  <li><strong>Automatic Android Application Package (APK) decompilation<\/strong>: Parse and decompile APKs.<\/li>\n  <li><strong>Stack trace deobfuscation<\/strong>: Automatically map obfuscated traces to source code.<\/li>\n  <li><strong>Class\/method context fetching<\/strong>: Pull relevant decompiled code sections for analysis.<\/li>\n<\/ul>\n\n<h4 id=\"ai-and-ci-pipeline-workflow\">AI and CI pipeline workflow:<\/h4>\n\n<p>We developed a systematic two-phase approach for investigating and fixing each runtime issue, combining AI assistance with parallel testing:<\/p>\n\n<p><strong>Phase 1: MCP server tools for debugging<\/strong><\/p>\n\n<ol>\n  <li><strong>Detect runtime issue<\/strong>: From End-to-End (E2E) tests, QA testing, or crash reports.<\/li>\n  <li><strong>MCP tool orchestrates APK analysis<\/strong>: Coordinates decompilation tools for reverse engineering.<\/li>\n  <li><strong>MCP tool provides decompiled code context<\/strong>: Pulls and decompiles problematic code sections.<\/li>\n  <li><strong>Engineer and AI analysis<\/strong>: The engineer uses AI assistance to analyze the decompiled code context and note down multiple solution approaches.<\/li>\n<\/ol>\n\n<p><strong>Phase 2: GitLab CI integration<\/strong><\/p>\n\n<p>We leveraged the GitLab CLI tool (<a href=\"https:\/\/docs.gitlab.com\/cli\/\"><code class=\"language-plaintext highlighter-rouge\">glab<\/code><\/a>) and instructed AI to use it for interacting with our CI pipeline:<\/p>\n\n<ol>\n  <li><strong>AI creates multiple Merge Requests (MRs)<\/strong>: Using <code class=\"language-plaintext highlighter-rouge\">glab<\/code> CLI, AI creates merge requests for different solution approaches from Phase 1, each triggering CI compilation.<\/li>\n  <li><strong>Track progress<\/strong>: Maintain an MD file as the source of truth for the investigation, containing all notes about the issue (root cause analysis, test cases, test branches, CI build status).<\/li>\n  <li><strong>AI fetches APK from CI<\/strong>: Using <code class=\"language-plaintext highlighter-rouge\">glab<\/code> CLI to retrieve built APKs from completed CI pipelines.<\/li>\n  <li><strong>Verify<\/strong>: Ask AI to use ADB install APK, then manually test the fix.<\/li>\n  <li><strong>Iterate<\/strong>: If issues remain, loop back to step 2 for further analysis.<\/li>\n<\/ol>\n\n<p><strong>Why this worked<\/strong>:<\/p>\n\n<p>Our approach functions as an AI assistant that:<\/p>\n\n<ul>\n  <li><strong>Decodes the obfuscated code<\/strong> automatically.<\/li>\n  <li><strong>Finds the relevant code sections<\/strong> without manual searching.<\/li>\n  <li><strong>Suggests multiple solutions<\/strong> based on the context provided by the MCP tools.<\/li>\n  <li><strong>Creates multiple test branches simultaneously<\/strong> and runs parallel CI builds to test different approaches.<\/li>\n  <li><strong>Tracks everything<\/strong> to ensure no progress is lost on complex investigations.<\/li>\n<\/ul>\n\n<p>Instead of testing solutions one-by-one (waiting 2 hours per build), AI creates multiple MRs in parallel, dramatically accelerating the verification process. Engineers focus on making decisions about which solutions to pursue while the AI handles both the mechanical work and the parallel experimentation.<\/p>\n\n<p><strong>The impact: Accelerating investigation<\/strong><\/p>\n\n<p>While investigating a single R8 issue might still take time, our MCP tools dramatically accelerated critical investigation tasks. Manual tasks that previously took hours (decompilation, deobfuscation, context gathering) were reduced to minutes. Additionally, AI assistance significantly sped up the analysis phase, helping engineers quickly identify patterns, suggest solutions, and explore multiple approaches in parallel, both analytically and through simultaneous CI builds, further accelerating the overall investigation process.<\/p>\n\n<h3 id=\"innovation-2-pragmatic-testing-strategy\">Innovation 2: Pragmatic testing strategy<\/h3>\n\n<p><strong>Solving challenge 2<\/strong>:<\/p>\n\n<p>How do we do testing at scale? How do we validate R8 optimization across a mature codebase containing more than nine million lines of code when comprehensive testing is necessary but impossible? Our solution came from a critical insight about R8 issues at scale.<\/p>\n\n<p><strong>Key insight<\/strong>:<\/p>\n\n<p>From our experience, R8 issues tend to share similar root causes across the codebase. Legacy patterns like reflection usage, parser implementations, and dynamic class loading follow consistent patterns within a large codebase. This insight led to two key advantages:<\/p>\n\n<ul>\n  <li><strong>Fix one, help many<\/strong>: Fixing one place often resolves issues in others.<\/li>\n  <li><strong>Pattern recognition<\/strong>: Once we identify a pattern, we can search the codebase to find similar issues instead of waiting for QA to discover them.<\/li>\n<\/ul>\n\n<p>If we could identify and fix these pattern-based issues, we could address many problems without testing every corner of the app. We decided to start with critical paths and expand from there. This \u201cripple effect\u201d strategy began at the center with the most important flows, then expanded by identifying common root causes and similar patterns across the codebase.<\/p>\n\n<p>With this foundation, we designed a validation pipeline that progressively increased confidence:<\/p>\n\n<p><strong>Progressive, Risk-Based validation strategy<\/strong><\/p>\n\n<ul>\n  <li>\n    <p><strong>Stage 1: E2E tests - pattern discovery phase<\/strong>: Fortunately, we had existing E2E tests covering most critical paths in the project, and they could be executed with R8 optimization enabled. Initially, all E2E tests failed after enabling optimization. This became our opportunity for pattern discovery: we systematically fixed issues and applied our pattern-based approach to resolve similar problems across the project.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Stage 2: QA smoke tests - coverage expansion<\/strong>: After E2E tests stabilized, we requested our QA team to run smoke tests on critical flows, especially those not covered by E2E automation. This caught additional edge cases and validated that the pattern-based fixes we applied were effective across different user journeys. We fixed any issues that appeared during this phase.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Stage 3: Daily QA build enablement - real-world integration<\/strong>: After confirming stability in controlled testing, we made a significant decision: enable R8 optimization in our daily QA build (the build our QA team uses for daily feature testing). This integrated R8 optimization into the normal development workflow without requiring additional testing effort.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Stage 4: Regression testing and Grab Early Access (GEA) - parallel production-scale validation<\/strong>: After confirming stability in daily QA builds, we moved to production-scale validation with two parallel tracks. Every release at Grab includes <strong>regression testing<\/strong> covering all critical paths and new features. With R8 optimization now enabled in the QA build, we ran regression tests using this build for a few weeks, providing sustained validation across multiple release cycles. One week after regression testing, we rolled out to <strong>GEA<\/strong>, Grab\u2019s internal production release channel for Grab employees and partners. While GEA users typically receive features one week before general production rollout, for this R8 optimization project, we extended the GEA phase to 2 weeks, given the significance of the change. With hundreds of daily active users using the app in real-world production conditions during this extended period, we encountered only one remaining R8 issue during the GEA phase. This combination of regression testing and real-world GEA production usage gave us the confidence needed before full production rollout.<\/p>\n  <\/li>\n<\/ul>\n\n<p><strong>Testing Approaches That Don\u2019t Work with R8<\/strong>:<\/p>\n\n<ul>\n  <li><strong>Unit tests<\/strong>: Run on Java JVM, while R8 optimizations affect Android Runtime behavior - fundamentally different environments<\/li>\n  <li><strong>UI tests with R8<\/strong>: Community solutions exist as Gradle plugins, but <a href=\"https:\/\/engineering.grab.com\/how-grab-is-blazing-through-the-super-app-bazel-migration\">our tests run on Bazel<\/a> - complex setup and reliability concerns<\/li>\n<\/ul>\n\n<p><strong>Pattern-based issue resolution<\/strong>:<\/p>\n\n<p>Throughout these validation phases, when we identified R8 issues, we followed a systematic pattern-based resolution process.<\/p>\n\n<ol>\n  <li><strong>Identify the issue<\/strong>: Catch the failure through E2E, QA, or monitoring.<\/li>\n  <li><strong>Find the pattern<\/strong>: Analyze the root cause to identify if it\u2019s a common pattern across the codebase.<\/li>\n  <li><strong>Detect similar instances<\/strong>: Search the entire codebase to find the same pattern across different modules and the internal SDKs.<\/li>\n  <li><strong>Coordinate fixes<\/strong>: Create tickets requesting teams to modify their code to prevent the same issue in their modules.<\/li>\n<\/ol>\n\n<p>This approach required cross team coordination for fixing, but critically, not for testing. The difference is significant: asking teams to fix identified issues in their modules is much more scalable than requiring all teams to perform comprehensive testing upfront.<\/p>\n\n<p><strong>Production rollout results<\/strong>:<\/p>\n\n<p>When we made it to production, only one issue escaped to production. Notably, we had actually detected this issue through our pattern-based approach during testing and created a ticket for the responsible team to fix it. However, with ongoing daily development, the team missed one instance when implementing the fix, which caused the production issue.<\/p>\n\n<p>This demonstrates that while our testing strategy worked effectively, human coordination challenges can still occur at scale. With a project of this scale, having only one small production issue is considered a highly successful rollout.<\/p>\n\n<p>This approach transformed an \u201cimpossible\u201d comprehensive testing problem into a manageable, systematic validation process, reducing what would have been months of coordinated testing effort to days, proving that a smart strategy can overcome resource constraints.<\/p>\n\n<h3 id=\"innovation-3-optimized-feedback-loop\">Innovation 3: Optimized feedback loop<\/h3>\n\n<p><strong>Solving challenge 3<\/strong>:<\/p>\n\n<p>The 2-hour CI builds, and the QA configuration misalignment created a bottleneck for R8 debugging. We addressed this through a comprehensive infrastructure strategy targeting three critical areas:<\/p>\n\n<p><strong>Remote compilation to enable local build and fast feedback loop<\/strong>:<\/p>\n\n<p>At Grab, we used to use <a href=\"https:\/\/github.com\/buildfoundation\/mainframer\">Mainframer<\/a> for remote execution to handle slow performance on local Gradle builds. However, since <a href=\"https:\/\/engineering.grab.com\/how-grab-is-blazing-through-the-super-app-bazel-migration\">migrating to Bazel<\/a> (only for the debug build without R8 enabled), we removed the large-scale Mainframer setup for every engineer. From that experience, to tackle the local compilation blocker for R8 builds, we decided to deploy a new Mainframer setup, a much smaller one with one powerful EC2 instance, serving as a solution for local compilation in a short time.<\/p>\n\n<p>This targeted deployment transformed physically impossible local R8 builds into a manageable remote process, enabling engineers to test R8 changes without requiring powerful local hardware.<\/p>\n\n<p>The performance improvement was substantial: <strong>from up to 2 hours in CI to around 1 hour with Mainframer<\/strong> - a ~50% reduction that enabled rapid iteration cycles essential for R8 debugging.<\/p>\n\n<p><strong>QA build configuration alignment<\/strong>:<\/p>\n\n<p>We eliminated the critical gap between QA and production R8 behavior by aligning build configurations exactly. The key change was setting <code class=\"language-plaintext highlighter-rouge\">debuggable = false<\/code> for QA builds while maintaining the environment configuration:<\/p>\n\n<div class=\"language-groovy highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"n\">buildTypes<\/span> <span class=\"o\">{<\/span>\n    <span class=\"n\">debug<\/span> <span class=\"o\">{<\/span>\n        <span class=\"k\">if<\/span> <span class=\"o\">(<\/span><span class=\"n\">isQaBuild<\/span><span class=\"o\">())<\/span> <span class=\"o\">{<\/span>\n            <span class=\"n\">minifyEnabled<\/span> <span class=\"kc\">true<\/span>\n            <span class=\"n\">shrinkResources<\/span> <span class=\"kc\">true<\/span>\n            <span class=\"n\">debuggable<\/span> <span class=\"kc\">false<\/span>\n            <span class=\"n\">buildConfigField<\/span> <span class=\"s1\">'boolean'<\/span><span class=\"o\">,<\/span> <span class=\"s1\">'DEBUG'<\/span><span class=\"o\">,<\/span> <span class=\"s1\">'true'<\/span>\n            <span class=\"o\">...<\/span>\n        <span class=\"o\">}<\/span>\n    <span class=\"o\">}<\/span>\n<span class=\"o\">}<\/span>\n<\/code><\/pre><\/div><\/div>\n\n<p>Since, from our understanding, R8 applies different optimization levels based on the <code class=\"language-plaintext highlighter-rouge\">debuggable<\/code> flag, with more aggressive optimizations when debuggable=false, this ensured our QA testing reflected actual production R8 processing. We preserved <code class=\"language-plaintext highlighter-rouge\">DEBUG = true<\/code> to maintain staging environment routing while achieving R8 parity.<\/p>\n\n<p>This infrastructure foundation was essential, providing faster feedback loops that accelerated verification and investigation, while the QA build configuration matching production exactly was critical for catching real production issues during testing.<\/p>\n\n<h2 id=\"a-lucky-break\">A lucky break<\/h2>\n\n<p>Perhaps most surprising: the R8 flakiness issue that blocked us in 2022 (<a href=\"https:\/\/issuetracker.google.com\/u\/0\/issues\/240077160\">Issue #240077160<\/a>) appears to have been resolved by the R8 team. We encountered no build determinism issues during this attempt, which significantly smoothed our path to production.<\/p>\n\n<h2 id=\"results\">Results<\/h2>\n\n<p>After ~10 weeks of systematic implementation <strong>led by one engineer<\/strong> collaborating with multiple teams across the organization, we achieved substantial improvements using <strong>Android Gradle Plugin 8.6.X<\/strong>:<\/p>\n\n<ul>\n  <li><strong>Stability<\/strong>: Around 25% reduction in ANR rates.<\/li>\n  <li><strong>App size:<\/strong> A 16% decrease in download size on our reference device (zipped APK).<\/li>\n  <li><strong>Performance:<\/strong> Nearly 27% improvement in startup time. <em>An interesting discovery: After enabling R8 optimization, we saw ~12% app startup improvement. However, during our analysis, we discovered that our existing Baseline and Startup Profiles implementation was incorrect. We reimplemented it properly, and the combination of R8 optimization plus the corrected profiles delivered the full 27% improvement.<\/em><\/li>\n<\/ul>\n\n<p>These results exceeded our initial targets and validated the significant effort required to enable R8 optimization at scale.<\/p>\n\n<h2 id=\"whats-next\">What\u2019s next<\/h2>\n\n<p>Our journey doesn\u2019t end here. We\u2019re exploring several areas for continued optimization:<\/p>\n\n<ul>\n  <li><strong>R8 full mode<\/strong>: More extreme\/aggressive optimization than the current mode for additional performance benefits.<\/li>\n  <li><strong>Revisit R8 keep rules<\/strong>: Clean up unnecessary rules that prevent optimization, and implement a governance solution to guardrail R8 rules in our pre-merge CI pipeline.<\/li>\n<\/ul>\n\n<h2 id=\"conclusion\">Conclusion<\/h2>\n\n<p>Enabling R8 optimization for the Grab Android app at scale required innovation beyond traditional debugging approaches. By combining AI-assisted debugging, pragmatic testing strategies, and infrastructure investment, we overcame challenges that had blocked previous attempts for many years.<\/p>\n\n<p>For other teams considering R8 optimization at scale: the journey is challenging, but the results speak for themselves. With the right tools, strategy, and team collaboration, it\u2019s achievable even for the largest codebases.<\/p>\n\n<h2 id=\"join-us\">Join us<\/h2>\n\n<p>Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility, and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries: Cambodia, Indonesia, Malaysia, Myanmar, the Philippines, Singapore, Thailand, and Vietnam. Grab enables millions of people every day to order food or groceries, send packages, hail a ride or taxi, pay for online purchases, or access services such as lending and insurance, all through a single app. We operate supermarkets in Malaysia under Jaya Grocer and Everrise, which enables us to bring the convenience of on-demand grocery delivery to more consumers in the country. As part of our financial services offerings, we also provide digital banking services through GXS Bank in Singapore and GXBank in Malaysia. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line. We aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.<\/p>\n\n<p>Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, <a href=\"https:\/\/grab.careers\/\">join our team today<\/a>!<\/p>\n","pubDate":"Thu, 12 Mar 2026 00:23:00 +0000","link":"https:\/\/engineering.grab.com\/r8-optimization-at-scale-with-ai-assisted-debugging","guid":"https:\/\/engineering.grab.com\/r8-optimization-at-scale-with-ai-assisted-debugging","category":["AI","Engineering"]},{"title":"Reclaiming Terabytes: Optimizing Android image caching with TLRU","description":"<h2 id=\"introduction\">Introduction<\/h2>\n\n<p>In a previous post, we discussed <a href=\"https:\/\/engineering.grab.com\/project-bonsai\">Project Bonsai<\/a>, our initiative to reduce the Grab app\u2019s download size. We successfully reduced the Android Application Package (APK) download size by 26%. This reduction offers a substantial advantage: it minimizes download friction, allowing users to download the app, even on slower networks. However, the battle for storage doesn\u2019t end after installation.<\/p>\n\n<p>The Grab app includes a wide range of features and workflows that heavily depend on image content, particularly in services like transportation and e-commerce. Although some images are packaged within the app binary, a large majority are downloaded from Grab\u2019s server at runtime. To optimize the app\u2019s performance and minimize server expenses, the downloaded images are cached in the app\u2019s storage. This reduces both load times and traffic to Grab\u2019s image server, resulting in better user experience and lower costs. Although we use Least Recently Used (LRU) cache to manage storage, many images can remain in the app storage for extended periods, even if they are no longer relevant.<\/p>\n\n<p>This blog details how we addressed this challenge in our Grab Android app by evolving our standard LRU cache into a <strong>Time-Aware Least Recently Used (TLRU)<\/strong> cache. This evolution allows us to reclaim storage space without compromising user experience or increasing server costs.<\/p>\n\n<h2 id=\"understanding-lru-cache-limitations\">Understanding LRU cache limitations<\/h2>\n\n<p><em>Note: In this article, when \u201ccache\u201d or \u201cimage cache\u201d is mentioned, it specifically refers to <strong>disk cache<\/strong>, which is the persistent storage on the device\u2019s file system, rather than in-memory cache.<\/em><\/p>\n\n<p>The Grab Android app uses the <a href=\"https:\/\/github.com\/bumptech\/glide\">Glide library<\/a> as its primary image loading framework. Glide provides excellent features for efficient image loading, caching, and display. At its core, by default, Glide uses a cloned version of <a href=\"https:\/\/github.com\/bumptech\/glide\/blob\/master\/third_party\/disklrucache\/src\/main\/java\/com\/bumptech\/glide\/disklrucache\/DiskLruCache.java\">Jake Wharton\u2019s DiskLruCache<\/a> for disk-based caching.<\/p>\n\n<p>To prevent unlimited cache growth, we configured the LRU cache with a maximum size limit of 100 MB. However, our analytics revealed that the 90th percentile (P90) of users were consistently reaching this 100 MB limit, meaning the cache was constantly at capacity. Conversely, for users whose cache hadn\u2019t yet reached the 100 MB threshold, images were never removed, even if they were outdated by several months and no longer relevant.<\/p>\n\n<p>Our analysis revealed that image caching was a major contributor to the app\u2019s disk footprint, and without proactive management, this would only worsen as we continued adding features and content to Grab\u2019s superapp.<\/p>\n\n<h3 id=\"how-disklrucache-works\">How DiskLruCache works<\/h3>\n\n<p>The LRU cache algorithm manages storage by maintaining entries in access order and automatically evicting the oldest unused entries when space is needed.<\/p>\n\n<p><strong>Figure 1 and 2<\/strong> illustrates how LRU cache trimming works. These diagrams present an LRU cache with a maximum size of 100 MB containing three cache entries totaling 95 MB. When a new 25 MB cache entry is added, it exceeds the cache\u2019s maximum size.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/image-caching\/figure-1.png\" alt=\"\" style=\"width:60%\" \/><figcaption align=\"middle\">Figure 1. A new cache entry is added to an LRU cache that's near its 100 MB capacity, exceeding the limit.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/image-caching\/figure-2.png\" alt=\"\" style=\"width:60%\" \/><figcaption align=\"middle\">Figure 2. The LRU cache automatically trims the least recently used entry to bring the total size back within the 100 MB limit.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h2 id=\"the-challenge\">The challenge<\/h2>\n\n<p>While DiskLruCache efficiently manages cache size, it has a critical limitation: It does not account for the age of cached content. Due to the lack of time-based eviction rules, the cache does not remove outdated entries until it exceeds the maximum size. This meant that stale promotional images, images from infrequently used features, and outdated content continued occupying disk space indefinitely, as long as the cache remained under the size limit.<\/p>\n\n<p>What we needed was a cache mechanism that could:<\/p>\n\n<ul>\n  <li>\n    <p><strong>Maintain LRU cache benefits<\/strong>: Preserve efficient caching for users who actively use the app features.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Remove stale content based on time:<\/strong> Automatically identify and evict outdated entries, not just rely on storage constraints.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Protect user experience<\/strong>: Ensure images still load quickly without cache misses.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Keep server costs low<\/strong>: Avoid increased server requests from premature cache evictions.<\/p>\n  <\/li>\n<\/ul>\n\n<p>These requirements pointed us toward an enhanced LRU approach. We needed to enhance LRU with time awareness while preserving its proven size-management capabilities.<\/p>\n\n<h2 id=\"tlru-cache-the-solution\">TLRU cache: The solution<\/h2>\n\n<p>To address these limitations, we developed a new LRU cache variant named TLRU that extends traditional LRU by introducing time-based eviction while maintaining size-based cache management.<\/p>\n\n<h3 id=\"core-tlru-attributes\">Core TLRU attributes<\/h3>\n\n<p>TLRU introduces three core attributes to manage cache entries:<\/p>\n\n<ul>\n  <li>\n    <p><strong>Time-To-Live (TTL)<\/strong>: A threshold that determines when a cache entry is considered expired. An entry is expired if <code class=\"language-plaintext highlighter-rouge\">(current_time - last_accessed) &gt; TTL<\/code>. Expired entries are automatically removed during cache operations.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Minimum cache size threshold<\/strong>: A safety net that ensures a baseline set of essential images always remains cached, even when entries expire. This prevents complete cache deletion when users haven\u2019t used the app for more than the TTL period, maintaining app responsiveness for returning users instead of starting with an empty cache.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Maximum cache size<\/strong>: Inherited from LRU cache, this enforces the upper storage limit (100 MB in our case). When exceeded, the least recently used entries are evicted regardless of their age.<\/p>\n  <\/li>\n<\/ul>\n\n<p>Together, these attributes ensure TLRU maintains optimal cache size by managing both storage constraints and temporal relevance, reducing app disk footprint without impacting user experience.<\/p>\n\n<h3 id=\"tlru-cache-trimming-in-action\">TLRU cache trimming in action<\/h3>\n\n<p>To better understand how TLRU works in practice, let\u2019s walk through a comprehensive example. The following diagrams demonstrate how the TLRU cache evaluates and trims entries based on both time and size constraints.<\/p>\n\n<p>Our TLRU cache configuration includes:<\/p>\n\n<ul>\n  <li>\n    <p><strong>Maximum cache size<\/strong>: 100 MB - the storage limit that triggers size-based eviction.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Minimum size threshold<\/strong>: 20 MB - the safety net that protects essential cached content.<\/p>\n  <\/li>\n  <li>\n    <p><strong>TTL<\/strong>: 20 days - entries older than this are considered expired.<\/p>\n  <\/li>\n<\/ul>\n\n<p>Each cache entry includes <code class=\"language-plaintext highlighter-rouge\">last_accessed<\/code> metadata containing the timestamp of its most recent access. When an entry is first created, this timestamp is initialized with the creation time. This timestamp determines whether an entry has expired based on the formula:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>Entry is expired if: (current_time - last_accessed) &gt; TTL\n<\/code><\/pre><\/div><\/div>\n\n<p>For this walkthrough, we\u2019ll use <code class=\"language-plaintext highlighter-rouge\">current_time = Day 100<\/code> as our starting point.<\/p>\n\n<h4 id=\"initial-cache-state-analysis\">Initial cache state analysis<\/h4>\n\n<p>Our example begins with three existing cache entries totaling 95 MB, approaching the 100 MB limit:<\/p>\n\n<ul>\n  <li><strong>Item 1<\/strong> (8 MB, last accessed Day 82): At 18 days old<\/li>\n  <li><strong>Item 2<\/strong> (30 MB, last accessed Day 81): At 19 days old<\/li>\n  <li><strong>Item 3<\/strong> (57 MB, last accessed Day 80): At exactly 20 days old, valid at the TTL threshold<\/li>\n<\/ul>\n\n<p>When a new 10 MB item is added on Day 100, the cache grows to 105 MB, exceeding our 100 MB limit and triggering size-based eviction.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/image-caching\/figure-3.png\" alt=\"\" style=\"width:60%\" \/><figcaption align=\"middle\">Figure 3. Initial TLRU cache state and the impact of adding new entries.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h4 id=\"size-based-eviction-process\">Size-based eviction process<\/h4>\n\n<p>When the cache exceeds its 100 MB limit, TLRU applies traditional LRU eviction logic. <strong>Item 3<\/strong> is selected for eviction because:<\/p>\n\n<ul>\n  <li>\n    <p>It is the least recently used entry (oldest access time).<\/p>\n  <\/li>\n  <li>\n    <p>This demonstrates TLRU maintaining LRU behavior for size enforcement, regardless of expiration status.<\/p>\n  <\/li>\n<\/ul>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/image-caching\/figure-4.png\" alt=\"\" style=\"width:60%\" \/><figcaption align=\"middle\">Figure 4. Size-based eviction removes the least recently used entry to enforce storage limits.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h4 id=\"time-based-eviction-process\">Time-based eviction process<\/h4>\n\n<p>Five days later (Day 105), Item 1 and Item 2 cross the expiration threshold:<\/p>\n\n<p>Despite operating well below the size limit (48 MB &lt; 100 MB), TLRU evaluates expired entries for time-based eviction. Item 2 is removed because it\u2019s expired, and the cache remains above the minimum threshold. Item 1, although also expired, is protected by the minimum threshold rule; removing it would leave only 10 MB, which falls below the 20 MB minimum.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/image-caching\/figure-5.png\" alt=\"\" style=\"width:60%\" \/><figcaption align=\"middle\">Figure 5. Time-based eviction and minimum threshold protection working together.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h4 id=\"tlru-behavior-summary\">TLRU behavior summary<\/h4>\n\n<p>This comprehensive example demonstrates TLRU\u2019s three core mechanisms:<\/p>\n\n<ul>\n  <li>\n    <p><strong>Size-based eviction<\/strong>: Enforces storage limits using traditional LRU ordering (Item 3 removed despite being valid).<\/p>\n  <\/li>\n  <li>\n    <p><strong>Time-based eviction<\/strong>: Proactively removes expired content when safe to do so (Item 2 removed for age).<\/p>\n  <\/li>\n  <li>\n    <p><strong>Minimum threshold protection<\/strong>: Preserves essential cache functionality even with expired content (Item 1 protected despite expiration).<\/p>\n  <\/li>\n<\/ul>\n\n<h2 id=\"technical-implementation\">Technical implementation<\/h2>\n\n<p>Rather than building an image cache from scratch, we recognized that Glide\u2019s bundled <a href=\"https:\/\/github.com\/bumptech\/glide\/blob\/master\/third_party\/disklrucache\/src\/main\/java\/com\/bumptech\/glide\/disklrucache\/DiskLruCache.java\">DiskLruCache<\/a> (originally from <a href=\"https:\/\/github.com\/JakeWharton\/DiskLruCache\">Jake Wharton\u2019s implementation<\/a>) already provided a mature, battle-tested foundation. This implementation is widely adopted across the Android ecosystem and handles complex edge cases like crash recovery, thread safety, and performance optimization that would require substantial effort to replicate.<\/p>\n\n<p>Our approach was pragmatic, we cloned Glide\u2019s DiskLruCache and extended it to support time-based expiration. This strategy allowed us to inherit the existing reliability while adding the temporal awareness we needed for TLRU.<\/p>\n\n<p>To understand our implementation, we\u2019ll first explore how the original DiskLruCache works, then dive into the specific modifications we made to transform it into TLRU.<\/p>\n\n<h3 id=\"understanding-disklrucache\">Understanding DiskLruCache<\/h3>\n\n<p><a href=\"https:\/\/github.com\/bumptech\/glide\/blob\/master\/third_party\/disklrucache\/src\/main\/java\/com\/bumptech\/glide\/disklrucache\/DiskLruCache.java\">DiskLruCache<\/a> provides a simple cache solution that stores key-value pairs on disk, while also keeping track of their usage to evict the least recently used items when the cache reaches its maximum size. Here is an overview of how DiskLruCache is implemented:<\/p>\n\n<ul>\n  <li>\n    <p><strong>Data storage<\/strong>: DiskLruCache stores its data in a specified directory, creating files for each entry.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Key-based access<\/strong>: Each entry has a unique key (typically a hash generated by the image loader) used to create the filename of the cached entry.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Atomic writes<\/strong>: When adding an entry, it creates a temporary file and writes the data to it. If successful, it atomically renames the temporary file to the final filename.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Cache retrieval<\/strong>: When reading from the cache, it looks up the key, opens the corresponding file on disk, and returns an InputStream to read the data.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Size management<\/strong>: It maintains a maximum cache size limit. When exceeded, it removes the least recently used items until it is within the specified limit.<\/p>\n  <\/li>\n<\/ul>\n\n<p>The central component that enables this functionality is the journaling mechanism, detailed in the following section.<\/p>\n\n<h4 id=\"the-journaling-mechanism\">The journaling mechanism<\/h4>\n\n<p>The journaling mechanism in DiskLruCache is designed to maintain consistency and prevent data corruption in the cache. The journal file records all cache operations, such as adding, updating, or removing entries. The journaling mechanism is essential in rebuilding the cache metadata during initialization and performing journal compaction to clean up the journal file.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/image-caching\/figure-6.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 6. Example of the journaling mechanism in DiskLruCache.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p><strong>Journal file format<\/strong>:<\/p>\n\n<p>The journal file is a plain text file that records cache operations line by line.<\/p>\n\n<ul>\n  <li>\n    <p>DIRTY: Indicates the start of a write operation to a cache entry.<\/p>\n  <\/li>\n  <li>\n    <p>CLEAN: Indicates that a cache entry was successfully written and closed.<\/p>\n  <\/li>\n  <li>\n    <p>REMOVE: Indicates that a cache entry was removed from the cache.<\/p>\n  <\/li>\n  <li>\n    <p>READ: Indicates that a cache entry was read.<\/p>\n  <\/li>\n<\/ul>\n\n<p>To gain a comprehensive understanding of the journal file format, refer to the following detailed explanation.<\/p>\n\n<ul>\n  <li>\n    <p><strong>Key information<\/strong>: Each line includes the key and other relevant information, such as the lengths of the cache entry files.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Cache initialization<\/strong>: Upon initialization, DiskLruCache reads the journal file to reconstruct cache metadata in memory, determining file associations, lengths, and access order. If the journal file is corrupted or missing, the cache will be considered invalid, and DiskLruCache will remove all cache files and start fresh.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Cache operations and journal updates<\/strong>: When performing cache operations like adding, updating, or removing entries, DiskLruCache appends corresponding lines to the journal file, recording the operation details. For example, when starting to write a new cache entry, it writes a <code class=\"language-plaintext highlighter-rouge\">DIRTY<\/code> line with the key, and when the write is successful, it appends a <code class=\"language-plaintext highlighter-rouge\">CLEAN<\/code> line with the key and lengths.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Synchronization and consistency:<\/strong> DiskLruCache uses synchronization to ensure that only one thread can access the cache at a time, preventing race conditions and data corruption. It also uses a journalWriter (<code class=\"language-plaintext highlighter-rouge\">java.io.Writer<\/code>) instance to append operations to the journal file, ensuring that the file is always in a consistent state.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Journal compaction<\/strong>: Over time, the journal file may grow with redundant operations. DiskLruCache periodically compacts the journal by creating a new file that contains only the current cache metadata, then atomically replaces the old file. The compaction process usually happens when the journal file size exceeds a certain threshold.<\/p>\n  <\/li>\n<\/ul>\n\n<p>DiskLruCache ensures consistency and prevents data corruption by using this journaling mechanism, making it a reliable solution for disk-based caching.<\/p>\n\n<h3 id=\"modifying-disklrucache-for-tlru\">Modifying DiskLruCache for TLRU<\/h3>\n\n<p>With a solid understanding of DiskLruCache\u2019s architecture, we can now explore how we extended it to implement the TLRU cache attributes defined earlier.<\/p>\n\n<p>Three primary modifications to DiskLruCache:<\/p>\n\n<ul>\n  <li><a href=\"#tracking-last-access-time\">Tracking last access time<\/a><\/li>\n  <li><a href=\"#time-based-eviction-logic\">Time-based eviction logic<\/a><\/li>\n  <li><a href=\"#backward-compatible-migration\">Backward-compatible migration<\/a><\/li>\n<\/ul>\n\n<h4 id=\"tracking-last-access-time\">Tracking last access time<\/h4>\n\n<p>To support time-based eviction, the cache needs to track when each entry was last accessed. This information m  ust persist across app restarts, so it\u2019s stored in the journal file itself.<\/p>\n\n<p>Modified journal format:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>READ [Cache-Key] [Access-Timestamp]\nCLEAN [Cache-Key] [File-Size]-[Access-Timestamp]\n<\/code><\/pre><\/div><\/div>\n\n<p>The timestamps are added to <code class=\"language-plaintext highlighter-rouge\">READ<\/code> and <code class=\"language-plaintext highlighter-rouge\">CLEAN<\/code> operations:<\/p>\n\n<ul>\n  <li>\n    <p><code class=\"language-plaintext highlighter-rouge\">READ<\/code> entries record when a cache entry is accessed, updating its last-access time.<\/p>\n  <\/li>\n  <li>\n    <p><code class=\"language-plaintext highlighter-rouge\">CLEAN<\/code> entries record the creation time when a new entry is successfully added to the cache.<\/p>\n  <\/li>\n<\/ul>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/image-caching\/figure-7.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 7. Example of a TLRU journal file.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h4 id=\"time-based-eviction-logic\">Time-based eviction logic<\/h4>\n\n<p>The TLRU cache leverages the existing LRU ordering to optimize expiration checking. For each cache operation, it checks if the least recently accessed entry has expired before proceeding with time-based trimming.<\/p>\n\n<p>The diagram below shows how the TLRU cache makes the decision to remove the cache entries.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/image-caching\/figure-8.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 8. TLRU eviction decision flow - evaluating cache entries based on time expiration and size constraints.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p>The algorithm leverages the sorted nature of the cache: if the least recently accessed entry hasn\u2019t expired, no other entries need checking. If it has expired, the cache trim operation walks through entries from oldest to newest, removing all expired ones.<\/p>\n\n<h4 id=\"backward-compatible-migration\">Backward-compatible migration<\/h4>\n\n<p>With an extensive user base, invalidating existing cached images would cause millions of users to experience poor performance while creating massive server traffic spikes and infrastructure costs.<\/p>\n\n<p>One of the challenges was retrieving last-access timestamps from existing LRU entries, as file system APIs do not offer reliable access time data. Our solution was to set the last-access time of all existing entries to the migration timestamp. This approach preserves all cached content and establishes a consistent baseline, although it necessitates waiting one TTL period to realize the full benefits of eviction.<\/p>\n\n<p>We also ensured bidirectional compatibility - the original LRU implementation can read TLRU journal files by ignoring timestamp suffixes, enabling safe rollbacks if needed.<\/p>\n\n<p>Upon completing our TLRU implementation, we focused on determining optimal values for the three core attributes: <strong>TTL duration<\/strong>, <strong>minimum threshold<\/strong>, and <strong>maximum cache size<\/strong>. These parameters are crucial for balancing storage optimization and cache performance, requiring careful tuning based on real user behavior.<\/p>\n\n<h2 id=\"finding-optimal-configuration-values\">Finding optimal configuration values<\/h2>\n\n<p>Finding optimal configuration values requires systematic experimentation and data-driven decision-making. Controlled experiments to compare the cache hit ratio with baseline LRU performance must be conducted.<\/p>\n\n<p><em>Note: Cache hit ratio, our key success metric, gauges efficiency by the percentage of requests served from cache versus requiring server downloads. Lower ratios lead to higher server costs and increased user data consumption.<\/em><\/p>\n\n<p>Our success criteria is for a cache hit ratio decrease of no more than 3 percentage points (pp) during the transition to TLRU. For instance, a decrease from 59% to 56% hit ratio would result in 7% increase in server requests. This threshold balances storage optimization with acceptable performance impact.<\/p>\n\n<p>To mitigate potential server cost impact from our maximum acceptable 3 pp cache hit ratio drop, we worked with the server team to optimize image delivery infrastructure, enabling a confident TLRU rollout without infrastructure cost concerns.<\/p>\n\n<h2 id=\"impact-and-results\">Impact and results<\/h2>\n\n<p>After fully rolling out TLRU to production, we significantly optimized storage while preserving user experience. Post-implementation stabilization, the P95 total app size reduced by approximately 50 MB. This meant that 95% of our users experienced storage reduction up to 50 MB, with the top 5% seeing even greater savings.<\/p>\n\n<p>With over 100 million downloads of the Grab Android app, even conservative estimates show terabytes of storage reclaimed across all user devices worldwide. This translates to better device performance, especially on low-end devices, and improved user satisfaction.<\/p>\n\n<p>Critically, we maintained our success criteria: cache hit ratio stayed within target thresholds (no more than 3 pp decrease), with no increase in infrastructure costs. The seamless migration preserved all existing cache data without disruption.<\/p>\n\n<h2 id=\"conclusion\">Conclusion<\/h2>\n\n<p>At Grab, we believe that every byte matters. Our users trust us with their device storage, and we take that responsibility seriously. The TLRU implementation exemplifies our commitment to user experience. We don\u2019t just build features, we optimize them to ensure our app respects our users\u2019 devices. The petabytes of storage reclaimed across millions of devices aren\u2019t just a technical achievement; it\u2019s a reflection of our dedication to creating a lighter, faster, more respectful mobile experience.<\/p>\n\n<p>The implementation demonstrates that meaningful improvements can be achieved through thoughtful modifications to existing, well-tested libraries. Our focus on backward compatibility and safe migration ensured zero disruption for Grab\u2019s users, proving that user experience and technical innovation can coexist.<\/p>\n\n<h2 id=\"join-us\">Join Us<\/h2>\n\n<p>Grab is Southeast Asia\u2019s leading superapp, serving over 900 cities across eight countries (Cambodia, Indonesia, Malaysia, Myanmar, the Philippines, Singapore, Thailand, and Vietnam). Through a single platform, millions of users access mobility, delivery, and digital financial services, including ride-hailing, food delivery, payments, lending, and digital banking via GXS Bank and GXBank. Founded in 2012, Grab\u2019s mission is to drive Southeast Asia forward by creating economic empowerment for everyone while delivering sustainable financial performance and positive social impact.<\/p>\n\n<p>Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, <a href=\"https:\/\/grab.careers\/\">join our team today<\/a>!<\/p>\n","pubDate":"Fri, 06 Mar 2026 00:23:00 +0000","link":"https:\/\/engineering.grab.com\/reclaiming-tetabytes-optimizing-android-image-caching-with-tlru","guid":"https:\/\/engineering.grab.com\/reclaiming-tetabytes-optimizing-android-image-caching-with-tlru","category":["App disk","Disk size","Optimization","Scalability","Engineering"]},{"title":"Cursor at Grab: Adoption and impact","description":"<h2 id=\"adoption-overview\">Adoption overview<\/h2>\n\n<p>The illustration below encapsulates how Cursor is scaled across Grab, achieving rapid and widespread adoption that accelerated software development and empowered non-technical teams to build solutions.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/cursor-at-grab\/cursor-figure-1.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 1: Adoption overview of AI tool Cursor in Grab.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"multi-tool-strategy\">Multi-tool strategy<\/h3>\n\n<p>Grab embraces a multi-tool strategy for AI coding assistants. Rather than committing to a single solution, we experiment with multiple tools simultaneously, allowing us to compare outcomes and adopt what works. This approach keeps us flexible in a space that evolves quickly. We covered this philosophy in a <a href=\"https:\/\/www.grab.com\/sg\/inside-grab\/stories\/beyond-one-size-fits-all-why-grab-embraces-multiple-ai-coding-assistants\/\">previous post<\/a>.<\/p>\n\n<h3 id=\"growth\">Growth<\/h3>\n\n<p>We introduced Cursor in late 2024 as one of several tools in our AI engineering toolkit. Adoption grew quickly: 98% of Tech Grabbers became monthly active users, and about 75% use it weekly. For comparison, Google\u2019s <a href=\"https:\/\/services.google.com\/fh\/files\/misc\/2025_state_of_ai_assisted_software_development.pdf\">2025 State of AI-Assisted Software Development<\/a> report highlights that even among high-performing teams, AI coding tool adoption seldom surpasses 70%. Notably, Cursor\u2019s appeal extended beyond engineering, with non-technical teams incorporating it into their workflows.<\/p>\n\n<p>A standout metric is Cursor\u2019s suggestion acceptance rate, which is around 50%, surpassing the industry average of 30%. This indicates two key insights: first, the suggestions are sufficiently relevant for engineers to accept them half of the time; second, engineers maintain a critical review process rather than accepting suggestions indiscriminately. We attribute this relevance to continuous feedback loops and environment-specific tuning, ensuring suggestions remain aligned with Grab\u2019s codebase and conventions.<\/p>\n\n<h2 id=\"extent-of-adoption\">Extent of adoption<\/h2>\n\n<p>Raw adoption figures don\u2019t provide the complete picture. We aimed to determine whether engineers were truly incorporating Cursor into their daily workflows or merely experimenting with it sporadically.<\/p>\n\n<p>The data indicates genuine integration. Approximately half of Cursor users engage with it 10 or more consecutive days each month, with some teams achieving full adoption. Over a third of merge requests now incorporate Cursor in some capacity. Engineers actively share tips and workflows via a dedicated Slack channel, fostering an organic knowledge base.<\/p>\n\n<p>Across various teams, we\u2019ve observed significant transitions from light usage to moderate and power user levels over the past six months.<\/p>\n\n<h2 id=\"engineer-utilization-patterns\">Engineer utilization patterns<\/h2>\n\n<p>The most common patterns we see are unit test generation, code refactoring, cross-repository navigation, bug fixing, and automation of routine tasks like API scaffolding or commit messages.<\/p>\n\n<p>Test generation is particularly popular. Writing tests manually is tedious, and Cursor\u2019s ability to generate and iteratively refine tests has become a standard part of many engineers\u2019 workflows. Cross-repository navigation helps with onboarding and context-switching: engineers can ask Cursor questions about unfamiliar codebases rather than hunting through documentation.<\/p>\n\n<p>Qualitative feedback confirms what the adoption numbers suggest: tasks that took a full day to complete now take hours. Engineers report tackling refactors and test additions they would have otherwise skipped due to time pressure. Cursor doesn\u2019t just speed up existing work; it makes previously impractical work feasible.<\/p>\n\n<h2 id=\"integration-with-grabs-stack\">Integration with Grab\u2019s stack<\/h2>\n\n<p>Integrating Cursor effectively at Grab required custom tooling. We built solutions for monorepo indexing to handle Grab\u2019s scale and to distribute preconfigured rules that align Cursor\u2019s suggestions with Grab-specific coding conventions. This integration ensures that Cursor understands our environment rather than offering generic suggestions.<\/p>\n\n<h2 id=\"whats-next\">What\u2019s next<\/h2>\n\n<p>Cursor is one tool in a broader toolkit. Our multi-tool strategy means we\u2019re also investing in terminal-based workflows and <a href=\"https:\/\/engineering.grab.com\/the-birth-of-grab-gpt\">GrabGPT<\/a> for internal knowledge retrieval. Different tools suit different workflows. The aim is to empower users, not to restrict them.<\/p>\n\n<p>Beyond engineering, we\u2019re expanding AI-assisted development to new personas. Our AI Upskilling workshops have trained several hundred Grabbers across five countries, including executive committee members and senior leaders who built and deployed their own apps. Non-engineers in Financial Planning and Analysis (FP&amp;A), Operations, and regional teams are now building tools to solve their own pain points.<\/p>\n\n<p>Our product design team has launched an initiative empowering designers to directly implement production fixes. Designers have successfully merged hundreds of merge requests, often with same-day turnaround, facilitating quicker iterations on UI fixes without the engineering queue delay. This process requires designers to be trained in Git fundamentals prior to gaining access, with initial reviews conducted by design managers.<\/p>\n\n<p>Cursor has become part of daily work at Grab. But adoption is only half the question \u2014 the other half is impact. We\u2019ve been running a parallel effort to measure productivity effects rigorously, using fixed-effects regression to isolate Cursor\u2019s contribution from other factors. Early findings show a dose-response relationship: productivity gains scale with usage intensity, and the effects hold up to statistical scrutiny.<\/p>\n\n<p>We will address the measurement methodology and present our findings in a subsequent post.<\/p>\n\n<h2 id=\"join-us\">Join us<\/h2>\n\n<p>Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line \u2013 we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.<\/p>\n\n<p>Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, <a href=\"https:\/\/grb.to\/cursoratgrab1\">join our team<\/a> today!<\/p>\n","pubDate":"Thu, 29 Jan 2026 00:23:00 +0000","link":"https:\/\/engineering.grab.com\/cursor-at-grab-adoption-and-impact","guid":"https:\/\/engineering.grab.com\/cursor-at-grab-adoption-and-impact","category":["AI","Engineering"]},{"title":"Docker lazy loading at Grab: Accelerating container startup times","description":"<h2 id=\"introduction\">Introduction<\/h2>\n\n<p>At Grab, we\u2019ve been exploring ways to dramatically reduce container startup times for our data platforms. Large container images for services like Airflow and Spark Connect were taking minutes to download, causing slow cold starts and poor auto-scaling performance. This blog post shares our journey implementing Docker image lazy loading using eStargz and Seekable OCI (SOCI) technologies, the results we achieved, and the lessons learned along the way.<\/p>\n\n<h2 id=\"results-the-numbers-speak-for-themselves\">Results: The numbers speak for themselves<\/h2>\n\n<h3 id=\"benchmark-results\">Benchmark results<\/h3>\n\n<p>Our initial testing on fresh nodes (nodes without cached images) showed dramatic improvements in image pull times as shown in <strong>Figure 1<\/strong>.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/docker-lazy-loading\/figure-1.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 1. Table of results.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p>The key advantage of lazy loading is the reduction in image pull time, especially on \u201cfresh\u201d nodes that do not have the image cached. By analyzing detailed pod events, we can see the precise impact of using the stargz snapshotter.<\/p>\n\n<p>During our SOCI benchmark testing, we observed an important distinction between SOCI and eStargz: <strong>SOCI maintains the same application startup time as standard images, while eStargz takes longer<\/strong>. For example, with Airflow, both overlayFS and SOCI achieved 5.0 seconds startup time, while eStargz took 25.0 seconds. This demonstrates that lazy loading doesn\u2019t eliminate download time; it redistributes it. SOCI\u2019s approach of maintaining separate indexes allows it to optimize the download-to-startup time trade-off more effectively, keeping application startup performance on par with standard images while still dramatically reducing image pull time.<\/p>\n\n<h2 id=\"production-performance\">Production performance<\/h2>\n\n<p>The production deployment of SOCI lazy loading has delivered significant, measurable improvements across our data platforms. Both Airflow and Spark Connect now experience 30-40% faster startup times, directly improving our ability to handle traffic spikes and scale efficiently. These improvements translate to better auto-scaling responsiveness, reduced resource waste during initialization, and improved user experience for data processing workloads. The sustained performance gains observed over time demonstrate that lazy loading is a stable, production-ready optimization that delivers consistent value.<\/p>\n\n<p><strong>Figure 2 and 3<\/strong> illustrates the P95 startup time improvements for both services:<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/docker-lazy-loading\/figure-2.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 2. Production results: Airflow P95 startup time. <\/figcaption>\n  <\/figure>\n<\/div>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/docker-lazy-loading\/figure-3.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 3. Production results: Spark Connect P95 startup time.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p>It is important to note that P95 startup time includes both the image download\/pull time and the application startup time itself. This metric captures the entire system performance for both cold and hot starts on fresh and hot nodes, showing the overall system improvement rather than just cold start performance.<\/p>\n\n<p>During the production deployment and monitoring, we gained valuable insights on SOCI configuration tuning. Following AWS\u2019s recommended configuration from their blog on <a href=\"https:\/\/aws.amazon.com\/blogs\/containers\/introducing-seekable-oci-parallel-pull-mode-for-amazon-eks\/\">Introducing Seekable OCI: Parallel Pull Mode for Amazon EKS<\/a>, we optimized our SOCI snapshotter settings:<\/p>\n\n<ul>\n  <li>\n    <p>Increased <em>max_concurrent_downloads_per_image<\/em> from 5 to 10.<\/p>\n  <\/li>\n  <li>\n    <p>Increased <em>max_concurrent_unpacks_per_image<\/em> from 3 to 10.<\/p>\n  <\/li>\n  <li>\n    <p>Increased <em>concurrent_download_chunk_size<\/em> from 8MB to 16MB (aligning with AWS\u2019s recommendation for Elastic Container Registry (ECR)).<\/p>\n  <\/li>\n<\/ul>\n\n<p>This configuration tuning led to a significant performance improvement: <strong>image download time on a fresh node was reduced from 60 seconds to 24 seconds, representing a 60% improvement<\/strong>. The key lesson here is that default SOCI configurations may not be optimal for all environments, and tuning these parameters based on your infrastructure (especially when using ECR) can yield substantial gains.<\/p>\n\n<h2 id=\"technical-background-how-docker-lazy-loading-works\">Technical background: How Docker lazy loading works<\/h2>\n\n<h3 id=\"container-root-filesystem-rootfs-and-file-organization\">Container root filesystem (rootfs) and file organization<\/h3>\n\n<p>A container\u2019s root filesystem, or rootfs, is the directory structure that the container sees as its root <code class=\"language-plaintext highlighter-rouge\">(\/)<\/code>. It contains all the files and directories necessary for an application to run, including the application itself, its dependencies, system libraries, and configuration files. It\u2019s an isolated filesystem, separate from the host machine\u2019s filesystem.<\/p>\n\n<p>The rootfs is built from a series of read-only layers that come from the container image. Each instruction in an image\u2019s Dockerfile creates a new layer, representing a set of filesystem changes. When a container is launched, a new writable layer, often called the \u201ccontainer layer,\u201d is added on top of the stack of read-only image layers. Any changes made to the running container, such as writing new files or modifying existing ones, are written to this writable layer. The underlying image layers remain untouched. This is known as a copy-on-write (CoW) mechanism.<\/p>\n\n<p>In containerd, a snapshotter is a plugin responsible for managing container filesystems. Its primary job is to take the layers of an image and assemble them into a rootfs for a container. The default snapshotter in containerd is <strong>overlayFS<\/strong>, which uses the Linux kernel\u2019s OverlayFS driver to efficiently stack layers. To assemble the rootfs, the overlayFS snapshotter creates a \u201cmerged\u201d view of the read-only image layers:<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/docker-lazy-loading\/figure-4.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 4. How OverlayFS assembles the container filesystem.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<ul>\n  <li>\n    <p><strong>lowerdir<\/strong>: The read-only image layers are used as the lowerdir in OverlayFS. These are the immutable layers from the container image.<\/p>\n  <\/li>\n  <li>\n    <p><strong>upperdir<\/strong>: A new, empty directory is created to be the upperdir. This is the writable layer for the container where any changes are stored.<\/p>\n  <\/li>\n  <li>\n    <p><strong>merged<\/strong>: The merged directory is the unified view of the lowerdir and upperdir. This is what is presented to the container as its rootfs.<\/p>\n  <\/li>\n<\/ul>\n\n<p>When a container reads a file, it\u2019s read from the merged view. When a container writes a file, it\u2019s written to the upperdir using a copy-on-write mechanism. This is an efficient way to manage container filesystems, as it avoids duplicating files and allows for fast container startup.<\/p>\n\n<h3 id=\"the-problem-traditional-container-image-pull\">The problem: Traditional container image pull<\/h3>\n\n<p>To understand the benefits of lazy loading, we first need to understand the traditional container image pull process:<\/p>\n\n<ol>\n  <li>\n    <p><strong>Download layers<\/strong>: The container runtime downloads all layer tarballs that make up the image.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Unpack layers<\/strong>: Each layer is unpacked and extracted onto the host\u2019s disk.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Create snapshot<\/strong>: The snapshotter combines these layers into a single, unified filesystem, known as the container\u2019s rootfs.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Start container<\/strong>: Only after all layers are downloaded and unpacked can the container start.<\/p>\n  <\/li>\n<\/ol>\n\n<p>This process is slow, especially for large images, as the entire image must be present on the host before the container can launch.<\/p>\n\n<h3 id=\"the-solution-remote-snapshotter\">The solution: Remote snapshotter<\/h3>\n\n<p>To address the slow startup issue with large images, we use a <strong>remote snapshotter<\/strong> solution. A remote snapshotter is a special type of snapshotter that doesn\u2019t require all image data to be locally present. Instead of downloading and unpacking all the layers, it creates a \u201csnapshot\u201d that points to the remote location of the data (like a container registry). The actual file content is then fetched on-demand when the container tries to read a file for the first time.<\/p>\n\n<p>While a traditional snapshotter like overlayFS uses directories on the local disk as its lowerdir, a remote snapshotter creates a virtual lowerdir that is backed by the remote registry. This is typically done using FUSE (Filesystem in Userspace). The remote snapshotter creates a FUSE filesystem that presents the contents of the remote layer as if it were a local directory. This FUSE mount is then used as the lowerdir for the overlayFS driver. This allows the remote snapshotter to integrate with the existing overlayFS infrastructure while adding the capability of lazy-loading data from a remote source.<\/p>\n\n<p>There are two main formats that enable remote snapshotters: <strong>eStargz<\/strong> and <strong>SOCI<\/strong>.<\/p>\n\n<h3 id=\"estargz-format\">eStargz format<\/h3>\n\n<p>eStargz is a backward-compatible extension of the standard OCI <code class=\"language-plaintext highlighter-rouge\">tar.gz<\/code> layer format. It has several key features that enable lazy loading:<\/p>\n\n<ul>\n  <li>\n    <p><strong>Individually compressed files<\/strong>: Each file within the layer (and even chunks of large files) is compressed individually. This is the key that allows for random access to file contents.<\/p>\n  <\/li>\n  <li>\n    <p><strong>TOC (table of contents)<\/strong>: A JSON file named <code class=\"language-plaintext highlighter-rouge\">stargz.index.json<\/code> is located at the end of the layer. This TOC contains metadata for every file, including its name, size, and, most importantly, its offset within the layer blob.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Footer<\/strong>: A small footer at the very end of the layer contains the offset of the TOC, allowing it to be easily located by reading only the last few bytes of the layer.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Chunking and verification<\/strong>: Large files can be broken down into smaller chunks, each with its own entry in the TOC. Each chunk also has a chunkDigest in its TOC entry, allowing for independent verification of each downloaded piece of data.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Prefetch landmark<\/strong>: A special file, <code class=\"language-plaintext highlighter-rouge\">.prefetch.landmark<\/code>, can be placed in the layer to mark the end of \u201cprioritized files\u201d. This allows the snapshotter to intelligently prefetch the most important files for the container\u2019s workload.<\/p>\n  <\/li>\n<\/ul>\n\n<p>The stargz snapshotter uses the eStargz format to enable lazy loading. Here\u2019s how it works:<\/p>\n\n<ol>\n  <li>\n    <p><strong>Mount request<\/strong>: When containerd calls the Mount function, it\u2019s the main entry point for creating a new filesystem for a layer.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Resolve and read TOC<\/strong>: The snapshotter fetches the layer\u2019s footer, then fetches the <code class=\"language-plaintext highlighter-rouge\">stargz.index.json<\/code> TOC from the remote registry. This TOC contains all the file metadata needed to create a virtual filesystem.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Mount FUSE filesystem<\/strong>: With the TOC in memory, the snapshotter creates a virtual filesystem using FUSE. The container can now start, as it has a valid rootfs, even though most of the file content has not been downloaded.<\/p>\n  <\/li>\n  <li>\n    <p><strong>On-demand fetching<\/strong>: When the container performs a file operation like <code class=\"language-plaintext highlighter-rouge\">read()<\/code>, the FUSE filesystem intercepts the call. The snapshotter checks a local disk cache for the requested bytes. If the data is not cached, it issues an HTTP Range request to the container registry to download only the required chunk of the layer.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Remote fetching and caching<\/strong>: The downloaded data is returned to the container and also written to the local cache for subsequent reads.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Prefetching for optimization<\/strong>: After the FUSE filesystem is mounted, a background goroutine begins downloading the prioritized files (up to the <em>.prefetch.landmark<\/em>) and can also be configured to download the entire rest of the layer in the background.<\/p>\n  <\/li>\n<\/ol>\n\n<p>For a deeper understanding of the eStargz format and stargz snapshotter, see the <a href=\"https:\/\/github.com\/containerd\/stargz-snapshotter\/blob\/main\/docs\/overview.md\">stargz-snapshotter overview documentation<\/a>.<\/p>\n\n<h3 id=\"soci-format\">SOCI format<\/h3>\n\n<p>SOCI is a technology open sourced by AWS that enables containers to launch faster by lazily loading the container image. SOCI works by creating an index (SOCI Index) of the files within an existing container image. SOCI borrows some of the design principles from stargz-snapshotter but takes a different approach:<\/p>\n\n<ul>\n  <li>\n    <p><strong>Separate index<\/strong>: A SOCI index is generated separately from the container image and is stored in the registry as an OCI Artifact, linked back to the container image by OCI Reference Types.<\/p>\n  <\/li>\n  <li>\n    <p><strong>No image conversion<\/strong>: This means that the container images do not need to be converted, image digests do not change, and image signatures remain valid.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Native Bottlerocket support<\/strong>: SOCI is natively supported on Bottlerocket OS.<\/p>\n  <\/li>\n<\/ul>\n\n<p>For a deeper understanding of the SOCI format, see the <a href=\"https:\/\/github.com\/awslabs\/soci-snapshotter\/blob\/main\/docs\/index.md\">soci-snapshotter documentation<\/a>.<\/p>\n\n<h2 id=\"building-and-deploying-lazy-loaded-images\">Building and deploying lazy-loaded images<\/h2>\n\n<h3 id=\"setting-up-snapshotters-in-eks\">Setting up snapshotters in EKS<\/h3>\n\n<p>When using EKS with containerd as the container runtime, you can configure remote snapshotters to enable lazy loading. Here\u2019s how to set them up:<\/p>\n\n<p><strong>For stargz-snapshotter (eStargz)<\/strong>: You need to install the <code class=\"language-plaintext highlighter-rouge\">containerd-stargz-grpc<\/code> service first, then register it as a proxy plugin in containerd\u2019s configuration:<\/p>\n\n<pre><code class=\"language-textproto\"># \/etc\/containerd\/config.toml\n[proxy_plugins]\n[proxy_plugins.stargz]\ntype = \"snapshot\"\naddress = \"\/run\/containerd-stargz-grpc\/containerd-stargz-grpc.sock\"\n<\/code><\/pre>\n\n<p>For detailed installation instructions, see the <a href=\"https:\/\/github.com\/containerd\/stargz-snapshotter\/blob\/main\/docs\/INSTALL.md\">stargz-snapshotter installation documentation<\/a>. The setup can be baked into an AMI for production use or tested via user data from node bootstrap scripts.<\/p>\n\n<p><strong>For SOCI snapshotter (Bottlerocket)<\/strong>: On Bottlerocket nodes, enable the SOCI snapshotter via user data:<\/p>\n\n<pre><code class=\"language-textproto\"># Enable SOCI snapshotter\n[settings.container-runtime]\nsnapshotter = \"soci\"\n<\/code><\/pre>\n\n<p>SOCI is natively supported on Bottlerocket, so no additional daemon installation is required.<\/p>\n\n<h3 id=\"building-lazy-loaded-images\">Building lazy-loaded images<\/h3>\n\n<p>eStargz images can be built natively using Docker Buildx by setting the output compression to <code class=\"language-plaintext highlighter-rouge\">estargz<\/code>:<\/p>\n\n<div class=\"language-shell highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>docker buildx build \n  <span class=\"nt\">--platform<\/span> linux\/amd64 \n  <span class=\"nt\">--output<\/span> <span class=\"nb\">type<\/span><span class=\"o\">=<\/span>registry,oci-mediatypes<span class=\"o\">=<\/span><span class=\"nb\">true<\/span>,compression<span class=\"o\">=<\/span>estargz,force-compression<span class=\"o\">=<\/span><span class=\"nb\">true<\/span> \n  <span class=\"nt\">--tag<\/span> <span class=\"nv\">$ECR_REGISTRY<\/span>\/airflow:<span class=\"nv\">$TAG<\/span> \n  <span class=\"nb\">.<\/span>\n<\/code><\/pre><\/div><\/div>\n\n<p>SOCI doesn\u2019t require rebuilding images; you only need to generate a SOCI index for existing images. Since Docker doesn\u2019t natively support SOCI index generation yet, workaround solutions include using the <a href=\"https:\/\/awslabs.github.io\/cfn-ecr-aws-soci-index-builder\/#_overview\">AWS SOCI Index Builder Using Lambda Functions<\/a> or integrating SOCI index generation into your CI\/CD pipeline as described in this <a href=\"https:\/\/pabis.eu\/blog\/2025-06-17-Faster-ECS-Startup-SOCI-Index-GitLab-Pipeline.html\">blog post<\/a>.<\/p>\n\n<h2 id=\"key-takeaway-why-we-chose-soci\">Key takeaway: Why we chose SOCI<\/h2>\n\n<p>We started our exploration with eStargz but ultimately chose SOCI for production deployment. The key reason is scalability and alignment with our strategy to use Bottlerocket OS for enhancing Kubernetes pod startup and security. SOCI is natively supported by Bottlerocket, which means service teams don\u2019t need to set up and maintain the more complicated stargz snapshotter across all EKS clusters. This makes the implementation easier to maintain and provides better support from AWS.<\/p>\n\n<p>Additionally, we learned that lazy loading doesn\u2019t eliminate the time required to download image data; it redistributes it from startup time to runtime. While this dramatically improves cold start performance, it\u2019s important to monitor application performance closely and tune configuration parameters based on your workload and infrastructure. We achieved a 60% improvement by optimizing SOCI\u2019s parallel pull mode settings, demonstrating the value of proper configuration tuning.<\/p>\n\n<h2 id=\"conclusion\">Conclusion<\/h2>\n\n<p>Docker image lazy loading with SOCI offers a significant opportunity to improve the performance and efficiency of our services at Grab. Our testing and production deployments have shown:<\/p>\n\n<ul>\n  <li>\n    <p>4x faster image pull times on fresh nodes.<\/p>\n  <\/li>\n  <li>\n    <p>29-34% improvement in P95 startup times for production workloads.<\/p>\n  <\/li>\n  <li>\n    <p>60% improvement in image download times with proper configuration tuning.<\/p>\n  <\/li>\n<\/ul>\n\n<p>The implementation path is clear, low-risk, and builds on proven components. This technology is production-ready, and we\u2019re continuing to scale it across more services.<\/p>\n\n<h3 id=\"references\">References<\/h3>\n\n<ul>\n  <li>\n    <p><strong>Databricks:<\/strong> <a href=\"https:\/\/www.databricks.com\/blog\/2021\/09\/08\/booting-databricks-vms-7x-faster-for-serverless-compute.html\">Booting Databricks VMs 7x Faster for Serverless Compute<\/a> - Industry case study showing how major tech companies achieve fast container startup at scale<\/p>\n  <\/li>\n  <li>\n    <p><strong>BytePlus:<\/strong> <a href=\"https:\/\/docs.byteplus.com\/en\/docs\/vke\/Container-image-lazy-loading-solution\">Container Image Lazy Loading Solution<\/a> - Enterprise implementation guide for lazy loading in production Kubernetes environments<\/p>\n  <\/li>\n  <li>\n    <p><strong>AWS:<\/strong> <a href=\"https:\/\/aws.amazon.com\/blogs\/containers\/introducing-seekable-oci-parallel-pull-mode-for-amazon-eks\/\">Introducing Seekable OCI: Parallel Pull Mode for Amazon EKS<\/a> - AWS\u2019s guide to SOCI configuration and optimization<\/p>\n  <\/li>\n<\/ul>\n\n<h2 id=\"join-us\">Join us<\/h2>\n\n<p>Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line \u2013 we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.<\/p>\n\n<p>Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, <a href=\"https:\/\/grb.to\/gebdockerlazyloading\">join our team<\/a> today!<\/p>\n","pubDate":"Wed, 21 Jan 2026 00:23:00 +0000","link":"https:\/\/engineering.grab.com\/docker-lazy-loading","guid":"https:\/\/engineering.grab.com\/docker-lazy-loading","category":["Database","Engineering"]},{"title":"From deployment slop to production reality: How BriX bridges the gap with enterprise-grade AI infrastructure","description":"<h2 id=\"abstract\">Abstract<\/h2>\n\n<p>You\u2019ve vibe-coded an AI assistant that\u2019s a game-changer for your team. It works perfectly on your laptop. But when you try to deploy it company-wide, everything falls apart.<\/p>\n\n<p>This is what is known as \u201cdeployment slop\u201d\u2014the messy reality when quick AI prototypes hit the enterprise world. Your tool suddenly becomes unreliable, insecure, and impossible to maintain. Different teams run different versions. Security flags it. IT won\u2019t touch it. Your innovation dies.<\/p>\n\n<p><strong>BriX solves this<\/strong>. It\u2019s a platform that takes your working AI prototype and makes it production-ready\u2014without forcing you to become a full-stack developer. BriX handles the hard parts such as security, scaling, and data connections, so you can focus on building great tools. Switch between AI models like Claude or GPT with a click. Connect securely to your company\u2019s data sources. Deploy once, and it just works\u2014for everyone.<\/p>\n\n<p>This article shows how BriX transforms AI deployment from an engineering bottleneck into a configuration task, enabling domain experts to ship enterprise-grade AI tools in days instead of months.<\/p>\n\n<h2 id=\"introduction\">Introduction<\/h2>\n\n<p>Building AI tools has never been easier. With ChatGPT, Claude, and other Large Language Models (LLMs), anyone can prototype a useful AI assistant in an afternoon. Data analysts build metric query tools; product managers create research assistants. This rapid experimentation\u2014\u201dvibe coding\u201d\u2014has sparked innovation across organizations.<\/p>\n\n<p>But then comes the hard part: <em>deployment<\/em>.<\/p>\n\n<p>That brilliant tool you built on your laptop? It works great for you. But when your boss asks you to \u201croll it out to the whole company,\u201d you hit a wall. Suddenly you need:<\/p>\n\n<ul>\n  <li>Security reviews (Is it leaking sensitive data?)<\/li>\n  <li>Reliability guarantees (What happens when 500 people use it at once?)<\/li>\n  <li>Access controls (Who can see what data?)<\/li>\n  <li>Audit trails (Who asked what, and when?)<\/li>\n  <li>Consistent behavior (Why does it give different answers to different people?)<\/li>\n<\/ul>\n\n<p>Most builders aren\u2019t DevOps engineers. They\u2019re domain experts who had a good idea. So these tools either:<\/p>\n\n<ul>\n  <li>Never get deployed (innovation dies in a Jupyter notebook); or<\/li>\n  <li>Get deployed badly (creating \u201cDeployment Slop\u201d\u2014a mess of insecure, unreliable scripts).<\/li>\n<\/ul>\n\n<h3 id=\"the-three-failure-modes-of-deployment-slop\">The three failure modes of deployment slop<\/h3>\n\n<h4 id=\"the-chaos-problem-everyones-running-a-different-version\">The chaos problem: Everyone\u2019s running a different version<\/h4>\n\n<p>Marketing copies your script and tweaks the prompts. Finance changed the model from GPT-4 to Claude because it\u2019s cheaper. Sales adds their own data sources. Within weeks, you have:<\/p>\n\n<ul>\n  <li>Five different versions of \u201cthe same tool\u201d.<\/li>\n  <li>Wildly different answers to the same question.<\/li>\n  <li>No one knows which version is \u201ccorrect\u201d.<\/li>\n  <li>Teams making decisions based on inconsistent data.<\/li>\n<\/ul>\n\n<p><strong>Potential risk<\/strong>: A senior executive receiving conflicting answers from different teams, resulting in a loss of trust.<\/p>\n\n<h4 id=\"the-reliability-problem-it-works-until-it-doesnt\">The reliability problem: It works until it doesn\u2019t<\/h4>\n\n<p>Your laptop script was built for one user (you). Now 50 people are using it simultaneously. The result:<\/p>\n\n<ul>\n  <li>Timeouts and crashes during peak hours.<\/li>\n  <li>No error handling (users see cryptic Python stack traces).<\/li>\n  <li>Rate limits hit on API calls.<\/li>\n  <li>No monitoring or alerts when things break.<\/li>\n  <li>You become the \u201con-call\u201d support person for a side project.<\/li>\n<\/ul>\n\n<p><strong>Potential risk<\/strong>: The tool fails during a critical metric review leaving folks to find the solution manually.<\/p>\n\n<h4 id=\"the-security-problem-accidental-data-leaks\">The security problem: Accidental data leaks<\/h4>\n\n<p>Your prototype connects directly to production databases. It has your personal credentials hardcoded. There\u2019s no:<\/p>\n\n<ul>\n  <li>Access control (everyone sees all data, including sensitive info).<\/li>\n  <li>Audit trail (no record of who queried what).<\/li>\n  <li>Data governance (PII might be exposed).<\/li>\n  <li>Compliance review (legal and security teams don\u2019t even know it exists).<\/li>\n<\/ul>\n\n<p><strong>Potential risk<\/strong>: An employee inadvertently querying PII, resulting in a potential breach.<\/p>\n\n<h3 id=\"who-gets-hit-hardest\">Who gets hit hardest?<\/h3>\n\n<p>This problem is especially painful for semi-technical builders\u2014the domain experts who understand the business problem but aren\u2019t DevOps engineers:<\/p>\n\n<ul>\n  <li>Product Managers who write SQL but not Kubernetes configs.<\/li>\n  <li>Data Analysts who know Python but not cloud security.<\/li>\n  <li>Marketing Ops who build dashboards but not CI\/CD pipelines.<\/li>\n  <li>HR Analytics who understand people data but not infrastructure scaling.<\/li>\n<\/ul>\n\n<p>The traditional solution is to \u201chand it to Engineering,\u201d but they are backlogged for months. By the time they rebuild your tool \u201cproperly,\u201d the business need has changed.<\/p>\n\n<h2 id=\"solution-enter-brix-from-prototype-to-production-in-days-not-months\">Solution: Enter BriX: From prototype to production in days, not months<\/h2>\n\n<p>BriX is a platform that solves the deployment problem by centralizing all the hard infrastructure work. Instead of forcing every builder to become a DevOps expert, BriX provides the production-ready foundation so you can focus on building great AI tools.<\/p>\n\n<p>The core insight: Deployment doesn\u2019t have to be an engineering problem. It can be a configuration problem.<\/p>\n\n<h3 id=\"what-brix-does\">What BriX does<\/h3>\n\n<p>Think of BriX as the \u201cproduction layer\u201d for AI tools. You bring your working prototype. BriX handles security, scaling, data connections, monitoring, audit trails, and consistent behavior across teams.<\/p>\n\n<p>You configure. BriX deploys.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/brix\/brix-infrastructure.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 1. BriX infrastructure<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"the-three-core-capabilities\">The three core capabilities<\/h3>\n\n<h4 id=\"choose-your-ai-model-model-agnosticism\">Choose your AI model (Model agnosticism)<\/h4>\n\n<p>Different tasks need different models. BriX lets you switch between models with a dropdown\u2014Claude, GPT, Gemini, or others. Test which works best. Change models without rewriting code. Optimize for cost vs. performance.<\/p>\n\n<p>Example: Your finance tool uses GPT-4 for complex analysis, but a new better model is available. Change it in BriX with one click\u2014no code changes needed.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/brix\/model-selection-interface.png\" alt=\"\" style=\"width:90%\" \/><figcaption align=\"middle\">Figure 2. Model selection interface<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h4 id=\"connect-to-enterprise-data-securely-model-context-protocols\">Connect to enterprise data securely (Model Context Protocols)<\/h4>\n\n<p>This is where BriX really shines. Your AI tool needs data\u2014metrics, customer info, documentation. But connecting to enterprise systems securely is hard.<\/p>\n\n<p>Model Context Protocols (MCPs) are BriX\u2019s solution. Think of them as secure, pre-built connectors to your company\u2019s data sources.<\/p>\n\n<p>Why MCPs matter:<\/p>\n<ul>\n  <li>Security built-in: No hardcoded credentials, proper access controls.<\/li>\n  <li>Certified data: Connect only to approved, governed data sources.<\/li>\n  <li>No custom integration: Pre-built connectors, not custom API code.<\/li>\n  <li>Audit trails: Every query is logged automatically.<\/li>\n<\/ul>\n\n<p><strong>Example<\/strong>: Your marketing tool can query the metrics system to get conversion rates, search the knowledge base for campaign guidelines, and pull customer data from the data lake \u2014all through secure, governed connections.<\/p>\n\n<p><strong>Technical note<\/strong>: MCPs use a standardized protocol, so adding new data sources doesn\u2019t require rebuilding your tool. BriX handles the complexity.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/brix\/chat-user-interface.png\" alt=\"\" style=\"width:90%\" \/><figcaption align=\"middle\">Figure 3. BriX chat user interface<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h4 id=\"ensure-consistent-behavior-system-prompts-and-context\">Ensure consistent behavior (System prompts and context)<\/h4>\n\n<p>Remember the \u201cchaos problem\u201d where everyone runs different versions? BriX solves this with centralized configurations by allowing you to lock it down for the users:<\/p>\n\n<ul>\n  <li>System prompts: Define your AI\u2019s personality, tone, and guardrails once.<\/li>\n  <li>Context files: Upload reference documents that every instance uses.<\/li>\n  <li>Global enforcement: All users get the same behavior automatically.<\/li>\n<\/ul>\n\n<p><strong>Example<\/strong>: Your customer support tool has a system prompt that says \u201cAlways be empathetic, never make promises about refunds, escalate to humans for complaints.\u201d Every support agent\u2019s AI follows these rules\u2014no exceptions.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/brix\/builder-view-1.png\" alt=\"\" style=\"width:80%\" \/>\n  <\/figure>\n<\/div>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/brix\/builder-view-2.png\" alt=\"\" style=\"width:80%\" \/><figcaption align=\"middle\">Figure 4. The builder\u2019s view<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h4 id=\"additional-feature-flexible-interfaces-and-collaboration\">Additional feature: Flexible interfaces and collaboration<\/h4>\n\n<p>Beyond the core infrastructure, BriX offers flexible ways to consume these tools. BriX goes beyond conversational interfaces\u2014you can host custom UIs built with any frontend framework while BriX handles the AI backend. Users can also generate and share analyses as persistent reports, turning individual queries into institutional knowledge accessible across teams via shareable links\u2014complete with data, visualizations, and AI insights.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/brix\/share-feature-interface.png\" alt=\"\" style=\"width:90%\" \/><figcaption align=\"middle\">Figure 5. Share feature interface<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"the-brix-workflow-a-real-example\">The BriX workflow: A real example<\/h3>\n<p>Let\u2019s see how a product manager would use BriX:<\/p>\n\n<p><strong>Step 1: Upload your prototype<\/strong><\/p>\n\n<ul>\n  <li>You\u2019ve built a Jupyter notebook that queries metrics and generates reports.<\/li>\n  <li>Upload it to BriX (or connect your GitHub repo).<\/li>\n<\/ul>\n\n<p><strong>Step 2: Configure (Not code)<\/strong><\/p>\n\n<ul>\n  <li>Choose your AI model: Claude 4.5 Sonnet<\/li>\n  <li>Connect data sources: Midas (metrics), Hubble (data lake)<\/li>\n  <li>Set system prompt: \u201cYou\u2019re a data analyst. Always cite sources. Format numbers with commas.\u201d<\/li>\n  <li>Upload context: Your company\u2019s metrics definitions guide.<\/li>\n<\/ul>\n\n<p><strong>Step 3: Lock<\/strong><\/p>\n\n<ul>\n  <li>Lock all the configurations of your BriX.<\/li>\n  <li>Share with your team.<\/li>\n<\/ul>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/brix\/brix-landing-page.png\" alt=\"\" style=\"width:90%\" \/><figcaption align=\"middle\">Figure 6. BriX landing page<\/figcaption>\n  <\/figure>\n<\/div>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/brix\/user-view-1.png\" alt=\"\" style=\"width:80%\" \/>\n  <\/figure>\n<\/div>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/brix\/user-view-2.png\" alt=\"\" style=\"width:80%\" \/><figcaption align=\"middle\">Figure 7. The user\u2019s view (Locks and edit not available)<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p><strong>Step 4: It just works<\/strong><\/p>\n\n<ul>\n  <li>Certification by design with Brick Quality residing with the brick admin.<\/li>\n  <li>Focused use cases have specific system prompts, context - minimizing hallucination concerns.<\/li>\n  <li>People can use it simultaneously (BriX handles scaling).<\/li>\n  <li>Everyone gets consistent answers (same model, same prompts).<\/li>\n  <li>All queries are logged (audit trail automatic).<\/li>\n  <li>The security team is happy (proper access controls).<\/li>\n  <li>You\u2019re not on-call (BriX monitors and alerts).<\/li>\n<\/ul>\n\n<p>Time to production: 3 Days, not 3 months.<\/p>\n\n<h3 id=\"under-the-hood-the-brix-architecture\">Under the hood: The BriX architecture<\/h3>\n\n<p>BriX is built on a synchronous streaming architecture\u2014a design that prioritizes real-time responsiveness without sacrificing enterprise security. Think of it like a live sports broadcast: you see the action as it happens, not a delayed replay.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/brix\/brix-architecture.png\" alt=\"\" style=\"width:90%\" \/><figcaption align=\"middle\">Figure 8. BriX architecture<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p>Here\u2019s how a single user request flows through the system, from question to answer.<\/p>\n\n<p><strong>The request journey: Six layers<\/strong><\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>User Question\n      \u2193\n[1] The Frontend \u2014 Real-Time Streaming\n      \u2193\n[2] The Gateway \u2014 FastAPI Backend\n      \u2193\n[3] The Brain \u2014 LangGraph Orchestration\n      \u2193\n[4] Memory \u2014 Hot and Cold Storage\n      \u2193\n[5] Security \u2014 Identity Propagation (\"On-Behalf-Of\" Flow)\n      \u2193\n[6] Data Processing \u2014 Full Context, Not Fragments\n      \u2193\nResponse streams back to user in real-time\n<\/code><\/pre><\/div><\/div>\n\n<p>Let\u2019s break down each layer.<\/p>\n\n<h4 id=\"layer-1-the-frontend--real-time-streaming\">Layer 1: The frontend \u2014 Real-time streaming<\/h4>\n\n<ul>\n  <li><strong>Technology<\/strong>: React (TypeScript)<\/li>\n  <li><strong>User experience<\/strong>: ChatGPT-style interface<\/li>\n<\/ul>\n\n<p>The User types a question: \u201cWhat\u2019s our conversion rate in Singapore last month?\u201d<\/p>\n\n<p>The frontend opens a persistent connection to BriX servers. As the AI processes the question, updates stream back instantly:<\/p>\n\n<ul>\n  <li>\u201c\ud83e\udd14 Thinking\u2026\u201d<\/li>\n  <li>\u201c\ud83d\udcca Querying metrics database\u2026\u201d<\/li>\n  <li>\u201c\u2705 Found 3 relevant data points\u2026\u201d\n[Final answer appears]<\/li>\n<\/ul>\n\n<p>Why streaming matters:<\/p>\n\n<table class=\"table\">\n  <thead>\n    <tr>\n      <th>Traditional approach<\/th>\n      <th>BriX approach<\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <td rowspan=\"1\">\u274c User waits 30 seconds, sees nothing, then gets full answer (feels broken).<\/td>\n      <td>\u2705 User sees progress every second (feels responsive and trustworthy).<\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n<p>Technical implementation: Server-Sent Events (SSE) for real-time updates without WebSocket complexity.<\/p>\n\n<h4 id=\"layer-2-the-gateway--fastapi-backend\">Layer 2: The Gateway \u2014 FastAPI backend<\/h4>\n\n<ul>\n  <li><strong>Technology<\/strong>: FastAPI (Python)<\/li>\n  <li><strong>Role<\/strong>: Central traffic controller<\/li>\n<\/ul>\n\n<p><strong>What it does<\/strong>:<\/p>\n\n<ul>\n  <li>Receives all incoming requests<\/li>\n  <li>Authenticates users (checks SSO tokens)<\/li>\n  <li>Routes requests to the appropriate agent<\/li>\n  <li>Manages rate limiting (prevents abuse)<\/li>\n  <li>Handles errors gracefully<\/li>\n<\/ul>\n\n<p><strong>Why FastAPI?<\/strong><\/p>\n\n<ul>\n  <li>\u26a1 Fast (async\/await for concurrent requests)<\/li>\n  <li>\ud83d\udd12 Secure (built-in authentication)<\/li>\n  <li>\ud83d\udcc8 Scalable (handles thousands of concurrent users)<\/li>\n<\/ul>\n\n<p><strong>Layer 3: The Brain \u2014 LangGraph orchestration<\/strong><\/p>\n\n<ul>\n  <li><strong>Technology<\/strong>: LangGraph (AI workflow framework)<\/li>\n  <li><strong>Role<\/strong>: The \u201cmain agent\u201d that coordinates everything.<\/li>\n<\/ul>\n\n<p>Think of LangGraph as a smart router that understands intent and delegates work.<\/p>\n\n<p><strong>Example flow<\/strong>:<\/p>\n\n<p><strong><em>User asks<\/em><\/strong>: \u201cCompare our Singapore and Malaysia conversion rates, then explain why they differ\u201d.<\/p>\n\n<p><strong><em>LangGraph analyzes the question<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>Task 1: Query metrics (needs Midas MCP)<\/li>\n  <li>Task 2: Compare data (needs calculation)<\/li>\n  <li>Task 3: Explain differences (needs context\/knowledge base)<\/li>\n<\/ul>\n\n<p>LangGraph delegates to specialized \u201cMCPs\u201d:<\/p>\n\n<ul>\n  <li>Midas MCP: Queries Midas for conversion data<\/li>\n  <li>LLM Agent: Calculates the difference<\/li>\n  <li>Glean MCP: Searches knowledge base for regional factors<\/li>\n<\/ul>\n\n<p>LangGraph synthesizes: Combines results into coherent answer\nWhy modular \u201cBricks\u201d?<\/p>\n\n<ul>\n  <li>\u2705 Reliability: Each Brick is specialized (fewer hallucinations)<\/li>\n  <li>\u2705 Maintainability: Update one Brick without breaking others<\/li>\n  <li>\u2705 Extensibility: Add new Bricks for new use cases<\/li>\n<\/ul>\n\n<h4 id=\"layer-4-memory--hot-and-cold-storage\">Layer 4: Memory \u2014 Hot and cold storage<\/h4>\n\n<p>BriX uses a two-tier memory system to balance speed and durability:<\/p>\n\n<p><strong><em>Hot memory (Redis)<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>\u26a1 Ultra-fast: In-memory storage (microsecond access).<\/li>\n  <li>\ud83d\udd04 Session management: Tracks active conversations.<\/li>\n  <li>\ud83d\udd12 Distributed locks: Prevents race conditions when multiple requests happen simultaneously.<\/li>\n  <li>\ud83d\udca8 Temporary: Data expires after session ends.<\/li>\n<\/ul>\n\n<p><strong><em>Cold memory (PostgreSQL)<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>\ud83d\udcbe Persistent: Data stored permanently<\/li>\n  <li>\ud83d\udcdc Audit trail: Every query, response, and action logged<\/li>\n  <li>\ud83d\udd0d Searchable: Users can search past conversations<\/li>\n  <li>\ud83d\udcca Analytics: Track usage patterns and performance<\/li>\n<\/ul>\n\n<p><strong><em>Example scenario<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>You ask BriX a question \u2192 Hot memory tracks your active session<\/li>\n  <li>You close the browser \u2192 Session data moves to cold memory<\/li>\n  <li>You return tomorrow \u2192 BriX loads your history from cold memory<\/li>\n  <li>You continue the conversation \u2192 New session in hot memory<\/li>\n<\/ul>\n\n<p><strong><em>Result<\/em><\/strong>: Fast responses + complete history + full auditability<\/p>\n\n<h4 id=\"layer-5-security--identity-propagation-on-behalf-of-flow\">Layer 5: Security \u2014 Identity propagation (\u201cOn-Behalf-Of\u201d flow)<\/h4>\n\n<p>This is where BriX\u2019s security model shines. Instead of using a single \u201cservice account\u201d to access all data, BriX uses your credentials for every query.<\/p>\n\n<p><strong><em>How it works<\/em><\/strong>:<\/p>\n\n<p><strong>Step 1: Authentication (Login)<\/strong><\/p>\n\n<ul>\n  <li>You log in via SSO (e.g., Okta, Azure AD)<\/li>\n  <li>BriX receives a secure token that represents your identity<\/li>\n  <li>This token includes your permissions (what data you can access)<\/li>\n<\/ul>\n\n<p><strong>Step 2: Identity propagation (Query execution)<\/strong><\/p>\n\n<ul>\n  <li>You ask: \u201cShow me customer revenue data\u201d<\/li>\n  <li>BriX doesn\u2019t use its own credentials to query the database<\/li>\n  <li>Instead, BriX carries your token to the data source<\/li>\n  <li>The data source checks: \u201cDoes this user have permission to see revenue data?\u201d\n    <ul>\n      <li>If yes \u2192 Returns data<\/li>\n      <li>If no \u2192 Access denied<\/li>\n    <\/ul>\n  <\/li>\n<\/ul>\n\n<p><strong>Step 3: Audit trail<\/strong><\/p>\n\n<ul>\n  <li>Every query is logged with:\n    <ul>\n      <li><strong>Who<\/strong> asked (your user ID)<\/li>\n      <li><strong>What<\/strong> they asked (the question)<\/li>\n      <li><strong>What data<\/strong> was accessed (the query)<\/li>\n      <li><strong>When<\/strong> it happened (timestamp)<\/li>\n    <\/ul>\n  <\/li>\n<\/ul>\n\n<p><strong><em>Why this matters<\/em><\/strong>:<\/p>\n\n<table class=\"table\">\n  <thead>\n    <tr>\n      <th>Traditional approach<\/th>\n      <th>BriX approach<\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <td rowspan=\"1\">\u274c Service account has access to ALL data.<\/td>\n      <td>\u2705 Each user only sees their authorized data.<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">\u274c Can't tell who accessed what.<\/td>\n      <td>\u2705 Complete audit trail per user.<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">\u274c Security team nervous about AI tools.<\/td>\n      <td>\u2705 Security team approves (same controls as existing tools).<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">\u274c One compromised credential = full breach.<\/td>\n      <td>\u2705 Breach limited to single user's permissions.<\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n<p><strong><em>Real-world example<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>Finance analyst asks about revenue \u2192 Sees all financial data (authorized)<\/li>\n  <li>Marketing analyst asks same question \u2192 Sees only marketing budget (restricted)<\/li>\n  <li>Same AI tool, different permissions \u2192 Security enforced automatically<\/li>\n<\/ul>\n\n<p><strong><em>Technical term<\/em><\/strong>: This is called \u201cidentity propagation\u201d or \u201con-behalf-of flow\u201d in enterprise security.<\/p>\n\n<h4 id=\"layer-6-data-processing--full-context-not-fragments\">Layer 6: Data processing \u2014 Full context, not fragments<\/h4>\n\n<p><strong><em>The old way (Retrieval Augmented Generation (RAG))<\/em><\/strong>:<\/p>\n\n<ol>\n  <li>User asks a question.<\/li>\n  <li>System searches for relevant document chunks.<\/li>\n  <li>System sends top 5 chunks to AI.<\/li>\n  <li>AI answers based on fragments.<\/li>\n<\/ol>\n\n<p><strong>Problem<\/strong>: AI might miss context from other parts of the document.<\/p>\n\n<p><strong><em>The BriX way (Full context)<\/em><\/strong>:<\/p>\n\n<ol>\n  <li>User uploads a document.<\/li>\n  <li>BriX feeds the entire document into the AI\u2019s context window.<\/li>\n  <li>AI reads and understands the full document.<\/li>\n  <li>AI answers with complete context.<\/li>\n<\/ol>\n\n<p><strong>Why this works now<\/strong>: Modern AI models (Claude, GPT-4) have massive context windows (100K+ tokens). They can process entire documents, not just snippets\u2014resulting in more accurate answers and fewer hallucinations.<\/p>\n\n<p><strong><em>Example<\/em><\/strong>:<\/p>\n\n<p>Question: \u201cWhat\u2019s our refund policy for international orders?\u201d<\/p>\n\n<ul>\n  <li>RAG approach: Finds 3 snippets about refunds \u2192 Might miss international-specific rules<\/li>\n  <li>BriX approach: Reads entire policy document \u2192 Finds exact international refund section<\/li>\n<\/ul>\n\n<h4 id=\"architecture-summary-why-this-design-works\">Architecture summary: Why this design works<\/h4>\n\n<table class=\"table\">\n  <thead>\n    <tr>\n      <th>Design choice<\/th>\n      <th>Benefit<\/th>\n      <th>User impact<\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <td rowspan=\"1\">Streaming architecture<\/td>\n      <td>Real-time feedback<\/td>\n      <td>Feels fast and responsive<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">Modular Bricks<\/td>\n      <td>Specialized agents<\/td>\n      <td>Fewer errors, more reliable<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">Hot\/Cold memory<\/td>\n      <td>Speed + durability<\/td>\n      <td>Fast responses + full history<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">Identity propagation<\/td>\n      <td>User-level security<\/td>\n      <td>Only see authorized data<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">Full context processing<\/td>\n      <td>Complete understanding<\/td>\n      <td>More accurate answers<\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n<p><strong>The result<\/strong>: An AI platform that feels as fast as ChatGPT but with enterprise-grade security and reliability.<\/p>\n\n<h3 id=\"what-using-brix-actually-feels-like\">What using BriX actually feels like<\/h3>\n\n<p>All the technical architecture is invisible to end users. Here\u2019s what they actually see and experience.<\/p>\n\n<h4 id=\"login-one-click-no-new-passwords\">Login: One click, no new passwords<\/h4>\n\n<p><strong><em>What users see<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>Visit BriX URL<\/li>\n  <li>Click \u201cLog in with SSO\u201d (uses your existing company login)<\/li>\n  <li>Redirects to familiar authentication screen<\/li>\n  <li>Logged in automatically<\/li>\n<\/ul>\n\n<p><strong><em>What users DON\u2019T see<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>No new account creation<\/li>\n  <li>No password to remember<\/li>\n  <li>No security questionnaire<\/li>\n  <li>BriX inherits your existing permissions automatically<\/li>\n<\/ul>\n\n<p>Why this matters: Zero onboarding friction. If you can access your email, you can use BriX.<\/p>\n\n<h4 id=\"the-app-library-your-companys-ai-tools\">The app library: Your company\u2019s AI tools<\/h4>\n\n<p><strong><em>What users see<\/em><\/strong>: Company\u2019s internal \u201cApp Store\u201d for AI tools.<\/p>\n\n<ul>\n  <li>Each tool is pre-configured and vetted<\/li>\n  <li>Click to launch (no installation)<\/li>\n  <li>Tools are tailored to company\u2019s data and processes<\/li>\n<\/ul>\n\n<h4 id=\"using-a-tool-chatgpt-style-interface\">Using a Tool: ChatGPT-style interface<\/h4>\n\n<p><strong><em>What users see<\/em><\/strong>:\nSee the AI \u201cthinking\u201d and \u201cquerying\u201d\u2014no black box waiting. Builds trust (\u201cI can see it\u2019s actually checking the data\u201d).<\/p>\n\n<p><strong><em>Source citations<\/em><\/strong>:\nEvery answer includes a data source. Click to view original data. No \u201ctrust me\u201d answers.<\/p>\n\n<p><strong><em>Conversational follow-ups<\/em><\/strong>:\n\u201cWhy did it increase?\u201d | \u201cCompare to Malaysia\u201d | \u201cShow me a chart\u201d<\/p>\n\n<p>BriX remembers the context.<\/p>\n\n<h4 id=\"data-upload-drag-drop-analyze\">Data upload: Drag, drop, analyze<\/h4>\n\n<p><strong><em>What users have<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>Files are processed securely (encrypted).<\/li>\n  <li>AI reads the full content.<\/li>\n  <li>Users can ask questions about the files.<\/li>\n  <li>Files are only visible to the uploader (privacy).<\/li>\n<\/ul>\n\n<h4 id=\"trustworthy-answers-certified-data-not-hallucinations\">Trustworthy answers: Certified data, not hallucinations<\/h4>\n\n<p>The problem BriX solves:<\/p>\n\n<table class=\"table\">\n  <thead>\n    <tr>\n      <th>ChatGPT\/Generic AI<\/th>\n      <th>BriX<\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <td rowspan=\"1\">\u274c Makes up data (\"hallucinations\")<\/td>\n      <td>\u2705 Only uses your company's real data<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">\u274c No source citations<\/td>\n      <td>\u2705 Every answer cites the source<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">\u274c Can't access internal data<\/td>\n      <td>\u2705 Connects to your data lakes, metrics, docs<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">\u274c Same answer for everyone<\/td>\n      <td>\u2705 Respects your permissions (you only see your data)<\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n<p>Why users trust it:<\/p>\n\n<ul>\n  <li>\u2705 Specific number (not vague)<\/li>\n  <li>\u2705 Source cited (can verify)<\/li>\n  <li>\u2705 Certified data (governance approved)<\/li>\n  <li>\u2705 Timestamp (know it\u2019s current)<\/li>\n  <li>\u2705 Can export\/verify (transparency)<\/li>\n<\/ul>\n\n<h3 id=\"the-impact-what-brix-actually-changes\">The impact: What BriX actually changes<\/h3>\n\n<p>BriX shifts how organizations build AI tools. Here\u2019s what that looks like in practice.<\/p>\n\n<h4 id=\"from-months-to-days\">From months to days<\/h4>\n\n<table class=\"table\">\n  <thead>\n    <tr>\n      <th>Traditional path<\/th>\n      <th>BriX path<\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <td rowspan=\"1\">1. Domain expert has idea.<\/td>\n      <td>1. Domain expert has idea<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">2. Submits request to engineering.<\/td>\n      <td>2. Configures the idea in BriX.<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">3. Waits in backlog (weeks to months).<\/td>\n      <td>3. Tests with small group.<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">4. Engineering rebuilds it \"properly\".<\/td>\n      <td>4. Deploys to production.<\/td>\n    <\/tr>\n    <tr>\n      <td rowspan=\"1\">5. Tool finally launches.<\/td>\n      <td>5. Shares with team.<\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n<p><strong><em>What changes<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>\u26a1 Speed (hours instead of months)<\/li>\n  <li>\ud83d\udc64 Ownership (domain experts maintain their tools)<\/li>\n  <li>\ud83d\udd04 Iteration (refine based on feedback immediately)<\/li>\n  <li>\u2705 Success rate (ideas get tested instead of dying in backlog)<\/li>\n<\/ul>\n\n<h4 id=\"true-democratization\">True democratization<\/h4>\n\n<p><strong><em>Who builds tools with BriX<\/em><\/strong>:<\/p>\n\n<p>The shift isn\u2019t just engineers anymore. We\u2019re seeing:<\/p>\n\n<ul>\n  <li>Product managers building feature analysis tools.<\/li>\n  <li>Data analysts creating custom dashboards.<\/li>\n  <li>Marketing ops building campaign trackers.<\/li>\n  <li>Sales ops creating pipeline monitors.<\/li>\n  <li>HR analytics building retention tools.<\/li>\n<\/ul>\n\n<p><strong><em>What this means<\/em><\/strong>:<\/p>\n\n<p>Domain expertise stays with domain experts (no translation loss). Engineering focuses on platforms (not individual tool requests). Innovation happens at business speed (not constrained by engineering capacity).<\/p>\n\n<p><strong><em>The reality check<\/em><\/strong>:<\/p>\n\n<p>Not every domain expert will build tools (and that\u2019s fine). Some tools still need engineering (complex integrations, custom logic). But the bottleneck shifts from \u201cengineering capacity\u201d to \u201cgood ideas.\u201d<\/p>\n\n<h4 id=\"flexibility-without-fragility\">Flexibility without fragility<\/h4>\n\n<p>What you can change without rewriting code:<\/p>\n\n<p><strong><em>Swap AI models<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>Dropdown menu selection (GPT-5, Claude, Gemini)<\/li>\n  <li>Different teams can setup different models for their BriX<\/li>\n  <li>Can test new models without rebuilding tools<\/li>\n<\/ul>\n\n<p><strong><em>Add data sources<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>New MCP connector (one-time setup)<\/li>\n  <li>All existing tools can access the new source<\/li>\n  <li>No need to update individual tools<\/li>\n<\/ul>\n\n<p><strong><em>Update behavior globally<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>Change system prompt in one place<\/li>\n  <li>All instances follow new rules immediately<\/li>\n  <li>Useful for policy updates, compliance changes<\/li>\n<\/ul>\n\n<p><strong>Real example<\/strong>: When a company needs to update data access policies:<\/p>\n\n<ul>\n  <li>Traditional approach: Update each tool individually (days\/weeks)<\/li>\n  <li>BriX approach: Update system prompt once (minutes)<\/li>\n<\/ul>\n\n<h4 id=\"security-that-enables-not-blocks\">Security that enables (Not blocks)<\/h4>\n\n<p><strong><em>The traditional trade-off<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>Secure tools = slow approval, limited functionality<\/li>\n  <li>Fast tools = security nightmares, compliance issues<\/li>\n<\/ul>\n\n<p>BriX\u2019s approach: Security is built into the platform, not added per tool.<\/p>\n\n<p><strong><em>What\u2019s automatic<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>SSO authentication (no passwords to manage)<\/li>\n  <li>Identity propagation (users see only their authorized data)<\/li>\n  <li>Audit logging (every query tracked)<\/li>\n<\/ul>\n\n<p><strong><em>What this changes<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>Security team reviews the platform once (not every tool)<\/li>\n  <li>Builders don\u2019t need to become security experts<\/li>\n  <li>Compliance is automatic (audit trails, access controls)<\/li>\n  <li>Tools can move fast without sacrificing governance<\/li>\n<\/ul>\n\n<p><strong><em>Real impact<\/em><\/strong>: Security teams that previously rejected most AI proposals can pre-approve BriX. Then tools built on BriX inherit those security controls automatically.<\/p>\n\n<p><strong><em>BriX will<\/em><\/strong>:<\/p>\n\n<ul>\n  <li>Provide infrastructure for rapid AI tool deployment.<\/li>\n  <li>Make it easier for domain experts to productionize ideas.<\/li>\n  <li>Centralize security and governance.<\/li>\n  <li>Reduce (not eliminate) the engineering bottleneck.<\/li>\n  <li>Give you a path from prototype to production.<\/li>\n<\/ul>\n\n<h4 id=\"the-real-impact\">The real impact<\/h4>\n\n<p>The biggest change isn\u2019t technical. It\u2019s organizational.<\/p>\n\n<p>BriX changes the conversation from:<\/p>\n\n<p><strong>\u201cCan engineering build this for us?\u201d<\/strong><\/p>\n\n<p>to:<\/p>\n\n<p><strong>\u201cLet me try building this and see if it works\u201d<\/strong><\/p>\n\n<p>That shift\u2014from asking permission to testing ideas\u2014is the real impact.\nSome ideas will fail. That\u2019s fine. The cost of testing is now low enough that failure is acceptable.<\/p>\n\n<p>The ideas that succeed can scale immediately. That\u2019s what matters.<\/p>\n\n<h2 id=\"adoption-from-zero-to-production-reality\">Adoption: From zero to production reality<\/h2>\n\n<p>This isn\u2019t theoretical. Real teams are using BriX right now:<\/p>\n\n<ul>\n  <li><strong>The Universal Playground<\/strong> - Data analysts and product managers drop in to run quick analyses or ask questions\u2014no setup, no credentials to configure. Just connect and go. It\u2019s become the default \u201clet me check something\u201d tool.<\/li>\n  <li><strong>Country Intelligence Assistant<\/strong> - Country Analytics built a specialized assistant that answers country-specific questions\u2014market data, regulations, operational metrics. It\u2019s now the go-to source for regional teams making local decisions.<\/li>\n  <li><strong>Medallion Architecture Validator<\/strong> - A data engineer created a tool that validates table compliance with medallion architecture standards. What used to take manual reviews now happens instantly. Teams query it before deployments to catch issues early.<\/li>\n  <li><strong>Conversion Funnel Analyzer<\/strong> - Product analyst built an assistant that tracks user conversion funnels step-by-step in a custom UI. Marketing and product teams use it daily to understand drop-off points without writing SQL.<\/li>\n<\/ul>\n\n<h2 id=\"learningsconclusion\">Learnings\/conclusion<\/h2>\n\n<p><strong>The promise<\/strong>: Anyone can build AI tools.\n<strong>The reality<\/strong>: Anyone can build prototypes, but production requires engineering expertise most people don\u2019t have.<\/p>\n\n<p><strong>BriX bridges that gap<\/strong>.<\/p>\n\n<h3 id=\"what-brix-does-1\">What BriX does<\/h3>\n\n<p><strong>For domain experts<\/strong>: Build and own tools without becoming DevOps experts. Iterate in hours, not months.\n<strong>For engineering<\/strong>: Stop being the bottleneck. Secure the platform once, not every tool.\n<strong>For the organization<\/strong>: Test more ideas. Scale what works. Automatic security and compliance.<\/p>\n\n<h3 id=\"why-brix-works-three-design-principles\">Why BriX works: Three design principles<\/h3>\n\n<p>Building BriX taught us that successful enterprise AI platforms require:<\/p>\n\n<p><strong>Specialization over generalization<\/strong>\nUsers prefer 5 focused tools over 1 unpredictable tool. That\u2019s why BriX uses modular \u201cBricks\u201d\u2014each specialized for specific tasks (data analysis, trend detection, document search). Narrow scope = better reliability.<\/p>\n\n<p><strong>Enablement over control<\/strong>\nDeployment slop isn\u2019t a problem to eliminate\u2014it\u2019s evidence of demand. Don\u2019t kill experimentation; provide the path to production. BriX lets teams experiment locally, then offers the infrastructure to scale what works.<\/p>\n\n<p><strong>Reliability over features<\/strong>\nUsers forgive missing features. They don\u2019t forgive unreliability. One slow response or wrong answer = they never come back. That\u2019s why BriX prioritizes real-time streaming, certified data sources, and source citations over adding more capabilities.<\/p>\n\n<p><strong>The result<\/strong>: A platform that feels as fast as ChatGPT but with enterprise-grade security and governance.<\/p>\n\n<h3 id=\"configure-once-analyze-everywhere-act-fast\">Configure once. Analyze everywhere. Act fast.<\/h3>\n\n<p>BriX makes AI tool deployment a configuration problem, not an engineering problem.<\/p>\n\n<p>Your domain experts have the ideas. BriX gives them the path to production.<\/p>\n\n<h2 id=\"whats-next\">What\u2019s next<\/h2>\n\n<p>BriX solves deployment, but we\u2019re not stopping there.<\/p>\n\n<h3 id=\"more-data-sources\">More data sources<\/h3>\n\n<p>We\u2019re expanding the MCP library. If our company uses it, BriX should connect to it\u2014securely and without custom engineering work.<\/p>\n\n<h3 id=\"bring-your-own-code\">Bring your own code<\/h3>\n\n<p>For technical builders who want custom logic without DevOps headaches, we\u2019re launching a mono repo setup:<\/p>\n\n<ul>\n  <li>App owners own: Their code and business logic<\/li>\n  <li>BriX owns: Platform, security, scaling, maintenance<\/li>\n<\/ul>\n\n<h3 id=\"more-brix\">More BriX<\/h3>\n\n<p>Onboarding more BriX for different tech and non-tech personas.<\/p>\n\n<h2 id=\"join-us\">Join us<\/h2>\n\n<p>Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line \u2013 we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.<\/p>\n\n<p>Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, <a href=\"https:\/\/grb.to\/gebbrix\">join our team<\/a> today!<\/p>\n","pubDate":"Fri, 16 Jan 2026 00:23:00 +0000","link":"https:\/\/engineering.grab.com\/brix","guid":"https:\/\/engineering.grab.com\/brix","category":["AI","LLM","Deployment","Engineering"]},{"title":"Demystifying user journeys: Revolutionizing troubleshooting with auto tracking","description":"<h2 id=\"introduction\">Introduction<\/h2>\n\n<p>Troubleshooting critical issues by deciphering a user\u2019s journey on the Grab app is an extremely challenging task. With countless user journeys and multiple paths through the User Interface (UI), it\u2019s akin to searching for a needle in a vast haystack. This challenge frequently resonates with us, the dedicated developers at Grab, as we strive to understand user behaviors, views, and interactions.<\/p>\n\n<h2 id=\"the-challenge\">The challenge<\/h2>\n\n<p>The distinction between resolving an issue effectively versus spending hours on a wild goose chase is understanding our user journey in real-time.<\/p>\n\n<p>The development team initially attempted to address the issue of the incomplete user journey tracking by implementing a system where a click stream event would be sent with every user interaction. However, this approach presented significant challenges due to the sheer volume of UI components\u2014often numbering in the hundreds\u2014and the reliance on individual developers to correctly instrument each one.<\/p>\n\n<p>A common pitfall was that developers would occasionally overlook or forget to instrument certain user interactions, leading to breaks in the recorded user journey. This created a highly frustrating situation for both the development and product teams, as the integrity of the user journey data was consistently compromised. Despite continuous efforts to patch these bugs and address the omissions, the team found themselves in a perpetual state of reaction, constantly trying to catch up with newly discovered breaches rather than proactively preventing them. This reactive approach consumed valuable resources and hindered the ability to gain a complete and accurate understanding of user behavior.<\/p>\n\n<p>Diagnosing system failures, application bugs, or poor user experiences in complex applications becomes inefficient without real-time performance metrics and detailed session tracking. When engineering teams rely on outdated or fragmented data, they are forced to piece together issue narratives reactively, long after the issues occur. This significantly delays the Mean Time To Resolution (MTTR). Such a reactive approach leads to increased downtime, higher operational costs, customer dissatisfaction, and a waste of developers\u2019 time, as they spend more time \u201chunting\u201d for clues rather than deploying solutions or new features.<\/p>\n\n<h2 id=\"our-eureka-moment-autotrack-sdk\">Our \u2018Eureka\u2019 moment: AutoTrack SDK<\/h2>\n\n<p>The pivotal breakthrough that provides our unique advantage was the creation of auto tracking user journeys\u2014our \u201cEureka\u201d moment. To deliver this, we developed the new Software Development Kit (SDK) called AutoTrack.<\/p>\n\n<p>AutoTrack is system that comprehensively records application state, UI view state, as well as user interactions - a solution that pieces together a chronicle of the user journey, from launch to interactions, as they navigate through the screens. AutoTrack SDK is built on the three core pillars:<\/p>\n\n<ol>\n  <li>Application state<\/li>\n  <li>User interactions<\/li>\n  <li>UI screens<\/li>\n<\/ol>\n\n<p>Let\u2019s delve deeper into the mechanics of how this operates.<\/p>\n\n<h3 id=\"application-state\">Application state<\/h3>\n\n<p>Understanding the application state is fundamental to comprehending user behavior and, consequently, executing effective troubleshooting. The application state provides crucial insights into how a user interacts with the app, particularly concerning its visibility and how it was initiated. This encompasses tracking when the app moves between the background and foreground, as well as the various launch mechanisms.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/auto-tracking\/figure-1.png\" alt=\"\" style=\"width:50%\" \/><figcaption align=\"middle\">Figure 1. Application state user flow.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p>Key aspects of application state that are vital to monitor include:<br \/>\n<strong>Application lifecycle transitions:<\/strong><\/p>\n\n<ul>\n  <li><strong>Background state:<\/strong> When the app is running but not actively displayed to the user (e.g., the user switches to another app, or the device is locked). Understanding how frequently and for how long an app resides in the background can inform power consumption analysis and the effectiveness of background tasks.<\/li>\n  <li><strong>Foreground state:<\/strong> When the app is actively in use and displayed to the user. Monitoring transitions into and out of the foreground provides a real-time view of user engagement.<\/li>\n  <li><strong>Inactive state:<\/strong> A temporary state where the app is in the foreground but not receiving events (e.g., an incoming call temporarily interrupts the app).<\/li>\n  <li><strong>Suspended state:<\/strong> An app that is in the background and has been explicitly suspended by the operating system to free up resources.<\/li>\n  <li><strong>Terminated state:<\/strong> When the app has been completely closed or crashed. Differentiating between intentional termination and crashes is critical for identifying stability issues.<\/li>\n<\/ul>\n\n<p><strong>Application launch mechanisms:<\/strong><\/p>\n\n<p>The way an app is launched significantly impacts the initial user experience and can influence subsequent interactions. Tracking these different launch types is essential for understanding user entry points and for debugging issues that might be specific to a particular launch method.<\/p>\n\n<ul>\n  <li><strong>Explicit user launch:<\/strong> This is the most straightforward launch mechanism, where the user directly taps on the app icon from their device\u2019s home screen or app drawer. This indicates a deliberate intent to use the app and often signifies a primary entry point for regular users.<\/li>\n  <li><strong>Deeplinks:<\/strong> Deeplinks are URLs that, when clicked, open a specific page or section within a mobile app rather than a web page. They are powerful tools for enhancing user experience and engagement by providing direct access to relevant content.<\/li>\n  <li><strong>Push notifications:<\/strong> Push notifications are messages sent by an app to a user\u2019s device even when the app is not actively in use. Tapping on a push notification often launches the app and directs the user to a specific context related to the notification\u2019s content.<\/li>\n<\/ul>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/auto-tracking\/figure-2.png\" alt=\"\" style=\"width:80%\" \/><figcaption align=\"middle\">Figure 2. Code sample for tracking application lifecycle transition.<\/figcaption> <\/figure>\n<\/div>\n\n<h3 id=\"user-interactions\">User interactions<\/h3>\n\n<p>Real-time session tracking is a crucial component in understanding user behavior and optimizing app performance. By meticulously tracking a wide array of user interactions, the system provides invaluable insights into how users navigate and engage with the app. This granular data forms the bedrock for constructing comprehensive user journeys, allowing development teams to visualise the path a user takes from their initial entry point to achieving their goals within the app.<\/p>\n\n<p>This deep understanding of user interactions is the most important pillar in creating accurate and insightful user journey maps. These maps, in turn, are instrumental in identifying patterns of user behavior, both positive and negative. For instance, tracking helps to identify pain points, bugs, or areas of confusion that might lead to user frustration or abandonment.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/auto-tracking\/figure-3.png\" alt=\"\" style=\"width:80%\" \/><figcaption align=\"middle\">Figure 3. Sample code for real-time session tracking.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"ui-screen\">UI screen<\/h3>\n\n<p>The system leverages lifecycle events from UIViewController (iOS), Activity (Android), and Fragments (Android) to accurately identify and track which specific screen is currently displayed to the user. This granular level of screen tracking is crucial because it significantly enriches the contextual information available to us. By understanding the precise UI that users are interacting with, we can account for the dynamic nature of our app. Different geographical regions, diverse user segments, and varying operational scenarios can lead to distinct user interfaces being presented. This capability ensures that our analysis and troubleshooting efforts are always based on the actual user experience, allowing for more precise problem identification and more effective solutions.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/auto-tracking\/figure-4-5.png\" alt=\"\" style=\"width:80%\" \/><figcaption align=\"middle\"> <\/figcaption>\n  <\/figure>\n<\/div>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/auto-tracking\/figure-6.png\" alt=\"\" style=\"width:80%\" \/><figcaption align=\"middle\">Figure 6. Sample code of UIViewController configuration.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"ui-screen-data\">UI screen data<\/h3>\n<p>On top of that, whenever the screen appears, we capture the screen metadata where we read the full screen hierarchy. With the Screen hierarchy JSON data at hand, we employ it to train an AI model. This model, consequently, can generate an HTML file, which mirrors the user\u2019s screen and interaction.<\/p>\n\n<p>Disclaimer: information is redacted in compliance with GDPR\/PDPA, personal data protection laws.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/auto-tracking\/figure-7.png\" alt=\"\" style=\"width:80%\" \/><figcaption align=\"middle\">Figure 7. Screen hierarchy.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h2 id=\"applications-of-autotrack\">Applications of AutoTrack<\/h2>\n\n<p><strong>Key applications of AutoTrack data:<\/strong><\/p>\n\n<ul>\n  <li><strong>Reconstructing user journeys and reproducing elusive bugs:<\/strong> One of the most significant benefits of AutoTrack is its ability to meticulously record user interactions within the app. This detailed session data allows our teams to precisely recreate the user journey that led to a reported issue. For bugs that are notoriously difficult to reproduce, this capability is a game-changer, eliminating hours of manual guesswork and dramatically accelerating the identification and resolution of underlying problems.<\/li>\n  <li><strong>Automated issue assignment:<\/strong> When an issue is reported, AutoTrack data can be leveraged to automatically assign it to the most relevant team. By analysing the context of the issue within the recorded session, including the specific features or modules involved, the system can intelligently route the problem to the engineers best equipped to address it. This automation reduces triage time, ensures issues are handled by subject matter experts, and improves overall response efficiency.<\/li>\n  <li><strong>Automating UI test case generation:<\/strong> The rich dataset provided by AutoTrack offers a powerful foundation for automating the creation of UI test cases. By observing how users interact with the interface, we can automatically generate test scripts that mimic real-world usage patterns. This not only speeds up the testing phase but also leads to more comprehensive test coverage, identifying edge cases and user flows that might otherwise be missed by manually written tests.<\/li>\n  <li><strong>Understanding analytics event triggers:<\/strong> AutoTrack data provides a granular view into when and why specific analytics events are triggered within the application. This allows us to validate the accuracy of our analytics instrumentation, ensure that events are firing as expected, and gain deeper insights into user behavior. By understanding the precise context surrounding event triggers, we can refine our data collection strategies and derive more meaningful insights from our analytics.<\/li>\n<\/ul>\n\n<h2 id=\"key-takeaways-and-whats-next\">Key takeaways and what\u2019s next<\/h2>\n\n<p>AutoTrack replaces fragile manual instrumentation with a unified, real-time view of application state, screen context, and user interactions. That end-to-end trace makes elusive bugs reproducible, routes issues to the right owners, and seeds reliable UI tests\u2014turning guesswork into grounded evidence so teams can ship fixes faster and with greater confidence.<\/p>\n\n<p>Looking ahead, we are expanding AutoTrack across surfaces and deepening the context it captures\u2014pairing sessions with network and performance signals, strengthening privacy guardrails, and integrating with automated triage and test generation. Look forward to reading more of our deep dives on auto-generated UI tests and how these journeys will power proactive quality across Grab\u2019s app.<\/p>\n\n<h2 id=\"join-us\">Join us<\/h2>\n\n<p>Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line \u2013 we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.<\/p>\n\n<p>Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, <a href=\"https:\/\/grb.to\/gebautocheck\">join our team<\/a> today!<\/p>\n","pubDate":"Tue, 23 Dec 2025 00:23:00 +0000","link":"https:\/\/engineering.grab.com\/auto-track-sdk","guid":"https:\/\/engineering.grab.com\/auto-track-sdk","category":["Mobile","iOS","Android","Tracking","Engineering","Design","Product"]},{"title":"How Grab is accelerating growth with real-time personalization using Customer Data Platform scenarios","description":"<h2 id=\"introduction\">Introduction<\/h2>\n\n<p>Delivering personalized user experiences in real-time is central to Grab\u2019s strategy, but achieving this at scale poses significant engineering challenges. Grab\u2019s Customer Data Platform (CDP) and Growth team has successfully delivered several real-time campaigns, driving significant business impact through enhanced personalization. These initiatives include high-impact use cases like immediate mall offers, timely traveler recommendations, precise ad retargeting, and proactive interventions during key user journey moments. At the core of these successes is Grab\u2019s CDP, which rapidly deploys advanced real-time personalization via a powerful new capability called \u201cScenarios.\u201d<\/p>\n\n<h2 id=\"about-grabs-cdp\">About Grab\u2019s CDP<\/h2>\n\n<p>Grab\u2019s CDP is a centralized, reliable repository for user attributes, designed for freshness, governance, and reusability. Built on <a href=\"https:\/\/engineering.grab.com\/signals-market-place\">Grab\u2019s Signal Marketplace<\/a> framework, the CDP streamlines data management through automation and integration, supporting seamless interactions with internal services and toolings that power marketing, experimentation, ads, Machine Learning (ML) features, and external platforms, including Facebook, Google Ads, and TikTok.<\/p>\n\n<p>The platform currently manages over 1,000 batch user attributes for Passengers, Drivers, and Merchants, powering diverse use cases from targeted marketing campaigns to operational decision-making across Grab\u2019s entire ecosystem.<\/p>\n\n<h2 id=\"the-need-for-real-time-personalization\">The need for real-time personalization<\/h2>\n\n<p>In our current CDP setup, user segments are primarily created for targeting using batch attributes that update once daily. While these batch updates provide valuable historical insights, they are not suitable for scenarios requiring real-time responsiveness. This delay prevents timely engagement with users, particularly when immediate actions can significantly enhance user experiences and conversion rates.<\/p>\n\n<p>For example, when travelers land at an airport, they immediately benefit from timely suggestions for rides, dining options, or local attractions. Traditional batch processing cannot deliver the agility and responsiveness required for these dynamic scenarios.<\/p>\n\n<p>Historically, real-time personalization at Grab relied heavily on engineering resources, which resulted in limited scalability and agility. Marketers and product teams often found themselves blocked by engineering bandwidth constraints, restricting experimentation and innovation.<\/p>\n\n<h2 id=\"problem-statement\">Problem statement<\/h2>\n\n<p>The limitations of Grab\u2019s existing personalization frameworks include:<\/p>\n\n<ul>\n  <li>\n    <p><strong>Batch attribute delays<\/strong>: Daily updates are insufficient for scenarios requiring immediate user responses.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Limited dynamic enrichment<\/strong>: Difficulties in dynamically integrating real-time events with historical user data, weakens personalization effectiveness.<\/p>\n  <\/li>\n  <li>\n    <p><strong>High engineering overhead<\/strong>: Custom solutions require extensive resources, limiting agility and innovation.<\/p>\n  <\/li>\n<\/ul>\n\n<p>To overcome these challenges and support Grab\u2019s vision for comprehensive personalization \u2013 including proactive recommendations and assistance \u2013 CDP needed robust real-time capabilities.<\/p>\n\n<h2 id=\"cdp-scenarios-real-time-personalization-made-simple\">CDP Scenarios: Real-time personalization made simple<\/h2>\n\n<p>The <strong>Scenario<\/strong> feature revolutionizes real-time targeting within the CDP by utilizing user-initiated events, geo-fencing, historical profile data, and on-the-fly predictions. This empowers the business to deliver easy, quick, and flexible personalization without the need for complex engineering efforts.<\/p>\n\n<p>Scenarios enable innovative use cases such as these:<\/p>\n\n<ul>\n  <li><strong>Mall personalization<\/strong>: Real-time personalized offers upon arrival.<\/li>\n  <li><strong>Traveler assistance<\/strong>: Immediate recommendations at airports or hotels.<\/li>\n  <li><strong>Ad retargeting<\/strong>: Enhanced real-time ad targeting.<\/li>\n  <li><strong>Conversion optimization<\/strong>: Timely intervention during user drop-off points.<\/li>\n<\/ul>\n\n<p>Imagine predicting a user\u2019s intent to drop off at a mall using both real-time and historical context. For instance, when a user books a ride to a mall, factors such as destination, time, cuisine preferences, and past behavior (e.g., affluence level) can help predict whether the user\u2019s purpose is retail therapy, grocery shopping, or dining out. This prediction accounts for elements like time of day, day of the week, and mall location. Grab\u2019s engineering teams can leverage this predicted intent (signal) to offer personalized actions, such as GrabPay discounts for shopping or exclusive dining offers for dinner.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/cdp-scenario\/figure-1.png\" alt=\"\" style=\"width:100%\" \/><figcaption align=\"middle\">Figure 1. Scenario in CDP.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"key-features\">Key features<\/h3>\n\n<ul>\n  <li><strong>Event-driven personalization<\/strong>: Real-time Scenarios triggered by Scribe events (Grab\u2019s comprehensive event collection and tracking platform) combined with geo-fencing.<\/li>\n  <li><strong>Historical context integration<\/strong>: Optionally enrich Scenarios using historical CDP data.<\/li>\n  <li><strong>Predictive modeling<\/strong>: Deploy pre-trained models for instant user behavior predictions.<\/li>\n  <li><strong>Self-serve graphical user interface (GUI)<\/strong>: Enable marketers to create complex event sequences and validate Scenarios with synthetic data processed through Flink pipelines.<\/li>\n  <li><strong>Headless application programming interfaces (APIs)<\/strong>: Allow programmatic access and management of Scenarios.<\/li>\n<\/ul>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/cdp-scenario\/figure-2.jpg\" alt=\"\" style=\"width:100%\" \/><figcaption align=\"middle\">Figure 2. Attributes for a scenario in CDP.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"self-serve-scenario-creation\">Self-serve Scenario creation<\/h3>\n\n<p>We designed an intuitive self-serve UI, embedded within the Grab app, empowering marketers to quickly define and deploy Scenarios. Users can specify event triggers, configure geo-fencing, incorporate historical user attributes, and select predictive models. Marketers can also validate Scenarios using synthetic data before deployment, ensuring accurate and realistic outcomes.<\/p>\n\n<p>How it works:<\/p>\n\n<ol>\n  <li><strong>Select event triggers<\/strong>: Choose predefined events or define custom intra-session sequences via the GUI.<\/li>\n  <li><strong>Configure geo-fencing<\/strong>: Define Scenario activation locations, like airports or malls.<\/li>\n  <li><strong>Include historical attributes (optional)<\/strong>: Utilize batch attributes from the CDP to enrich Scenarios.<\/li>\n  <li><strong>Select predictive models (optional)<\/strong>: Train custom classifiers or pick from pre-trained Catwalk models.<\/li>\n  <li><strong>Define data sink<\/strong>: Choose between Amphawa (DynamoDB), Kafka, or both; potentially extendable to external destinations (e.g., Appsflyer).<\/li>\n  <li>Once configured, metadata synchronizes automatically with our streaming service, and Scenarios become available for real-time consumption within an hour.<\/li>\n<\/ol>\n\n<h2 id=\"proven-impact-real-world-success\">Proven impact: Real-world success<\/h2>\n\n<p>CDP Scenarios are already delivering measurable business results, with over 12 live production implementations. For instance, in a case study addressing Grab Unlimited subscription signup abandonment, we leveraged CDP Scenarios to increase signups by engaging users in real time within 15 minutes of them leaving the signup process.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/cdp-scenario\/figure-4.png\" alt=\"\" style=\"width:90%\" \/><figcaption align=\"middle\">Figure 3. Grab Unlimited sign-up journey.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p>To enhance conversion rates, personalized real-time nudges were deployed through Scenarios. For example, users who started the signup process but failed to complete it within 15 minutes received a follow-up notification, prompting them to finalize their registration.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/cdp-scenario\/figure-5.png\" alt=\"\" style=\"width:80%\" \/><figcaption align=\"middle\">Figure 4. Scenario flow for Grab Unlimited registration.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<p>This scenario alone achieved more than a 3% uplift in subscriber conversions vs non-real-time acquisition campaigns, demonstrating Scenarios\u2019 potential to significantly boost business outcomes.<\/p>\n\n<h2 id=\"technical-architecture-low-latency-high-reliability\">Technical architecture: Low latency, high reliability<\/h2>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/cdp-scenario\/figure-6.jpg\" alt=\"\" style=\"width:90%\" \/><figcaption align=\"middle\">Figure 5. High-level scenario flow. Scenarios are designed for low latency (under 15 seconds) and high reliability.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<ol>\n  <li><strong>Event registration<\/strong>: Popular UI events from Scribe are whitelisted and immediately available; custom events are onboarded via the CDP web portal.<\/li>\n  <li><strong>Scenario creation<\/strong>: Users configure Scenarios through a user-friendly GUI, defining events, historical contexts, and predictive models.<\/li>\n  <li><strong>Real-time Flink processing<\/strong>: Incoming events trigger Scenarios, evaluating user historical data via StarRocks and performing real-time predictions using pre-trained models.<\/li>\n  <li><strong>Real-time data sync<\/strong>: Outcomes are synced back to Kafka or Amphawa (Grab\u2019s internal feature store built on AWS DynamoDB), enriching data for use by subsequent services.<\/li>\n  <li><strong>Consumption by downstream services<\/strong>: Kafka streams or CDP\u2019s Profile SDK facilitates immediate, personalized user experiences.<\/li>\n<\/ol>\n\n<h2 id=\"advancing-the-future-of-real-time-personalization\">Advancing the future of real-time personalization<\/h2>\n\n<p>As we continue to innovate, we are focused on enhancing the capabilities of CDP Scenarios to support more complex and scalable personalization use cases. Here are some key areas of improvement we are exploring:<\/p>\n\n<ul>\n  <li>\n    <p><strong>Optimized Scenario sharding for scalable processing<\/strong>: To accommodate the growing number of use cases, we plan to scale and orchestrate our Flink pipeline fleet in a headless manner. This approach will improve system stability and enable seamless management of complex Scenarios across the pipeline.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Enhanced signal distribution across multiple destinations<\/strong>: Currently, Scenario outputs are limited to a single topic or sink. To address the increasing diversity of use cases, we aim to expand signal distribution, allowing downstream consumers to access Scenario outcomes through multiple scalable and reliable channels.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Advanced scheduling and delayed triggering<\/strong>: While real-time computation of Scenario signals is effective, certain use cases require delayed activation for maximum impact. We are exploring ways to compute signals instantly but trigger actions at scheduled times, such as sending a push notification for booking a return Grab ride based on the average wait time at the drop-off location.<\/p>\n  <\/li>\n<\/ul>\n\n<h2 id=\"conclusion-revolutionizing-real-time-personalization\">Conclusion: Revolutionizing real-time personalization<\/h2>\n\n<p>The launch of CDP Scenarios represents a significant milestone for Grab, paving the way for scalable, efficient, and user-friendly real-time personalization. Initial successes have demonstrated its immense potential, delivering notable improvements in user engagement and conversion rates. Looking ahead, we are committed to continuously advancing Scenarios by expanding its features, integrations, and applications to further elevate user experiences across the Grab ecosystem.<\/p>\n\n<h2 id=\"join-us\">Join us<\/h2>\n\n<p>Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line \u2013 we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.<\/p>\n\n<p>Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, <a href=\"https:\/\/grb.to\/gebdef1\">join our team<\/a> today!<\/p>\n\n","pubDate":"Thu, 18 Dec 2025 00:23:00 +0000","link":"https:\/\/engineering.grab.com\/cdp-scenarios","guid":"https:\/\/engineering.grab.com\/cdp-scenarios","category":["Database","FlinkSQL","Engineering"]},{"title":"A Decade of Defense: Celebrating Grab's 10th Year Bug Bounty Program","description":"<h2 id=\"introduction\">Introduction<\/h2>\n\n<p>Ten years ago, we launched our bug bounty program in partnership with <a href=\"https:\/\/www.hackerone.com\/blog\">HackerOne<\/a>. Beyond a security initiative, it represented an open invitation to collaborative development.\nAs pioneers in Southeast Asia, we began the program with 23 initial researchers, and it has since evolved into a global community of security researchers.<\/p>\n\n<p>The strategic structure and scope of our Bug Bounty Program, combined with our continuous innovation and experimentation, have successfully captured the attention of the global security research community. Over the past decade, we have partnered with more than 850 active security researchers from HackerOne\u2019s community of over 2 million cybersecurity professionals worldwide. These dedicated researchers work alongside us across borders and time zones, forming a collaborative defense network that helps protect over 187 million users throughout Southeast Asia. Their ongoing participation demonstrates both the maturity of our program and the trust we\u2019ve built within the security research community.<\/p>\n\n<p>This milestone reflects the strength of shared purpose and our sustained partnership with the HackerOne platform. It demonstrates the value of human connection and the collective understanding that security is stronger through collaboration. Here\u2019s to a decade of partnership and to many more years of building a safer future, one collaboration at a time!<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/decade-of-defense\/figure-1.jpg\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 1. Ten years of achievements with our HackerOne partnership.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"evolution-and-growth-adapting-to-a-dynamic-threat-landscape\">Evolution and growth: Adapting to a dynamic threat landscape<\/h3>\n\n<p>Over the past ten years, our program has consistently adapted to the dynamic threat landscape and integrated invaluable feedback from our research community. We have grown from a private initiative to a program that consistently ranks among the top 20 worldwide and among the top 3 in Asia on HackerOne. Key milestones from our journey include:<\/p>\n\n<ul>\n  <li><strong>Expanding our horizons:<\/strong> Our scope significantly broadened in 2023-2024, continuously adding new assets and prominently including financial services in Indonesia and AI systems. This expansion provides researchers with more avenues to contribute to Grab\u2019s security.<\/li>\n  <li><strong>Focused mobile security:<\/strong> We introduced a dedicated bounty table for mobile-specific issues, recognizing the unique challenges of mobile security.<\/li>\n  <li><strong>Incentivizing excellence:<\/strong> We regularly experiment with campaigns of various types and targets, diversifying our reward methods to include both financial rewards and recognition.<\/li>\n  <li><strong>Evolving vulnerability focus:<\/strong> We\u2019ve observed a significant shift in the types of vulnerabilities reported over the decade, moving from foundational issues in early years to more sophisticated and emerging categories recently.<\/li>\n<\/ul>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/decade-of-defense\/figure-2.png\" alt=\"\" style=\"width:70%\" \/><figcaption align=\"middle\">Figure 2. The journey of our bug bounty program.<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"the-global-stage-connecting-with-the-best\">The global stage: Connecting with the best<\/h3>\n\n<p>Our program\u2019s success is deeply rooted in its vibrant global community, which we actively foster through continuous engagement. Our strategy extends beyond the platform to major live hacking events, including the <strong>ThreatCon Live Hacking Event 2023<\/strong> <strong>in Nepal<\/strong> and <strong>DEFCON 32\u2019s Live Recon Village 2024 in Las Vegas.<\/strong> These initiatives have been instrumental in connecting us with a diverse pool of new talent and strengthening relationships with researchers across different continents. By meeting hackers where they are, we\u2019ve not only brought new expertise into our ecosystem but also demonstrated our commitment to being an accessible and collaborative partner on a global scale.<\/p>\n\n<p>The high participation and quality submissions from these events demonstrate the effectiveness of this approach. They\u2019ve expanded our global security testing coverage and strengthened our standing within the worldwide cybersecurity community. Through ongoing interactions and submitted reports, we continue to see that security is a collaborative effort with no borders.<\/p>\n\n<h3 id=\"exclusive-anniversary-celebrations-global-club-campaigns\">Exclusive anniversary celebrations: Global club campaigns<\/h3>\n\n<p>To commemorate our 10th anniversary, we launched three exclusive, invite-only campaigns with HackerOne\u2019s regional clubs in <strong>Germany, Morocco, and India<\/strong>. These campaigns served as cultural exchanges, bringing fresh perspectives from outside our core Southeast Asian consumer markets. By engaging with these clubs, we expanded our researcher community and connected with security experts who understand different threat landscapes and methodologies, bringing outside perspectives to our systems.<\/p>\n\n<p>In August, we also ran a broader anniversary campaign that drew significant participation from the researcher community, resulting in 461 submissions. <a href=\"https:\/\/hackerone.com\/xchopath?type=user\">xchopath<\/a> was awarded the Best Hacker Bonus for their contributions during this campaign.<\/p>\n\n<p>These campaigns expanded our global security testing coverage and strengthened relationships with international researcher communities. Beyond vulnerability reports, they functioned as knowledge-sharing initiatives. We connected directly with researchers to learn from their experience and feedback, creating a continuous loop of improvement. This international collaboration also informed our global expansion security strategy by providing insights into how different regions approach digital payments and authentication.<\/p>\n\n<p>The anniversary campaigns allowed us to validate our security frameworks against diverse regulatory environments and advanced testing methodologies from established security markets, reinforcing our commitment to maintaining robust security standards.<\/p>\n\n<h3 id=\"voices-from-our-community\">Voices from our community<\/h3>\n\n<p>Behind every vulnerability report is a researcher who chose to help make Grab safer. Their perspectives reveal the human side of our security evolution. These individuals are not just cybersecurity experts; they are partners in our mission to protect millions of users and ensure a safe digital environment. Here are a few testimonies from participants in our past campaigns:<\/p>\n\n<ul>\n  <li>\n    <p>\u201cThe triage was very fast despite the time difference, which I really appreciated. The triaging experience was better than other programs. The huge scope and business portal with different user roles made it especially interesting to explore.\u201d \u2013 <a href=\"https:\/\/hackerone.com\/artsec?type=user\"><em>ArtSec<\/em><\/a> <em>[H1 Germany club campaign participant]<\/em><\/p>\n  <\/li>\n  <li>\n    <p>\u201cI liked that different countries have different features\u2014this gives me more attack surface to explore. Response time was great, triage was very fast, and I appreciated Grab\u2019s effort in providing fast responses. The scope was huge with a lot of wildcards for reconnaissance.\u201d \u2013 <a href=\"https:\/\/hackerone.com\/sicksec?type=user\"><em>Sicksec<\/em><\/a> <em>[H1 Morocco club campaign participant]<\/em><\/p>\n  <\/li>\n  <li>\n    <p>\u201cMore than 20 bugs were reported, and was particularly happy that bounties were being paid upon triage. The Germany team spent a lot of time on the educational part, especially for newcomers. Communication overall was very good, and the immediate response even outside working hours was really cool. SSO and authentication is my expertise and I liked that aspect of exploring the platform.\u201d \u2013 <a href=\"https:\/\/hackerone.com\/lauritz?type=user\"><em>Lauritz<\/em><\/a> <em>[H1 Germany club campaign participant]<\/em><\/p>\n  <\/li>\n<\/ul>\n\n<h3 id=\"the-road-ahead-our-commitment-to-a-secure-future\">The road ahead: Our commitment to a secure future<\/h3>\n\n<p>With a strong community of security researchers across countries and a decade of collaboration, we\u2019ve built meaningful partnerships. Every vulnerability report represents trust, and every discovery reflects dedication to our shared mission. The program demonstrates our choice to build together rather than work in isolation, to protect rather than exploit, and to collaborate rather than compete.<\/p>\n\n<p>While we celebrate our external community, the success of our program relies equally on our dedicated internal teams. Our cybersecurity teams form the operational foundation of this initiative. Their consistent responsiveness and researcher-focused approach have enabled vulnerability reporting to evolve into a genuine partnership, maintaining researcher trust and keeping Grab secure.<\/p>\n\n<p>The next ten years will bring challenges we can\u2019t yet imagine, from emerging threats in artificial intelligence to novel cryptographic approaches in a quantum-powered world. We will face them together as a community that spans cultures, time zones, and expertise.<\/p>\n\n<p>Together, we\u2019ll continue securing Southeast Asia\u2019s digital future, one partnership, one discovery, one shared achievement at a time.<\/p>\n\n<h2 id=\"join-us\">Join us<\/h2>\n\n<p>Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility, and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people every day to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line \u2013 we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.<\/p>\n\n<p>Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, <a href=\"https:\/\/grb.to\/gebdef\">join our team<\/a> today!<\/p>\n","pubDate":"Mon, 01 Dec 2025 00:00:10 +0000","link":"https:\/\/engineering.grab.com\/a-decade-of-defense","guid":"https:\/\/engineering.grab.com\/a-decade-of-defense","category":["Engineering","Performance","Engineering","Data"]},{"title":"Real-time data quality monitoring: Kafka stream contracts with syntactic and semantic test","description":"<h2 id=\"introduction\">Introduction<\/h2>\n\n<p>In today\u2019s data-driven landscape, monitoring data quality has become a critical need for ensuring reliable and efficient data usage across domains. High-quality data is the backbone of AI innovation, driving efficiency and unlocking new opportunities. As decentralized data ownership grows, the ability to effectively monitor data quality is essential for maintaining reliability in data systems.<\/p>\n\n<p>Kafka streams, as a vital component of real-time data processing, play a significant role in this ecosystem. However, unreliable data within Kafka streams can lead to errors and inefficiencies for downstream users, and monitoring the quality of data within these streams has always been a challenge. This blog introduces a solution that empowers stream users to define a data contract, specifying the rules that Kafka stream data must adhere to. By leveraging this user-defined data contract, the solution performs automated real-time data quality checks, identifies problematic data as it occurs, and promptly notifies stream owners. This ensures timely action, enabling effective monitoring and management of Kafka stream data quality while supporting the broader goals of data mesh and AI-driven innovation.<\/p>\n\n<h2 id=\"problem-statement\">Problem statement<\/h2>\n\n<p>In the past, monitoring Kafka stream data processing lacked an effective solution for data quality validation. This limitation made it challenging to identify bad data, notify users in a timely manner, and prevent the cascading impact on downstream users from further escalating.<\/p>\n\n<p><strong>Challenges in syntactic and semantic issue identification<\/strong>:<\/p>\n\n<ul>\n  <li><strong>Syntactic issues<\/strong>: Refers to schema mismatches between producers and consumers, which can lead to deserialization errors. While schema backward compatibility can be validated upon schema evolution, there are scenarios where the actual data in the Kafka topic does not align with the defined schema. For example, this can occur when a rogue Kafka producer is not using the expected schema for a given Kafka topic. Identifying the specific fields causing these syntactic issues is a typical challenge.<\/li>\n  <li><strong>Semantic issues<\/strong>: Refers to inconsistencies or misalignments between producers and consumers about the expected pattern or significance of each field. Unlike Kafka stream schemas, which act as a data structure contract between producers and consumers, there is no existing framework for stakeholders to define and enforce field-level semantic rules, for example, the expected length or pattern of an identifier.<\/li>\n<\/ul>\n\n<p><strong>Timeliness challenge in data quality monitoring<\/strong>: There is no real-time mechanism to automatically validate data against predefined rules, timely identify quality issues, and promptly alert stream stakeholders. Without real-time stream validation, data quality issues can sometimes persist for periods of time, impacting various online and offline downstream systems before being discovered.<\/p>\n\n<p><strong>Observability challenge for troubleshooting bad data<\/strong>: Even when problematic data is identified, stream users face difficulties in pinpointing the exact \u201cpoison data\u201d and understanding which fields are incompatible with the schema or violate semantic rules. This lack of visibility complicates Root Cause Analysis and resolution efforts.<\/p>\n\n<h2 id=\"solution\">Solution<\/h2>\n\n<p>Our <a href=\"https:\/\/engineering.grab.com\/an-elegant-platform\">Coban platform<\/a> offers a standardized data quality test and observability solution at the platform level, consisting of the following components:<\/p>\n\n<ul>\n  <li><strong>Data Contract Definition<\/strong>: Enables Kafka stream stakeholders to define contracts that include schema agreements, semantic rules that Kafka topic data must comply with, and Kafka stream ownership details for alerting and notifications.<\/li>\n  <li><strong>Automated Test Execution<\/strong>: Provides a long running Test Runner to automatically execute real-time tests based on the defined contract.<\/li>\n  <li><strong>Real-time Data Quality Issue Identification<\/strong>: Detects data issues at both syntactic and semantic levels in real-time.<\/li>\n  <li><strong>Alerts and Result Observability<\/strong>: Alerts users, simplifying observation of data quality issues via the platform.<\/li>\n<\/ul>\n\n<h3 id=\"architecture-details\">Architecture details<\/h3>\n\n<p>The solution includes three components: <em>Data Contract Definition, Test Execution &amp; Data Quality Issue Identification, and Result Observability as shown in the architecture diagram in figure 1<\/em>. All mentions of \u201cFlow\u201d from here onwards refer to the corresponding processes illustrated in figure 1.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/real-time-data-quality-monitoring\/coban-architecture.jpg\" alt=\"\" style=\"width:100%\" \/><figcaption align=\"middle\">Figure 1. Real-time Kafka Stream Data Quality Monitoring Architecture diagram.\n<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h4 id=\"data-contract-definition\">Data Contract Definition<\/h4>\n\n<p>The Coban Platform streamlines the process of defining Kafka stream data contracts, serving as a formal agreement among Kafka stream stakeholders. This includes the following components:<\/p>\n\n<ul>\n  <li><strong>Kafka Stream Schema<\/strong>: Represents the schema used by the Kafka topic under test and helps the Test Runner to validate schema compatibility across data streams (Flow 1.1).<\/li>\n  <li><strong>Kafka Stream Configuration<\/strong>: Encompasses essential configurations such as the endpoint and topic name, which the platform automatically populates (Flow 1.2).<\/li>\n  <li><strong>Observability Metadata<\/strong>: Provides contact information for notifying Kafka stream stakeholders about data quality issues and includes alert configurations for monitoring (Flow 1.3).<\/li>\n  <li><strong>Kafka Stream Semantic Test Rules<\/strong>: Empowers users to define intuitive semantic test rules at the field level. These rules include checks for string patterns, number ranges, constant values, etc. (Flow 1.5).<\/li>\n  <li><strong>LLM-Based Semantic Test Rules Recommendation<\/strong>: Defining dozens if not hundreds of field-specific test rules can overwhelm users. To simplify this process, the Coban Platform uses LLM-based recommendations to predict semantic test rules using provided Kafka stream schemas and anonymized sample data (Flow 1.4). This feature helps users set up semantic rules efficiently, as demonstrated in the sample UI in figure 2.<\/li>\n<\/ul>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/real-time-data-quality-monitoring\/sample-ui.png\" alt=\"\" style=\"width:100%\" \/><figcaption align=\"middle\">Figure 2. Sample UI showcasing LLM-based Kafka stream schema field-level semantic test rules. Note that the data shown is entirely fictional.\n<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h4 id=\"data-contract-transformation\">Data Contract Transformation<\/h4>\n\n<p>Once defined, the Coban Platform\u2019s transformation engine converts the data contract into configurations that the Test Runner can interpret (Flow 2.1). This transformation process includes:<\/p>\n\n<ul>\n  <li><strong>Kafka Stream Schema<\/strong>: Translates the schema defined in the data contract into a schema reference that the Test Runner can parse.<\/li>\n  <li><strong>Kafka Stream Configuration<\/strong>: Sets up the Kafka stream as a source for the Test Runner.<\/li>\n  <li><strong>Observability metadata<\/strong>: Sets contact information as configurations of the Test Runner.<\/li>\n  <li><strong>Kafka Stream Semantic Test Rules<\/strong>: Transforms human-readable semantic test rules into an inverse SQL query to capture the data that violates the defined rules.<\/li>\n<\/ul>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/real-time-data-quality-monitoring\/semantic-test-rules.jpg\" alt=\"\" style=\"width:100%\" \/><figcaption align=\"middle\">Figure 3. Illustration of semantic test rules being converted from human-readable formats into inverse SQL queries.\n<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"test-execution--data-quality-issue-identification\">Test Execution &amp; Data Quality Issue Identification<\/h3>\n\n<p>Once the Test Configuration Transformation Engine generates the Test Runner configuration (Flow 2.1), the platform automatically deploys the Test Runner.<\/p>\n\n<h4 id=\"test-runner\">Test Runner<\/h4>\n\n<p>The Test Runner utilises FlinkSQL as the compute engine to execute the tests. FlinkSQL was selected for its flexibility in defining test rules as straightforward SQL statements, enabling our platform to efficiently convert data contracts into enforceable rules.<\/p>\n\n<h4 id=\"test-execution-workflow-and-problematic-data-identification\">Test Execution Workflow And Problematic Data Identification<\/h4>\n\n<p>FlinkSQL consumes data from the Kafka topic under test (Flow 2.2) using its own consumer group, ensuring it doesn\u2019t impact other consumers. It runs the inverse SQL query (Flow 2.3) to identify any data that violates the semantic rules or that is syntactically incorrect in the first place. Test Runner captures such data, packages it into a data quality issue event enriched with a test summary, the total count of bad records, and sample bad data, and publishes it to a dedicated Kafka topic (Flow 3.2). Additionally, the platform sinks all such data quality events to an AWS S3 bucket (Flow 3.1) to enable deeper observability and analysis.<\/p>\n\n<h3 id=\"result-observability\">Result Observability<\/h3>\n\n<p>Grab\u2019s in-house data quality observability platform, Genchi, consumes problematic data captured by the Test Runner (Flow 3.3).<\/p>\n\n<h4 id=\"alerting\">Alerting<\/h4>\n<p>Genchi sends Slack notifications (Flow 3.5) to stream owners specified in the data contract observability metadata. These notifications include detailed information about stream issues, such as links to sample data in Coban UI, observed windows, counts of bad records, and other relevant details.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/real-time-data-quality-monitoring\/slack-notification.png\" alt=\"\" style=\"width:80%\" \/><figcaption align=\"middle\">Figure 4. Sample Slack notifications\n<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h4 id=\"observability\">Observability<\/h4>\n\n<p>Users can access the Coban UI (Flow 3.4), displaying Kafka stream test rules and sample bad records, highlighting fields and values that violate rules.<\/p>\n\n<div class=\"post-image-section\"><figure>\n  <img src=\"\/img\/real-time-data-quality-monitoring\/sample-test-result.jpg\" alt=\"\" style=\"width:100%\" \/><figcaption align=\"middle\">Figure 5. In this Sample Test Result, the highlighted fields indicate violations of the semantic test rules.\n<\/figcaption>\n  <\/figure>\n<\/div>\n\n<h3 id=\"impact\">Impact<\/h3>\n\n<p>Since its deployment earlier this year, the solution has enabled Kafka stream users to define contracts with syntactic and semantic rules, automate test execution, and alert users when problematic data is detected, prompting timely action. It has been actively monitoring data quality across 100+ critical Kafka topics. The solution offers the capability to immediately identify and halt the propagation of invalid data across multiple streams.<\/p>\n\n<h2 id=\"conclusion\">Conclusion<\/h2>\n\n<p>We implemented and rolled out a solution to assist Grab engineers in effectively monitoring data quality in their Kafka streams. This solution empowers them to establish syntactic and semantic tests for their data. Our platform\u2019s automatic testing feature enables real-time tracking of data quality, with instant alerts for any discrepancies. Additionally, we provide detailed visibility into test results, facilitating the easy identification of specific data fields that violate the rules. This accelerates the process of diagnosing and resolving issues, allowing users to swiftly address production data challenges.<\/p>\n\n<h2 id=\"whats-next\">What\u2019s next<\/h2>\n\n<p>While our current solution emphasizes monitoring the quality of Kafka streaming data, further exploration will focus on tracing producers to pinpoint the origin of problematic data, as well as enabling more advanced semantic tests such as cross-field validations. Additionally, we aim to expand monitoring capabilities to cover broader aspects like data completeness and freshness, and integrate with <a href=\"https:\/\/www.gable.ai\/\">Gable AI<\/a> to detect Data Transfer Object (DTO) changes and semantic regressions in Go producers upon committing code to the Git repository. These enhancements will pave the way for a more robust, multidimensional data quality testing solution across a wider range.<\/p>\n\n<h2 id=\"references\">References<\/h2>\n\n<p><a href=\"https:\/\/www.oreilly.com\/library\/view\/driving-data-quality\/9781837635009\/\">Driving Data Quality with Data Contracts: A Comprehensive Guide to Building Reliable, Trusted, and Effective Data Platforms<\/a> by <a href=\"https:\/\/www.oreilly.com\/search?q=author:%22Andrew%20Jones%22\">Andrew Jones<\/a><\/p>\n\n<h2 id=\"join-us\">Join us<\/h2>\n\n<p>Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line \u2013 we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.<\/p>\n\n<p>Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, <a href=\"https:\/\/grb.to\/gebdatam\">join our team<\/a> today!<\/p>\n","pubDate":"Wed, 26 Nov 2025 00:00:10 +0000","link":"https:\/\/engineering.grab.com\/real-time-data-quality-monitoring","guid":"https:\/\/engineering.grab.com\/real-time-data-quality-monitoring","category":["Engineering","Kafka","Performance","Data Science","Data processing","Real-time streaming","Engineering","Data"]}]}}