Overview

Relevant source files

Purpose and Scope

This document provides an overview of Apache Amoro, a Lakehouse management system built on open data lake formats. It introduces the system architecture, core components, module structure, and key capabilities. For detailed information about specific subsystems, refer to:

Architecture details: Architecture Overview
Core concepts and terminology: Core Concepts
Module organization: Module Structure

What is Apache Amoro?

Apache Amoro is a Lakehouse management system that provides self-optimizing and management capabilities for data lake tables. It works with multiple compute engines (Apache Flink, Apache Spark, Trino) and supports various table formats (Iceberg, Mixed-Iceberg, Mixed-Hive, Paimon, Hudi).

Key Features:

Self-Optimizing: Automatically compacts small files, merges changes, and removes expired data
Multi-Format Support: Manages tables in Iceberg, Mixed-Iceberg, Mixed-Hive, Paimon, and Hudi formats
Unified Catalog Service: Provides catalog integration for all compute engines
Pluggable Architecture: Extensible optimizer and terminal implementations
Management Tools: Web UI dashboard and REST API for system administration

Primary Use Cases:

Managing data lake tables with automatic optimization
Providing streaming and batch data processing on data lakes
Upgrading Hive tables to data lake formats while maintaining compatibility
Building lakehouse architectures with infrastructure decoupling

Sources: README.md40-98 pom.xml34-36

System Architecture

The following diagram shows the high-level architecture of Apache Amoro and how its major components interact:

Amoro System Architecture

Sources: README.md46-61 docs/admin-guides/deployment.md71-97 amoro-ams/pom.xml1-641

Architecture Components:

AMS (Amoro Management Service): Central management service providing three main interfaces:
- HTTP Server on port 1630 for dashboard and REST API
- Thrift Table Service on port 1260 for compute engine integration
- Thrift Optimizing Service on port 1261 for optimizer coordination
Compute Engines: External processing engines that read/write data through AMS table service
Optimizer Containers: Execution environments for self-optimizing tasks (Local JVM, Flink, Spark, Kubernetes)
Storage Layer: Underlying storage systems for data files, metadata, and message queues

Core Components

AMS Service Container

The AmoroServiceContainer is the main entry point that initializes and manages all AMS services:

Sources: CONTRIBUTING.md134-141 amoro-ams/pom.xml29-31

Service Interfaces

HTTP Server (Port 1630):

DashboardServer: Javalin-based HTTP server for web UI and REST API
Controllers: LoginController, CatalogController, TableController, OptimizerController, TerminalController
Authentication: Token, Basic, JWT, API Token
REST API: Catalog management, table operations, optimizer control

Thrift Services:

TableManagementServer (Port 1260): Table lifecycle operations for compute engines
OptimizingServiceServer (Port 1261): Task polling and optimizer coordination

Sources: docs/admin-guides/deployment.md71-97

Core Management Services

Sources: README.md103-114 docs/admin-guides/deployment.md141-175

Module Structure

Apache Amoro is organized into the following Maven modules:

Module Organization

Sources: pom.xml47-58 README.md99-114

Module Descriptions:

Module	Purpose	Key Components
`amoro-common`	Core abstractions and common utilities	Thrift API definitions, configuration, shared utilities
`amoro-ams`	Management service implementation	Service container, REST API, Thrift servers
`amoro-web`	Web dashboard frontend	Vue.js application for AMS management
`amoro-metrics`	Metrics reporting	Prometheus metric reporter
`amoro-format-iceberg`	Iceberg format integration	Iceberg table support, catalog implementations
`amoro-format-mixed`	Mixed format implementation	Mixed-Iceberg and Mixed-Hive table support
`amoro-format-paimon`	Paimon format integration	Paimon table metadata display
`amoro-format-hudi`	Hudi format integration	Hudi table metadata display
`amoro-optimizer`	Optimizer implementations	Local, Flink, Spark, and Kubernetes optimizers
`dist`	Binary distribution packaging	Maven assembly for release artifacts

Sources: README.md99-114 pom.xml47-58

Supported Table Formats and Engines

Table Formats

Apache Amoro supports multiple table formats, each serving different use cases:

Format	Description	Use Case
Iceberg	Native Apache Iceberg tables	Standard Iceberg functionality with AMS management
Mixed-Iceberg	Enhanced Iceberg with LogStore and ChangeStore	Streaming updates with high performance
Mixed-Hive	Hive-compatible mixed format	Upgrade Hive tables while maintaining compatibility
Paimon	Apache Paimon integration	Display Paimon table metadata and statistics
Hudi	Apache Hudi integration	Display Hudi table metadata and statistics

Sources: README.md62-70

Engine Support Matrix

Engine	Version	Batch Read	Batch Write	Streaming Read	Streaming Write	Create Table	Alter Table
Flink	1.16.x, 1.17.x, 1.18.x	✓	✓	✓	✓	✓	✗
Spark	3.3, 3.5	✓	✓	✗	✗	✓	✓
Hive	2.x, 3.x	✓	✗	✗	✗	✗	✓
Trino	406	✓	✗	✗	✗	✗	✓

Sources: README.md79-88

Deployment Options

Apache Amoro can be deployed in multiple ways to suit different infrastructure requirements:

Deployment Architecture

Sources: docs/admin-guides/deployment.md27-339 docker/build.sh1-243 .github/workflows/docker-images.yml1-267

Deployment Methods:

Standalone Binary: Unpack tarball, configure via YAML files, start with shell scripts
- Scripts: bin/ams.sh, bin/optimizer.sh
- Configuration: conf/config.yaml, conf/jvm.properties
Docker Containers: Pre-built images for AMS and optimizers
- apache/amoro: Main AMS service
- apache/amoro-flink-optimizer: Flink-based optimizer (versions 1.14.6, 1.20.0)
- apache/amoro-spark-optimizer: Spark-based optimizer (version 3.5.7)
Kubernetes with Helm: Production-grade deployment with scaling and HA
- Helm charts with ConfigMaps and StatefulSets
- Dynamic optimizer scaling via Kubernetes containers

Sources: docs/admin-guides/deployment.md36-176 .github/workflows/docker-images.yml37-265 docker/README.md1-48

Build System

Apache Amoro uses Maven for building and dependency management:

Build Configuration

JDK Requirements: Java 11 (JDK 17 required for Trino module)
Maven Version: 3.9.11
Build Tool: Maven Wrapper (./mvnw)

Key Build Properties:

Property	Default Value	Description
`iceberg.version`	1.6.1	Apache Iceberg version
`spark.version`	3.5.7	Apache Spark version
`flink.version`	1.20.3	Apache Flink version
`hadoop.version`	3.4.0	Hadoop version (use `-Phadoop2` for 2.x)
`paimon.version`	1.2.0	Apache Paimon version
`hudi.version`	0.14.1	Apache Hudi version

Common Build Commands:

Sources: README.md117-134 pom.xml72-171

System Requirements

Minimum Requirements:

Java 11 or higher
4GB RAM (8GB recommended for production)
Database: Derby (embedded), MySQL 5.5+, or PostgreSQL 14.x+
Optional: ZooKeeper 3.4.x+ for HA deployments

Supported Operating Systems:

Linux (recommended for production)
macOS (development)
Windows (with WSL)

Sources: docs/admin-guides/deployment.md31-35

Next Steps

For detailed information about specific aspects of Apache Amoro:

Architecture Details: See Architecture Overview for component interactions
Key Concepts: See Core Concepts for terminology and data structures
Module Details: See Module Structure for package organization
Deployment Guide: See Deployment for installation and configuration
Development Setup: See Development for building and contributing

Sources: README.md1-186 CONTRIBUTING.md1-249

Overview

Relevant source files

Purpose and Scope

Architecture details: Architecture Overview
Core concepts and terminology: Core Concepts
Module organization: Module Structure

What is Apache Amoro?

Key Features:

Self-Optimizing: Automatically compacts small files, merges changes, and removes expired data
Multi-Format Support: Manages tables in Iceberg, Mixed-Iceberg, Mixed-Hive, Paimon, and Hudi formats
Unified Catalog Service: Provides catalog integration for all compute engines
Pluggable Architecture: Extensible optimizer and terminal implementations
Management Tools: Web UI dashboard and REST API for system administration

Primary Use Cases:

Managing data lake tables with automatic optimization
Providing streaming and batch data processing on data lakes
Upgrading Hive tables to data lake formats while maintaining compatibility
Building lakehouse architectures with infrastructure decoupling

Sources: README.md40-98 pom.xml34-36

System Architecture

The following diagram shows the high-level architecture of Apache Amoro and how its major components interact:

Amoro System Architecture

Sources: README.md46-61 docs/admin-guides/deployment.md71-97 amoro-ams/pom.xml1-641

Architecture Components:

AMS (Amoro Management Service): Central management service providing three main interfaces:
- HTTP Server on port 1630 for dashboard and REST API
- Thrift Table Service on port 1260 for compute engine integration
- Thrift Optimizing Service on port 1261 for optimizer coordination
Compute Engines: External processing engines that read/write data through AMS table service
Optimizer Containers: Execution environments for self-optimizing tasks (Local JVM, Flink, Spark, Kubernetes)
Storage Layer: Underlying storage systems for data files, metadata, and message queues

Core Components

AMS Service Container

The AmoroServiceContainer is the main entry point that initializes and manages all AMS services:

Sources: CONTRIBUTING.md134-141 amoro-ams/pom.xml29-31

Service Interfaces

HTTP Server (Port 1630):

DashboardServer: Javalin-based HTTP server for web UI and REST API
Controllers: LoginController, CatalogController, TableController, OptimizerController, TerminalController
Authentication: Token, Basic, JWT, API Token
REST API: Catalog management, table operations, optimizer control

Thrift Services:

TableManagementServer (Port 1260): Table lifecycle operations for compute engines
OptimizingServiceServer (Port 1261): Task polling and optimizer coordination

Sources: docs/admin-guides/deployment.md71-97

Core Management Services

Sources: README.md103-114 docs/admin-guides/deployment.md141-175

Module Structure

Apache Amoro is organized into the following Maven modules:

Module Organization

Sources: pom.xml47-58 README.md99-114

Module Descriptions:

Module	Purpose	Key Components
`amoro-common`	Core abstractions and common utilities	Thrift API definitions, configuration, shared utilities
`amoro-ams`	Management service implementation	Service container, REST API, Thrift servers
`amoro-web`	Web dashboard frontend	Vue.js application for AMS management
`amoro-metrics`	Metrics reporting	Prometheus metric reporter
`amoro-format-iceberg`	Iceberg format integration	Iceberg table support, catalog implementations
`amoro-format-mixed`	Mixed format implementation	Mixed-Iceberg and Mixed-Hive table support
`amoro-format-paimon`	Paimon format integration	Paimon table metadata display
`amoro-format-hudi`	Hudi format integration	Hudi table metadata display
`amoro-optimizer`	Optimizer implementations	Local, Flink, Spark, and Kubernetes optimizers
`dist`	Binary distribution packaging	Maven assembly for release artifacts

Sources: README.md99-114 pom.xml47-58

Supported Table Formats and Engines

Table Formats

Apache Amoro supports multiple table formats, each serving different use cases:

Format	Description	Use Case
Iceberg	Native Apache Iceberg tables	Standard Iceberg functionality with AMS management
Mixed-Iceberg	Enhanced Iceberg with LogStore and ChangeStore	Streaming updates with high performance
Mixed-Hive	Hive-compatible mixed format	Upgrade Hive tables while maintaining compatibility
Paimon	Apache Paimon integration	Display Paimon table metadata and statistics
Hudi	Apache Hudi integration	Display Hudi table metadata and statistics

Sources: README.md62-70

Engine Support Matrix

Engine	Version	Batch Read	Batch Write	Streaming Read	Streaming Write	Create Table	Alter Table
Flink	1.16.x, 1.17.x, 1.18.x	✓	✓	✓	✓	✓	✗
Spark	3.3, 3.5	✓	✓	✗	✗	✓	✓
Hive	2.x, 3.x	✓	✗	✗	✗	✗	✓
Trino	406	✓	✗	✗	✗	✗	✓

Sources: README.md79-88

Deployment Options

Apache Amoro can be deployed in multiple ways to suit different infrastructure requirements:

Deployment Architecture

Sources: docs/admin-guides/deployment.md27-339 docker/build.sh1-243 .github/workflows/docker-images.yml1-267

Deployment Methods:

Standalone Binary: Unpack tarball, configure via YAML files, start with shell scripts
- Scripts: bin/ams.sh, bin/optimizer.sh
- Configuration: conf/config.yaml, conf/jvm.properties
Docker Containers: Pre-built images for AMS and optimizers
- apache/amoro: Main AMS service
- apache/amoro-flink-optimizer: Flink-based optimizer (versions 1.14.6, 1.20.0)
- apache/amoro-spark-optimizer: Spark-based optimizer (version 3.5.7)
Kubernetes with Helm: Production-grade deployment with scaling and HA
- Helm charts with ConfigMaps and StatefulSets
- Dynamic optimizer scaling via Kubernetes containers

Sources: docs/admin-guides/deployment.md36-176 .github/workflows/docker-images.yml37-265 docker/README.md1-48

Build System

Apache Amoro uses Maven for building and dependency management:

Build Configuration

JDK Requirements: Java 11 (JDK 17 required for Trino module)
Maven Version: 3.9.11
Build Tool: Maven Wrapper (./mvnw)

Key Build Properties:

Property	Default Value	Description
`iceberg.version`	1.6.1	Apache Iceberg version
`spark.version`	3.5.7	Apache Spark version
`flink.version`	1.20.3	Apache Flink version
`hadoop.version`	3.4.0	Hadoop version (use `-Phadoop2` for 2.x)
`paimon.version`	1.2.0	Apache Paimon version
`hudi.version`	0.14.1	Apache Hudi version

Common Build Commands:

Sources: README.md117-134 pom.xml72-171

System Requirements

Minimum Requirements:

Java 11 or higher
4GB RAM (8GB recommended for production)
Database: Derby (embedded), MySQL 5.5+, or PostgreSQL 14.x+
Optional: ZooKeeper 3.4.x+ for HA deployments

Supported Operating Systems:

Linux (recommended for production)
macOS (development)
Windows (with WSL)

Sources: docs/admin-guides/deployment.md31-35

Next Steps

For detailed information about specific aspects of Apache Amoro:

Architecture Details: See Architecture Overview for component interactions
Key Concepts: See Core Concepts for terminology and data structures
Module Details: See Module Structure for package organization
Deployment Guide: See Deployment for installation and configuration
Development Setup: See Development for building and contributing

Sources: README.md1-186 CONTRIBUTING.md1-249

Overview

Purpose and Scope

What is Apache Amoro?

System Architecture

Amoro System Architecture

Core Components

AMS Service Container

Service Interfaces

Core Management Services

Module Structure

Module Organization

Supported Table Formats and Engines

Table Formats

Engine Support Matrix

Deployment Options

Deployment Architecture

Build System

Build Configuration

System Requirements

Next Steps

On this page

Overview

Purpose and Scope

What is Apache Amoro?

System Architecture

Amoro System Architecture

Core Components

AMS Service Container

Service Interfaces

Core Management Services

Module Structure

Module Organization

Supported Table Formats and Engines

Table Formats

Engine Support Matrix

Deployment Options

Deployment Architecture

Build System

Build Configuration

System Requirements

Next Steps

On this page