Working with ORC files in Python
Optimized Row Columnar (ORC) is a high-performance columnar file format widely used in big data ecosystems such as Apache Hive, Spark, and Flink. Let us delve into understanding how to work with the ORC file format in Python through a guide with examples.
1. Introduction to the ORC File Format
ORC (Optimized Row Columnar) is a highly efficient columnar storage format originally developed for Apache Hive and now widely used across the Hadoop ecosystem. It is designed to store and process very large datasets efficiently in distributed systems. Instead of storing data row by row, ORC organizes data column by column and further groups rows into large logical units called stripes. Each stripe contains index data, row data, and a footer with metadata. This layout allows query engines to read only the columns and row ranges required for a given query, significantly reducing I/O.
1.1 Key Advantages
- High compression ratios – ORC supports multiple compression algorithms (such as ZLIB, Snappy, and LZO) and applies compression at the column level, often achieving much better compression than row-based formats.
- Faster query performance – Columnar storage minimizes disk reads and improves CPU efficiency by enabling vectorized execution in query engines like Hive, Presto, and Spark.
- Efficient predicate pushdown – ORC stores per-column statistics (min, max, null count, etc.) that allow query engines to skip entire stripes or row groups that do not satisfy filter conditions.
- Built-in indexing and statistics – Lightweight indexes and rich metadata are embedded directly in the file, eliminating the need for external index structures.
Because of these characteristics, ORC is especially well-suited for read-heavy analytical workloads such as log analysis, business intelligence reporting, and data warehousing, where queries frequently scan large datasets but touch only a subset of columns.
1.2 When Should You Use ORC?
You should consider using ORC when:
- You are working with very large datasets that are primarily read-intensive rather than write-intensive.
- Your data is queried using analytical engines such as Apache Hive, Spark SQL, Presto, or Trino.
- Queries frequently access only a subset of columns, benefiting from ORC’s columnar storage model.
- You want to take advantage of predicate pushdown and built-in column statistics to reduce I/O.
- Efficient compression and reduced storage costs are important for your data pipeline.
2. End-to-End Python ORC Example
2.1 Required Dependencies
To work with ORC files in Python, you need to install the required third-party libraries. The pyarrow library provides native support for reading and writing ORC files, while pandas is used for data manipulation and analysis. Make sure you are using Python 3.8 or later before installing the dependencies.
pip install pyarrow pandas
2.2 Writing, Reading, and Processing ORC Data in Python
import pyarrow as pa
import pyarrow.orc as orc
import pandas as pd
def _create_sample_log_data() -> pd.DataFrame:
"""create sample log data including complex types"""
log_data = {
"timestamp": [
"2026-01-01 10:00:00",
"2026-01-01 10:01:00",
"2026-01-01 10:02:00",
"2026-01-01 10:03:00",
],
"service": ["auth", "auth", "payment", "payment"],
"level": ["INFO", "ERROR", "ERROR", "INFO"],
"message": [
"User login successful",
"Invalid credentials",
"Payment failed",
"Payment completed",
],
"response_time_ms": [120, 300, 850, 200],
"tags": [
["security", "login"],
["security", "auth"],
["payment", "failure"],
["payment", "success"],
],
"metadata": [
{"ip": "10.0.0.1", "browser": "chrome"},
{"ip": "10.0.0.2", "browser": "firefox"},
{"ip": "10.0.0.3", "browser": "safari"},
{"ip": "10.0.0.4", "browser": "edge"},
],
}
return pd.DataFrame(log_data)
def _get_orc_schema() -> pa.Schema:
"""define orc schema with complex data types"""
return pa.schema([
("timestamp", pa.string()),
("service", pa.string()),
("level", pa.string()),
("message", pa.string()),
("response_time_ms", pa.int32()),
("tags", pa.list_(pa.string())),
("metadata", pa.struct([
("ip", pa.string()),
("browser", pa.string()),
])),
])
def _write_orc_file(table: pa.Table, file_path: str) -> None:
"""write pyarrow table to orc with compression"""
with open(file_path, "wb") as f:
orc.write_table(table, f, compression="zlib")
def _read_orc_file(file_path: str) -> pd.DataFrame:
"""read orc file and return pandas dataframe"""
with open(file_path, "rb") as f:
orc_file = orc.ORCFile(f)
table = orc_file.read()
return table.to_pandas()
def _process_logs(df: pd.DataFrame) -> None:
"""perform basic log analysis"""
error_logs = df[df["level"] == "ERROR"]
error_count_by_service = error_logs.groupby("service").size()
print("\nError count per service:")
print(error_count_by_service)
avg_response_time = df.groupby("service")["response_time_ms"].mean()
print("\nAverage response time (ms) per service:")
print(avg_response_time)
print("\nSample complex fields:")
print(df[["tags", "metadata"]])
def main() -> None:
orc_file_path = "application_logs.orc"
df = _create_sample_log_data()
schema = _get_orc_schema()
table = pa.Table.from_pandas(df, schema=schema)
_write_orc_file(table, orc_file_path)
print("ORC file written successfully.")
read_df = _read_orc_file(orc_file_path)
print("\nData read from ORC file:")
print(read_df)
_process_logs(read_df)
if __name__ == "__main__":
main()
2.2.1 Code Explanation
This Python example demonstrates a complete, production-style workflow for working with the ORC (Optimized Row Columnar) file format using PyArrow and Pandas, structured for clarity through private-style helper methods. The program begins by importing PyArrow for columnar data processing and native ORC support, along with Pandas for in-memory data manipulation.
The _create_sample_log_data method constructs a realistic application log dataset and returns it as a Pandas DataFrame, including both primitive fields such as timestamps, service names, log levels, messages, and response times, as well as complex data types like a list of tags and a structured metadata object containing IP address and browser information.
The _get_orc_schema method explicitly defines the ORC schema using PyArrow, specifying column data types and nested structures to ensure that complex fields are stored correctly and efficiently in the ORC file. In the main execution flow, the DataFrame is converted into an immutable, columnar PyArrow Table using the defined schema, which optimizes the data for serialization and analytical workloads.
The _write_orc_file method then writes this table to disk in ORC format using ZLIB compression, reducing storage size while preserving fast read performance through column-level compression. After writing, the _read_orc_file method reopens the ORC file in binary mode, reads it using the ORCFile reader, and converts the resulting PyArrow Table back into a Pandas DataFrame for analysis.
Finally, the _process_logs method performs analytical operations on the data by filtering ERROR-level log entries, aggregating error counts per service, calculating average response times per service, and demonstrating access to nested complex fields, illustrating how ORC integrates seamlessly with Pandas for downstream analytics.
The main function orchestrates this end-to-end flow, while the if __name__ == "__main__" guard ensures the script can be safely reused as a module, resulting in a clean, readable, and maintainable example of reading, writing, and analyzing ORC data in Python.
2.3 Program Output
ORC file written successfully.
Data read from ORC file:
timestamp service level message response_time_ms tags metadata
0 2026-01-01 10:00:00 auth INFO User login successful 120 [security, login] {'ip': '10.0.0.1', 'browser': 'chrome'}
1 2026-01-01 10:01:00 auth ERROR Invalid credentials 300 [security, auth] {'ip': '10.0.0.2', 'browser': 'firefox'}
2 2026-01-01 10:02:00 payment ERROR Payment failed 850 [payment, failure] {'ip': '10.0.0.3', 'browser': 'safari'}
3 2026-01-01 10:03:00 payment INFO Payment completed 200 [payment, success] {'ip': '10.0.0.4', 'browser': 'edge'}
Error count per service:
service
auth 1
payment 1
dtype: int64
Average response time (ms) per service:
service
auth 210.0
payment 525.0
Name: response_time_ms, dtype: float64
Sample complex fields:
tags metadata
0 [security, login] {'ip': '10.0.0.1', 'browser': 'chrome'}
1 [security, auth] {'ip': '10.0.0.2', 'browser': 'firefox'}
2 [payment, failure] {'ip': '10.0.0.3', 'browser': 'safari'}
3 [payment, success] {'ip': '10.0.0.4', 'browser': 'edge'}
3. Conclusion
ORC is a powerful columnar file format designed for high-performance analytics. With Python libraries like PyArrow, working with ORC files is simple and efficient. Understanding ORC usage in Java further helps when building cross-language big data pipelines. If your workload involves large datasets, complex schemas, and analytical queries, ORC is an excellent choice.



