Introduction
Data Frame
A Data Frame is the core component of Flow PHP's ETL framework. It represents a structured collection of tabular data that can be processed, transformed, and loaded efficiently. Think of it as a programmable spreadsheet that can handle large datasets with minimal memory footprint.
Key Features
- Memory Efficient: Processes data in chunks using generators, avoiding memory exhaustion
- Lazy Evaluation: Operations are only executed when needed
- Immutable: Each transformation returns a new DataFrame instance
- Type Safe: Strict typing throughout with comprehensive schema support
- Chainable API: Fluent interface for building complex data pipelines
Understanding DataFrame Operations
DataFrame methods fall into two categories based on when they execute:
Lazy Operations (@lazy)
These methods build the processing pipeline without executing it immediately:
- Transformations:
filter(),map(),withEntry(),select(),drop(),rename() - Memory-intensive:
collect(),sortBy(),groupBy(),join(),cache() - Processing control:
batchSize(),limit(),offset(),partitionBy()
Trigger Operations (@trigger)
These methods execute the entire pipeline and return results:
- Data retrieval:
get(),getEach(),fetch(),count() - Output operations:
run(),forEach(),printRows(),printSchema() - Schema inspection:
schema(),display()
Important: Build your complete pipeline with lazy operations, then execute once with a trigger operation for optimal performance.
Creating DataFrames
DataFrames are created using the data_frame() DSL function and populated with data through extractors. The framework supports various data sources through adapter-specific extractors.
<?php
use function Flow\ETL\DSL\{data_frame, from_array, to_output};
$dataFrame = data_frame()
->read(from_array([
['id' => 1, 'name' => 'John', 'age' => 30],
['id' => 2, 'name' => 'Jane', 'age' => 25],
['id' => 3, 'name' => 'Bob', 'age' => 35],
]))
->filter(col('age')->greaterThan(lit(25)))
->select('id', 'name')
->write(to_output())
->run();
Note: Flow PHP supports many data sources through specialized adapters. See individual adapter documentation for specific extractor usage (CSV, JSON, Parquet, databases, APIs, etc.).
Memory Management Best Practices
- Prefer Generator Methods: Use
get(),getEach(),getEachAsArray()overfetch()for large datasets - Avoid Memory-Intensive Operations: Be cautious with
collect(),sortBy(),groupBy(), andjoin()on large datasets - Use Appropriate Batch Sizes: Start with 1000-5000 rows and adjust based on your memory constraints
- Monitor Memory Usage: Use
run(analyze: true)to track memory consumption during development
Performance Optimization
- Push Operations to Data Source: When possible, perform filtering, sorting, and joins at the database/file level
- Minimize Data Movement: Apply filters early in the pipeline to reduce data volume
- Cache Strategically: Only cache expensive operations that will be reused multiple times
- Avoid Large Offsets: Use data source pagination instead of DataFrame
offset()for large skips
Component Documentation
For detailed information about specific DataFrame operations, see the following component documentation:
Core Operations
- Building Blocks - Understanding Rows, Entries, and basic data structures
- Transformations - Reusable DataFrame transformations and the Transformation interface
- Select/Drop - Column selection and removal
- Rename - Column renaming strategies
- Map - Row transformations and data mapping
- Filter - Row filtering and conditions
- Execution Mode - configure how strict DataFrame is during execution
- Save Mode - configure how flow is saving files
Data Processing
- Join - DataFrame joining operations
- Group By - Grouping and aggregation operations
- Pivot - Transform data from long to wide format
- Sort - Data sorting
- Limit - Result limiting and pagination
- Offset - Skipping rows and pagination
- Until - Conditional processing termination
- Window Functions - Advanced analytical functions
Memory & Performance
- Batch Processing - Controlling batch sizes and memory collection
- Partitioning - Data partitioning for efficient processing
- Caching - Performance optimization through caching
- Data Retrieval - Methods for getting processed data
Data Quality & Validation
- Schema - Schema management and validation
- Constraints - Data integrity constraints and business rules
- Error Handling - Error management strategies
Reliability & Recovery
- Retry Mechanisms - Automatic retry for transient failures
Output & Display
- Display - Data visualization and output