Introduction

Data Frame

Data Frame
- Building Blocks
- Data Retrieval
- Data Manipulation
- Select/Drop
- Rename
- Map
- Filter
- Execution Mode
- Save Mode
- Join
- Group By
  - Aggregations
- Pivot
- Window Functions
- Sort
- Limit
- Offset
- Until
- Batch Processing
- Caching
- Partitioning
- Constraints
- Schema
- Display
- Error Handling
CLI

Upgrade Guide

Table of Contents

Upgrading from 0.28.x to 0.29.x
- 1) JsonType now uses Json value object instead of string
Upgrading from 0.26.x to 0.27.x
- 1) Force EntryFactory $entryFactory to be required on array_to_row & array_to_row(s)
Upgrading from 0.16.x to 0.17.x
Upgrading from 0.15.x to 0.16.x
- 1) Deprecated Flow\ETL\DataFrame::renameAll* methods
- 2) Deprecated RenameAllCaseTransformer & RenameStrReplaceAllEntriesTransformer
Upgrading from 0.14.x to 0.15.x
- 1) Removed Flow\ETL\Row\Schema\Matcher and implementations
- 2) Renamed Flow\ETL\Row\Schema namespace into Flow\ETL\Schema.
Upgrading from 0.11.x to 0.14.x
- 1) Replaced Flow\ETL\DataFrame::validate() with Flow\ETL\DataFrame::match()
- 2) Replaced Flow\ETL\Function\ScalarFunction\TypedScalarFunction with
Upgrading from 0.10.x to 0.11.x
- 1) Removed StructureElement/struct_element/structure_element from StructureType Definition
- 2) Doctrine DBAL Adapter
Upgrading from 0.8.x to 0.10.x
- 1) Providing multiple paths to a single extractor
- 2) Passing optional arguments to extractors/loaders
Upgrading from 0.7.x to 0.8.x
- 1) Joins
- 2) GroupBy
Upgrading from 0.6.x to 0.7.x
- 1) DataFrame::appendSafe() method was removed
Upgrading from 0.5.x to 0.6.x
- 1) Rows::merge() accepts single instance of Rows
Upgrading from 0.4.x to 0.5.x
Upgrading from 0.3.x to 0.4.x

This document provides guidelines for upgrading between versions of Flow PHP. Please follow the instructions for your specific version to ensure a smooth upgrade process.

Upgrading from 0.28.x to 0.29.x

1) JsonType now uses Json value object instead of string

The JsonType has been refactored to use a dedicated Json value object (similar to Uuid/UuidType pattern). This allows static analysis tools to distinguish between regular strings and JSON strings.

Breaking Changes:

JsonType::assert() now returns Json instance instead of string
JsonType::cast() now returns Json instance instead of string
JsonType::isValid() now checks for Json instance (plain strings are no longer valid)
Cast::cast('json', $value) function now returns Json object instead of string
type_json() return type annotation changed from Type<string> to Type<Json>
JsonEntry::value() now returns ?Json instead of ?array (consistent with UuidEntry::value() returning ?Uuid)
JsonEntry::json() method removed (use value() instead)

Migration:

If you were using type_json()->cast($value) and expected a string, use ->toString():

Before:

$jsonString = type_json()->cast($array); // was string

After:

$json = type_json()->cast($array); // now Json object
$jsonString = $json->toString(); // get the string
$jsonArray = $json->toArray(); // get as array

If you were using JsonEntry::value() and edxpected an array:

Before:

$entry = json_entry('data', ['key' => 'value']);
$array = $entry->value(); // was array

After:

$entry = json_entry('data', ['key' => 'value']);
$json = $entry->value(); // now Json object
$array = $json?->toArray(); // get as array
$string = $json?->toString(); // get as string

If you were using JsonEntry::json():

Before:

$json = $entry->json();

After:

$json = $entry->value(); // json() method removed, use value() instead

New Json value object features:

use Flow\Types\Value\Json;

// Create from string
$json = new Json('{"key": "value"}');

// Create from array
$json = Json::fromArray(['key' => 'value']);

// Check if valid JSON
Json::isValid('{"key": "value"}'); // true

// Convert to string/array
$json->toString(); // '{"key":"value"}'
$json->toArray(); // ['key' => 'value']

// Json implements Stringable
(string) $json; // '{"key":"value"}'

// Json implements JsonSerializable
json_encode($json); // '{"key":"value"}'

Note: JsonEntry::value() now returns ?Json for consistency with UuidEntry::value() returning ?Uuid. Use ->toArray() or ->toString() on the Json object to get the underlying data.

Row methods behavior:

// Row::toArray() converts Json to array automatically (for convenient serialization)
$row->toArray();           // Returns ['data' => ['key' => 'value']] not ['data' => Json(...)]

// Row::valueOf() returns the raw value (Json object for json entries)
$row->valueOf('data');     // Returns Json object (use ->toArray() if you need array)

// Entry value() returns the typed value
$row->get('data')->value();  // Returns Json object (use ->toArray() if you need array)

Upgrading from 0.26.x to 0.27.x

1) Force `EntryFactory $entryFactory` to be required on `array_to_row` & `array_to_row(s)`

Before:

to_entry('name', 'data');
array_to_row([]);
array_to_rows([]);

After:

to_entry('name', 'data', flow_context(config())->entryFactory());
array_to_row([], flow_context(config())->entryFactory());
array_to_rows([], flow_context(config())->entryFactory());

Upgrading from 0.16.x to 0.17.x

1) Removed $nullable property from all types

Before:

type_string(nullable:true)->toString() // ?string

After:

type_optional(string())->toString() // ?string

2) Removed precision from `float_type()`

Before float_type() use to have default precision 6. This means that any operations on float had to round values to given precision. The problem with this approach is that all operations now need to receive a dedicated rounding option.

Instead, end users should handle precision of float columns through round() scalar function.

3) Moved all Types to `Flow\Types\Type` namespace

Before

\Flow\ETL\DSL\type_string(); // now deprecated, alias for \Flow\Types\DSL\type_string();

After

\Flow\Types\DSL\type_string();

Upgrading from 0.15.x to 0.16.x

1) Deprecated `Flow\ETL\DataFrame::renameAll*` methods

Methods:

Flow\ETL\DataFrame::renameAll(),
Flow\ETL\DataFrame::renameAllLowerCase(),
Flow\ETL\DataFrame::renameAllUpperCase(),
Flow\ETL\DataFrame::renameAllUpperCaseFirst(),
Flow\ETL\DataFrame::renameAllUpperCaseWord(),

Were deprecated in favor of using new method: DataFrame::renameEach() with proper RenameEntryStrategy object.

2) Deprecated `RenameAllCaseTransformer` & `RenameStrReplaceAllEntriesTransformer`

Selected transformers were deprecated in favor of using DataFrame::renameEach() with related RenameEntryStrategy:

RenameAllCaseTransformer -> RenameCaseTransformer,
RenameStrReplaceAllEntriesTransformer -> RenameReplaceStrategy,

Upgrading from 0.14.x to 0.15.x

1) Removed `Flow\ETL\Row\Schema\Matcher` and implementations

Schema Matcher was the initial attempt to implement a schema evolution next to schema validation that over time got replaced with a different implementation of Schema Validator.

2) Renamed `Flow\ETL\Row\Schema` namespace into `Flow\ETL\Schema`.

This means all classes related to Schema now live under Flow\ETL\Schema namespace.

Upgrading from 0.11.x to 0.14.x

1) Replaced `Flow\ETL\DataFrame::validate()` with `Flow\ETL\DataFrame::match()`

The old method is now deprecated and will be removed in the next release.

2) Replaced `Flow\ETL\Function\ScalarFunction\TypedScalarFunction` with

Flow\ETL\Function\ScalarFunction\ScalarResult.

The old interface was used to allow defining the return type of the ScalarFunctions. It was replaced with a ScalarResult value object that is much more flexible than the interface, because it's allowing to return any type dynamically without making the scalar function stateful.

Upgrading from 0.10.x to 0.11.x

1) Removed StructureElement/struct_element/structure_element from StructureType Definition

Before:

type_structure([
    struct_element('name', string()),
    struct_element('age', integer()),
]);

After:

type_structure([
    'name' => string(),
    'age' => integer(),
]);

2) Doctrine DBAL Adapter

From now options for:

to_dbal_table_insert()
to_db_table_update()

are passed as objects (instance of UpdateOptions|InsertOptions interfaces) and they are platform specific, so please use the proper class for the platform you are using.

PostgreSQL
- PostgreSQLInsertOptions
- PostgreSQLUpdateOptions
MySQL
- MySQLInsertOptions
- MySQLUpdateOptions
Sqlite
- SQLiteInsertOptions
- SQLiteUpdateOptions

Upgrading from 0.8.x to 0.10.x

1) Providing multiple paths to a single extractor

From now to read from multiple locations use from_all(Extractor ...$extractors) : Exctractor extractor.

Before:

<?php

from_parquet([
    path(__DIR__ . '/data/1.parquet'),
    path(__DIR__ . '/data/2.parquet'),
]);

After:

<?php

from_all(
    from_parquet(path(__DIR__ . '/data/1.parquet')),
    from_parquet(path(__DIR__ . '/data/2.parquet')),
);

2) Passing optional arguments to extractors/loaders

From now all extractors/loaders are accepting only mandatory arguments, all optional arguments should be passed through with* methods and fluent interface.

Before:

<?php

from_parquet(path(__DIR__ . '/data/1.parquet'), schema: $schema);

After:

<?php

from_parquet(path(__DIR__ . '/data/1.parquet'))->withSchema($schema);

Upgrading from 0.7.x to 0.8.x

1) Joins

To support joining bigger datasets, we had to move from initial NestedLoop join algorithm into Hash Join algorithm.

the only supported coin expression is = (equals) that can be grouped with AND and OR operators.
joinPrefix is now always required, and by default is set to 'joined_'
join will always result all columns from both datasets, columns used in join condition will be prefixed with joinPrefix.

Other than that, API stays the same.

Above changes were introduced in all 3 types of joins:

DataFrame::join()
DataFrame::joinEach()
DataFrame::crossJoin()

2) GroupBy

From now on, DataFrame::groupBy() method will return GroupedDataFrame object, which is nothing more than a GroupBy statement Builder. To get the results, you first need to define the aggregation functions or optionally pivot the data.

Upgrading from 0.6.x to 0.7.x

1) DataFrame::appendSafe() method was removed

DataFrame::appendSafe() aka DataFrame::threadSafe() method was removed as it was introducing additional complexity and was not used in any of the adapters.

Upgrading from 0.5.x to 0.6.x

1) Rows::merge() accepts single instance of Rows

Before:

Rows::merge(Rows ...$rows) : Rows

After:

Rows::merge(Rows $rows) : Rows

Upgrading from 0.4.x to 0.5.x

1) Entry factory moved from extractors to `FlowContext`

To improve code quality and reduce code coupling EntryFactory was removed from all constructors of extractors, in favor of passing it into FlowContext & re-using same entry factory in a whole pipeline.

2) Invalid schema has no fallback in `NativeEntryFactory`

Before, passing Schema into NativeEntryFactory::create() had fallback when the given entry was not found in a passed schema, now the schema has higher priority & fallback is no longer available, instead when the definition is missing in a passed schema, InvalidArgumentException will be thrown.

3) BufferLoader was removed

BufferLoader was removed in favor of DataFrame::collect(int $batchSize = null) method which now accepts additional argument $batchSize that will keep collecting Rows from Extractor until the given batch size is reached. Which does exactly the same thing as BufferLoader did, but in a more generic way.

4) Pipeline Closure

Pipeline Closure was reduced to be only Loader Closure and it was moved to \Flow\ETL\Loader namespace. Additionally, \Closure::close method no longer requires Rows to be passed as an argument.

5) Parallelize

DataFrame::parallelize() method is deprecated, and it will be removed, instead use DataFrame::batchSize(int $size) method.

6) Rows in batch - Extractors

From now, file-based Extractors will always throw one Row at time, in order to merge them into bigger groups use DataFrame::batchSize(int $size) just after extractor method.

Before:

<?php

(new Flow())
    ->read(CSV::from(__DIR__ . '/1_mln_rows.csv', rows_in_batch: 100))
    ->write(To::output())
    ->count();

After:

(new Flow())
    ->read(CSV::from(__DIR__ . '/1_mln_rows.csv',))
    ->batchSize(100)
    ->write(To::output())
    ->count();

Affected extractors:

CSV
Parquet
JSON
Text
XML
Avro
DoctrineDBAL - rows_in_batch wasn't removed, but now results are thrown row by row, instead of whole page.
GoogleSheet

7) `GoogleSheetExtractor`

Argument $rows_in_batch was renamed to $rows_per_page which no longer determines the size of the batch, but the size of the page that will be fetched from Google API. Rows are yielded one by one.

8) `DataFrame::threadSafe()` method was replaced by `DataFrame::appendSafe()`

DataFrame::appendSafe() is doing exactly the same thing as the old method, it's just more descriptive and self-explanatory. It's no longer mandatory to set this flat to true when using SaveMode::APPEND, it's now set automatically.

9) Loaders - chunk size

Loaders are no longer accepting chunk_size parameter, from now in order to control the number of rows saved at once use DataFrame::batchSize(int $size) method.

10) Removed DSL functions: `datetime_string()`, `json_string()`

Those functions were removed in favor of accepting string values in related DSL functions:

datetime_string() => datetime(),
json_string() => json() & json_object()

11) Removed Asynchronous Processing

More details can be found in this issue.

Removed etl-adapter-amphp
Removed etl-adapter-reactphp
Removed LocalSocketPipeline
Removed DataFrame::pipeline()

12) `CollectionEntry` removal

After adding native & logical types into the Flow, we remove the CollectionEntry as obsolete. New types that cover it better are: ListType, MapType & StructureType along with related new entry types.

13) Removed `from*()` methods from scalar entries

Removed BooleanEntry::from(), FloatEntry::from(), IntegerEntry::from(), StringEntry::fromDateTime() methods in favor of using DSL functions.

14) Removed deprecated `Sha1IdFactory`

Class Sha1IdFactory was removed, use HashIdFactory class:

(new HashIdFactory('entry_name'))->withAlgorithm('sha1');

15) Deprecate DSL Static classes

DSL static classes were deprecated in favor of using functions defined in src/core/etl/src/Flow/ETL/DSL/functions.php file.

Deprecated classes:

src/core/etl/src/Flow/ETL/DSL/From.php
src/core/etl/src/Flow/ETL/DSL/Handler.php
src/core/etl/src/Flow/ETL/DSL/To.php
src/core/etl/src/Flow/ETL/DSL/Transform.php
src/core/etl/src/Flow/ETL/DSL/Partitions.php
src/adapter/etl-adapter-avro/src/Flow/ETL/DSL/Avro.php
src/adapter/etl-adapter-chartjs/src/Flow/ETL/DSL/ChartJS.php
src/adapter/etl-adapter-csv/src/Flow/ETL/DSL/CSV.php
src/adapter/etl-adapter-doctrine/src/Flow/ETL/DSL/Dbal.php
src/adapter/etl-adapter-elasticsearch/src/Flow/ETL/DSL/Elasticsearch.php
src/adapter/etl-adapter-google-sheet/src/Flow/ETL/DSL/GoogleSheet.php
src/adapter/etl-adapter-json/src/Flow/ETL/DSL/Json.php
src/adapter/etl-adapter-meilisearch/src/Flow/ETL/DSL/Meilisearch.php
src/adapter/etl-adapter-parquet/src/Flow/ETL/DSL/Parquet.php
src/adapter/etl-adapter-text/src/Flow/ETL/DSL/Text.php
src/adapter/etl-adapter-xml/src/Flow/ETL/DSL/XML.php

Upgrading from 0.3.x to 0.4.x

1) Transformers replaced with scalar functions

Transformers are a really powerful tool that was used in Flow since the beginning, but that tool was too powerful for the simple cases that were needed, and introduced additional complexity and maintenance issues when they were handwritten.

We reworked most of the internal transformers to new scalar functions and entry scalar functions (based on the built-in functions), and we still internally use that powerful tool, but we don't expose it to end users, instead, we provide easy-to-use, covering all user needs functions.

All available functions can be found in ETL\Row\Function folder or in ETL\DSL\functions file, and entry scalar functions are defined in EntryScalarFunction.

Before:

<?php

use Flow\ETL\Extractor\MemoryExtractor;
use Flow\ETL\Flow;
use Flow\ETL\DSL\Transform;

(new Flow())
    ->read(new MemoryExtractor())
    ->rows(Transform::string_concat(['name', 'last name'], ' ', 'name'))

After:

<?php

use function Flow\ETL\DSL\concat;
use function Flow\ETL\DSL\lit;
use Flow\ETL\Extractor\MemoryExtractor;
use Flow\ETL\Flow;

(new Flow())
    ->read(new MemoryExtractor())
    ->withEntry('name', concat(ref('name'), lit(' '), ref('last name')))

2) `ref` function nullability

ref("entry_name") is no longer returning null when the entry is not found. Instead, it throws an exception. The same behavior can be achieved through using a newly introduced optional function:

Before:

<?php

use function Flow\ETL\DSL\optional;
use function Flow\ETL\DSL\ref;

ref('non_existing_column')->cast('string');

After:

<?php

use function Flow\ETL\DSL\optional;
use function Flow\ETL\DSL\ref;

optional(ref('non_existing_column'))->cast('string');
// or  
optional(ref('non_existing_column')->cast('string'));

3) Extractors output

Affected extractors:

CSV
JSON
Avro
DBAL
GoogleSheet
Parquet
Text
XML

Extractors are no longer returning data under an array entry called row, thanks to this unpacking row become redundant.

Because of that all DSL functions are no longer expecting $entry_row_name parameter, if it was used anywhere, please remove it.

Before:

<?php 

(new Flow())
    ->read(From::array([['id' => 1, 'array' => ['a' => 1, 'b' => 2, 'c' => 3]]]))
    ->withEntry('row', ref('row')->unpack())
    ->renameAll('row.', '')
    ->drop('row')
    ->withEntry('array', ref('array')->arrayMerge(lit(['d' => 4])))
    ->write(To::memory($memory = new ArrayMemory()))
    ->run();

After:

<?php

(new Flow())
    ->read(From::array([['id' => 1, 'array' => ['a' => 1, 'b' => 2, 'c' => 3]]]))
    ->withEntry('array', ref('array')->arrayMerge(lit(['d' => 4])))
    ->write(To::memory($memory = new ArrayMemory()))
    ->run();

4) ConfigBuilder::putInputIntoRows() output is now prefixed with _ (underscore)

In order to avoid collisions with datasets columns, additional columns created after using putInputIntoRows() would now be prefixed with _ (underscore) symbol.

Before:

<?php

$rows = (new Flow(Config::builder()->putInputIntoRows()))
            ->read(Json::from(__DIR__ . '/../Fixtures/timezones.json', 5))
            ->fetch();

foreach ($rows as $row) {
    $this->assertSame(
        [
            ...
            '_input_file_uri',
        ],
        \array_keys($row->toArray())
    );
}

After:

<?php

$rows = (new Flow(Config::builder()->putInputIntoRows()))
            ->read(Json::from(__DIR__ . '/../Fixtures/timezones.json', 5))
            ->fetch();

foreach ($rows as $row) {
    $this->assertSame(
        [
            ...
            '_input_file_uri',
        ],
        \array_keys($row->toArray())
    );
}

Adapters

Libraries

PHP Extensions

pg_query

Bridges

Contributors

Join us on GitHub

Introduction

Data Frame

#Upgrade Guide

#Upgrading from 0.28.x to 0.29.x

#1) JsonType now uses Json value object instead of string

#Upgrading from 0.26.x to 0.27.x

#1) Force EntryFactory $entryFactory to be required on array_to_row & array_to_row(s)

#Upgrading from 0.16.x to 0.17.x

#1) Removed $nullable property from all types

#2) Removed precision from float_type()

#3) Moved all Types to Flow\Types\Type namespace

#Upgrading from 0.15.x to 0.16.x

#1) Deprecated Flow\ETL\DataFrame::renameAll* methods

#2) Deprecated RenameAllCaseTransformer & RenameStrReplaceAllEntriesTransformer

#Upgrading from 0.14.x to 0.15.x

#1) Removed Flow\ETL\Row\Schema\Matcher and implementations

#2) Renamed Flow\ETL\Row\Schema namespace into Flow\ETL\Schema.

#Upgrading from 0.11.x to 0.14.x

#1) Replaced Flow\ETL\DataFrame::validate() with Flow\ETL\DataFrame::match()

#2) Replaced Flow\ETL\Function\ScalarFunction\TypedScalarFunction with

#Upgrading from 0.10.x to 0.11.x

#1) Removed StructureElement/struct_element/structure_element from StructureType Definition

#2) Doctrine DBAL Adapter

#Upgrading from 0.8.x to 0.10.x

#1) Providing multiple paths to a single extractor

#2) Passing optional arguments to extractors/loaders

#Upgrading from 0.7.x to 0.8.x

#1) Joins

#2) GroupBy

#Upgrading from 0.6.x to 0.7.x

#1) DataFrame::appendSafe() method was removed

#Upgrading from 0.5.x to 0.6.x

#1) Rows::merge() accepts single instance of Rows

#Upgrading from 0.4.x to 0.5.x

#1) Entry factory moved from extractors to FlowContext

#2) Invalid schema has no fallback in NativeEntryFactory

#3) BufferLoader was removed

#4) Pipeline Closure

#5) Parallelize

#6) Rows in batch - Extractors

#7) GoogleSheetExtractor

#8) DataFrame::threadSafe() method was replaced by DataFrame::appendSafe()

#9) Loaders - chunk size

#10) Removed DSL functions: datetime_string(), json_string()

#11) Removed Asynchronous Processing

#12) CollectionEntry removal

#13) Removed from*() methods from scalar entries

#14) Removed deprecated Sha1IdFactory

#15) Deprecate DSL Static classes

#Upgrading from 0.3.x to 0.4.x

#1) Transformers replaced with scalar functions

#2) ref function nullability

#3) Extractors output

#4) ConfigBuilder::putInputIntoRows() output is now prefixed with _ (underscore)

Adapters

Libraries

PHP Extensions

Bridges

Contributors

Upgrade Guide

Upgrading from 0.28.x to 0.29.x

1) JsonType now uses Json value object instead of string

Upgrading from 0.26.x to 0.27.x

1) Force `EntryFactory $entryFactory` to be required on `array_to_row` & `array_to_row(s)`

Upgrading from 0.16.x to 0.17.x

1) Removed $nullable property from all types

2) Removed precision from `float_type()`

3) Moved all Types to `Flow\Types\Type` namespace

Upgrading from 0.15.x to 0.16.x

1) Deprecated `Flow\ETL\DataFrame::renameAll*` methods

2) Deprecated `RenameAllCaseTransformer` & `RenameStrReplaceAllEntriesTransformer`

Upgrading from 0.14.x to 0.15.x

1) Removed `Flow\ETL\Row\Schema\Matcher` and implementations

2) Renamed `Flow\ETL\Row\Schema` namespace into `Flow\ETL\Schema`.

Upgrading from 0.11.x to 0.14.x

1) Replaced `Flow\ETL\DataFrame::validate()` with `Flow\ETL\DataFrame::match()`

2) Replaced `Flow\ETL\Function\ScalarFunction\TypedScalarFunction` with

Upgrading from 0.10.x to 0.11.x

1) Removed StructureElement/struct_element/structure_element from StructureType Definition

2) Doctrine DBAL Adapter

Upgrading from 0.8.x to 0.10.x

1) Providing multiple paths to a single extractor

2) Passing optional arguments to extractors/loaders

Upgrading from 0.7.x to 0.8.x

1) Joins

2) GroupBy

Upgrading from 0.6.x to 0.7.x

1) DataFrame::appendSafe() method was removed

Upgrading from 0.5.x to 0.6.x

1) Rows::merge() accepts single instance of Rows

Upgrading from 0.4.x to 0.5.x

1) Entry factory moved from extractors to `FlowContext`

2) Invalid schema has no fallback in `NativeEntryFactory`

3) BufferLoader was removed

4) Pipeline Closure

5) Parallelize

6) Rows in batch - Extractors

7) `GoogleSheetExtractor`

8) `DataFrame::threadSafe()` method was replaced by `DataFrame::appendSafe()`

9) Loaders - chunk size

10) Removed DSL functions: `datetime_string()`, `json_string()`

11) Removed Asynchronous Processing

12) `CollectionEntry` removal

13) Removed `from*()` methods from scalar entries

14) Removed deprecated `Sha1IdFactory`

15) Deprecate DSL Static classes

Upgrading from 0.3.x to 0.4.x

1) Transformers replaced with scalar functions

2) `ref` function nullability

3) Extractors output

4) ConfigBuilder::putInputIntoRows() output is now prefixed with _ (underscore)