Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 175 additions & 0 deletions specification/ORCv2.md
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,8 @@ message Type {
VARCHAR = 16;
CHAR = 17;
TIMESTAMP_INSTANT = 18;
GEOMETRY = 19;
GEOGRAPHY = 20;
}
// the kind of this type
required Kind kind = 1;
Expand All @@ -273,9 +275,84 @@ message Type {
// the precision and scale for decimal
optional uint32 precision = 5;
optional uint32 scale = 6;
repeated StringPair attributes = 7;
// the attributes associated with the geometry type
optional GeometryType geometry = 8;
// Coordinate Reference System (CRS) for Geometry and Geography types
optional string crs = 8;
// Edge interpolation algorithm for Geography type
enum EdgeInterpolationAlgorithm {
SPHERICAL = 0;
VINCENTY = 1;
THOMAS = 2;
ANDOYER = 3;
KARNEY = 4;
}
optional EdgeInterpolationAlgorithm algorithm = 9;
}
```

#### Geometry & Geography Types

##### Background

The Geometry and Geography class hierarchy and its Well-Known Text (WKT) and
Well-Known Binary (WKB) serializations (ISO variant supporting XY, XYZ, XYM,
XYZM) are defined by [OpenGIS Implementation Specification for Geographic
information - Simple feature access - Part 1: Common architecture][sfa-part1],
from [OGC(Open Geospatial Consortium)][ogc].

The version of the OGC standard first used here is 1.2.1, but future versions
may also be used if the WKB representation remains wire-compatible.

[sfa-part1]: https://portal.ogc.org/files/?artifact_id=25355
[ogc]: https://www.ogc.org/standard/sfa/

###### Coordinate Reference System

Coordinate Reference System (CRS) is a mapping of how coordinates refer to
locations on Earth.

The default CRS `OGC:CRS84` means that the geospatial features must be stored
in the order of longitude/latitude based on the WGS84 datum.

Custom CRS can be specified by a string value. It is recommended to use an
identifier-based approach like [Spatial reference identifier][srid].

For geographic CRS, longitudes are bound by [-180, 180] and latitudes are bound
by [-90, 90].

[srid]: https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier

###### Edge Interpolation Algorithm

An algorithm for interpolating edges, and is one of the following values:

* `spherical`: edges are interpolated as geodesics on a sphere.
* `vincenty`: [https://en.wikipedia.org/wiki/Vincenty%27s_formulae](https://en.wikipedia.org/wiki/Vincenty%27s_formulae)
* `thomas`: Thomas, Paul D. Spheroidal geodesics, reference systems, & local geometry. US Naval Oceanographic Office, 1970.
* `andoyer`: Thomas, Paul D. Mathematical models for navigation systems. US Naval Oceanographic Office, 1965.
* `karney`: [Karney, Charles FF. "Algorithms for geodesics." Journal of Geodesy 87 (2013): 43-55](https://link.springer.com/content/pdf/10.1007/s00190-012-0578-z.pdf), and [GeographicLib](https://geographiclib.sourceforge.io/)

###### CRS Customization

CRS is represented as a string value. Writer and reader implementations are
responsible for serializing and deserializing the CRS, respectively.

As a convention to maximize the interoperability, custom CRS values can be
specified by a string of the format `type:identifier`, where `type` is one of
the following values:

* `srid`: [Spatial reference identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), `identifier` is the SRID itself.
* `projjson`: [PROJJSON](https://proj.org/en/stable/specifications/projjson.html), `identifier` is the name of a table property or a file property where the projjson string is stored.

###### Coordinate Axis Order

The axis order of the coordinates in WKB and bounding box stored here
follows the de facto standard for axis order in WKB and is therefore always
(x, y) where x is easting or longitude and y is northing or latitude. This
ordering explicitly overrides the axis order as specified in the CRS.

### Column Statistics

The goal of the column statistics is that for each column, the writer
Expand Down Expand Up @@ -303,6 +380,7 @@ message ColumnStatistics {
optional bool hasNull = 10;
optional uint64 bytes_on_disk = 11;
optional CollectionStatistics collection_statistics = 12;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we handle these two above lines independently because this is irrelevant to the geometry, @wgtmac ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #22 to address this.

optional GeospatialStatistics geospatial_statistics = 13;
}
```

Expand Down Expand Up @@ -397,6 +475,88 @@ message BinaryStatistics {
}
```

Geometry and Geography columns store optional bounding boxes and list of
geospatial type codes from all values.

**Bounding Box**

A geospatial instance has at least two coordinate dimensions: X and Y for 2D
coordinates of each point. Please note that X is longitude/easting and Y is
latitude/northing. A geospatial instance can optionally have Z and/or M values
associated with each point.

The Z values introduce the third dimension coordinate. Usually they are used to
indicate the height, or elevation.

M values are an opportunity for a geospatial instance to express a fourth
dimension as a coordinate value. These values can be used as a linear reference
value (e.g., highway milepost value), a timestamp, or some other value as defined
by the CRS.

Bounding box is defined as the thrift struct below in the representation of
min/max value pair of coordinates from each axis. Note that X and Y Values are
always present. Z and M are omitted for 2D geospatial instances.

For the X values only, xmin may be greater than xmax. In this case, an object
in this bounding box may match if it contains an X such that `x >= xmin` OR
`x <= xmax`. This wraparound occurs only when the corresponding bounding box
crosses the antimeridian line. In geographic terminology, the concepts of `xmin`,
`xmax`, `ymin`, and `ymax` are also known as `westernmost`, `easternmost`,
`southernmost` and `northernmost`, respectively.

For Geography type, X and Y values are restricted to the canonical ranges of
[-180, 180] for X and [-90, 90] for Y.

**Geospatial Types**

A list of geospatial types from all instances in the Geometry or Geography
column, or an empty list if they are not known.

This is borrowed from [geometry_types of GeoParquet][geometry-types] except that
values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code].
Table below shows the most common geospatial types and their codes:

| Type | XY | XYZ | XYM | XYZM |
| :----------------- | :--- | :--- | :--- | :--: |
| Point | 0001 | 1001 | 2001 | 3001 |
| LineString | 0002 | 1002 | 2002 | 3002 |
| Polygon | 0003 | 1003 | 2003 | 3003 |
| MultiPoint | 0004 | 1004 | 2004 | 3004 |
| MultiLineString | 0005 | 1005 | 2005 | 3005 |
| MultiPolygon | 0006 | 1006 | 2006 | 3006 |
| GeometryCollection | 0007 | 1007 | 2007 | 3007 |

In addition, the following rules are applied:
- A list of multiple values indicates that multiple geospatial types are present (e.g. `[0003, 0006]`).
- An empty array explicitly signals that the geospatial types are not known.
- The geospatial types in the list must be unique (e.g. `[0001, 0001]` is not valid).

[geometry-types]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159
[wkb-integer-code]: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary

```protobuf
// Bounding box for Geometry or Geography type in the representation of min/max
// value pair of coordinates from each axis.
message BoundingBox {
optional double xmin = 1;
optional double xmax = 2;
optional double ymin = 3;
optional double ymax = 4;
optional double zmin = 5;
optional double zmax = 6;
optional double mmin = 7;
optional double mmax = 8;
}

// Statistics specific to Geometry or Geography type
message GeospatialStatistics {
// A bounding box of geospatial instances
optional BoundingBox bbox = 1;
// Geospatial type codes of all instances, or an empty list if not known
repeated int32 geospatial_types = 2;
}
```

### User Metadata

The user can add arbitrary key/value pairs to an ORC file as it is
Expand Down Expand Up @@ -1235,6 +1395,21 @@ Encoding | Stream Kind | Optional | Contents
DIRECT | PRESENT | Yes | Boolean RLE
| DIRECT | No | Byte RLE

## Geometry & Geography Columns

Geometry and Geography data are encoded with a PRESENT stream, a DATA stream that records
the WKB-encoded geometry/geography data as binary, and a LENGTH stream that records
the number of bytes per a value.

Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Binary contents
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Binary contents
| LENGTH | No | Unsigned Integer RLE v2

# Indexes

## Row Group Index
Expand Down
36 changes: 36 additions & 0 deletions src/main/proto/orc/proto/orc_proto.proto
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,27 @@ message CollectionStatistics {
optional uint64 total_children = 3;
}

// Bounding box for Geometry or Geography type in the representation of min/max
// value pair of coordinates from each axis.
message BoundingBox {
optional double xmin = 1;
optional double xmax = 2;
optional double ymin = 3;
optional double ymax = 4;
optional double zmin = 5;
optional double zmax = 6;
optional double mmin = 7;
optional double mmax = 8;
}

// Statistics specific to Geometry or Geography type
message GeospatialStatistics {
// A bounding box of geospatial instances
optional BoundingBox bbox = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is bbox a well-known name? I'm just wondering if we can use bounding_box like the other field names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, bbox is a well-known geospatial acronym.

// Geospatial type codes of all instances, or an empty list if not known
repeated int32 geospatial_types = 2;
}

message ColumnStatistics {
optional uint64 number_of_values = 1;
optional IntegerStatistics int_statistics = 2;
Expand All @@ -97,6 +118,7 @@ message ColumnStatistics {
optional bool has_null = 10;
optional uint64 bytes_on_disk = 11;
optional CollectionStatistics collection_statistics = 12;
optional GeospatialStatistics geospatial_statistics = 13;
}

message RowIndexEntry {
Expand Down Expand Up @@ -216,6 +238,8 @@ message Type {
VARCHAR = 16;
CHAR = 17;
TIMESTAMP_INSTANT = 18;
GEOMETRY = 19;
GEOGRAPHY = 20;
}
optional Kind kind = 1;
repeated uint32 subtypes = 2 [packed=true];
Expand All @@ -224,6 +248,18 @@ message Type {
optional uint32 precision = 5;
optional uint32 scale = 6;
repeated StringPair attributes = 7;

// Coordinate Reference System (CRS) for Geometry and Geography types
optional string crs = 8;
// Edge interpolation algorithm for Geography type
enum EdgeInterpolationAlgorithm {
SPHERICAL = 0;
VINCENTY = 1;
THOMAS = 2;
ANDOYER = 3;
KARNEY = 4;
}
optional EdgeInterpolationAlgorithm algorithm = 9;
}

message StripeInformation {
Expand Down