-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, for users, it can be cumbersome to customize the behavior of extension types. For example, consider a specialized pretty-printing implementation for a certain type (e.g., format JSON).
In DataFusion this is currently not implemented. Even though we have started to replace DataType with Field, this still requires us to pass through some kind of extension type registry (github.com/apache/datafusion/issues/18223) through all code paths that require access to the customized printing implementation. The procedure would be to lookup the extension type in the registry and then call the pretty-printing implementation.
While this is possible, I am currently exploring an approach that directly associates a dyn DynExtensionType with the Field, thus making it possible to access the pretty-printing implementation without passing a registry around. I think Field would be a good candidate for that as it is currently used to store the metadata.
Before undertaking any significant implementation effort, I think we should have a discussion on how (and if) we want to support such customization options in arrow-rs.
Describe the solution you'd like
I think there are two approaches to improve the situtation from arrow-rs:
For the DataType in the Field use a new FieldType enum:
pub enum FieldType {
Physical(DataType),
Extension(DataType, Arc<dyn DynExtensionType>)
}or we add an additional field extension_type with the type Option<Arc<dyn DynExtensionType>>.
The DynExtensionType would have an as_any method that allows users (e.g., DataFusion) to cast to their specific extension type traits. If someone has a better idea that does not rely on down casting, feel free to propose it.
I've whipped together a rough prototype of how this could look like (the API is not really changed yet):
https://github.com/apache/arrow-rs/compare/main...tobixdev:arrow-rs:crazy-field-experiment?expand=1
Personally, I'd prefer the first solution but its a bigger breaking change. It could be enough if we provide a storage_type() method that returns the DataType how it is in the current version of arrow.
Of course, a registry will still be needed at some point. The pieces of code that instantiate new Fields (e.g., parser) would require access to the registry.
Describe alternatives you've considered
We can also keep these efforts completely in DataFusion. This would require either i) creating something akin to DataFusionField or DataFusionExtensionInformation or ii) pass a around a registry and use that for looking up the pretty-printing implementation.
Additional context
There has been discussion on using a DataType::ExtensionType(...) enum variant for the same purpose but AFAIK we decided against this approach as this allows arrow kernels to focus on the physical data layout (which makes sense IMO). Still, not needing a registry everywhere is an attractive aspect of this solution that the Field approach could also provide.
Other links:
- API to register behavior for Extension Types datafusion#18223
- Polars seems to pursue a
DataType::Extensionvariant