Adding custom data loader handler support. #309
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
Currently, there is no support for custom data sources. File types not accounted for in feathr will simply throw a file system exceptions, as it defaults to looking for the file locally.
This pr will add support for user provided data sources through the new case class
DataPathHandlerthrough theFeathrClient. This is important particularly in some private environments, where they use privately owned data sources/file types.For the user this change should be backwards compatible, as we only adding additional builder methods, and all the classes with their required parameters are only accessible in feathr's
offlinedomain and its tests.The
class designwill go over the added classes, and theAPI Flowwill discuss how these hooks are passed around, and where they are invoked. TheTestingsection goes over how tests were modified and which tests were added.Class Design
The key case class that enclose the handler functionality,
DataPathHandler, is defined in DataSourceAccessor.scala, and its child classesDataAccessorHandlerandDataLoaderHandlerare defined in DataSourceAccessor.scala and DataLoaderFactory.scala respectively.Both
DataAccessorHandlerandDataLoaderHandlerhave a sharedvalidatePathfunctionality, that performs the same path checking functionality. Perhaps in future, we can refactor this to be cleaner.DataAccessorHandlerhas a custom hookgetAccessorwhich depending on thevalidatePathcondition will return a user provided data accessor.DataLoaderHandlerhas the following custom hooks, which like DataAccessorHandler will only be used ifvalidatePathis true:It is necessary to provide the
DataLoaderHandlerhook interface for users, as creating/writing data frame depends on the data/path type.The interface was copied from SparkIOUtils, which handles a lot of the data writing/reading outside of data loaders.
More details regarding the main access points for file writing/reading is in the following section.
API Flow
For the current scope, the functionality is exposed primarily though
FeathrClient. As such, there was a key investigation on which modules are dependent on type of file path, and tracking those APIs toFeathrClient.The key access points for writing/reading files, while also high enough on call chain to prevent redundant refactoring are the following modules.
For data loaders and data accessors, the following object/call dependency graph was created to track the overall flow for instantiating data loaders and accessors,
For the SparkIOUtils, it was traced separately.
Default Values
For most of the of apis, default values were not chosen, as we don't want to data handlers to accidentally be dropped during the api flow.
Testing
External Integration Testing
Work was done in a downstream library to verify that data handling works.
Modifications to existing Feathr tests
Work was done to modify many unit tests, as although additional data handler parameters were added to a lot of classes, default values were not added.
The reasoning for this is because we don't want opaque parameter dropping. Only top-level components, such as the feathr client, and local test related components had default values appendend, to reduce the refactoring needed.
[WIP] Feathr Regression Testing
Additional work is being done to add regression testing to Feathr.
FAQ
InputLocation?InputLocationclasses as allInputLocation:FileFormat.loadHdfsDataFrame