Skip to content

Conversation

@teh-cmc
Copy link
Member

@teh-cmc teh-cmc commented Dec 1, 2022

An MVP implementation of an Arrow-based datastore that closely follows the design that we've been mulling over the past few weeks.

This is entirely dynamic: insert anything you want, retrieve anything you want!

Consider this step 1 of the usual "1) make it work 2) make it correct 3) make it fast".

@teh-cmc teh-cmc marked this pull request as ready for review December 4, 2022 18:15
@teh-cmc
Copy link
Member Author

teh-cmc commented Dec 4, 2022

Unsurprisingly, actually thoroughly testing the thing surfaced a million critical bugs 🙄: added tests and fixes for all of those.

Still need to add the benchmark suite before calling it a complete skeleton, but already I feel much more confident undrafting this.

@teh-cmc teh-cmc force-pushed the cmc/arrow_store_continued branch from 434741f to f424684 Compare December 5, 2022 10:44
@teh-cmc
Copy link
Member Author

teh-cmc commented Dec 5, 2022

  • Added a very simple integration benchmark suite, akin to the one we have for the current data store.
    The APIs and overall models are different enough that it's kinda impossible to port this 1-to-1 right now, so it results in a kinda awkward bench, but at least it's something we can start iterating from.

  • Exploded the DataFrames returned by query().
    Literally just called .explode(), so it's actually possible to do useful stuff with the results (e.g. tests & benchmarks 🙄)

Copy link
Member

@emilk emilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work 👏👏👏

I found some room for improvement though 😆

/// entity: this/that
/// buckets: [
/// IndexBucket {
/// time range: from -∞ (inclusive) to +9223372036.855s (exlusive)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look like an accurate time range, given the data below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I purposely want it to look wrong though: this PR completely ignores bucketing and as such it doesn't pretend to get anything right about it.

@teh-cmc teh-cmc requested a review from emilk December 5, 2022 16:19
@teh-cmc
Copy link
Member Author

teh-cmc commented Dec 5, 2022

Alright, that should hopefully cover it all @emilk!

Well, apart from the bucketing stuff; but as I said above, that's the subject of the next 2 PRs, on which I will jump ASAP :)

@teh-cmc teh-cmc merged commit bd69add into main Dec 5, 2022
@teh-cmc teh-cmc deleted the cmc/arrow_store_continued branch December 5, 2022 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🏹 arrow Apache Arrow ⛃ re_datastore affects the datastore itself

Projects

None yet

Development

Successfully merging this pull request may close these issues.

re_datastore: do a first pass of ensuring all around correctness re_datastore: basic benchmark suite Out of order Arrow ingestion

4 participants