Skip to content

ARROW-10240: [Rust] Optionally load data into memory before running benchmark query#8409

Closed
jhorstmann wants to merge 4 commits intoapache:masterfrom
jhorstmann:ARROW-10240-load-data-into-memory-for-tpch
Closed

ARROW-10240: [Rust] Optionally load data into memory before running benchmark query#8409
jhorstmann wants to merge 4 commits intoapache:masterfrom
jhorstmann:ARROW-10240-load-data-into-memory-for-tpch

Conversation

@jhorstmann
Copy link
Copy Markdown
Contributor

No description provided.

file_format: String,

/// Load the data into a MemTable before executing the query
#[structopt(short = "l", long = "load")]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably a better/clearer name for this parameter

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps --mem-table ?

@github-actions
Copy link
Copy Markdown

github-actions bot commented Oct 9, 2020

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

};

if opt.load {
let memtable = MemTable::load(tableprovider.as_ref()).await?;
Copy link
Copy Markdown
Member

@andygrove andygrove Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a nit but it would be nice to have some printlns here showing that the data is loading, and how long it takes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good idea

@andygrove
Copy link
Copy Markdown
Member

The results are pretty interesting for me.

Without --mem-table:

Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 24, batch_size: 4096, path: "/mnt/tpch/s1/parquet", file_format: "parquet", mem_table: false }
Query 1 iteration 0 took 241 ms
Query 1 iteration 1 took 164 ms
Query 1 iteration 2 took 167 ms

With --mem-table:

Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 24, batch_size: 4096, path: "/mnt/tpch/s1/parquet", file_format: "parquet", mem_table: true }
Loading data into memory
Loaded data into memory in 11240 ms
Query 1 iteration 0 took 353 ms
Query 1 iteration 1 took 302 ms
Query 1 iteration 2 took 322 ms

I filed https://issues.apache.org/jira/browse/ARROW-10251 to fix the single-threaded loading in MemTable but I'm not sure why the actual query time is slower for mem tables than for Parquet.

@jhorstmann
Copy link
Copy Markdown
Contributor Author

That's indeed interesting. Could the issue actually be the batch size? Seems the MemTable::scan method ignores the batch size parameter and instead uses the hardcoded one used for loading.

@andygrove
Copy link
Copy Markdown
Member

It's looking much better now 🚀

Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 24, batch_size: 4096, path: "/mnt/tpch/s1/parquet", file_format: "parquet", mem_table: true }
Loading data into memory
Loaded data into memory in 4334 ms
Query 1 iteration 0 took 174 ms
Query 1 iteration 1 took 144 ms
Query 1 iteration 2 took 148 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants