Skip to content

[R] Segfault when collecting parquet dataset query results #41813

@mrd0ll4r

Description

@mrd0ll4r

Hello!
I've been using arrow with R for a while now to great success.
Recently, I've re-opened an old project (managed with renv, so I'm pretty confident all the package versions were the same).
It is possible I upgraded the OS and/or OS packages in the meantime.
Now, some of my queries on a gzip-compressed dataset of parquet files lead to a segfault:

 *** caught segfault ***
address 0x7f54ce520898, cause 'memory not mapped'

Traceback:
 1: Table__from_ExecPlanReader(self)
 2: x$read_table()
 3: as_arrow_table.RecordBatchReader(reader)
 4: as_arrow_table(reader)
 5: as_arrow_table.arrow_dplyr_query(x)
 6: as_arrow_table(x)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) {    augment_io_error_msg(e, call, schema = schema())})
11: compute.arrow_dplyr_query(x)
12: collect.arrow_dplyr_query(.)
13: collect(.)
14: d_redacted %>% group_by(year, month, cid) %>% summarize(n = n()) %>%     collect()

I have a core dump from that session, but it's 46GB.
The machine has 256GB RAM and another 256GB swap, so I'm confident that's not the problem.

I'm not a professional in analyzing these things, but this is what I got:

Core was generated by `/usr/lib/R/bin/exec/R'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f612d4ea3b0 in arrow::compute::KeyCompare::CompareBinaryColumnToRow_avx2(bool, unsigned int, unsigned int, unsigned short const*, unsigned int const*, arrow::compute::LightContext*, arrow::compute::KeyColumnArray const&, arrow::compute::RowTableImpl const&, unsigned char*) () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
[Current thread is 1 (Thread 0x7f6093fff640 (LWP 2273813))]
(gdb) bt
#0  0x00007f612d4ea3b0 in arrow::compute::KeyCompare::CompareBinaryColumnToRow_avx2(bool, unsigned int, unsigned int, unsigned short const*, unsigned int const*, arrow::compute::LightContext*, arrow::compute::KeyColumnArray const&, arrow::compute::RowTableImpl const&, unsigned char*) () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#1  0x00007f612d4d7093 in void arrow::compute::KeyCompare::CompareBinaryColumnToRow<true>(unsigned int, unsigned int, unsigned short const*, unsigned int const*, arrow::compute::LightContext*, arrow::compute::KeyColumnArray const&, arrow::compute::RowTableImpl const&, unsigned char*) () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#2  0x00007f612d4d6278 in arrow::compute::KeyCompare::CompareColumnsToRows(unsigned int, unsigned short const*, unsigned int const*, arrow::compute::LightContext*, unsigned int*, unsigned short*, std::vector<arrow::compute::KeyColumnArray, std::allocator<arrow::compute::KeyColumnArray> > const&, arrow::compute::RowTableImpl const&, bool, unsigned char*) ()
   from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#3  0x00007f612d4d896e in ?? () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#4  0x00007f612d3a98e6 in ?? () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#5  0x00007f612d3ab154 in arrow::compute::SwissTable::find(int, unsigned int const*, unsigned char*, unsigned char const*, unsigned int*, arrow::util::TempVectorStack*, std::function<void (int, unsigned short const*, unsigned int const*, unsigned int*, unsigned short*, void*)> const&, void*) const ()
   from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#6  0x00007f612d4df2d0 in ?? () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#7  0x00007f612d4dfb73 in ?? () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#8  0x00007f612cf8da83 in arrow::acero::aggregate::GroupByNode::Merge() () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#9  0x00007f612cf8f8a3 in arrow::acero::aggregate::GroupByNode::OutputResult(bool) ()
   from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#10 0x00007f612cf941f6 in arrow::acero::aggregate::GroupByNode::InputReceived(arrow::acero::ExecNode*, arrow::compute::ExecBatch) ()
   from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#11 0x00007f612cef3f1b in arrow::acero::MapNode::InputReceived(arrow::acero::ExecNode*, arrow::compute::ExecBatch) ()
   from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#12 0x00007f612cf25dd2 in ?? () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#13 0x00007f612cf05a7e in arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)> >::invoke() ()
   from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#14 0x00007f612d290a9d in ?? () from /home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
#15 0x00007f6136f87253 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#16 0x00007f61396a9ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#17 0x00007f613973b850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

I've tried:

  • Updating all the dependencies. I'm now at 15.0.1 from RSPM. Above crash is from this version.
  • Re-writing the dataset. The raw data is a bunch of CSV files, which I read -> mutate -> write to parquet
  • Checking if simple queries (dataset %>% summarize(n=n())) work, which they do

Specifically, this query works:

d_redacted %>% group_by(year, month) %>% summarize(n=n()) %>% collect()

and this doesn't:

d_redacted %>% group_by(year, month, cid) %>% summarize(n=n()) %>% collect()

The dataset looks like this:

> d_redacted
FileSystemDataset with 1342 Parquet files
peer: string
address: string
asn: string
geolocation: string
cid: string
entry_type: string
date: date32[day]
monitor: string
year: int32
month: int32

It's 3GB on disk, gzip compressed.

Unfortunately, I cannot share the dataset publicly as it contains sensitive information.

Overall, pretty lost now.
The system is running Ubuntu 22.04, kernel:

5.15.0-101-generic #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Hope that helps somehow...

Component(s)

R

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions