Skip to content

Ability to get "source" / and error stack from an ArrowError to help debugging #2725

@alamb

Description

@alamb

(note I am filing this in arrow-rs rather than datafusion as the same applies to lower level arrow errors and we would love to follow the same model in both projects and because it just came up in the context of refactoring the arrow crate: #2711 (comment))

I also harbor perhaps unrealistic dreams, we can do something in arrow that is reasonable and show it works in a real set of projects, and then write about it / blog about it, use that as a bully pulpit to move some of the rust error projects along more speedily

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When developing IOx (which uses DataFusion, which uses Arrow) when an error is encountered, I get an error like this:

thread 'influxrpc::read_group::test_grouped_series_set_plan_first' panicked at 'running plans: NotImplemented("Physical plan does not support logical expression selector_first(#h2o.b AS b, #h2o.time)")', query_tests/src/influxrpc/util.rs:16:10
stack backtrace:

This is challenging for several reasons:

  1. I don't know were this error originated (aka what module / source / line number). The source and line number reported (util.rs in this case) is where the error was detected not the source of the error.
  2. Even if I can find where the error originated (by grepping the source code of arrow and datatafusion) I don't know what the call stack was when that code was invoked. In this case I don't know what type of plan was being converted when the error happened.

What I typically do (please don't laugh) to debug such errors is:

  1. Patch my project to use a local checkout of datafusion and/or arrow source code
  2. Change the error site in datafusion and/or arrow from returning an error to panic!
  3. Rerun my test with RUST_BACKTRACE=1 set so I can get a backtrace

While this works it is both annoying and I suspect more than most users will be willing to bear as they don't already have local checkouts of arrow and datafusion.

What I really want is I want is errors in Arrow (and then also Datafusion) to provide a trace like what python provides. Stylistically:

DataFusionError::Anonymous(), make_plan, /my/datafusion_checkout/physical_planner.rs:100 
  DataFusionError::NotImplemented("Physical plan does not support logical expression selector_first(#h2o.b AS b, #h2o.time)"), make_plan,  /my/arrow/checkout/datatypes/schema.rs:42
    ArrowError::NotImplemented("RecordBatch schema doesn't match), /my/arrow/checkout/datatypes/schema.rs:42

Where the Anonymous is meant to signify a location where ? was used to propagate the error.

I can write flesh out these ideas more if anyone is interested.

Items I do not care about

Note I don't want to get into a discussion about providing runtime backtrace support. I would be very happy to only have function names (ideally with line numbers) from arrow / datafusion and any other projects that add the support explicitly. I would also be fine with using a proper backtrace in the implementation but I don't want this ticket to get bogged down like other RFCs seem to have)

I also would like to avoid requiring every error site to have a different error enum to get this feature (though whatever we do here shouldn't prevent adding new error variants for error structured reporting)

Describe the solution you'd like
What I would like is some way to annotate Arrow errors with the source location it came from as well as any causing error and a way to walk the chain. Perhaps we can start with some macros -- here is a crazy idea to start thinking about

// in source.rs
if do_the_thing() {
  Ok(())
} else {
  arrow_err!(Schema, "The thing errored: {}", 42)
}

Which on error would result in an error like

enum ArrowError {
  SchemaError { 
    message: String,
    source_file: Option<&'static str>,
    source_line: Option<&'static str>,
    cause: Option<Box<dyn Error>>,
  }
  ...
}

We could potentially use a similar macro for annotating the result of ? (somehow)

Describe alternatives you've considered
We could wait for any of the various Rust RFCs in error handling to stabilize such as https://rust-lang.github.io/rfcs/0201-error-chaining.html or https://rust-lang.github.io/rfcs/2504-fix-error.html.

However, given how long they have been outstanding I am not going to hold my breath.

Additional context
@yahoNanJing and @mingmwang are discussing similar things on apache/datafusion#3410 (comment), I believe

The most recent time I hit this was in https://github.com/influxdata/influxdb_iox/pull/5606

@tustvold @andygrove and I discussed error handling in Arrow as well in #2711 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions