-
Notifications
You must be signed in to change notification settings - Fork 70
Description
Repro is with Zed commit d64a090.
We started a debate years ago about this difference between SQL and Zed behaviors. Start with this data.csv test data.
id,key,val
1,max,hi
2,zap,my
3,patty,name
4,zap,is
5,max,phil
6,gris,these
7,carrot,are
8,thomas,a
9,max,bunch
10,max,of
11,zap,lines
In DuckDB, we'll search & count to determine the presence and absence of a particular field value.
$ duckdb
v0.8.1 6536a77232
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D create table pets as select * from 'data.csv';
D select count(*) from pets where key='max';
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 4 │
└──────────────┘
D select count(*) from pets where key='foo';
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 0 │
└──────────────┘
Compare this with Zed behavior.
$ zq -version
Version: v1.8.1-63-gd64a0909
$ zq 'grep("max") | count()' data.csv
4(uint64)
$ zq 'grep("foo") | count()' data.csv
[no output]
$ echo $?
0
The last time this was debated, there were differing opinions about if Zed should adapt to SQL's behavior of having count() always return a scalar (so, 0 in this case) vs. the way it's currently silent and leaving that for the user to interpret.
Having left the topic alone for some time while the project has evolved, here's my updated thinking on the topic. I sense we may not be able to definitively settle on one true behavior. Instead, like we've done in other areas, perhaps we can employ Zed's first-class errors to make it possible to test for this condition.
One thought I had was that if pipeline elements are able to detect the condition that their upstream dataflow has never produced results and has effectively shut down (is that called EOF in the code?), the pipeline element (count() in this case) could raise error(missing), so the user would have the option to invoke quiet() if they wanted to restore the prior behavior of seeing pure silence when there's no results.
Another approach might be to let pipeline elements behave as they currently do but have an operator with the specific role of detecting this condition & raising such an error. Since we already have a pass operator, perhaps this could be a new option on pass. For illustrative purposes I'll call it -mustflow. So if the user wanted the SQL-like behavior it might look like:
$ zq 'grep("foo") | count() | pass -mustflow | yield is_error(this) ? 0 : this' simple-comma.csv
0