Return result type info on GET/POST endpoints#367
Conversation
src/frontend/http.rs
Outdated
| let output = schema | ||
| .fields | ||
| .iter() | ||
| .map(|f| format!("{}={}", f.name(), f.data_type())) | ||
| .collect::<Vec<String>>() | ||
| .join("; "); |
There was a problem hiding this comment.
One alternative would be to use serde serialization of the schema (so someone on the other side can get a proper arrow object), though I suspect that may end up being too convoluted.
src/frontend/http.rs
Outdated
| int_array_val=List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }); \ | ||
| text_array_val=List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })", |
There was a problem hiding this comment.
Is there a name for this format of Type(Params)? Just wondering how a client might parse it. It looks almost like JSON but not quite.
There was a problem hiding this comment.
I don't think so, it's really only a serialization of a recursive type enum variant DataType::List(Box<Field>).
This however sort of breaks the abstraction, since we serialize only the name of the (root) Field on the left, but on the right the entire (nested) Field struct gets unwrapped.
There was a problem hiding this comment.
Yeah, I guess the more human-readable verison would be Int64[]. Would you lose any information if you forced it to be like that? I guess nullable is important.
Related: What would be the type of a scalar column that is nullable? Is it even possible to produce such a query?
There was a problem hiding this comment.
We could also serialize it into SQL datatypes (TEXT instead of Utf8, BIGINT[] instead of List(...Int64...)), though I don't remember if we even support lists at the DDL level. There was some code that serializes column types into the database in the arrow-native JSON format if that didn't go away with the delta-rs migration (EDIT: here it is,
Line 41 in a280251
There was a problem hiding this comment.
Something like this (Splitgraph query):
SELECT
"foo",
"baz",
CASE
WHEN baz % 2 = 0 THEN null
WHEN baz % 2 <> 0 THEN 'ok'
END isbaz
FROM
"miles/big-random-csv:7700078ffc7592e8ad029070bcdccaafb03500b5f32be52f7ba73a5b93b95afd"."file_1"
LIMIT 100;What would be the type of isbaz column?
There was a problem hiding this comment.
I was able to construct such a query on my Seafowl test DB:
curl -i \
-H "Content-Type: application/json" \
https://splitgraph-testing-nomad.seafowl.cloud/q -d@-<<EOF
{"query": "SELECT name, value, CASE when value > 2 THEN null WHEN value < 2 THEN 'ok' END islessthantwo FROM test_cache;"}
EOF
which returns the rows with nullable results as undefined (this is expected):
HTTP/2 200
content-type: application/octet-stream
vary: Content-Type, Origin, X-Seafowl-Query
content-length: 386
date: Thu, 20 Apr 2023 14:54:11 GMT
server: Fly/ad79467d (2023-04-14)
via: 2 fly.io
fly-request-id: 01GYFJH6RQ0132ZX4000EK6RSY-lhr
{"name":"Alpha","value":2.3}
{"name":"Alpha","value":2.4}
{"name":"Beta","value":9.3}
{"name":"Beta","value":4.0}
{"name":"Gamma","value":6.8}
{"islessthantwo":"ok","name":"Gamma","value":1.2}
{"name":"Alpha","value":2.3}
{"name":"Alpha","value":2.4}
{"name":"Beta","value":9.3}
{"name":"Beta","value":4.0}
{"name":"Gamma","value":6.8}
{"islessthantwo":"ok","name":"Gamma","value":1.2}
So what would we expect the column type of islessthantwo to be in the header?
There was a problem hiding this comment.
What would be the type of a scalar column that is nullable? Is it even possible to produce such a query?
Nullability is being tracked at the Field level, not the data type level, so in the case of a scalar column that is nullable there would be no info about it (as we'd record only the field name).
There was some code that serializes column types into the database in the arrow-native JSON format if that didn't go away with the delta-rs migration.
That's still there, though it may be too verbose.
We could also serialize it into SQL datatypes (TEXT instead of Utf8, BIGINT[] instead of List(...Int64...))
I think this is the best approach overall, as it also keeps the output implementation agnostic, i.e. SQL in SQL out. I'll try to find a way to make this work.
There was a problem hiding this comment.
We could also (which might be overkill) emit a JSONSchema instead of rolling our own custom format, which supports dates and floating point numbers. Could be useful for clients autogenerating parsers downstream.
There was a problem hiding this comment.
It would definitely be nice to include nullability in this response header, even if it's technically part of field.nullable rather than field.type.
|
The implementation looks good to me, but I don't know how browsers/caches will handle it. AFAICT, this is to spec. According to MDN, when
So while we're using it as a response header here, I do worry that some browser/proxy implementations might see As far as I can see, the spec for |
|
The spec does say this, but (ironically) I'm unsure how to parse the english:
Is this saying that parameters do not allow whitespace around the EDIT: I guess whitespace inside the value is okay, based on this example of a parameter: |
|
One thing I'd be particularly curious to test is whether Cloudflare caching behavior changes when including the |
src/frontend/http.rs
Outdated
| .unwrap() | ||
| .to_str() | ||
| .unwrap(), | ||
| "application/json; smallint_val=Int16; integer_val=Int32; bigint_val=Int64; char_val=Utf8; \ |
There was a problem hiding this comment.
This header will be a pain to parse by the client:
- need to split the header value on the semicolon and the equals sign
- the parameterized DF types like List(...) or Timestamp(...) are awkward to parse
- I'm worried about how this encodes column names with spaces or Unicode characters (EDIT: or characters like the equals sign or something that's not allowed in headers)
Maybe an alternative could be having a single value in the header like
application/json; seafowl-schema=[JSON encoding of the schema, uriencoded so that it's header-safe]
?
There was a problem hiding this comment.
It's not a pain to parse on the client, because you can just .split() on ; to get each parameter, and then for each parameter, split on = to get the (key,value). The part that is a pain to parse is the types themselves, but that comes down to what we actually want to do with them.
However, I do agree that a JSON encoding of the schema would be easier to parse. It still doesn't solve the issue of what to call (how to interpret) the actual types like List(...) though.
There was a problem hiding this comment.
you can just .split() on ; to get each parameter, and then for each parameter, split on = to get the (key,value)
what if my column name contains a semicolon? 🙃
There was a problem hiding this comment.
It still doesn't solve the issue of what to call (how to interpret) the actual types like List(...) though.
The JSON Arrow serialization I suggested (from
Line 41 in a280251
Lines 72 to 85 in a280251
There was a problem hiding this comment.
what if my column name contains a semicolon? 🙃
I'm not sure this would be a valid HTTP parameter name, even if it was escaped. :/
EDIT: Actually, I think it's valid as long as the parameter value is surrounded by quotes.
There was a problem hiding this comment.
I didn't find any function/crate for converting Arrow type back to SQL-looking types, so I'd have to make a custom converter. In principle this would be fine for simple types, but for nested and recursive types it would get ugly. Plus, we'd still lose the nullability information regardless.
That's why I went ahead with using the arrow schema JSON serialization format now. Note that this is not an official format, it's just something that is used in integration testing, but is consistent. (Also note that I haven't employed url encoding for now.)
The returned schema is very verbose (check out the test assertions), though in theory analytical queries employ a smaller column cardinality (plus we're not "polluting" the actual result output as the schema info is hidden in the header). This could still pose a problem in terms of header size limits; as per Cloudflare: Request headers observe a total limit of 32 KB, but each header is limited to 16 KB. Maybe we could make the schema type info in the headers optional, so that for some very large output column cardinality people can disable it.
EDIT: In the example of that large type conversion test, the entire content-type header text is 1.73KB, which seems comfortably within the size limits (though other clients/services may have different limits). We could also make this typo info be opt-in (controlled via a query param, some default request header or some custom one), so that by default we don't do it?
There was a problem hiding this comment.
For the record, I've further modified this approach in order to url-encode the schema when sending via content-type header parameter, as discussed offline.
src/frontend/http.rs
Outdated
| int_array_val=List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }); \ | ||
| text_array_val=List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })", |
There was a problem hiding this comment.
We could also serialize it into SQL datatypes (TEXT instead of Utf8, BIGINT[] instead of List(...Int64...)), though I don't remember if we even support lists at the DDL level. There was some code that serializes column types into the database in the arrow-native JSON format if that didn't go away with the delta-rs migration (EDIT: here it is,
Line 41 in a280251
|
I guess how we serialize it depends what we want to do with the data. It would be nice to use some standardized format, like a serialization of Arrow types, because then we could use (somewhat) standardized parsers if we want to coerce types (e.g. turning a DATETIME into a I found arrow-js, which has code for coercing values to JavaScript types: https://github.com/apache/arrow/blob/main/js/src/type.ts We could implement something like that in our client(s) (it's kind of up to the client what it does with it), or maybe somehow use arrow-js itself, but I think the important thing is that we would want the types to match the keys of this enum used by arrow: Can we get those somehow? Is that what the serde serialization effectively does? |
|
It looks like arrow-js can maybe read a record batch from JSON? https://github.com/apache/arrow/blob/7ea2d9867cb90f9d0ca1db9b3bffa2fe8426e8ed/js/src/ipc/reader.ts#L153-L166 which calls but the Interesting how it coerces type string by trying to convert it to a date first, and then a string. This makes me think that it might be best to just use simple SQL types, since the motivating use case for this feature is displaying types to the user. And there is no avoiding that some coercion code needs to specify how to coerce values to special types (e.g. timestamp to |
Not sure -- I wouldn't want us to pull in arrow-js (or copypaste code from it) just to map these magic numbers back to enums. I was thinking about using this JSON format. The options we seem to have are:
|
Yeah ofc. I was wondering if those types come from some arrow standard that we're already using. The arrow-js wouldn't be involved on the server side, but it would be nice if they were somehow compatible with some arrow-js code so that a client using arrow-js could load the HTTP jsonlines response into arrow from JSON and provide it with the schema from the header. |
|
I think returning SQL-looking data types might be the simplest solution, but I worry it's non-standard (ofc SQL is standard, but it's the "-looking" part that concerns me). When loading into arrow in JS, it seems unavoidable that the user needs to provide coercion functions: https://arrow.apache.org/docs/js/#create-a-table-from-javascript-arrays The only exception would be if we were loading a 🤷 |
|
And we could always support returning both, defaulting to "SQL looking types," but the user could specify a preference in some request header (maybe |
|
Ok, so I found the arrow-js code that can parse schema from JSON: Based on the fact it's accessing It would be cool to support this, because it means the JS client can use AFAICT, the "verbose serialization format" that @gruuya linked is what we would need for that: https://github.com/splitgraph/seafowl/blob/main/src/schema.rs#L72-L85 But it's of course quite verbose (are there length limits on this header?). Meanwhile, for basic use cases like showing the user the type of the column in a shell, the SQL-looking types seem sufficient. |
…nfo in the header
…uild See: splitgraph/seafowl#367 Parse fields from content-type header (we still aren't munging the arrow schema to our own shape), and get it working with a live build of the new feature. Important note: The value of the HTTP parameter is an _un_-escaped JSON string inside double quotes, by our convention, so we can just slice the double quote off the beginning and end of the parameter value and then parse it as JSON. This commit adds some live tests, but they need a server with the new build to work, and they're integration tests anyway (which wouldn't run in CI), but they serve primarily as a place to collect the shape of responses for later accurate mocking. It also updates the existing mocked tests to use the new scheme (in particular, the value is not escaped inside the double quotes).
src/frontend/http.rs
Outdated
| match HeaderValue::from_str( | ||
| format!("application/json; arrow-schema-escaped={output}").as_str(), | ||
| ) { | ||
| Ok(value) => value, | ||
| Err(err) => { | ||
| // Seems silly to error out here if the query itself succeeded. | ||
| warn!("Couldn't generate content type header for output schema {output}: {err:?}"); | ||
| HeaderValue::from_static("application/json") | ||
| } | ||
| } |
There was a problem hiding this comment.
minor: unwrap_or_else could work here (no idea if you tried it; I haven't tested this code)
| match HeaderValue::from_str( | |
| format!("application/json; arrow-schema-escaped={output}").as_str(), | |
| ) { | |
| Ok(value) => value, | |
| Err(err) => { | |
| // Seems silly to error out here if the query itself succeeded. | |
| warn!("Couldn't generate content type header for output schema {output}: {err:?}"); | |
| HeaderValue::from_static("application/json") | |
| } | |
| } | |
| HeaderValue::from_str( | |
| format!("application/json; arrow-schema-escaped={output}").as_str(), | |
| ).unwrap_or_else(|e| { | |
| // Seems silly to error out here if the query itself succeeded. | |
| warn!("Couldn't generate content type header for output schema {output}: {e:?}"); | |
| HeaderValue::from_static("application/json") | |
| }) |
src/frontend/http.rs
Outdated
| let output = utf8_percent_encode(&schema_string, NON_ALPHANUMERIC); | ||
|
|
||
| match HeaderValue::from_str( | ||
| format!("application/json; arrow-schema-escaped={output}").as_str(), |
There was a problem hiding this comment.
I think we should still call it arrow-schema, even though it's percent encoded.
There was a problem hiding this comment.
That's fine with me. @milesrichardson just an fyi that I'm revising the parameter name back to arrow-schema.
Also revise the upload endpoint to be in lign with the new multipart request handler in warp. For more details see seanmonstar/warp#1034.
|
LGTM |
This was finalized in splitgraph/seafowl#367
Provide this info via the content-type response header.
Closes #356.