Skip to content

Return result type info on GET/POST endpoints#367

Merged
gruuya merged 6 commits intomainfrom
result-type-info-response-header
Apr 26, 2023
Merged

Return result type info on GET/POST endpoints#367
gruuya merged 6 commits intomainfrom
result-type-info-response-header

Conversation

@gruuya
Copy link
Contributor

@gruuya gruuya commented Apr 20, 2023

Provide this info via the content-type response header.

Closes #356.

Comment on lines 107 to 112
let output = schema
.fields
.iter()
.map(|f| format!("{}={}", f.name(), f.data_type()))
.collect::<Vec<String>>()
.join("; ");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative would be to use serde serialization of the schema (so someone on the other side can get a proper arrow object), though I suspect that may end up being too convoluted.

Comment on lines 1617 to 1618
int_array_val=List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }); \
text_array_val=List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a name for this format of Type(Params)? Just wondering how a client might parse it. It looks almost like JSON but not quite.

Copy link
Contributor Author

@gruuya gruuya Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, it's really only a serialization of a recursive type enum variant DataType::List(Box<Field>).

This however sort of breaks the abstraction, since we serialize only the name of the (root) Field on the left, but on the right the entire (nested) Field struct gets unwrapped.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess the more human-readable verison would be Int64[]. Would you lose any information if you forced it to be like that? I guess nullable is important.

Related: What would be the type of a scalar column that is nullable? Is it even possible to produce such a query?

Copy link
Contributor

@mildbyte mildbyte Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also serialize it into SQL datatypes (TEXT instead of Utf8, BIGINT[] instead of List(...Int64...)), though I don't remember if we even support lists at the DDL level. There was some code that serializes column types into the database in the arrow-native JSON format if that didn't go away with the delta-rs migration (EDIT: here it is,

.map(|f| (f.name().clone(), field_to_json(f).to_string()))
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this (Splitgraph query):

SELECT
    "foo",
    "baz",
CASE
    WHEN baz % 2 = 0 THEN null
    WHEN baz % 2 <> 0 THEN 'ok'
END isbaz
FROM
    "miles/big-random-csv:7700078ffc7592e8ad029070bcdccaafb03500b5f32be52f7ba73a5b93b95afd"."file_1"
LIMIT 100;

What would be the type of isbaz column?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to construct such a query on my Seafowl test DB:

curl -i \
    -H "Content-Type: application/json" \
https://splitgraph-testing-nomad.seafowl.cloud/q -d@-<<EOF
{"query": "SELECT name, value, CASE when value > 2 THEN null WHEN value < 2 THEN 'ok' END islessthantwo FROM test_cache;"}
EOF

which returns the rows with nullable results as undefined (this is expected):

HTTP/2 200 
content-type: application/octet-stream
vary: Content-Type, Origin, X-Seafowl-Query
content-length: 386
date: Thu, 20 Apr 2023 14:54:11 GMT
server: Fly/ad79467d (2023-04-14)
via: 2 fly.io
fly-request-id: 01GYFJH6RQ0132ZX4000EK6RSY-lhr

{"name":"Alpha","value":2.3}
{"name":"Alpha","value":2.4}
{"name":"Beta","value":9.3}
{"name":"Beta","value":4.0}
{"name":"Gamma","value":6.8}
{"islessthantwo":"ok","name":"Gamma","value":1.2}
{"name":"Alpha","value":2.3}
{"name":"Alpha","value":2.4}
{"name":"Beta","value":9.3}
{"name":"Beta","value":4.0}
{"name":"Gamma","value":6.8}
{"islessthantwo":"ok","name":"Gamma","value":1.2}

So what would we expect the column type of islessthantwo to be in the header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the type of a scalar column that is nullable? Is it even possible to produce such a query?

Nullability is being tracked at the Field level, not the data type level, so in the case of a scalar column that is nullable there would be no info about it (as we'd record only the field name).

There was some code that serializes column types into the database in the arrow-native JSON format if that didn't go away with the delta-rs migration.

That's still there, though it may be too verbose.

We could also serialize it into SQL datatypes (TEXT instead of Utf8, BIGINT[] instead of List(...Int64...))

I think this is the best approach overall, as it also keeps the output implementation agnostic, i.e. SQL in SQL out. I'll try to find a way to make this work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also (which might be overkill) emit a JSONSchema instead of rolling our own custom format, which supports dates and floating point numbers. Could be useful for clients autogenerating parsers downstream.

Copy link

@milesrichardson milesrichardson Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would definitely be nice to include nullability in this response header, even if it's technically part of field.nullable rather than field.type.

@milesrichardson
Copy link

The implementation looks good to me, but I don't know how browsers/caches will handle it. AFAICT, this is to spec.

According to MDN, when Content-Type is used as a Response header, there is no restriction on the character set of the parameters. However, MDN does mention that when it's used as a Request header, there are some characters that will cause it not to be a CORS-safelisted request header when they are included:

yes, with the additional restriction that values can't contain a CORS-unsafe request header byte: 0x00-0x1F (except 0x09 (HT)), "():<>?@[\]{}, and 0x7F (DEL).

So while we're using it as a response header here, I do worry that some browser/proxy implementations might see { and } and (incorrectly) interpret it as an invalid header.

As far as I can see, the spec for Parameters does not specify any illegal characters (other than, presumably, the separators = and ;).

@milesrichardson
Copy link

milesrichardson commented Apr 20, 2023

The spec does say this, but (ironically) I'm unsure how to parse the english:

Note: Parameters do not allow whitespace (not even "bad" whitespace) around the "=" character.

Is this saying that parameters do not allow whitespace around the = character, or that parameters do not allow whitespace anywhere inside their value?

EDIT: I guess whitespace inside the value is okay, based on this example of a parameter:

Accept: audio/*; q=0.2, audio/basic

@milesrichardson
Copy link

One thing I'd be particularly curious to test is whether Cloudflare caching behavior changes when including the Content-Type header, and if we can still use the trick of appending .csv to make sure it's cached.

.unwrap()
.to_str()
.unwrap(),
"application/json; smallint_val=Int16; integer_val=Int32; bigint_val=Int64; char_val=Utf8; \
Copy link
Contributor

@mildbyte mildbyte Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header will be a pain to parse by the client:

  • need to split the header value on the semicolon and the equals sign
  • the parameterized DF types like List(...) or Timestamp(...) are awkward to parse
  • I'm worried about how this encodes column names with spaces or Unicode characters (EDIT: or characters like the equals sign or something that's not allowed in headers)

Maybe an alternative could be having a single value in the header like

application/json; seafowl-schema=[JSON encoding of the schema, uriencoded so that it's header-safe]

?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a pain to parse on the client, because you can just .split() on ; to get each parameter, and then for each parameter, split on = to get the (key,value). The part that is a pain to parse is the types themselves, but that comes down to what we actually want to do with them.

However, I do agree that a JSON encoding of the schema would be easier to parse. It still doesn't solve the issue of what to call (how to interpret) the actual types like List(...) though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just .split() on ; to get each parameter, and then for each parameter, split on = to get the (key,value)

what if my column name contains a semicolon? 🙃

Copy link
Contributor

@mildbyte mildbyte Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still doesn't solve the issue of what to call (how to interpret) the actual types like List(...) though.

The JSON Arrow serialization I suggested (from

.map(|f| (f.name().clone(), field_to_json(f).to_string()))
) solves this problem and has a more consistent format of the field's datatype. Let me dig up an example. EDIT: see the test output:

seafowl/src/schema.rs

Lines 72 to 85 in a280251

let cols = sf_schema.to_column_names_types();
assert_eq!(
cols,
vec![
("col_1".to_string(),
r#"{"children":[],"name":"col_1","nullable":false,"type":{"name":"date","unit":"MILLISECOND"}}"#.to_string()),
("col_2".to_string(),
r#"{"children":[],"name":"col_2","nullable":false,"type":{"name":"floatingpoint","precision":"DOUBLE"}}"#.to_string()),
("col_3".to_string(),
r#"{"children":[],"name":"col_3","nullable":true,"type":{"name":"bool"}}"#.to_string()),
("col_4".to_string(),
r#"{"children":[{"children":[],"name":"child_0","nullable":true,"type":{"name":"utf8"}}],"name":"col_4","nullable":false,"type":{"name":"list"}}"#.to_string()),
]
);

Copy link

@milesrichardson milesrichardson Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if my column name contains a semicolon? 🙃

I'm not sure this would be a valid HTTP parameter name, even if it was escaped. :/

EDIT: Actually, I think it's valid as long as the parameter value is surrounded by quotes.

Copy link
Contributor Author

@gruuya gruuya Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any function/crate for converting Arrow type back to SQL-looking types, so I'd have to make a custom converter. In principle this would be fine for simple types, but for nested and recursive types it would get ugly. Plus, we'd still lose the nullability information regardless.

That's why I went ahead with using the arrow schema JSON serialization format now. Note that this is not an official format, it's just something that is used in integration testing, but is consistent. (Also note that I haven't employed url encoding for now.)

The returned schema is very verbose (check out the test assertions), though in theory analytical queries employ a smaller column cardinality (plus we're not "polluting" the actual result output as the schema info is hidden in the header). This could still pose a problem in terms of header size limits; as per Cloudflare: Request headers observe a total limit of 32 KB, but each header is limited to 16 KB. Maybe we could make the schema type info in the headers optional, so that for some very large output column cardinality people can disable it.

EDIT: In the example of that large type conversion test, the entire content-type header text is 1.73KB, which seems comfortably within the size limits (though other clients/services may have different limits). We could also make this typo info be opt-in (controlled via a query param, some default request header or some custom one), so that by default we don't do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I've further modified this approach in order to url-encode the schema when sending via content-type header parameter, as discussed offline.

Comment on lines 1617 to 1618
int_array_val=List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }); \
text_array_val=List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })",
Copy link
Contributor

@mildbyte mildbyte Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also serialize it into SQL datatypes (TEXT instead of Utf8, BIGINT[] instead of List(...Int64...)), though I don't remember if we even support lists at the DDL level. There was some code that serializes column types into the database in the arrow-native JSON format if that didn't go away with the delta-rs migration (EDIT: here it is,

.map(|f| (f.name().clone(), field_to_json(f).to_string()))
)

@milesrichardson
Copy link

I guess how we serialize it depends what we want to do with the data. It would be nice to use some standardized format, like a serialization of Arrow types, because then we could use (somewhat) standardized parsers if we want to coerce types (e.g. turning a DATETIME into a Date() object in JavaScript). However, if we're returning responses as JSON, and not a Parquet file, then what is the point of parsing it with Arrow?

I found arrow-js, which has code for coercing values to JavaScript types: https://github.com/apache/arrow/blob/main/js/src/type.ts

We could implement something like that in our client(s) (it's kind of up to the client what it does with it), or maybe somehow use arrow-js itself, but I think the important thing is that we would want the types to match the keys of this enum used by arrow:

https://github.com/apache/arrow/blob/main/js/src/enum.ts#L158-L203

Can we get those somehow? Is that what the serde serialization effectively does?

@milesrichardson
Copy link

It looks like arrow-js can maybe read a record batch from JSON? https://github.com/apache/arrow/blob/7ea2d9867cb90f9d0ca1db9b3bffa2fe8426e8ed/js/src/ipc/reader.ts#L153-L166

which calls

export const isArrowJSON = (x: any): x is ArrowJSONLike => {
    return isObject(x) && isObject(x['schema']);
};

but the schema type is just any, and I'm having trouble following how this works. Based on the arrow-python docs, it defaults to inferring types from values, but you can also pass it an explicit schema: https://arrow.apache.org/docs/python/json.html#automatic-type-inference

Interesting how it coerces type string by trying to convert it to a date first, and then a string.


This makes me think that it might be best to just use simple SQL types, since the motivating use case for this feature is displaying types to the user. And there is no avoiding that some coercion code needs to specify how to coerce values to special types (e.g. timestamp to new Date()), so it doesn't really matter what serialization format the header uses in that case, since the user needs to write some code to define which fields need to be coerced anyway (or a library can try to guess it based on the SQL type).

@mildbyte
Copy link
Contributor

https://github.com/apache/arrow/blob/main/js/src/enum.ts#L158-L203

Can we get those somehow? Is that what the serde serialization effectively does?

Not sure -- I wouldn't want us to pull in arrow-js (or copypaste code from it) just to map these magic numbers back to enums. I was thinking about using this JSON format.

The options we seem to have are:

  • SQL-looking data types (e.g. BIGINT, TEXT[]...)
  • Arrow field's JSON serialization (like pasted above)
  • JSONSchema
  • Dumping raw Arrow strings (like what this PR does right now)

@milesrichardson
Copy link

milesrichardson commented Apr 20, 2023

I wouldn't want us to pull in arrow-js (or copypaste code from it)

Yeah ofc. I was wondering if those types come from some arrow standard that we're already using. The arrow-js wouldn't be involved on the server side, but it would be nice if they were somehow compatible with some arrow-js code so that a client using arrow-js could load the HTTP jsonlines response into arrow from JSON and provide it with the schema from the header.

@milesrichardson
Copy link

I think returning SQL-looking data types might be the simplest solution, but I worry it's non-standard (ofc SQL is standard, but it's the "-looking" part that concerns me).

When loading into arrow in JS, it seems unavoidable that the user needs to provide coercion functions: https://arrow.apache.org/docs/js/#create-a-table-from-javascript-arrays

The only exception would be if we were loading a .arrow file (Arrow IPC format) from a URL. So maybe in the future Seafowl could support returning data in Arrow IPC format. But as long as we're returning JSON, if we want to go back to arrow, it doesn't matter whether the schema is provided in "SQL-looking types" or arrow types, because we need to provide the coercion functions in either case. And it's easier to parse SQL-looking types, so I'd prefer those, I guess.

🤷

@milesrichardson
Copy link

And we could always support returning both, defaulting to "SQL looking types," but the user could specify a preference in some request header (maybe Content-Type).

@milesrichardson
Copy link

Ok, so I found the arrow-js code that can parse schema from JSON:

https://github.com/apache/arrow/blob/7ea2d9867cb90f9d0ca1db9b3bffa2fe8426e8ed/js/src/ipc/metadata/json.ts#L144-L206

Based on the fact it's accessing ["type"]["name"], I assume it expects the same serialization format that we see here in the nested types.

It would be cool to support this, because it means the JS client can use arrow-js and doesn't need to write its own parsers for the schema. But I don't intend to add that soon, it would just be a nice-to-have.

AFAICT, the "verbose serialization format" that @gruuya linked is what we would need for that: https://github.com/splitgraph/seafowl/blob/main/src/schema.rs#L72-L85 But it's of course quite verbose (are there length limits on this header?).

Meanwhile, for basic use cases like showing the user the type of the column in a shell, the SQL-looking types seem sufficient.

milesrichardson added a commit to splitgraph/madatdata that referenced this pull request Apr 24, 2023
…uild

See: splitgraph/seafowl#367

Parse fields from content-type header (we still aren't munging the arrow
schema to our own shape), and get it working with a live build of the
new feature.

Important note: The value of the HTTP parameter is an _un_-escaped JSON
string inside double quotes, by our convention, so we can just slice the
double quote off the beginning and end of the parameter value and then
parse it as JSON.

This commit adds some live tests, but they need a server with the new
build to work, and they're integration tests anyway (which wouldn't run
in CI), but they serve primarily as a place to collect the shape of
responses for later accurate mocking.

It also updates the existing mocked tests to use the new scheme (in particular,
the value is not escaped inside the double quotes).
Comment on lines 110 to 119
match HeaderValue::from_str(
format!("application/json; arrow-schema-escaped={output}").as_str(),
) {
Ok(value) => value,
Err(err) => {
// Seems silly to error out here if the query itself succeeded.
warn!("Couldn't generate content type header for output schema {output}: {err:?}");
HeaderValue::from_static("application/json")
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: unwrap_or_else could work here (no idea if you tried it; I haven't tested this code)

Suggested change
match HeaderValue::from_str(
format!("application/json; arrow-schema-escaped={output}").as_str(),
) {
Ok(value) => value,
Err(err) => {
// Seems silly to error out here if the query itself succeeded.
warn!("Couldn't generate content type header for output schema {output}: {err:?}");
HeaderValue::from_static("application/json")
}
}
HeaderValue::from_str(
format!("application/json; arrow-schema-escaped={output}").as_str(),
).unwrap_or_else(|e| {
// Seems silly to error out here if the query itself succeeded.
warn!("Couldn't generate content type header for output schema {output}: {e:?}");
HeaderValue::from_static("application/json")
})

let output = utf8_percent_encode(&schema_string, NON_ALPHANUMERIC);

match HeaderValue::from_str(
format!("application/json; arrow-schema-escaped={output}").as_str(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still call it arrow-schema, even though it's percent encoded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me. @milesrichardson just an fyi that I'm revising the parameter name back to arrow-schema.

gruuya added 2 commits April 26, 2023 09:21
Also revise the upload endpoint to be in lign with the new multipart request handler in warp.
For more details see seanmonstar/warp#1034.
@milesrichardson
Copy link

LGTM

@gruuya gruuya merged commit 8f753c3 into main Apr 26, 2023
@gruuya gruuya deleted the result-type-info-response-header branch April 26, 2023 13:57
milesrichardson added a commit to splitgraph/madatdata that referenced this pull request Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Return result set type information in the HTTP response

3 participants