Return result type info on GET/POST endpoints by gruuya · Pull Request #367 · splitgraph/seafowl

gruuya · 2023-04-20T14:15:52Z

Provide this info via the content-type response header.

Closes #356.

gruuya · 2023-04-20T14:22:04Z

src/frontend/http.rs

+    let output = schema
+        .fields
+        .iter()
+        .map(|f| format!("{}={}", f.name(), f.data_type()))
+        .collect::<Vec<String>>()
+        .join("; ");


One alternative would be to use serde serialization of the schema (so someone on the other side can get a proper arrow object), though I suspect that may end up being too convoluted.

milesrichardson · 2023-04-20T14:25:39Z

src/frontend/http.rs

+            int_array_val=List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }); \
+            text_array_val=List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })",


Is there a name for this format of Type(Params)? Just wondering how a client might parse it. It looks almost like JSON but not quite.

I don't think so, it's really only a serialization of a recursive type enum variant DataType::List(Box<Field>).

This however sort of breaks the abstraction, since we serialize only the name of the (root) Field on the left, but on the right the entire (nested) Field struct gets unwrapped.

Yeah, I guess the more human-readable verison would be Int64[]. Would you lose any information if you forced it to be like that? I guess nullable is important.

Related: What would be the type of a scalar column that is nullable? Is it even possible to produce such a query?

We could also serialize it into SQL datatypes (TEXT instead of Utf8, BIGINT[] instead of List(...Int64...)), though I don't remember if we even support lists at the DDL level. There was some code that serializes column types into the database in the arrow-native JSON format if that didn't go away with the delta-rs migration (EDIT: here it is,

seafowl/src/schema.rs

Line 41 in a280251

.map(|f| (f.name().clone(), field_to_json(f).to_string()))

)

Something like this (Splitgraph query):

SELECT "foo", "baz", CASE WHEN baz % 2 = 0 THEN null WHEN baz % 2 <> 0 THEN 'ok' END isbaz FROM "miles/big-random-csv:7700078ffc7592e8ad029070bcdccaafb03500b5f32be52f7ba73a5b93b95afd"."file_1" LIMIT 100;

What would be the type of isbaz column?

I was able to construct such a query on my Seafowl test DB:

curl -i \ -H "Content-Type: application/json" \ https://splitgraph-testing-nomad.seafowl.cloud/q -d@-<<EOF {"query": "SELECT name, value, CASE when value > 2 THEN null WHEN value < 2 THEN 'ok' END islessthantwo FROM test_cache;"} EOF

which returns the rows with nullable results as undefined (this is expected):

HTTP/2 200 content-type: application/octet-stream vary: Content-Type, Origin, X-Seafowl-Query content-length: 386 date: Thu, 20 Apr 2023 14:54:11 GMT server: Fly/ad79467d (2023-04-14) via: 2 fly.io fly-request-id: 01GYFJH6RQ0132ZX4000EK6RSY-lhr {"name":"Alpha","value":2.3} {"name":"Alpha","value":2.4} {"name":"Beta","value":9.3} {"name":"Beta","value":4.0} {"name":"Gamma","value":6.8} {"islessthantwo":"ok","name":"Gamma","value":1.2} {"name":"Alpha","value":2.3} {"name":"Alpha","value":2.4} {"name":"Beta","value":9.3} {"name":"Beta","value":4.0} {"name":"Gamma","value":6.8} {"islessthantwo":"ok","name":"Gamma","value":1.2}

So what would we expect the column type of islessthantwo to be in the header?

What would be the type of a scalar column that is nullable? Is it even possible to produce such a query?

Nullability is being tracked at the Field level, not the data type level, so in the case of a scalar column that is nullable there would be no info about it (as we'd record only the field name).

There was some code that serializes column types into the database in the arrow-native JSON format if that didn't go away with the delta-rs migration.

That's still there, though it may be too verbose.

We could also serialize it into SQL datatypes (TEXT instead of Utf8, BIGINT[] instead of List(...Int64...))

I think this is the best approach overall, as it also keeps the output implementation agnostic, i.e. SQL in SQL out. I'll try to find a way to make this work.

We could also (which might be overkill) emit a JSONSchema instead of rolling our own custom format, which supports dates and floating point numbers. Could be useful for clients autogenerating parsers downstream.

It would definitely be nice to include nullability in this response header, even if it's technically part of field.nullable rather than field.type.

milesrichardson · 2023-04-20T14:29:18Z

The implementation looks good to me, but I don't know how browsers/caches will handle it. AFAICT, this is to spec.

According to MDN, when Content-Type is used as a Response header, there is no restriction on the character set of the parameters. However, MDN does mention that when it's used as a Request header, there are some characters that will cause it not to be a CORS-safelisted request header when they are included:

yes, with the additional restriction that values can't contain a CORS-unsafe request header byte: 0x00-0x1F (except 0x09 (HT)), "():<>?@[\]{}, and 0x7F (DEL).

So while we're using it as a response header here, I do worry that some browser/proxy implementations might see { and } and (incorrectly) interpret it as an invalid header.

As far as I can see, the spec for Parameters does not specify any illegal characters (other than, presumably, the separators = and ;).

milesrichardson · 2023-04-20T14:31:53Z

The spec does say this, but (ironically) I'm unsure how to parse the english:

Note: Parameters do not allow whitespace (not even "bad" whitespace) around the "=" character.

Is this saying that parameters do not allow whitespace around the = character, or that parameters do not allow whitespace anywhere inside their value?

EDIT: I guess whitespace inside the value is okay, based on this example of a parameter:

Accept: audio/*; q=0.2, audio/basic

milesrichardson · 2023-04-20T14:41:51Z

One thing I'd be particularly curious to test is whether Cloudflare caching behavior changes when including the Content-Type header, and if we can still use the trick of appending .csv to make sure it's cached.

mildbyte · 2023-04-20T14:46:12Z

src/frontend/http.rs

+                .unwrap()
+                .to_str()
+                .unwrap(),
+            "application/json; smallint_val=Int16; integer_val=Int32; bigint_val=Int64; char_val=Utf8; \


This header will be a pain to parse by the client:

need to split the header value on the semicolon and the equals sign

the parameterized DF types like List(...) or Timestamp(...) are awkward to parse

I'm worried about how this encodes column names with spaces or Unicode characters (EDIT: or characters like the equals sign or something that's not allowed in headers)

Maybe an alternative could be having a single value in the header like

application/json; seafowl-schema=[JSON encoding of the schema, uriencoded so that it's header-safe]

?

It's not a pain to parse on the client, because you can just .split() on ; to get each parameter, and then for each parameter, split on = to get the (key,value). The part that is a pain to parse is the types themselves, but that comes down to what we actually want to do with them.

However, I do agree that a JSON encoding of the schema would be easier to parse. It still doesn't solve the issue of what to call (how to interpret) the actual types like List(...) though.

you can just .split() on ; to get each parameter, and then for each parameter, split on = to get the (key,value)

what if my column name contains a semicolon? 🙃

It still doesn't solve the issue of what to call (how to interpret) the actual types like List(...) though.

The JSON Arrow serialization I suggested (from

seafowl/src/schema.rs

Line 41 in a280251

.map(|f| (f.name().clone(), field_to_json(f).to_string()))

) solves this problem and has a more consistent format of the field's datatype. ~~Let me dig up an example.~~ EDIT: see the test output:

seafowl/src/schema.rs

Lines 72 to 85 in a280251

let cols = sf_schema.to_column_names_types();

assert_eq!(

cols,

vec![

("col_1".to_string(),

r#"{"children":[],"name":"col_1","nullable":false,"type":{"name":"date","unit":"MILLISECOND"}}"#.to_string()),

("col_2".to_string(),

r#"{"children":[],"name":"col_2","nullable":false,"type":{"name":"floatingpoint","precision":"DOUBLE"}}"#.to_string()),

("col_3".to_string(),

r#"{"children":[],"name":"col_3","nullable":true,"type":{"name":"bool"}}"#.to_string()),

("col_4".to_string(),

r#"{"children":[{"children":[],"name":"child_0","nullable":true,"type":{"name":"utf8"}}],"name":"col_4","nullable":false,"type":{"name":"list"}}"#.to_string()),

]

);

what if my column name contains a semicolon? 🙃

I'm not sure this would be a valid HTTP parameter name, even if it was escaped. :/

EDIT: Actually, I think it's valid as long as the parameter value is surrounded by quotes.

I didn't find any function/crate for converting Arrow type back to SQL-looking types, so I'd have to make a custom converter. In principle this would be fine for simple types, but for nested and recursive types it would get ugly. Plus, we'd still lose the nullability information regardless.

That's why I went ahead with using the arrow schema JSON serialization format now. Note that this is not an official format, it's just something that is used in integration testing, but is consistent. (Also note that I haven't employed url encoding for now.)

The returned schema is very verbose (check out the test assertions), though in theory analytical queries employ a smaller column cardinality (plus we're not "polluting" the actual result output as the schema info is hidden in the header). This could still pose a problem in terms of header size limits; as per Cloudflare: Request headers observe a total limit of 32 KB, but each header is limited to 16 KB. Maybe we could make the schema type info in the headers optional, so that for some very large output column cardinality people can disable it.

EDIT: In the example of that large type conversion test, the entire content-type header text is 1.73KB, which seems comfortably within the size limits (though other clients/services may have different limits). We could also make this typo info be opt-in (controlled via a query param, some default request header or some custom one), so that by default we don't do it?

For the record, I've further modified this approach in order to url-encode the schema when sending via content-type header parameter, as discussed offline.

mildbyte · 2023-04-20T14:49:13Z

src/frontend/http.rs

+            int_array_val=List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }); \
+            text_array_val=List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })",


We could also serialize it into SQL datatypes (TEXT instead of Utf8, BIGINT[] instead of List(...Int64...)), though I don't remember if we even support lists at the DDL level. There was some code that serializes column types into the database in the arrow-native JSON format if that didn't go away with the delta-rs migration (EDIT: here it is,

seafowl/src/schema.rs

Line 41 in a280251

.map(|f| (f.name().clone(), field_to_json(f).to_string()))

)

milesrichardson · 2023-04-20T15:08:08Z

I guess how we serialize it depends what we want to do with the data. It would be nice to use some standardized format, like a serialization of Arrow types, because then we could use (somewhat) standardized parsers if we want to coerce types (e.g. turning a DATETIME into a Date() object in JavaScript). However, if we're returning responses as JSON, and not a Parquet file, then what is the point of parsing it with Arrow?

I found arrow-js, which has code for coercing values to JavaScript types: https://github.com/apache/arrow/blob/main/js/src/type.ts

We could implement something like that in our client(s) (it's kind of up to the client what it does with it), or maybe somehow use arrow-js itself, but I think the important thing is that we would want the types to match the keys of this enum used by arrow:

https://github.com/apache/arrow/blob/main/js/src/enum.ts#L158-L203

Can we get those somehow? Is that what the serde serialization effectively does?

milesrichardson · 2023-04-20T15:19:39Z

It looks like arrow-js can maybe read a record batch from JSON? https://github.com/apache/arrow/blob/7ea2d9867cb90f9d0ca1db9b3bffa2fe8426e8ed/js/src/ipc/reader.ts#L153-L166

which calls

export const isArrowJSON = (x: any): x is ArrowJSONLike => {
    return isObject(x) && isObject(x['schema']);
};

but the schema type is just any, and I'm having trouble following how this works. Based on the arrow-python docs, it defaults to inferring types from values, but you can also pass it an explicit schema: https://arrow.apache.org/docs/python/json.html#automatic-type-inference

Interesting how it coerces type string by trying to convert it to a date first, and then a string.

This makes me think that it might be best to just use simple SQL types, since the motivating use case for this feature is displaying types to the user. And there is no avoiding that some coercion code needs to specify how to coerce values to special types (e.g. timestamp to new Date()), so it doesn't really matter what serialization format the header uses in that case, since the user needs to write some code to define which fields need to be coerced anyway (or a library can try to guess it based on the SQL type).

mildbyte · 2023-04-20T15:19:50Z

https://github.com/apache/arrow/blob/main/js/src/enum.ts#L158-L203

Can we get those somehow? Is that what the serde serialization effectively does?

Not sure -- I wouldn't want us to pull in arrow-js (or copypaste code from it) just to map these magic numbers back to enums. I was thinking about using this JSON format.

The options we seem to have are:

SQL-looking data types (e.g. BIGINT, TEXT[]...)
Arrow field's JSON serialization (like pasted above)
JSONSchema
Dumping raw Arrow strings (like what this PR does right now)

milesrichardson · 2023-04-20T15:22:29Z

I wouldn't want us to pull in arrow-js (or copypaste code from it)

Yeah ofc. I was wondering if those types come from some arrow standard that we're already using. The arrow-js wouldn't be involved on the server side, but it would be nice if they were somehow compatible with some arrow-js code so that a client using arrow-js could load the HTTP jsonlines response into arrow from JSON and provide it with the schema from the header.

milesrichardson · 2023-04-20T15:27:09Z

I think returning SQL-looking data types might be the simplest solution, but I worry it's non-standard (ofc SQL is standard, but it's the "-looking" part that concerns me).

When loading into arrow in JS, it seems unavoidable that the user needs to provide coercion functions: https://arrow.apache.org/docs/js/#create-a-table-from-javascript-arrays

The only exception would be if we were loading a .arrow file (Arrow IPC format) from a URL. So maybe in the future Seafowl could support returning data in Arrow IPC format. But as long as we're returning JSON, if we want to go back to arrow, it doesn't matter whether the schema is provided in "SQL-looking types" or arrow types, because we need to provide the coercion functions in either case. And it's easier to parse SQL-looking types, so I'd prefer those, I guess.

🤷

milesrichardson · 2023-04-20T15:31:29Z

And we could always support returning both, defaulting to "SQL looking types," but the user could specify a preference in some request header (maybe Content-Type).

milesrichardson · 2023-04-20T15:44:20Z

Ok, so I found the arrow-js code that can parse schema from JSON:

https://github.com/apache/arrow/blob/7ea2d9867cb90f9d0ca1db9b3bffa2fe8426e8ed/js/src/ipc/metadata/json.ts#L144-L206

Based on the fact it's accessing ["type"]["name"], I assume it expects the same serialization format that we see here in the nested types.

It would be cool to support this, because it means the JS client can use arrow-js and doesn't need to write its own parsers for the schema. But I don't intend to add that soon, it would just be a nice-to-have.

AFAICT, the "verbose serialization format" that @gruuya linked is what we would need for that: https://github.com/splitgraph/seafowl/blob/main/src/schema.rs#L72-L85 But it's of course quite verbose (are there length limits on this header?).

Meanwhile, for basic use cases like showing the user the type of the column in a shell, the SQL-looking types seem sufficient.

…nfo in the header

…uild See: splitgraph/seafowl#367 Parse fields from content-type header (we still aren't munging the arrow schema to our own shape), and get it working with a live build of the new feature. Important note: The value of the HTTP parameter is an _un_-escaped JSON string inside double quotes, by our convention, so we can just slice the double quote off the beginning and end of the parameter value and then parse it as JSON. This commit adds some live tests, but they need a server with the new build to work, and they're integration tests anyway (which wouldn't run in CI), but they serve primarily as a place to collect the shape of responses for later accurate mocking. It also updates the existing mocked tests to use the new scheme (in particular, the value is not escaped inside the double quotes).

…ontent-type param

mildbyte · 2023-04-26T07:08:45Z

src/frontend/http.rs

+    match HeaderValue::from_str(
+        format!("application/json; arrow-schema-escaped={output}").as_str(),
+    ) {
+        Ok(value) => value,
+        Err(err) => {
+            // Seems silly to error out here if the query itself succeeded.
+            warn!("Couldn't generate content type header for output schema {output}: {err:?}");
+            HeaderValue::from_static("application/json")
+        }
+    }


minor: unwrap_or_else could work here (no idea if you tried it; I haven't tested this code)

Suggested change

match HeaderValue::from_str(

format!("application/json; arrow-schema-escaped={output}").as_str(),

) {

Ok(value) => value,

Err(err) => {

// Seems silly to error out here if the query itself succeeded.

warn!("Couldn't generate content type header for output schema {output}: {err:?}");

HeaderValue::from_static("application/json")

}

}

HeaderValue::from_str(

format!("application/json; arrow-schema-escaped={output}").as_str(),

).unwrap_or_else(|e| {

// Seems silly to error out here if the query itself succeeded.

warn!("Couldn't generate content type header for output schema {output}: {e:?}");

HeaderValue::from_static("application/json")

})

mildbyte · 2023-04-26T07:10:08Z

src/frontend/http.rs

+    let output = utf8_percent_encode(&schema_string, NON_ALPHANUMERIC);
+
+    match HeaderValue::from_str(
+        format!("application/json; arrow-schema-escaped={output}").as_str(),


I think we should still call it arrow-schema, even though it's percent encoded.

That's fine with me. @milesrichardson just an fyi that I'm revising the parameter name back to arrow-schema.

Also revise the upload endpoint to be in lign with the new multipart request handler in warp. For more details see seanmonstar/warp#1034.

milesrichardson · 2023-04-26T13:19:21Z

LGTM

This was finalized in splitgraph/seafowl#367

Add output typo info via the content-type response header

01f539e

gruuya requested review from mildbyte and milesrichardson April 20, 2023 14:15

gruuya mentioned this pull request Apr 20, 2023

Return result set type information in the HTTP response #356

Closed

gruuya commented Apr 20, 2023

View reviewed changes

milesrichardson reviewed Apr 20, 2023

View reviewed changes

mildbyte reviewed Apr 20, 2023

View reviewed changes

Use arrow's informal schema serialization to return the result type i…

262058e

…nfo in the header

milesrichardson mentioned this pull request Apr 24, 2023

Unify seafowl/splitgraph response shape, and get seafowl fields from content-type header splitgraph/madatdata#15

Merged

gruuya added 2 commits April 25, 2023 13:57

Change schema serialization to use url encoding when setting as the c…

2640a25

…ontent-type param

Swap order of test module and extern crate to appease clippy

debbcd4

mildbyte approved these changes Apr 26, 2023

View reviewed changes

gruuya added 2 commits April 26, 2023 09:21

Simplify the schema parameter name in the content-type header

f8791ff

Bump object_store crate and do a cargo update

030d108

Also revise the upload endpoint to be in lign with the new multipart request handler in warp. For more details see seanmonstar/warp#1034.

gruuya merged commit 8f753c3 into main Apr 26, 2023

gruuya deleted the result-type-info-response-header branch April 26, 2023 13:57

milesrichardson added a commit to splitgraph/madatdata that referenced this pull request Apr 26, 2023

Parse fields from arrow-schema instead of arrow-schema-escaped

48a1ad7

This was finalized in splitgraph/seafowl#367

gruuya mentioned this pull request Apr 27, 2023

Fix SQLite race condition causing new DBs/schemas to not be immediately available after a HTTP 200 #372

Closed

		int_array_val=List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }); \
		text_array_val=List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })",

	let cols = sf_schema.to_column_names_types();
	assert_eq!(
	cols,
	vec![
	("col_1".to_string(),
	r#"{"children":[],"name":"col_1","nullable":false,"type":{"name":"date","unit":"MILLISECOND"}}"#.to_string()),
	("col_2".to_string(),
	r#"{"children":[],"name":"col_2","nullable":false,"type":{"name":"floatingpoint","precision":"DOUBLE"}}"#.to_string()),
	("col_3".to_string(),
	r#"{"children":[],"name":"col_3","nullable":true,"type":{"name":"bool"}}"#.to_string()),
	("col_4".to_string(),
	r#"{"children":[{"children":[],"name":"child_0","nullable":true,"type":{"name":"utf8"}}],"name":"col_4","nullable":false,"type":{"name":"list"}}"#.to_string()),
	]
	);

Conversation

gruuya commented Apr 20, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gruuya Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mildbyte Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milesrichardson Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milesrichardson commented Apr 20, 2023

Uh oh!

milesrichardson commented Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milesrichardson commented Apr 20, 2023

Uh oh!

mildbyte Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mildbyte Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milesrichardson Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gruuya Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mildbyte Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milesrichardson commented Apr 20, 2023

Uh oh!

milesrichardson commented Apr 20, 2023

Uh oh!

mildbyte commented Apr 20, 2023

Uh oh!

milesrichardson commented Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milesrichardson commented Apr 20, 2023

Uh oh!

milesrichardson commented Apr 20, 2023

Uh oh!

milesrichardson commented Apr 20, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gruuya Apr 20, 2023 •

edited

Loading

mildbyte Apr 20, 2023 •

edited

Loading

milesrichardson Apr 20, 2023 •

edited

Loading

milesrichardson commented Apr 20, 2023 •

edited

Loading

mildbyte Apr 20, 2023 •

edited

Loading

mildbyte Apr 20, 2023 •

edited

Loading

milesrichardson Apr 20, 2023 •

edited

Loading

gruuya Apr 21, 2023 •

edited

Loading

mildbyte Apr 20, 2023 •

edited

Loading

milesrichardson commented Apr 20, 2023 •

edited

Loading