-
Notifications
You must be signed in to change notification settings - Fork 23
Fuzz fix, quote invalid UTF-8 values #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- for parse/write/parse check, change second parse to parse the write of the parsed value instead of the original input encode: - invalidKeyRune: treat utf8.RuneError as invalid rune - add invalidKey, invalidKeyString funcs - checks len(key) == 0, invalidKeyRune, and utf8.Valid - needsQuotedValueRune: if value contains utf8.RuneError quote the value. a literal k=\ufffd encodes to k=\"\\ufffd\" - writeStringValue, writeBytesValue: quote any invalid utf8 string - the above two changes fix the "reserialized data does not match" error found during fuzz testing decode: - reject invalidKey as parse error jsonstring: - remove "&& size == 1" when checking for rune decode error fuzz testing output: - .quoted: "0=\"\xbd\x00\"" - .output: panic: reserialized data does not match: "0=\"\\ufffd\\u0000\"\n" "0=\"�\\u0000\"\n"
|
Fantastic catch! |
encode_test.go
Outdated
| {key: "k", value: "\ufffd\x00", want: "k=\"\\ufffd\\u0000\""}, | ||
| {key: "k", value: "\ufffd", want: "k=\"\\ufffd\""}, | ||
| {key: "k", value: []byte("\ufffd\x00"), want: "k=\"\\ufffd\\u0000\""}, | ||
| {key: "k", value: []byte("\ufffd"), want: "k=\"\\ufffd\""}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve readability by using back-tick quotes for all of the want values containing quotes and escape sequences.
encode.go
Outdated
| } | ||
|
|
||
| func invalidKeyString(key string) bool { | ||
| return len(key) == 0 || strings.IndexFunc(key, invalidKeyRune) != -1 || !utf8.ValidString(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think calling utf8.ValidString is overkill here. Note that strings.IndexFunc already calls utf8.DecodeRuneInString internally and passes each rune to invalidKeyRune. It would seem that any invalid UTF-8 is already caught by simply adding the r == utf8.RuneError check to invalidKeyRune. This change improves the performance some, but I still measure it as a 25% drop in throughput from before. The rest of the performance drop is simply due to the new calls to invalidKey in the decoder that weren't there before. But we can avoid most of that penalty for most logs if we only check for invalid UTF-8 in keys that contain bytes >= 0x80.
encode.go
Outdated
| if ok && value == "null" { | ||
| _, err = io.WriteString(w, `"null"`) | ||
| } else if strings.IndexFunc(value, needsQuotedValueRune) != -1 { | ||
| } else if strings.IndexFunc(value, needsQuotedValueRune) != -1 || !utf8.ValidString(value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call to utf8.ValidString not needed.
encode.go
Outdated
| func writeBytesValue(w io.Writer, value []byte) error { | ||
| var err error | ||
| if bytes.IndexFunc(value, needsQuotedValueRune) >= 0 { | ||
| if bytes.IndexFunc(value, needsQuotedValueRune) >= 0 || !utf8.Valid(value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call to utf8.Valid not needed.
encode.go
Outdated
| } | ||
|
|
||
| func invalidKey(key []byte) bool { | ||
| return len(key) == 0 || bytes.IndexFunc(key, invalidKeyRune) != -1 || !utf8.Valid(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call to utf8.Valid not needed.
| if invalidKey(dec.key) { | ||
| dec.syntaxError(invalidKeyError) | ||
| return false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid doing extra work for ASCII only keys.
- Initialize
multibyte := falseoutside the loop. - Add
case c >= utf8.RuneSelf: multibyte = trueat the bottom of the switch. - Change each
if invalidKey(dec.key)toif multibyte && bytes.IndexRune(dec.key, utf8.RuneError) != -1
|
@judwhite Can you re-target this PR to the develop branch? This project is using Git flow now, so we shouldn't update master until we make a new release. |
|
These benchmarks were taken from a different computer. First PR iteration benchmarks (outdated; for comparison): Current Feedback applied benchmarks: |
Summary:
"\xbd\x00"with "reserialized data does not match"; commit message has more detailsutf8.RuneErrorNew tests results with existing code:
Old benchmarks:
New benchmarks: