Add simdutf8 feature to make simdutf8 optional, consolidate check_valid_utf8#6979
Add simdutf8 feature to make simdutf8 optional, consolidate check_valid_utf8#6979alamb merged 7 commits intoapache:mainfrom
simdutf8 feature to make simdutf8 optional, consolidate check_valid_utf8#6979Conversation
parquet/README.md
Outdated
There was a problem hiding this comment.
this was a drive by cleanup b/c I can't help myself now that @etseidl pointed out the capitalization inconsistency with Parquet and parquet
parquet/README.md
Outdated
There was a problem hiding this comment.
this is the new feature
parquet/src/util/utf8.rs
Outdated
There was a problem hiding this comment.
I implemented @etseidl 's suggestion #6668 (comment)
for encapsulation to make the code / use eaiser to understand
There was a problem hiding this comment.
Actually I was thinking something more like
#[cfg(feature = "simdutf8")]
pub fn from_utf8(val: &[u8]) -> Result<&str, simdutf8::compat::Utf8Error> {
match simdutf8::basic::from_utf8(val) {
Ok(result) => Ok(result),
Err(_) => simdutf8::compat::from_utf8(val),
}
}
#[cfg(not(feature = "simdutf8"))]
pub fn from_utf8(val: &[u8]) -> Result<&str, std::str::Utf8Error> {
std::str::from_utf8(val)
}
pub fn check_valid_utf8(val: &[u8]) -> Result<()> {
match from_utf8(val) {
Ok(_) => Ok(()),
Err(e) => Err(general_err!("encountered non UTF-8 data: {}", e)),
}
}Then we could start replacing other uses of std::str::from_utf8 if they're slowing things down. I could run down this rabbit hole after this merges if you think there's any value to doing this.
There was a problem hiding this comment.
Then we could start replacing other uses of std::str::from_utf8 if they're slowing things down. I could run down this rabbit hole after this merges if you think there's any value to doing this.
It seems reasonable to me -- thank you
I looked around in the rest of the parquet code for uses of std::str::from_utf8 and they seemed somewhat limited (but I could be missing something)
There was a problem hiding this comment.
Yes, its use is pretty limited, so I would first have to do some benchmarks to see if this is even worthwhile.
parquet/src/lib.rs
Outdated
There was a problem hiding this comment.
This publically exports the utf8 module as part of the parquet modules, which is needed for adding a doc test.
I was thinking this might be useful for other users (to use the same utf8 validation library) but I can also be convinced to avoid adding this function to the public API of the
7d24c25 to
f038faf
Compare
|
Thank you for the reviews @Dandandan and @etseidl |
…k_valid_utf8` (apache#6979) * Add `simd8tf8` feature * Consolidate check utf8 * Publically doc and export * fmt * Update parquet/src/util/utf8.rs Co-authored-by: Daniël Heres <[email protected]> * enable by default --------- Co-authored-by: Daniël Heres <[email protected]>
Which issue does this PR close?
simdutf8#6668Rationale for this change
Per @Dandandan's comments on #6668 (comment), let's make this package optional to provide an "escape hatch" if anyone downstream hits issues.
I made it a separate PR per @doki23's suggestion #6668 (comment)
What changes are included in this PR?
simdutf8feature, which controls the use ofsimdutf8for utf8 validationAre there any user-facing changes?