Faster i256 Division (2-100x) (#4663)#4672
Conversation
arrow-buffer/src/bigint/div.rs
Outdated
There was a problem hiding this comment.
I debated using uint directly, but this brought in a lot of code and logic that we didn't need, and wouldn't have easily generalised to support the 512-bit division necessary for precision-loss decimal arithmetic.
eef08b0 to
189f6ee
Compare
| /// An array of N + 1 elements | ||
| /// | ||
| /// This is a hack around lack of support for const arithmetic | ||
| struct ArrayPlusOne<T, const N: usize>([T; N], T); |
There was a problem hiding this comment.
This is a hack, but it is a hack I am quite pleased with 😅
There was a problem hiding this comment.
I'm a bit surprised that miri is pleased with it too 🤔
I would suggest adding #[repr(C)] to the struct as the compiler is allowed to reorder fields otherwise. Adding -Z randomize-layout to RUST_FLAGS might expose this issue.
There was a problem hiding this comment.
Good spot, I did not realise Rust would reorder fields with the same alignment
arrow-buffer/src/bigint/div.rs
Outdated
| } | ||
|
|
||
| /// Divide a u128 by a u64 divisor, returning the quotient and remainder | ||
| fn div_rem_word(hi: u64, lo: u64, y: u64) -> (u64, u64) { |
There was a problem hiding this comment.
On x86_64 this gets converted into a single instruction
There was a problem hiding this comment.
Strange, as I couldn't get the compiler to play nice with that function:
There was a problem hiding this comment.
You are quite right - https://stackoverflow.com/questions/62257103/emit-div-instruction-instead-of-udivti3
There may be further room for improvement in that case 😄
There was a problem hiding this comment.
No worries, I was hopeful I was going to learn a new compiler option or feature to enable 😊
There was a problem hiding this comment.
a37aee3 uses some inline assembly to force the correct compilation 😄
Shaves off a further 7 microseconds
viirya
left a comment
There was a problem hiding this comment.
I'll review this in this week.
|
|
||
| // LLVM fails to use the div instruction as it is not able to prove | ||
| // that hi < divisor, and therefore the result will fit into 64-bits | ||
| #[cfg(target_arch = "x86_64")] |
There was a problem hiding this comment.
This improves performance by ~25%, but would be the first use of inline assembly in this project. I personally think it is constrained and limited in scope enough to not be a concern, but defer to the consensus
There was a problem hiding this comment.
Looks okay to me. Just wonder if we want to have assembly for other arch (e.g. AArch64).
There was a problem hiding this comment.
I don't believe aarch64 has narrowing division support, so I don't think we can do better than udivti3 (the compiler built-in LLVM inserts)
There was a problem hiding this comment.
I couldn't find a narrowing division instruction in aarch64 ISA reference
There was a problem hiding this comment.
Yea, looks like udivti3 is what we could have on aarch64.
| ) -> ([u64; N], [u64; N]) { | ||
| let numerator_bits = bits(numerator); | ||
| let divisor_bits = bits(divisor); | ||
| assert_ne!(divisor_bits, 0, "division by zero"); |
There was a problem hiding this comment.
Would be better if returning a Err for this?
There was a problem hiding this comment.
ArrowError isn't defined in this crate, we could return an Option though, but this seemed overkill
arrow-buffer/src/bigint/div.rs
Outdated
| (q, remainder) | ||
| } | ||
|
|
||
| /// Divide a u128 by a u64 divisor, returning the quotient and remainder |
There was a problem hiding this comment.
Maybe add a comment here that this is not for general u128/u64 division.
| fn sub_assign(a: &mut [u64], b: &[u64]) -> bool { | ||
| binop_slice(a, b, u64::overflowing_sub) | ||
| } |
There was a problem hiding this comment.
Hmm, does this work for cases like a = [1, 0, 0] and b = [0, 1, 1]? I got a overflow (true) and a = [1, 18446744073709551615, 18446744073709551614].
There was a problem hiding this comment.
What are you expecting, you are doing 1 - 2^64 - 2^128?
There was a problem hiding this comment.
I don't look where this function is used, but just from its description a -= b, so playing it with fake inputs. The output seems not a -= b? no?
There was a problem hiding this comment.
Oh, you mean a = [1, 0, 0] is 1? Got it.
There was a problem hiding this comment.
Yeah the digits are little endian
| } | ||
| q_hat | ||
| } else { | ||
| u64::MAX |
There was a problem hiding this comment.
Hmm, I compare this implementation to some resources I can find, e.g. https://skanthak.homepage.t-online.de/division.html. Is this else branch for initial overflow check and return the largest possible quotient? If so, seems it is possibly to simply set rem to an impossible value?
There was a problem hiding this comment.
If u_jn is larger than v_n_1, our guess of q_hat would overflow the 64-bit word, which in turn would cause div_rem_word to trap, so we just use u64::MAX as our guess.
I'll try to add some further docs for what is going on here
Which issue does this PR close?
Closes #4663
Relates to #4664
Rationale for this change
The current algorithm for performing integer division is simple but has time complexity linear w.r.t the length of the quotient. This results in very poor performance when performing division by small divisors.
N-digit division is also a precursor to performing precision-loss decimal arithmetic (#4664)
What changes are included in this PR?
Are there any user-facing changes?