-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Native jacobi symbol algorithm #979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5a5f7a6 to
042f91b
Compare
robot-dreams
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK 042f91b
./tests passed, including the additional checks I suggested.
I'm still brushing up on the divsteps algorithm, and will review this change again once I've worked through the various details / optimizations.
Benchmark results look good on x86_64 (MacBook Pro): native jacobi version is faster than libgmp. However, on arm64 (Mac M1), native jacobi version was slightly slower!
On M1 I also tried ./configure --with-asm=arm --enable-experimental to see if that would speed up the native jacobi version, but attempting to build resulted in a lot of compile errors in field_10x26_arm.s.
MacBook Pro (x86_64)
native jacobi branch
$ ./bench_internal field
...
field_jacobi_var: min 2.03us / avg 2.25us / max 3.85us
field_sqrt: min 6.33us / avg 6.38us / max 6.46us
$ ./bench_internal jacobi
field_jacobi_var: min 2.00us / avg 2.06us / max 2.15us
libgmp branch f24e122:
$ ./bench_internal field
...
field_sqrt: min 6.05us / avg 6.13us / max 6.38us
$ ./bench_internal num
num_jacobi: min 2.44us / avg 2.47us / max 2.53us
Mac M1 (arm64)
native jacobi branch
% ./bench_internal field
...
field_jacobi_var: min 1.54us / avg 1.55us / max 1.56us
field_sqrt: min 4.04us / avg 4.09us / max 4.29us
% ./bench_internal jacobi
field_jacobi_var: min 1.55us / avg 1.74us / max 3.18us
libgmp branch f24e122:
% ./bench_internal field
...
field_sqrt: min 4.04us / avg 4.04us / max 4.05us
% ./bench_internal jacobi
...
num_jacobi: min 1.41us / avg 1.41us / max 1.42us
|
@robot-dreams Always nice to see new contributors in the repo. :) Just curious, are you simply helping or are you interested in getting this merged? If the latter, I'm further curious about your use case. Feel free just ignore my intrusive questions. |
|
@real-or-random Thanks for the reply! I'm simply helping and try to learn about this repo, I don't have a personal use case. |
aced43a to
6e3fcc4
Compare
|
@robot-dreams Thanks for your comments! The failure with asm=arm is expected, as Apple M1 is 64-bit ARM, while the assembly code used here is 32-bit. |
|
Made another change: in VERIFY mode, the upper bound on the number of posdivsteps iterations is reduced to ~750 (which is close to the median of how many iterations are needed for random input mod the field size), so that the uncomputable path is actually exercised in the tests. |
robot-dreams
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 6e3fcc4
I agree with your modifications to the original safegcd for calculating Jacobi symbol instead of modular inverse. I also replicated the logic for the 64-bit version in Python, and checked that the result matches a "slow" Jacobi function for a few randomly generated inputs (see https://gist.github.com/robot-dreams/ceb00162b80384f2ae1913aaa2b35e75).
One potential concern is that with this change, there are now 4 slightly different implementations of a similar algorithm (32/64-bit, inv/jacobi). Do you think there's a nice way to consolidate the logic, at least among a given word size? (If there's no nice way or if it's too much trouble, I'm fine with leaving it as is and relying on having good tests.)
Other questions:
-
What was the motivation for using different approaches to clear the low-order bits of
gbetween the 32 and 64 bit implementations (8-bit lookup table vs 4 and 6-bit tricks from Hacker's Delight)? -
I agree that replacing
f, g = g, -fwithf, g = g, fensures that all matrix entries are always positive. But aside from actually observing that it converges tof = 1in practice, do you have any other intuitive justification for this? (The best I can come up with is that since you're constantly dividinggby 2 and swapping the roles offandg, you're quickly moving towards a solution even if you replaceg - fwithg + fin the original GCD algorithm.) -
Is there already a writeup describing this change? If not, would it make sense to add a brief section to https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md? (If you think so, I'd be happy to give that a shot.)
|
@sipa Thanks for responding to the comments!
|
6e3fcc4 to
46eaa95
Compare
Arguably there are 6 implementations (32/64-bit, and const/var/varpos divsteps), and they do share some of the code (in particular the matrix application), but not everything. I don't think there is an obvious way of merging more code without reducing performance, but I also haven't tried very hard.
Just dumb benchmarking. The formula was faster on x86_64 systems we tried, and the table was faster on ARMv7 systems (it's explained in one of the commit messages that introduced it).
I don't, no. I do have a proof that any f>0,g>0 converges to f=g somewhere in a finite number of steps, but the upper bound is something like The difficulty is that there are both steps that take f and g further apart (g /= 2) and steps that bring them closer (g = (f+g)/2), and it's hard to infer tight bounds on how many times one can happen in between two executions of the other. So in short, no; the actual argument why this works is purely empirical; all out of a 2.5 billion random inputs (with the field size modulus) need between 630 and 921 iterations.
I think that would be valuable, yes. I haven't done so yet, but was considering it. Feel free to take a shot at it. |
55ec89c to
902d57e
Compare
|
Included @robot-dreams's update to the writeup. |
robot-dreams
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 5af3d29 aside from comment nits—I checked the range diff against the old commit and it looks good.
902d57e to
53b7ce2
Compare
|
ACK 53b7ce2 |
53b7ce2 to
7e1bbef
Compare
7e1bbef to
0bce5e0
Compare
|
Rebased, and updated to only support coprime inputs in the |
eba1971 to
1c1c202
Compare
|
reACK mod nit |
This introduces variants of the divsteps-based GCD algorithm used for modular inverses to compute Jacobi symbols. Changes compared to the normal vartime divsteps: * Only positive matrices are used, guaranteeing that f and g remain positive. * An additional jac variable is updated to track sign changes during matrix computation. * There is (so far) no proof that this algorithm terminates within reasonable amount of time for every input, but experimentally it appears to almost always need less than 900 iterations. To account for that, only a bounded number of iterations is performed (1500), after which failure is returned. In VERIFY mode a lower iteration count is used to make sure that callers exercise their fallback. * The algorithm converges to f=g=gcd(f0,g0) rather than g=0. To keep this test simple, the end condition is f=1, which won't be reached if started with non-coprime or g=0 inputs. Because of that we only support coprime non-zero inputs.
The implementation calls the secp256k1_modinvNN_jacobi_var code, falling back to computing a square root in the (extremely rare) case it failed converge.
1c1c202 to
ce3cfc7
Compare
|
ACK ce3cfc7 |
real-or-random
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reACK ce3cfc7 diff and writeup is good and I tested every commit
This introduces variants of the vartime divsteps-based GCD algorithm used for modular inverses to compute Jacobi symbols. Changes compared to the normal vartime divsteps:
This code is currently unused, except for tests. I don't aim for it to be merged until there is a need for it, but this demonstrates its feasibility.
In terms of performance:
while with the older (f24e122) libgmp based Jacobi code on the same system: