Skip to content

Conversation

@sipa
Copy link
Contributor

@sipa sipa commented Sep 23, 2021

This introduces variants of the vartime divsteps-based GCD algorithm used for modular inverses to compute Jacobi symbols. Changes compared to the normal vartime divsteps:

  • Only positive matrices are used, guaranteeing that f and g remain positive.
  • An additional jac variable is updated to track sign changes during matrix computation.
  • There is (so far) no proof that this algorithm terminates within reasonable amount of time for every input, but experimentally it appears to almost always need less than 900 iterations. To account for that, only a bounded number of iterations is performed (1500), after which failure is returned. The field logic then falls back to using square roots to determining the result.
  • The algorithm converges to f=g=gcd(f0,g0) rather than g=0. To keep this test simple, the end condition is f=1, which won't be reached if started with g=0. That case is dealt with specially.

This code is currently unused, except for tests. I don't aim for it to be merged until there is a need for it, but this demonstrates its feasibility.

In terms of performance:

field_inverse: min 1.76us / avg 1.76us / max 1.78us
field_inverse_var: min 0.991us / avg 0.993us / max 0.996us
field_jacobi_var: min 1.31us / avg 1.31us / max 1.31us
field_sqrt: min 4.36us / avg 4.37us / max 4.40us

while with the older (f24e122) libgmp based Jacobi code on the same system:

num_jacobi: min 1.53us / avg 1.54us / max 1.55us

@sipa sipa force-pushed the 202109_native_jacobi branch 2 times, most recently from 5a5f7a6 to 042f91b Compare September 23, 2021 23:01
@sipa sipa mentioned this pull request Sep 28, 2021
Copy link
Contributor

@robot-dreams robot-dreams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK 042f91b

./tests passed, including the additional checks I suggested.

I'm still brushing up on the divsteps algorithm, and will review this change again once I've worked through the various details / optimizations.


Benchmark results look good on x86_64 (MacBook Pro): native jacobi version is faster than libgmp. However, on arm64 (Mac M1), native jacobi version was slightly slower!

On M1 I also tried ./configure --with-asm=arm --enable-experimental to see if that would speed up the native jacobi version, but attempting to build resulted in a lot of compile errors in field_10x26_arm.s.

MacBook Pro (x86_64)

native jacobi branch

$ ./bench_internal field
...
field_jacobi_var: min 2.03us / avg 2.25us / max 3.85us
field_sqrt: min 6.33us / avg 6.38us / max 6.46us

$ ./bench_internal jacobi
field_jacobi_var: min 2.00us / avg 2.06us / max 2.15us

libgmp branch f24e122:

$ ./bench_internal field
...
field_sqrt: min 6.05us / avg 6.13us / max 6.38us

$ ./bench_internal num
num_jacobi: min 2.44us / avg 2.47us / max 2.53us

Mac M1 (arm64)

native jacobi branch

% ./bench_internal field
...
field_jacobi_var: min 1.54us / avg 1.55us / max 1.56us
field_sqrt: min 4.04us / avg 4.09us / max 4.29us

% ./bench_internal jacobi
field_jacobi_var: min 1.55us / avg 1.74us / max 3.18us

libgmp branch f24e122:

% ./bench_internal field
...
field_sqrt: min 4.04us / avg 4.04us / max 4.05us

% ./bench_internal jacobi
...
num_jacobi: min 1.41us / avg 1.41us / max 1.42us

@real-or-random
Copy link
Contributor

@robot-dreams Always nice to see new contributors in the repo. :) Just curious, are you simply helping or are you interested in getting this merged? If the latter, I'm further curious about your use case. Feel free just ignore my intrusive questions.

@robot-dreams
Copy link
Contributor

@real-or-random Thanks for the reply! I'm simply helping and try to learn about this repo, I don't have a personal use case.

@sipa sipa force-pushed the 202109_native_jacobi branch 5 times, most recently from aced43a to 6e3fcc4 Compare November 9, 2021 22:22
@sipa
Copy link
Contributor Author

sipa commented Nov 9, 2021

@robot-dreams Thanks for your comments!

The failure with asm=arm is expected, as Apple M1 is 64-bit ARM, while the assembly code used here is 32-bit.

@sipa
Copy link
Contributor Author

sipa commented Nov 9, 2021

Made another change: in VERIFY mode, the upper bound on the number of posdivsteps iterations is reduced to ~750 (which is close to the median of how many iterations are needed for random input mod the field size), so that the uncomputable path is actually exercised in the tests.

Copy link
Contributor

@robot-dreams robot-dreams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 6e3fcc4

I agree with your modifications to the original safegcd for calculating Jacobi symbol instead of modular inverse. I also replicated the logic for the 64-bit version in Python, and checked that the result matches a "slow" Jacobi function for a few randomly generated inputs (see https://gist.github.com/robot-dreams/ceb00162b80384f2ae1913aaa2b35e75).

One potential concern is that with this change, there are now 4 slightly different implementations of a similar algorithm (32/64-bit, inv/jacobi). Do you think there's a nice way to consolidate the logic, at least among a given word size? (If there's no nice way or if it's too much trouble, I'm fine with leaving it as is and relying on having good tests.)

Other questions:

  • What was the motivation for using different approaches to clear the low-order bits of g between the 32 and 64 bit implementations (8-bit lookup table vs 4 and 6-bit tricks from Hacker's Delight)?

  • I agree that replacing f, g = g, -f with f, g = g, f ensures that all matrix entries are always positive. But aside from actually observing that it converges to f = 1 in practice, do you have any other intuitive justification for this? (The best I can come up with is that since you're constantly dividing g by 2 and swapping the roles of f and g, you're quickly moving towards a solution even if you replace g - f with g + f in the original GCD algorithm.)

  • Is there already a writeup describing this change? If not, would it make sense to add a brief section to https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md? (If you think so, I'd be happy to give that a shot.)

@robot-dreams
Copy link
Contributor

@sipa Thanks for responding to the comments!

  • Yes, 32-bit makes sense (in retrospect the fact that the file was called field_10x26 instead of field_5x52 should have given that away)
  • Great idea to reduce the number of iterations for validation!

@sipa sipa force-pushed the 202109_native_jacobi branch from 6e3fcc4 to 46eaa95 Compare November 10, 2021 16:15
@sipa
Copy link
Contributor Author

sipa commented Nov 10, 2021

One potential concern is that with this change, there are now 4 slightly different implementations of a similar algorithm (32/64-bit, inv/jacobi). Do you think there's a nice way to consolidate the logic, at least among a given word size? (If there's no nice way or if it's too much trouble, I'm fine with leaving it as is and relying on having good tests.)

Arguably there are 6 implementations (32/64-bit, and const/var/varpos divsteps), and they do share some of the code (in particular the matrix application), but not everything. I don't think there is an obvious way of merging more code without reducing performance, but I also haven't tried very hard.

What was the motivation for using different approaches to clear the low-order bits of g between the 32 and 64 bit implementations (8-bit lookup table vs 4 and 6-bit tricks from Hacker's Delight)?

Just dumb benchmarking. The formula was faster on x86_64 systems we tried, and the table was faster on ARMv7 systems (it's explained in one of the commit messages that introduced it).

I agree that replacing f, g = g, -f with f, g = g, f ensures that all matrix entries are always positive. But aside from actually observing that it converges to f = 1 in practice, do you have any other intuitive justification for this? (The best I can come up with is that since you're constantly dividing g by 2 and swapping the roles of f and g, you're quickly moving towards a solution even if you replace g - f with g + f in the original GCD algorithm.)

I don't, no. I do have a proof that any f>0,g>0 converges to f=g somewhere in a finite number of steps, but the upper bound is something like f*g*log2(f*g), which is obviously useless for practical purposes. The technique explained in https://github.com/sipa/safegcd-bounds completely fails for posdivsteps (doesn't converge at all). With a lot of tricks I got it to converge, but in what appears quadratic (O(log2(f)^2)), and it's only tractable to evaluate for f up to 2^20 or so.

The difficulty is that there are both steps that take f and g further apart (g /= 2) and steps that bring them closer (g = (f+g)/2), and it's hard to infer tight bounds on how many times one can happen in between two executions of the other.

So in short, no; the actual argument why this works is purely empirical; all out of a 2.5 billion random inputs (with the field size modulus) need between 630 and 921 iterations.

Is there already a writeup describing this change? If not, would it make sense to add a brief section to https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md? (If you think so, I'd be happy to give that a shot.)

I think that would be valuable, yes. I haven't done so yet, but was considering it. Feel free to take a shot at it.

@sipa sipa force-pushed the 202109_native_jacobi branch from 55ec89c to 902d57e Compare November 12, 2021 16:37
@sipa
Copy link
Contributor Author

sipa commented Nov 12, 2021

Included @robot-dreams's update to the writeup.

Copy link
Contributor

@robot-dreams robot-dreams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 5af3d29 aside from comment nits—I checked the range diff against the old commit and it looks good.

@sipa sipa force-pushed the 202109_native_jacobi branch from 902d57e to 53b7ce2 Compare November 16, 2021 20:28
@robot-dreams
Copy link
Contributor

ACK 53b7ce2

@sipa
Copy link
Contributor Author

sipa commented Feb 27, 2023

Rebased, and updated to only support coprime inputs in the secp256k1_secp256k1_jacobiXX_maybe_var functions. The return value of those functions for "unknown/timeout" has been changed from -2 to 0.

@sipa sipa force-pushed the 202109_native_jacobi branch 2 times, most recently from eba1971 to 1c1c202 Compare February 28, 2023 16:23
@jonasnick
Copy link
Contributor

reACK mod nit

sipa and others added 3 commits February 28, 2023 15:54
This introduces variants of the divsteps-based GCD algorithm used for
modular inverses to compute Jacobi symbols. Changes compared to
the normal vartime divsteps:
* Only positive matrices are used, guaranteeing that f and g remain
  positive.
* An additional jac variable is updated to track sign changes during
  matrix computation.
* There is (so far) no proof that this algorithm terminates within
  reasonable amount of time for every input, but experimentally it
  appears to almost always need less than 900 iterations. To account
  for that, only a bounded number of iterations is performed (1500),
  after which failure is returned. In VERIFY mode a lower iteration
  count is used to make sure that callers exercise their fallback.
* The algorithm converges to f=g=gcd(f0,g0) rather than g=0. To keep
  this test simple, the end condition is f=1, which won't be reached
  if started with non-coprime or g=0 inputs. Because of that we only
  support coprime non-zero inputs.
The implementation calls the secp256k1_modinvNN_jacobi_var code, falling back
to computing a square root in the (extremely rare) case it failed converge.
@sipa sipa force-pushed the 202109_native_jacobi branch from 1c1c202 to ce3cfc7 Compare February 28, 2023 20:57
@jonasnick
Copy link
Contributor

ACK ce3cfc7

Copy link
Contributor

@real-or-random real-or-random left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reACK ce3cfc7 diff and writeup is good and I tested every commit

@real-or-random real-or-random merged commit 09b1d46 into bitcoin-core:master Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants