Native jacobi symbol algorithm #979

sipa · 2021-09-23T22:54:02Z

This introduces variants of the vartime divsteps-based GCD algorithm used for modular inverses to compute Jacobi symbols. Changes compared to the normal vartime divsteps:

Only positive matrices are used, guaranteeing that f and g remain positive.
An additional jac variable is updated to track sign changes during matrix computation.
There is (so far) no proof that this algorithm terminates within reasonable amount of time for every input, but experimentally it appears to almost always need less than 900 iterations. To account for that, only a bounded number of iterations is performed (1500), after which failure is returned. The field logic then falls back to using square roots to determining the result.
The algorithm converges to f=g=gcd(f0,g0) rather than g=0. To keep this test simple, the end condition is f=1, which won't be reached if started with g=0. That case is dealt with specially.

This code is currently unused, except for tests. I don't aim for it to be merged until there is a need for it, but this demonstrates its feasibility.

In terms of performance:

field_inverse: min 1.76us / avg 1.76us / max 1.78us
field_inverse_var: min 0.991us / avg 0.993us / max 0.996us
field_jacobi_var: min 1.31us / avg 1.31us / max 1.31us
field_sqrt: min 4.36us / avg 4.37us / max 4.40us

while with the older (f24e122) libgmp based Jacobi code on the same system:

num_jacobi: min 1.53us / avg 1.54us / max 1.55us

robot-dreams

tACK 042f91b

./tests passed, including the additional checks I suggested.

I'm still brushing up on the divsteps algorithm, and will review this change again once I've worked through the various details / optimizations.

Benchmark results look good on x86_64 (MacBook Pro): native jacobi version is faster than libgmp. However, on arm64 (Mac M1), native jacobi version was slightly slower!

On M1 I also tried ./configure --with-asm=arm --enable-experimental to see if that would speed up the native jacobi version, but attempting to build resulted in a lot of compile errors in field_10x26_arm.s.

MacBook Pro (x86_64)

native jacobi branch

$ ./bench_internal field
...
field_jacobi_var: min 2.03us / avg 2.25us / max 3.85us
field_sqrt: min 6.33us / avg 6.38us / max 6.46us

$ ./bench_internal jacobi
field_jacobi_var: min 2.00us / avg 2.06us / max 2.15us

libgmp branch f24e122:

$ ./bench_internal field
...
field_sqrt: min 6.05us / avg 6.13us / max 6.38us

$ ./bench_internal num
num_jacobi: min 2.44us / avg 2.47us / max 2.53us

Mac M1 (arm64)

native jacobi branch

% ./bench_internal field
...
field_jacobi_var: min 1.54us / avg 1.55us / max 1.56us
field_sqrt: min 4.04us / avg 4.09us / max 4.29us

% ./bench_internal jacobi
field_jacobi_var: min 1.55us / avg 1.74us / max 3.18us

libgmp branch f24e122:

% ./bench_internal field
...
field_sqrt: min 4.04us / avg 4.04us / max 4.05us

% ./bench_internal jacobi
...
num_jacobi: min 1.41us / avg 1.41us / max 1.42us

src/modinv32.h

src/tests.c

src/field_5x52_impl.h

real-or-random · 2021-11-09T08:09:10Z

@robot-dreams Always nice to see new contributors in the repo. :) Just curious, are you simply helping or are you interested in getting this merged? If the latter, I'm further curious about your use case. Feel free just ignore my intrusive questions.

robot-dreams · 2021-11-09T17:25:25Z

@real-or-random Thanks for the reply! I'm simply helping and try to learn about this repo, I don't have a personal use case.

sipa · 2021-11-09T22:24:06Z

@robot-dreams Thanks for your comments!

The failure with asm=arm is expected, as Apple M1 is 64-bit ARM, while the assembly code used here is 32-bit.

sipa · 2021-11-09T22:57:35Z

Made another change: in VERIFY mode, the upper bound on the number of posdivsteps iterations is reduced to ~750 (which is close to the median of how many iterations are needed for random input mod the field size), so that the uncomputable path is actually exercised in the tests.

robot-dreams

ACK 6e3fcc4

I agree with your modifications to the original safegcd for calculating Jacobi symbol instead of modular inverse. I also replicated the logic for the 64-bit version in Python, and checked that the result matches a "slow" Jacobi function for a few randomly generated inputs (see https://gist.github.com/robot-dreams/ceb00162b80384f2ae1913aaa2b35e75).

One potential concern is that with this change, there are now 4 slightly different implementations of a similar algorithm (32/64-bit, inv/jacobi). Do you think there's a nice way to consolidate the logic, at least among a given word size? (If there's no nice way or if it's too much trouble, I'm fine with leaving it as is and relying on having good tests.)

Other questions:

What was the motivation for using different approaches to clear the low-order bits of g between the 32 and 64 bit implementations (8-bit lookup table vs 4 and 6-bit tricks from Hacker's Delight)?
I agree that replacing f, g = g, -f with f, g = g, f ensures that all matrix entries are always positive. But aside from actually observing that it converges to f = 1 in practice, do you have any other intuitive justification for this? (The best I can come up with is that since you're constantly dividing g by 2 and swapping the roles of f and g, you're quickly moving towards a solution even if you replace g - f with g + f in the original GCD algorithm.)
Is there already a writeup describing this change? If not, would it make sense to add a brief section to https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md? (If you think so, I'd be happy to give that a shot.)

src/modinv64_impl.h

src/modinv32_impl.h

src/modinv64_impl.h

src/modinv32_impl.h

src/modinv64_impl.h

robot-dreams · 2021-11-10T04:57:53Z

@sipa Thanks for responding to the comments!

Yes, 32-bit makes sense (in retrospect the fact that the file was called field_10x26 instead of field_5x52 should have given that away)
Great idea to reduce the number of iterations for validation!

sipa · 2021-11-10T16:28:27Z

One potential concern is that with this change, there are now 4 slightly different implementations of a similar algorithm (32/64-bit, inv/jacobi). Do you think there's a nice way to consolidate the logic, at least among a given word size? (If there's no nice way or if it's too much trouble, I'm fine with leaving it as is and relying on having good tests.)

Arguably there are 6 implementations (32/64-bit, and const/var/varpos divsteps), and they do share some of the code (in particular the matrix application), but not everything. I don't think there is an obvious way of merging more code without reducing performance, but I also haven't tried very hard.

What was the motivation for using different approaches to clear the low-order bits of g between the 32 and 64 bit implementations (8-bit lookup table vs 4 and 6-bit tricks from Hacker's Delight)?

Just dumb benchmarking. The formula was faster on x86_64 systems we tried, and the table was faster on ARMv7 systems (it's explained in one of the commit messages that introduced it).

I agree that replacing f, g = g, -f with f, g = g, f ensures that all matrix entries are always positive. But aside from actually observing that it converges to f = 1 in practice, do you have any other intuitive justification for this? (The best I can come up with is that since you're constantly dividing g by 2 and swapping the roles of f and g, you're quickly moving towards a solution even if you replace g - f with g + f in the original GCD algorithm.)

I don't, no. I do have a proof that any f>0,g>0 converges to f=g somewhere in a finite number of steps, but the upper bound is something like f*g*log2(f*g), which is obviously useless for practical purposes. The technique explained in https://github.com/sipa/safegcd-bounds completely fails for posdivsteps (doesn't converge at all). With a lot of tricks I got it to converge, but in what appears quadratic (O(log2(f)^2)), and it's only tractable to evaluate for f up to 2^20 or so.

The difficulty is that there are both steps that take f and g further apart (g /= 2) and steps that bring them closer (g = (f+g)/2), and it's hard to infer tight bounds on how many times one can happen in between two executions of the other.

So in short, no; the actual argument why this works is purely empirical; all out of a 2.5 billion random inputs (with the field size modulus) need between 630 and 921 iterations.

Is there already a writeup describing this change? If not, would it make sense to add a brief section to https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md? (If you think so, I'd be happy to give that a shot.)

I think that would be valuable, yes. I haven't done so yet, but was considering it. Feel free to take a shot at it.

sipa · 2021-11-12T16:37:24Z

Included @robot-dreams's update to the writeup.

robot-dreams

ACK 5af3d29 aside from comment nits—I checked the range diff against the old commit and it looks good.

src/modinv32_impl.h

robot-dreams · 2021-11-16T21:22:41Z

ACK 53b7ce2

sipa · 2023-02-27T21:25:52Z

Rebased, and updated to only support coprime inputs in the secp256k1_secp256k1_jacobiXX_maybe_var functions. The return value of those functions for "unknown/timeout" has been changed from -2 to 0.

src/modinv32.h

src/tests.c

jonasnick · 2023-02-28T20:51:48Z

reACK mod nit

This introduces variants of the divsteps-based GCD algorithm used for modular inverses to compute Jacobi symbols. Changes compared to the normal vartime divsteps: * Only positive matrices are used, guaranteeing that f and g remain positive. * An additional jac variable is updated to track sign changes during matrix computation. * There is (so far) no proof that this algorithm terminates within reasonable amount of time for every input, but experimentally it appears to almost always need less than 900 iterations. To account for that, only a bounded number of iterations is performed (1500), after which failure is returned. In VERIFY mode a lower iteration count is used to make sure that callers exercise their fallback. * The algorithm converges to f=g=gcd(f0,g0) rather than g=0. To keep this test simple, the end condition is f=1, which won't be reached if started with non-coprime or g=0 inputs. Because of that we only support coprime non-zero inputs.

The implementation calls the secp256k1_modinvNN_jacobi_var code, falling back to computing a square root in the (extremely rare) case it failed converge.

jonasnick · 2023-02-28T21:10:28Z

ACK ce3cfc7

real-or-random

reACK ce3cfc7 diff and writeup is good and I tested every commit

src/int128_struct_impl.h

sipa force-pushed the 202109_native_jacobi branch 2 times, most recently from 5a5f7a6 to 042f91b Compare September 23, 2021 23:01

sipa mentioned this pull request Sep 28, 2021

Add Elligator Square module #982

Closed

dhruv mentioned this pull request Nov 3, 2021

BIP324: CKey encode/decode to elligator-swift bitcoin/bitcoin#23432

Closed

robot-dreams reviewed Nov 8, 2021

View reviewed changes

src/modinv32.h Show resolved Hide resolved

src/tests.c Outdated Show resolved Hide resolved

src/field_5x52_impl.h Show resolved Hide resolved

sipa force-pushed the 202109_native_jacobi branch 5 times, most recently from aced43a to 6e3fcc4 Compare November 9, 2021 22:22

robot-dreams reviewed Nov 10, 2021

View reviewed changes

src/modinv64_impl.h Outdated Show resolved Hide resolved

src/modinv32_impl.h Outdated Show resolved Hide resolved

src/modinv64_impl.h Outdated Show resolved Hide resolved

src/modinv32_impl.h Outdated Show resolved Hide resolved

src/modinv64_impl.h Outdated Show resolved Hide resolved

sipa force-pushed the 202109_native_jacobi branch from 6e3fcc4 to 46eaa95 Compare November 10, 2021 16:15

sipa force-pushed the 202109_native_jacobi branch from 55ec89c to 902d57e Compare November 12, 2021 16:37

robot-dreams reviewed Nov 15, 2021

View reviewed changes

src/modinv32_impl.h Show resolved Hide resolved

sipa force-pushed the 202109_native_jacobi branch from 902d57e to 53b7ce2 Compare November 16, 2021 20:28

sipa closed this Jan 29, 2022

sipa reopened this Jan 29, 2022

sipa force-pushed the 202109_native_jacobi branch from 53b7ce2 to 7e1bbef Compare June 28, 2022 20:19

sipa mentioned this pull request Jul 3, 2022

Add x-only ecmult_const version with x specified as n/d #1118

Merged

sipa mentioned this pull request Jul 21, 2022

ElligatorSwift + integrated x-only DH #1129

Merged

mratsim mentioned this pull request Aug 8, 2022

Perf: isSquare - constant-time Jacobi/Kronecker/Legendre symbol using fast GCD mratsim/constantine#199

Closed

sipa force-pushed the 202109_native_jacobi branch from 7e1bbef to 0bce5e0 Compare November 2, 2022 19:22

jonasnick reviewed Feb 28, 2023

View reviewed changes

src/modinv32.h Outdated Show resolved Hide resolved

sipa force-pushed the 202109_native_jacobi branch 2 times, most recently from eba1971 to 1c1c202 Compare February 28, 2023 16:23

jonasnick reviewed Feb 28, 2023

View reviewed changes

src/tests.c Outdated Show resolved Hide resolved

sipa and others added 3 commits February 28, 2023 15:54

Add secp256k1_fe_is_square_var function

6be0103

The implementation calls the secp256k1_modinvNN_jacobi_var code, falling back to computing a square root in the (extremely rare) case it failed converge.

doc: Describe Jacobi calculation in safegcd_implementation.md

ce3cfc7

sipa force-pushed the 202109_native_jacobi branch from 1c1c202 to ce3cfc7 Compare February 28, 2023 20:57

real-or-random approved these changes Mar 1, 2023

View reviewed changes

real-or-random merged commit 09b1d46 into bitcoin-core:master Mar 1, 2023

real-or-random mentioned this pull request Mar 1, 2023

modinv: Verify invariant of _modinfo struct #1216

Open

roconnor-blockstream reviewed Mar 22, 2023

View reviewed changes

src/int128_struct_impl.h Show resolved Hide resolved

sipa mentioned this pull request May 11, 2023

Abstract out and merge all the magnitude/normalized logic #1066

Merged

jonasnick mentioned this pull request Jul 20, 2023

Upstream PRs 1160, 1193, 1169, 1190, 1192, 1194, 1196, 1195, 1170, 1172, 1200, 1199, 1203, 1201, 1206, 1078, 1209, 979, 1212, 1218, 1217, 1221, 1222 BlockstreamResearch/secp256k1-zkp#249

Merged

sipa mentioned this pull request Sep 8, 2023

BIP324 tracking issue bitcoin/bitcoin#27634

Closed

43 tasks

Native jacobi symbol algorithm #979

Native jacobi symbol algorithm #979

Uh oh!

Conversation

sipa commented Sep 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robot-dreams left a comment

Choose a reason for hiding this comment

MacBook Pro (x86_64)

Mac M1 (arm64)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

real-or-random commented Nov 9, 2021

Uh oh!

robot-dreams commented Nov 9, 2021

Uh oh!

sipa commented Nov 9, 2021

Uh oh!

sipa commented Nov 9, 2021

Uh oh!

robot-dreams left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robot-dreams commented Nov 10, 2021

Uh oh!

sipa commented Nov 10, 2021

Uh oh!

sipa commented Nov 12, 2021

Uh oh!

robot-dreams left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robot-dreams commented Nov 16, 2021

Uh oh!

sipa commented Feb 27, 2023

Uh oh!

Uh oh!

Uh oh!

jonasnick commented Feb 28, 2023

Uh oh!

jonasnick commented Feb 28, 2023

Uh oh!

real-or-random left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sipa commented Sep 23, 2021 •

edited

Loading