Skip to content

Conversation

@eiriktsarpalis
Copy link
Member

@eiriktsarpalis eiriktsarpalis commented Mar 13, 2024

Updating the implementations to use FMA where applicable.

NB tests might require some calibration pending test results from platforms without FMA support.

@ghost ghost added the area-System.Numerics label Mar 13, 2024
@eiriktsarpalis eiriktsarpalis self-assigned this Mar 13, 2024
@eiriktsarpalis eiriktsarpalis added this to the 9.0.0 milestone Mar 13, 2024
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

@stephentoub
Copy link
Member

Are there corresponding changes to be made to VectorMath in corelib?

@eiriktsarpalis
Copy link
Member Author

Are there corresponding changes to be made to VectorMath in corelib?

Not sure. Do these routines have precision issues as well?

@stephentoub
Copy link
Member

Are there corresponding changes to be made to VectorMath in corelib?

Not sure. Do these routines have precision issues as well?

It's a copy of the same code, e.g.

public static TVectorDouble ExpDouble<TVectorDouble, TVectorInt64, TVectorUInt64>(TVectorDouble x)
. The code being changed here for Exp for example only applies to .NET 8; on .NET 9 that code from Corelib ends up being used.

@eiriktsarpalis
Copy link
Member Author

I was only running the .NET 9 tests when working on this, so clearly something's off here.

Comment on lines +207 to +214
Vector128<double> a0 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C12), r, Vector128.Create(C11));
Vector128<double> a1 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C10), r, Vector128.Create(C9));
Vector128<double> a2 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C8), r, Vector128.Create(C7));
Vector128<double> a3 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C6), r, Vector128.Create(C5));
Vector128<double> a4 = MultiplyAddEstimateOperator<double>.Invoke(Vector128.Create(C4), r, Vector128.Create(C3));
Vector128<double> a5 = MultiplyAddEstimateOperator<double>.Invoke(a0, r2, a1);
Vector128<double> a6 = MultiplyAddEstimateOperator<double>.Invoke(a2, r2, a3);
Vector128<double> a7 = MultiplyAddEstimateOperator<double>.Invoke(a4, r2, r + Vector128<double>.One);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This many locals isn't really "good" for the JIT and can hinder some optimizations.

It can also make it a little bit harder to follow the intended computation ordering, which can itself limit other optimizations or reorderings the JIT would be allowed to do.

We could reduce the local count significantly if we broke it up like this, and then its also equivalent in codegen to as if there were no locals at all:

Vector128<double> t1 = MultiplyAddEstimate(
    MultiplyAddEstimate(Vector128.Create(C12), r, Vector128.Create(C11)),
    r2,
    MultiplyAddEstimate(Vector128.Create(C10), r, Vector128.Create(C9))
);

Vector128<double> t2 = MultiplyAddEstimate(
   MultiplyAddEstimate(Vector128.Create(C8),  r, Vector128.Create(C7)),
   r2,
   MultiplyAddEstimate(Vector128.Create(C6),  r, Vector128.Create(C5))
);

t1 = MultiplyAddEstimate(t1, r8, t2);
t2 = MultiplyAddEstimate(Vector128.Create(C4),  r, Vector128.Create(C3));
t2 = MultiplyAddEstimate(t2, r2, r + Vector128<double>.One);

MultiplyAddEstimate(t1, r4, t2);

You can see it results in smaller codegen, lets the initial multiplies be parallel dispatched up front, keeps the total register usage lower, etc: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABABgAJiBGAOgCUBXAOwwEt8YaBJFqVp3KzC4A3AFgAUGUq1GLdpx4Y+AobhoANABxJxUgMyUATOQDC5AN6TyN8gAc+AN2wYY5SAIzkAJhAbAAGzdTQwBeclIaAFZSWLj4uKotPVt7Jxc3D1wvX38gsxRycMiqJDLyivKqAHYU2wdWZ1d3CE8fP0DgqKKImlIUUsrylABOI2ra63r05qycjvzTJB7I2K19Dc3NqKix6qM6mwamzNbs9rzg6pW+0ip9LUenkaiqIyMUfsO0xoyWttynTMWhuiRGWgGRiQIxQ+ihE1GUW+xz+cwuQNMIx6RhoKC0dzxSCiRnu1RQeP0AFE4KQkVMjjNTgCFsEqBRwjjqjtqkh3lE4VFYaQjFSabp6T8Tv9zoDFlQqNiaDy4UYRnDwUYYi8UNTSJMJKkUbMzvNLmY3orRjD9P19FQXlootUqLrkhKJcRDAA1GBgDDQN5aAA8spgAD5yABZAAUPr9AaMwdDEagAEpLBLUnH/VBAyGWSmTOEoOQAFTkKDfLO+nN55MVwrFkzlqAHTO2bMJpMFisg4uFFsoKsdmtd/OXCPYdlRhgBNh2AIATwAgt5vJTsuwMrHR7nEzRTLBt6Y3qm0BXz5291oD0fXNGT1RU6nhzYr3We9gFeFI7P50vV3XTd8G3d990PGBjzZM8L3IMCbwg48RmfV84N3D8J3IbAixnOdWAXFc1w3NgQPveDb0g+9TC0GCoEvdDwLvGAH2qFD2zfBju0w7Awlw/9CKAkjQM4ijjyQWj6Pja9RKoqI2INVIRykjDOknRs+PwgCiOA4TlMYyjmNMFAJLQvSEKYh99HkxSlNrRNx1UrDuh/P9NIE4it3vKdz1bc8vxfdjTLsrjHOwZYXLwgjAI80jmOwnyjD8qzUPghygkna4Iv46KdK89AK0SityAAaiCsdkxoAB5JgYAChSbMCgB6Rq7AgJcemwboWxBUqwrLBsSqw/UbI4sy0vDew2sXHpf0irTBM8uKkB8/LsFY1DiGuWbsu0oSvKiHytHPVqlzq1IAF93XqyhvU48aIwAOR3Mb6zTDNrurF6e1bHoSxbFK7te9SfpbNsPts8rvr7AbBwBr7MIwb8NKi3bFujQLUm2tycr25jyMQqjTx8yTgpkwz5WfNAMdsXzqZsLGUYW2LntJgnydIEz8YszFn0CurAtS+sMBwhn5pi7c6dF9zcrxkS2YfGjz1grmDJYym6dp8HFKlnG0ZVsSYOVuXubk1M+b0QLEZm1zGfF+9EcO89hbOkbhetubpdxlmuzJh9jKVuiyuk+WQms12RZtsWZejYWEp8wbBYLKqav5rXKC2yPPbRh2GydowXZsS7rolI03ETzCddR5ny8coIADMMBJyHML4ABzAALRug5U9KsLXGAmG8M3rqsNPWDr8howAMRA7hcAAZQYOxWqgVwh8C0eRsUzbyBn7AaEr7xo3rru287vz+8HgvFKLrfAp34+YAb/qz4wdNesv7xviL86gA=

@github-actions github-actions bot locked and limited conversation to collaborators Apr 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants