Const-qulify PyUFuncGenericFunction's arguments by edudev · Pull Request #40728 · tensorflow/tensorflow

edudev · 2020-06-23T21:07:28Z

Due to recent changes in Numpy [1], PyUFuncGenericFunction's argument should be const-qualified.
This change is non-breaking (i.e. it will work for both numpy 1.18 and 1.19).

[1] numpy/numpy#15355

edudev · 2020-06-23T21:34:50Z

I just noticed this change [1]. I believe my PR resolves all the breaking changes in the numpy ABI and the commit can be reversed.

[1] 79518fa

gbaned · 2020-07-09T15:04:34Z

@jaingaurav Can you please review this PR ? Thanks!

Related to tensorflow/tensorflow#40728

Flamefire · 2020-07-28T08:53:51Z

I'd suggest to include a revert of 79518fa in this PR as this resolves the TODO. I hope the TF guys agree.

edudev · 2020-07-30T15:22:16Z

I'd suggest to include a revert of 79518fa in this PR as this resolves the TODO. I hope the TF guys agree.

I believe it should be reverted, too.
However, before adding it to the PR, I'm awaiting for some feedback.
One other thing that I do not know is whether the old version of numpy should be supported.

Flamefire · 2020-07-30T16:20:19Z

One other thing that I do not know is whether the old version of numpy should be supported.

What do you mean by that exactly? -->

This change is non-breaking (i.e. it will work for both numpy 1.18 and 1.19).

edudev · 2020-07-30T16:54:32Z

One other thing that I do not know is whether the old version of numpy should be supported.

What do you mean by that exactly? -->

This change is non-breaking (i.e. it will work for both numpy 1.18 and 1.19).

Ah, I totally forgot that it's a non-breaking change. My mistake.

gbaned · 2020-08-06T16:24:00Z

@edudev, @majnemer Any update on this PR? Please. Thanks!

Remove the limit of numpy's version

edudev · 2020-08-08T22:42:17Z

@edudev, @majnemer Any update on this PR? Please. Thanks!

Well, I'm just waiting for a review.

Something that I did notice though is that the relevant function type (PyUFuncGenericFunction) is also used in the XLA part of tensorflow. For some reason, when I compiled tensorflow with XLA support, it did not hit the 'broken' code (didn't attempt to compile it, at all).
However, when I explicitly compiled it (//tensorflow/compiler/xla/python:bfloat16), it was in fact unsuccessful, hence why I just updated my pull request to fix this, as well.
I also reverted the change which limited the numpy version to 1.18.

And something that I sadly just noticed: this change is in fact breaking. Once merged, it will not be possible to compile tensorflow with numpy below 1.19. Moreover, I believe that using tensorflow with lower versions of numpy will also be impossible, due to differences in the mangled name of the relevant functions.
Which is why I also suggest that the minimum version of numpy also be bumped (from 1.16 to 1.19).

Flamefire · 2020-08-09T19:42:28Z

this change is in fact breaking. Once merged, it will not be possible to compile tensorflow with numpy below 1.19. Moreover, I believe that using tensorflow with lower versions of numpy will also be impossible, due to differences in the mangled name of the relevant functions.

Can you elaborate why? I mean: those functions call numpy functions and those numpy function either take non-const pointers (<1.19) or const-pointers (1.19+), so both work now. Only thing which will be impossible is to compile TF with <1.19 and use with 1.19+. Or is that why you meant?

edudev · 2020-08-09T21:04:02Z

this change is in fact breaking. Once merged, it will not be possible to compile tensorflow with numpy below 1.19. Moreover, I believe that using tensorflow with lower versions of numpy will also be impossible, due to differences in the mangled name of the relevant functions.

Can you elaborate why? I mean: those functions call numpy functions and those numpy function either take non-const pointers (<1.19) or const-pointers (1.19+), so both work now. Only thing which will be impossible is to compile TF with <1.19 and use with 1.19+. Or is that why you meant?

Yes, it will not be possible to compile tensorflow with numpy versions below 1.19 due to the change in the function pointer type (from non-const pointer argument to const pointer argument).
I'm not sure why I did not notice this issue before. I believe it may have been due to some caches. I only noticed the compilation issues after running bazel clean.

About the runtime issues: had the relevant code (bfloat16) been compiled as a C source, this would not have been an issue. However, in C++ the argument types actually influence the function names.
A simple example:

$ c++filt _Z3fooPc
foo(char*)
$ c++filt _Z3fooPKc
foo(char const*)

Notice how changing the argument from a normal pointer to a constant pointer changes the name of the function, as well (there's an extra K).

Now, coming to the numpy case, it's... complicated... way more complicated, because the offending argument is actually a pointer to a function.

edudev · 2020-08-09T21:46:15Z

Hmm, okay, I took a closer look at this once again. As I said, it's quite complicated.

The linking issue is at the boundary of tensorflow and numpy. Including headers and having const/non-const is only a compile-time issue, as already mentioned.
The only boundary where the offending function pointer type change (PyUFuncGenericFunction) is used at a library boundary is when calling PyUFunc_RegisterLoopForType. This is because PyUFunc_RegisterLoopForType takes an argument, which is PyUFuncGenericFunction. These calls are a few lines below my changes in the two bfloat16.cc source files of tensorflow.

Normally, when the tensorflow call to PyUFunc_RegisterLoopForType gets executed, the (dynamic) linker will search for the relevant function and, as already mentioned, that one (or two) const qualifiers make a hell of a lot difference in the function names.
Just for illustration purposes, let's assume that PyUFunc_RegisterLoopForType had the following declaration (I've removed some parameters to make it simpler, but the most important PyUFuncGenericFunction is still there):

int PyUFunc_RegisterLoopForType(PyUFuncGenericFunction);

Here are how the two names of this function would look like (without and with const qualifiers for the parameters of the PyUFuncGenericFunction parameter; told you it was complicated):

_Z27PyUFunc_RegisterLoopForTypePFvPPcPlS1_PvE
_Z27PyUFunc_RegisterLoopForTypePFvPPcPKlS2_PvE

(they are different)

Okay, up until now, this pretty much says that we would not be able to use tensorflow with versions of numpy before 1.19.
However, I went to also check the definition of PyUFunc_RegisterLoopForType, and it becomes even more obfuscated.
The declaration is located in __ufunc_api.h of the numpy source code; and there are actually two declarations.
I'll ignore the #ifdef, which branches between the two, and here they are:

NPY_NO_EXPORT  int PyUFunc_RegisterLoopForType \
       (PyUFuncObject *, int, PyUFuncGenericFunction, const int *, void *);

#define PyUFunc_RegisterLoopForType \
        (*(int (*)(PyUFuncObject *, int, PyUFuncGenericFunction, const int *, void *)) \
         PyUFunc_API[2])

(these declarations differ from the one I provided above, as it was intentionally made simpler, with less arguments)

I have not gone through the numpy source code, so I am not exactly sure when each of these two versions is used. My guess would be that the former is used for internal numpy complication, while the latter is used when compiling against external libraries (such as tensorflow).
Nonetheless, the definition of the former is not in the C source code, while the definition of the latter is quite simple (I'm pasting again):

#define PyUFunc_RegisterLoopForType \
        (*(int (*)(PyUFuncObject *, int, PyUFuncGenericFunction, const int *, void *)) \
         PyUFunc_API[2])

And this is just a simple cast of PyUFunc_API[2], whatever it may be. This cast does not care about whether a parameter of a function pointer parameter is const-qualified or not.

Therefore, in conclusion, I believe it may not be an issue to use older versions of numpy with tensorflow.
However, I'm not sure how to test this, and I have absolutely no idea how to write an automated test for this (the test itself would pretty much require having multiple versions of numpy).

hawkinsp · 2020-08-09T22:50:04Z

I didn't follow the discussion in the most recent comment, but it looks to me like it is overcomplicating this.

I think all you need to do to fix any source compatibility issues with older versions of NumPy is to call reinterpret_cast<PyUFuncGenericFunction>(fn) on the function being passed to PyUFunc_RegisterLoopForType. This will be a no-op if the const-qualifiers match, and perform a cast otherwise.

edudev · 2020-08-11T04:26:54Z

You can ignore my comments on this failing to link due to the C++ name mangling. I just noticed that there's an extern "C" around the numpy code for PyUFunc_RegisterLoopForType.
I see no issues and I believe the pull request is more or less ready (should it be okay that numpy 1.19 be required to compile tensorflow).

About the reinterpret cast: as I was wrong about the C++ function name mangling occurring, it will not be needed.
However, assuming that I was not mistaken about this issue, it would still not work. The compiler decides at compile time which of the overloaded functions to call (the one with the const qualifier or the one without). A reinterpret cast would be able to fix the issue of compiling tensorflow with versions of numpy prior to 1.19.
The issue that I was describing was a potential one at link time, where a reinterpret cast would not change anything.

So... my question is: how bad would it be to require numpy 1.19 for tensorflow compilation (but still be able to work alongside old versions during runtime)?

Flamefire · 2020-08-11T09:04:39Z

It seems to me that you are confusing things. I don't see how this change is not compatible with numpy before and after 1.19. Can you point out why you think this is not the case? Did you try it?

Let me address what we have up until now:

PyUFunc_RegisterLoopForType expects a function of type PyUFuncGenericFunction and is passed your changed functions
PyUFuncGenericFunction changed in 1.19 to have const params, so the changed functions will no longer match the type PyUFuncGenericFunction`` in pre-1.19 numpy
reinterpret_cast<PyUFuncGenericFunction>(fn) can be used to solve this
PyUFunc_RegisterLoopForType seems to be a macro which casts a void* out of an array called PyUFunc_API to the (function) type required. Hence ABI for this doesn't matter
As the changed TF function is passed via a (casted) function pointer to numpy ABI of that doesn't matter either
Hence with the cast everything is API and ABI compatible with all numpy versions

The compiler decides at compile time which of the overloaded functions to call

Which overloaded functions? There are no overloaded functions here are they? You either have ones with const-pointers (this change) or without (prior to this changed)

The issue that I was describing was a potential one at link time,

After reading the above, do you still see an issue? Which one? Have you tried it?

edudev · 2020-08-15T17:04:49Z

Once again, you can ignore my comments on linking errors.

The only questions that remains is whether tensorflow should be compiled with numpy < 1.19.
The pull request, as it is now would allow tensorflow to be compiled only with numpy 1.19.
A reinterpret cast would allow compilation with lower versions of numpy. My personal opinion is that adding the cast is not necessary, unless somebody can point out a good argument for supporting older versions of numpy.

In either case (with the reinterpret cast, or without it), tensorflow will continue to work with the older versions of numpy at runtime.

Flamefire · 2020-08-15T17:52:42Z

A reinterpret cast would allow compilation with lower versions of numpy.

IMO that is worth it. Maybe with a slightly safer version that static_asserts that the signature is what is expected and does the cast only then. So either:

check for numpy < 1.19 (if that is possible) and cast to the explicit, known signature
or, add static_asserts so that it checks that PyUFuncGenericFunction is either the known new or old version and a second check that checks that the function to be cast is either the new or old version

Reason: If they change it again or the function here is changed then it will result in a build error instead of a silent failure

somebody can point out a good argument for supporting older versions of numpy.

One I can think of: User having numpy already installed. That won't be upgraded unless they do the explicitly (or implicitly if TF requires the new version) but maybe users have other software they use which doesn't work with the older version due to similar problems to this

edudev · 2020-08-15T20:01:46Z

Glad we're finally on the same page about the issue of the pull request.
I apologise for all my previous comments which diverted everyone's attention for something irrelevant.

About the cast: I honestly am not sure which is better, which is why I asked for arguments for the cast.
My reasoning to dislike it is that not having a cast would not affect users of tensorflow. The only ones affected are developers. If you're a user and have issues with numpy 1.19, you can, of course downgrade to 1.18 (or not upgrade at all).
If you are a developer of a 3rd party library, then I'd say that there would be issues only if your C/C++ code is not compatible with the new ABI of numpy 1.19 (i.e. you use the numpy ABI). I would argue that it is then the responsibility of this 3rd party to update their code to comply with the breaking changes to numpy 1.19 - the change, as in this pull request is fairly straighforward, and if you really don't want to be compatible with numpy 1.19, you can also delay compatibility with the new tensorflow version (with which this fix would likely get shipped).
If you're a developer of tensorflow, then the worst thing that would happen is that your build would fail until you do a
$ pip install numpy>=1.19
Generally, I do believe that most developers have separate python virtual environments (right?, at least I do), so several versions of numpy can co-exist without issues.

Another reason that I don't like having a cast is that it is to support an old version of numpy. I looked for official information on their release guidelines and end-of-life timeline, but I could find nothing. If a decision is made to support numpy 1.18, I would also add a comment on this (probably also with a link to this issue). Nonetheless, I feel like a cast would be a 'temporary' solution (until numpy 1.18 gets forgotten), which would remain there permanently (I try to avoid such code, as much as possible).

And lastly, the cast is in fact not future-proof, as you've noted. Although extremely unlikely, the ABI may change again to something which is incompatible (unlike the current const qualifiers). I tried to look for some macros to detect for numpy 1.19, but could not find anything (most likely I missed it, though).

Flamefire · 2020-08-15T21:22:07Z

You are missing an important group: people who install packages like system package maintainers and system admins. For them it is not possible to have conflicting versions or "just upgrade" a package. So having wider compatibility is a big bonus.

Numpy 1.18 is not old enough to ignore it. And just adding the cast is less work than all this discussion.

For your last point: imo it is irrelevant as long as we detect that at compile time. See my solutions above. They won't completly change it without at least raising the major version and then a decision needs to be made. But that is all fruitless speculation.

So: best to get something to work for current versions and don't limit them just to save 4 lines of code or so. Rely on reported versions (is there a macro of the Numpy version?) or add a comment. Then nothing is forgotten

ebrevdo · 2020-09-23T15:02:51Z

@alextp does this PR break the C ABI guarantees we have in place?

alextp · 2020-09-23T15:10:27Z

@ebrevdo I don't see any change to our ABI-stable headers in tensorflow/c/{eager,} so I don't think it's ABI-breaking.

I also believe that the numpy abi breaking change is actually not an abi breaking change (i.e. the code links fine) but an api-breaking change (i.e. the const makes C++ compilers not willing to compile old code against the new header file but old code compiled agains the old header file links just fine against existing numpy).

That said I didn't review this change at all so don't trust me to approve.

Flamefire · 2020-09-23T15:12:17Z

FWIW: All the functions changed are internal linkage so not exposed. Hence no ABI break is possible. And yes I agree that this is an API break only from numpy

However before this is accepted I'd like to see support for pre and post numpy 1.19 handling. I.e.

tensorflow/tensorflow/compiler/xla/python/bfloat16.cc

Line 798 in 84cc21a

PyUFuncGenericFunction fn = UFunc::Call;

and

tensorflow/tensorflow/python/lib/core/bfloat16.cc

Line 637 in 84cc21a

auto register_ufunc = [&](const char* name, PyUFuncGenericFunction fn,

need to be adapted so the new and the old signature (PyUFuncGenericFunction) are accepted (and only those 2)

To make the evaluation easier see this godbolt link which demonstrates the effect of compiling against numpy pre 1.19 and post 1.19 via a define: https://godbolt.org/z/5MKMWK

ebrevdo · 2020-09-23T15:58:26Z

Thanks for clarifying, @Flamefire ; looks like we may need an #if/#else that checks the version of numpy.

alextp · 2020-09-23T16:01:02Z

I think always passing a const should work without the need for an #if. And it's also fine if we require a particular version of numpy's header files for TF to build against.

…

On Wed, Sep 23, 2020 at 8:58 AM ebrevdo ***@***.***> wrote: Thanks for clarifying, @Flamefire <https://github.com/Flamefire> ; looks like we may need an #if/#else that checks the version of numpy. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40728 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABHRPIDPQ4Z2JT6L6G5YTSHILLPANCNFSM4OGBKTEA> .

-- - Alex

Flamefire · 2020-09-23T16:15:25Z

@alextp Not really. The problem is that the numpy API either expects a function like foo(const int*) OR foo(int*) and you cannot pass the other one as the types of the functions are different.
So yes a solution would be to accept this PR and require numpy >= 1.19 which would limit interop with other packages.

My proposed alternative: https://godbolt.org/z/5s9xMj

The idea is to use a function which accepts the new signature and casts it to what numpy expects. A static_assert makes sure we don't cast something wrong (e.g. when numpy adds a parameter):

static PyUFuncGenericFunction castToPyUFuncGenericFunction(TF_UFuncGenericFunction_New func) {
    static_assert(
      std::is_same<PyUFuncGenericFunction, TF_UFuncGenericFunction_New>::value || 
      std::is_same<PyUFuncGenericFunction, TF_UFuncGenericFunction_Old>::value,
      "Unknown PyUFuncGenericFunction signature"
    );
    return reinterpret_cast<PyUFuncGenericFunction>(func);
}

avdv · 2020-09-23T18:24:27Z

I thought this issue was already resolved by 75ea0b3 ?

Flamefire · 2020-09-24T09:04:46Z

@avdv You are right, that commit fixes the first part of the issue and 8ed58f3 the other. The first is very elegant, the second likely dangerous as it uses an unchecked reinterpret_cast which could be improved by my solution or a similar one to 75ea0b3 by providing an overload

mihaimaruseac · 2020-10-16T17:53:04Z

Question: what do we need to do from this PR, given that it has been partly solved/merged in previous commits (if I understand the previous comments)?

I am currently working on fixing the dependencies to work with the new pip resolver and this issue popped up.

mihaimaruseac · 2020-10-16T18:36:39Z

It seems I was able to build on Python3.8 with latest numpy.

mihaimaruseac · 2020-10-16T20:39:41Z

With aafe25d numpy is now past the API breakage

google-ml-butler bot added the size:XS CL Change Size: Extra Small label Jun 23, 2020

googlebot added the cla: yes label Jun 23, 2020

gbaned self-assigned this Jun 24, 2020

gbaned requested a review from jaingaurav June 24, 2020 03:49

gbaned added the awaiting review Pull request awaiting review label Jun 30, 2020

nim65s mentioned this pull request Jul 6, 2020

Build error with numpy 1.19.0 stack-of-tasks/eigenpy#202

Closed

jcarpent added a commit to jcarpent/eigenpy that referenced this pull request Jul 15, 2020

core: fix compilation issue with NumPy 1.19

b7ca733

Related to tensorflow/tensorflow#40728

gbaned requested a review from majnemer July 29, 2020 17:14

Const-qulify PyUFuncGenericFunction's arguments

84cc21a

Remove the limit of numpy's version

edudev force-pushed the numpy-1.19-const-qualify-pyufunc branch from 8bed450 to 84cc21a Compare August 8, 2020 22:28

majnemer requested a review from hawkinsp August 9, 2020 19:42

tensorflowbutler removed the awaiting review Pull request awaiting review label Aug 11, 2020

Flamefire mentioned this pull request Aug 15, 2020

Run SWIG in the default environment #42362

Merged

gbaned added the awaiting review Pull request awaiting review label Aug 21, 2020

gbaned requested a review from mihaimaruseac August 25, 2020 18:19

gbaned requested a review from ebrevdo September 23, 2020 10:02

jaingaurav removed their request for review September 23, 2020 22:59

tensorflowbutler removed the awaiting review Pull request awaiting review label Sep 26, 2020

mihaimaruseac mentioned this pull request Oct 2, 2020

relaxing numpy requirement #43679

Closed

gbaned added the awaiting review Pull request awaiting review label Oct 9, 2020

mihaimaruseac closed this Oct 16, 2020

rgommers mentioned this pull request Feb 12, 2021

MAINT: Const qualify UFunc inner loops numpy/numpy#15355

Merged

Conversation

edudev commented Jun 23, 2020

Uh oh!

edudev commented Jun 23, 2020

Uh oh!

gbaned commented Jul 9, 2020

Uh oh!

Flamefire commented Jul 28, 2020

Uh oh!

edudev commented Jul 30, 2020

Uh oh!

Flamefire commented Jul 30, 2020

Uh oh!

edudev commented Jul 30, 2020

Uh oh!

gbaned commented Aug 6, 2020

Uh oh!

edudev commented Aug 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flamefire commented Aug 9, 2020

Uh oh!

edudev commented Aug 9, 2020

Uh oh!

edudev commented Aug 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkinsp commented Aug 9, 2020

Uh oh!

edudev commented Aug 11, 2020

Uh oh!

Flamefire commented Aug 11, 2020

Uh oh!

edudev commented Aug 15, 2020

Uh oh!

Flamefire commented Aug 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edudev commented Aug 15, 2020

Uh oh!

Flamefire commented Aug 15, 2020

Uh oh!

ebrevdo commented Sep 23, 2020

Uh oh!

alextp commented Sep 23, 2020

Uh oh!

Flamefire commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebrevdo commented Sep 23, 2020

Uh oh!

alextp commented Sep 23, 2020 via email

Uh oh!

Flamefire commented Sep 23, 2020

Uh oh!

avdv commented Sep 23, 2020

Uh oh!

Flamefire commented Sep 24, 2020

Uh oh!

mihaimaruseac commented Oct 16, 2020

Uh oh!

mihaimaruseac commented Oct 16, 2020

Uh oh!

mihaimaruseac commented Oct 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

edudev commented Aug 8, 2020 •

edited

Loading

edudev commented Aug 9, 2020 •

edited

Loading

Flamefire commented Aug 15, 2020 •

edited

Loading

Flamefire commented Sep 23, 2020 •

edited

Loading