Skip to content

Const-qulify PyUFuncGenericFunction's arguments#40728

Closed
edudev wants to merge 1 commit intotensorflow:masterfrom
edudev:numpy-1.19-const-qualify-pyufunc
Closed

Const-qulify PyUFuncGenericFunction's arguments#40728
edudev wants to merge 1 commit intotensorflow:masterfrom
edudev:numpy-1.19-const-qualify-pyufunc

Conversation

@edudev
Copy link
Copy Markdown

@edudev edudev commented Jun 23, 2020

Due to recent changes in Numpy [1], PyUFuncGenericFunction's argument should be const-qualified.
This change is non-breaking (i.e. it will work for both numpy 1.18 and 1.19).

[1] numpy/numpy#15355

@google-ml-butler google-ml-butler bot added the size:XS CL Change Size: Extra Small label Jun 23, 2020
@edudev
Copy link
Copy Markdown
Author

edudev commented Jun 23, 2020

I just noticed this change [1]. I believe my PR resolves all the breaking changes in the numpy ABI and the commit can be reversed.

[1] 79518fa

@gbaned gbaned self-assigned this Jun 24, 2020
@gbaned gbaned requested a review from jaingaurav June 24, 2020 03:49
@gbaned gbaned added the awaiting review Pull request awaiting review label Jun 30, 2020
@gbaned
Copy link
Copy Markdown
Contributor

gbaned commented Jul 9, 2020

@jaingaurav Can you please review this PR ? Thanks!

jcarpent added a commit to jcarpent/eigenpy that referenced this pull request Jul 15, 2020
@Flamefire
Copy link
Copy Markdown
Contributor

I'd suggest to include a revert of 79518fa in this PR as this resolves the TODO. I hope the TF guys agree.

@gbaned gbaned requested a review from majnemer July 29, 2020 17:14
@edudev
Copy link
Copy Markdown
Author

edudev commented Jul 30, 2020

I'd suggest to include a revert of 79518fa in this PR as this resolves the TODO. I hope the TF guys agree.

I believe it should be reverted, too.
However, before adding it to the PR, I'm awaiting for some feedback.
One other thing that I do not know is whether the old version of numpy should be supported.

@Flamefire
Copy link
Copy Markdown
Contributor

One other thing that I do not know is whether the old version of numpy should be supported.

What do you mean by that exactly? -->

This change is non-breaking (i.e. it will work for both numpy 1.18 and 1.19).

@edudev
Copy link
Copy Markdown
Author

edudev commented Jul 30, 2020

One other thing that I do not know is whether the old version of numpy should be supported.

What do you mean by that exactly? -->

This change is non-breaking (i.e. it will work for both numpy 1.18 and 1.19).

Ah, I totally forgot that it's a non-breaking change. My mistake.

@gbaned
Copy link
Copy Markdown
Contributor

gbaned commented Aug 6, 2020

@edudev, @majnemer Any update on this PR? Please. Thanks!

Remove the limit of numpy's version
@edudev edudev force-pushed the numpy-1.19-const-qualify-pyufunc branch from 8bed450 to 84cc21a Compare August 8, 2020 22:28
@edudev
Copy link
Copy Markdown
Author

edudev commented Aug 8, 2020

@edudev, @majnemer Any update on this PR? Please. Thanks!

Well, I'm just waiting for a review.

Something that I did notice though is that the relevant function type (PyUFuncGenericFunction) is also used in the XLA part of tensorflow. For some reason, when I compiled tensorflow with XLA support, it did not hit the 'broken' code (didn't attempt to compile it, at all).
However, when I explicitly compiled it (//tensorflow/compiler/xla/python:bfloat16), it was in fact unsuccessful, hence why I just updated my pull request to fix this, as well.
I also reverted the change which limited the numpy version to 1.18.

And something that I sadly just noticed: this change is in fact breaking. Once merged, it will not be possible to compile tensorflow with numpy below 1.19. Moreover, I believe that using tensorflow with lower versions of numpy will also be impossible, due to differences in the mangled name of the relevant functions.
Which is why I also suggest that the minimum version of numpy also be bumped (from 1.16 to 1.19).

@Flamefire
Copy link
Copy Markdown
Contributor

this change is in fact breaking. Once merged, it will not be possible to compile tensorflow with numpy below 1.19. Moreover, I believe that using tensorflow with lower versions of numpy will also be impossible, due to differences in the mangled name of the relevant functions.

Can you elaborate why? I mean: those functions call numpy functions and those numpy function either take non-const pointers (<1.19) or const-pointers (1.19+), so both work now. Only thing which will be impossible is to compile TF with <1.19 and use with 1.19+. Or is that why you meant?

@majnemer majnemer requested a review from hawkinsp August 9, 2020 19:42
@edudev
Copy link
Copy Markdown
Author

edudev commented Aug 9, 2020

this change is in fact breaking. Once merged, it will not be possible to compile tensorflow with numpy below 1.19. Moreover, I believe that using tensorflow with lower versions of numpy will also be impossible, due to differences in the mangled name of the relevant functions.

Can you elaborate why? I mean: those functions call numpy functions and those numpy function either take non-const pointers (<1.19) or const-pointers (1.19+), so both work now. Only thing which will be impossible is to compile TF with <1.19 and use with 1.19+. Or is that why you meant?

Yes, it will not be possible to compile tensorflow with numpy versions below 1.19 due to the change in the function pointer type (from non-const pointer argument to const pointer argument).
I'm not sure why I did not notice this issue before. I believe it may have been due to some caches. I only noticed the compilation issues after running bazel clean.

About the runtime issues: had the relevant code (bfloat16) been compiled as a C source, this would not have been an issue. However, in C++ the argument types actually influence the function names.
A simple example:

$ c++filt _Z3fooPc
foo(char*)
$ c++filt _Z3fooPKc
foo(char const*)

Notice how changing the argument from a normal pointer to a constant pointer changes the name of the function, as well (there's an extra K).

Now, coming to the numpy case, it's... complicated... way more complicated, because the offending argument is actually a pointer to a function.

@edudev
Copy link
Copy Markdown
Author

edudev commented Aug 9, 2020

Hmm, okay, I took a closer look at this once again. As I said, it's quite complicated.

The linking issue is at the boundary of tensorflow and numpy. Including headers and having const/non-const is only a compile-time issue, as already mentioned.
The only boundary where the offending function pointer type change (PyUFuncGenericFunction) is used at a library boundary is when calling PyUFunc_RegisterLoopForType. This is because PyUFunc_RegisterLoopForType takes an argument, which is PyUFuncGenericFunction. These calls are a few lines below my changes in the two bfloat16.cc source files of tensorflow.

Normally, when the tensorflow call to PyUFunc_RegisterLoopForType gets executed, the (dynamic) linker will search for the relevant function and, as already mentioned, that one (or two) const qualifiers make a hell of a lot difference in the function names.
Just for illustration purposes, let's assume that PyUFunc_RegisterLoopForType had the following declaration (I've removed some parameters to make it simpler, but the most important PyUFuncGenericFunction is still there):

int PyUFunc_RegisterLoopForType(PyUFuncGenericFunction);

Here are how the two names of this function would look like (without and with const qualifiers for the parameters of the PyUFuncGenericFunction parameter; told you it was complicated):

_Z27PyUFunc_RegisterLoopForTypePFvPPcPlS1_PvE
_Z27PyUFunc_RegisterLoopForTypePFvPPcPKlS2_PvE

(they are different)

Okay, up until now, this pretty much says that we would not be able to use tensorflow with versions of numpy before 1.19.
However, I went to also check the definition of PyUFunc_RegisterLoopForType, and it becomes even more obfuscated.
The declaration is located in __ufunc_api.h of the numpy source code; and there are actually two declarations.
I'll ignore the #ifdef, which branches between the two, and here they are:

NPY_NO_EXPORT  int PyUFunc_RegisterLoopForType \
       (PyUFuncObject *, int, PyUFuncGenericFunction, const int *, void *);

#define PyUFunc_RegisterLoopForType \
        (*(int (*)(PyUFuncObject *, int, PyUFuncGenericFunction, const int *, void *)) \
         PyUFunc_API[2])

(these declarations differ from the one I provided above, as it was intentionally made simpler, with less arguments)

I have not gone through the numpy source code, so I am not exactly sure when each of these two versions is used. My guess would be that the former is used for internal numpy complication, while the latter is used when compiling against external libraries (such as tensorflow).
Nonetheless, the definition of the former is not in the C source code, while the definition of the latter is quite simple (I'm pasting again):

#define PyUFunc_RegisterLoopForType \
        (*(int (*)(PyUFuncObject *, int, PyUFuncGenericFunction, const int *, void *)) \
         PyUFunc_API[2])

And this is just a simple cast of PyUFunc_API[2], whatever it may be. This cast does not care about whether a parameter of a function pointer parameter is const-qualified or not.

Therefore, in conclusion, I believe it may not be an issue to use older versions of numpy with tensorflow.
However, I'm not sure how to test this, and I have absolutely no idea how to write an automated test for this (the test itself would pretty much require having multiple versions of numpy).

@hawkinsp
Copy link
Copy Markdown
Contributor

hawkinsp commented Aug 9, 2020

I didn't follow the discussion in the most recent comment, but it looks to me like it is overcomplicating this.

I think all you need to do to fix any source compatibility issues with older versions of NumPy is to call reinterpret_cast<PyUFuncGenericFunction>(fn) on the function being passed to PyUFunc_RegisterLoopForType. This will be a no-op if the const-qualifiers match, and perform a cast otherwise.

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Aug 11, 2020
@edudev
Copy link
Copy Markdown
Author

edudev commented Aug 11, 2020

You can ignore my comments on this failing to link due to the C++ name mangling. I just noticed that there's an extern "C" around the numpy code for PyUFunc_RegisterLoopForType.
I see no issues and I believe the pull request is more or less ready (should it be okay that numpy 1.19 be required to compile tensorflow).

About the reinterpret cast: as I was wrong about the C++ function name mangling occurring, it will not be needed.
However, assuming that I was not mistaken about this issue, it would still not work. The compiler decides at compile time which of the overloaded functions to call (the one with the const qualifier or the one without). A reinterpret cast would be able to fix the issue of compiling tensorflow with versions of numpy prior to 1.19.
The issue that I was describing was a potential one at link time, where a reinterpret cast would not change anything.

So... my question is: how bad would it be to require numpy 1.19 for tensorflow compilation (but still be able to work alongside old versions during runtime)?

@Flamefire
Copy link
Copy Markdown
Contributor

It seems to me that you are confusing things. I don't see how this change is not compatible with numpy before and after 1.19. Can you point out why you think this is not the case? Did you try it?

Let me address what we have up until now:

  • PyUFunc_RegisterLoopForType expects a function of type PyUFuncGenericFunction and is passed your changed functions
  • PyUFuncGenericFunction changed in 1.19 to have const params, so the changed functions will no longer match the type PyUFuncGenericFunction`` in pre-1.19 numpy
  • reinterpret_cast<PyUFuncGenericFunction>(fn) can be used to solve this
  • PyUFunc_RegisterLoopForType seems to be a macro which casts a void* out of an array called PyUFunc_API to the (function) type required. Hence ABI for this doesn't matter
  • As the changed TF function is passed via a (casted) function pointer to numpy ABI of that doesn't matter either
  • Hence with the cast everything is API and ABI compatible with all numpy versions

The compiler decides at compile time which of the overloaded functions to call

Which overloaded functions? There are no overloaded functions here are they? You either have ones with const-pointers (this change) or without (prior to this changed)

The issue that I was describing was a potential one at link time,

After reading the above, do you still see an issue? Which one? Have you tried it?

@edudev
Copy link
Copy Markdown
Author

edudev commented Aug 15, 2020

Once again, you can ignore my comments on linking errors.

The only questions that remains is whether tensorflow should be compiled with numpy < 1.19.
The pull request, as it is now would allow tensorflow to be compiled only with numpy 1.19.
A reinterpret cast would allow compilation with lower versions of numpy. My personal opinion is that adding the cast is not necessary, unless somebody can point out a good argument for supporting older versions of numpy.

In either case (with the reinterpret cast, or without it), tensorflow will continue to work with the older versions of numpy at runtime.

@Flamefire
Copy link
Copy Markdown
Contributor

Flamefire commented Aug 15, 2020

A reinterpret cast would allow compilation with lower versions of numpy.

IMO that is worth it. Maybe with a slightly safer version that static_asserts that the signature is what is expected and does the cast only then. So either:

  • check for numpy < 1.19 (if that is possible) and cast to the explicit, known signature
  • or, add static_asserts so that it checks that PyUFuncGenericFunction is either the known new or old version and a second check that checks that the function to be cast is either the new or old version

Reason: If they change it again or the function here is changed then it will result in a build error instead of a silent failure

somebody can point out a good argument for supporting older versions of numpy.

One I can think of: User having numpy already installed. That won't be upgraded unless they do the explicitly (or implicitly if TF requires the new version) but maybe users have other software they use which doesn't work with the older version due to similar problems to this

@edudev
Copy link
Copy Markdown
Author

edudev commented Aug 15, 2020

Glad we're finally on the same page about the issue of the pull request.
I apologise for all my previous comments which diverted everyone's attention for something irrelevant.

About the cast: I honestly am not sure which is better, which is why I asked for arguments for the cast.
My reasoning to dislike it is that not having a cast would not affect users of tensorflow. The only ones affected are developers. If you're a user and have issues with numpy 1.19, you can, of course downgrade to 1.18 (or not upgrade at all).
If you are a developer of a 3rd party library, then I'd say that there would be issues only if your C/C++ code is not compatible with the new ABI of numpy 1.19 (i.e. you use the numpy ABI). I would argue that it is then the responsibility of this 3rd party to update their code to comply with the breaking changes to numpy 1.19 - the change, as in this pull request is fairly straighforward, and if you really don't want to be compatible with numpy 1.19, you can also delay compatibility with the new tensorflow version (with which this fix would likely get shipped).
If you're a developer of tensorflow, then the worst thing that would happen is that your build would fail until you do a
$ pip install numpy>=1.19
Generally, I do believe that most developers have separate python virtual environments (right?, at least I do), so several versions of numpy can co-exist without issues.

Another reason that I don't like having a cast is that it is to support an old version of numpy. I looked for official information on their release guidelines and end-of-life timeline, but I could find nothing. If a decision is made to support numpy 1.18, I would also add a comment on this (probably also with a link to this issue). Nonetheless, I feel like a cast would be a 'temporary' solution (until numpy 1.18 gets forgotten), which would remain there permanently (I try to avoid such code, as much as possible).

And lastly, the cast is in fact not future-proof, as you've noted. Although extremely unlikely, the ABI may change again to something which is incompatible (unlike the current const qualifiers). I tried to look for some macros to detect for numpy 1.19, but could not find anything (most likely I missed it, though).

@Flamefire
Copy link
Copy Markdown
Contributor

You are missing an important group: people who install packages like system package maintainers and system admins. For them it is not possible to have conflicting versions or "just upgrade" a package. So having wider compatibility is a big bonus.

Numpy 1.18 is not old enough to ignore it. And just adding the cast is less work than all this discussion.

For your last point: imo it is irrelevant as long as we detect that at compile time. See my solutions above. They won't completly change it without at least raising the major version and then a decision needs to be made. But that is all fruitless speculation.

So: best to get something to work for current versions and don't limit them just to save 4 lines of code or so. Rely on reported versions (is there a macro of the Numpy version?) or add a comment. Then nothing is forgotten

@gbaned gbaned added the awaiting review Pull request awaiting review label Aug 21, 2020
@gbaned gbaned requested a review from mihaimaruseac August 25, 2020 18:19
@gbaned gbaned requested a review from ebrevdo September 23, 2020 10:02
@ebrevdo
Copy link
Copy Markdown
Contributor

ebrevdo commented Sep 23, 2020

@alextp does this PR break the C ABI guarantees we have in place?

@alextp
Copy link
Copy Markdown
Contributor

alextp commented Sep 23, 2020

@ebrevdo I don't see any change to our ABI-stable headers in tensorflow/c/{eager,} so I don't think it's ABI-breaking.

I also believe that the numpy abi breaking change is actually not an abi breaking change (i.e. the code links fine) but an api-breaking change (i.e. the const makes C++ compilers not willing to compile old code against the new header file but old code compiled agains the old header file links just fine against existing numpy).

That said I didn't review this change at all so don't trust me to approve.

@Flamefire
Copy link
Copy Markdown
Contributor

Flamefire commented Sep 23, 2020

FWIW: All the functions changed are internal linkage so not exposed. Hence no ABI break is possible. And yes I agree that this is an API break only from numpy

However before this is accepted I'd like to see support for pre and post numpy 1.19 handling. I.e.

PyUFuncGenericFunction fn = UFunc::Call;
and
auto register_ufunc = [&](const char* name, PyUFuncGenericFunction fn,
need to be adapted so the new and the old signature (PyUFuncGenericFunction) are accepted (and only those 2)

To make the evaluation easier see this godbolt link which demonstrates the effect of compiling against numpy pre 1.19 and post 1.19 via a define: https://godbolt.org/z/5MKMWK

@ebrevdo
Copy link
Copy Markdown
Contributor

ebrevdo commented Sep 23, 2020

Thanks for clarifying, @Flamefire ; looks like we may need an #if/#else that checks the version of numpy.

@alextp
Copy link
Copy Markdown
Contributor

alextp commented Sep 23, 2020 via email

@Flamefire
Copy link
Copy Markdown
Contributor

@alextp Not really. The problem is that the numpy API either expects a function like foo(const int*) OR foo(int*) and you cannot pass the other one as the types of the functions are different.
So yes a solution would be to accept this PR and require numpy >= 1.19 which would limit interop with other packages.

My proposed alternative: https://godbolt.org/z/5s9xMj

The idea is to use a function which accepts the new signature and casts it to what numpy expects. A static_assert makes sure we don't cast something wrong (e.g. when numpy adds a parameter):

static PyUFuncGenericFunction castToPyUFuncGenericFunction(TF_UFuncGenericFunction_New func) {
    static_assert(
      std::is_same<PyUFuncGenericFunction, TF_UFuncGenericFunction_New>::value || 
      std::is_same<PyUFuncGenericFunction, TF_UFuncGenericFunction_Old>::value,
      "Unknown PyUFuncGenericFunction signature"
    );
    return reinterpret_cast<PyUFuncGenericFunction>(func);
}

@avdv
Copy link
Copy Markdown

avdv commented Sep 23, 2020

I thought this issue was already resolved by 75ea0b3 ?

@jaingaurav jaingaurav removed their request for review September 23, 2020 22:59
@Flamefire
Copy link
Copy Markdown
Contributor

@avdv You are right, that commit fixes the first part of the issue and 8ed58f3 the other. The first is very elegant, the second likely dangerous as it uses an unchecked reinterpret_cast which could be improved by my solution or a similar one to 75ea0b3 by providing an overload

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Sep 26, 2020
@gbaned gbaned added the awaiting review Pull request awaiting review label Oct 9, 2020
@mihaimaruseac
Copy link
Copy Markdown
Contributor

Question: what do we need to do from this PR, given that it has been partly solved/merged in previous commits (if I understand the previous comments)?

I am currently working on fixing the dependencies to work with the new pip resolver and this issue popped up.

@mihaimaruseac
Copy link
Copy Markdown
Contributor

It seems I was able to build on Python3.8 with latest numpy.

@mihaimaruseac
Copy link
Copy Markdown
Contributor

With aafe25d numpy is now past the API breakage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review Pull request awaiting review cla: yes size:XS CL Change Size: Extra Small

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants