Skip to content

Conversation

@colesbury
Copy link
Member

@colesbury colesbury commented Jul 20, 2018

This is a modification of the strategy from #8919 and #9579.

Previously, the CPU architecture-specific kernels self-registered with
the DispatchStub. When linking as part of a static library, this requires
the flag --whole-archive to be passed to the linker to ensure that the
object files for the kernels are included. Caffe2 and TensorFlow use that
strategy.

We ran into some issues with --whole-archive blowing up the binary size
of some downstream projects in Facebook. This PR avoids --whole-archive
for CPU kernels. The downside is that the generic code needs to be aware
of whether kernels are compiled with AVX and with AVX2 (via
HAVE_AVX_CPU_DEFINITION and HAVE_AVX2_CPU_DEFINITION).

The CUDA kernels still self-register with DispatchStub because the CPU
library is not aware of whether the CUDA library will be available at
runtime.

There are a few major changes to DispatchStub

 - The environment variable ATEN_CPU_CAPABILITY overrides the CPU
   capability detection code (Previous ATEN_DISABLE_AVX/AVX2)

 - DispatchStub is defined in the generic native code instead of the
   CPU_CAPABILITY_DEFAULT kernel.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This is a modification of the strategy from
pytorch#8919 and
pytorch#9579.

Previously, the CPU architecture-specific kernels self-registered with
the DispatchStub. When linking as part of a static library, this requires
the flag --whole-archive to be passed to the linker to ensure that the
object files for the kernels are included. Caffe2 and TensorFlow use that
strategy.

We ran into some issues with --whole-archive blowing up the binary size
of some downstream projects in Facebook. This PR avoids --whole-archive
for CPU kernels. The downside is that the generic code needs to be aware
of whether kernels are compiled with AVX and with AVX2 (via
HAVE_AVX_CPU_DEFINITION and HAVE_AVX2_CPU_DEFINITION).

The CUDA kernels still self-register with DispatchStub because the CPU
library is not aware of whether the CUDA library will be available at
runtime.

There are a few major changes to DispatchStub

 - The environment variable ATEN_CPU_CAPABILITY overrides the CPU
   capability detection code (Previous ATEN_DISABLE_AVX/AVX2)

 - DispatchStub is defined in the generic native code instead of the
   CPU_CAPABILITY_DEFAULT kernel.
@soumith
Copy link
Contributor

soumith commented Jul 21, 2018

@pytorchbot retest this please

@ezyang
Copy link
Contributor

ezyang commented Jul 23, 2018

Build error looks real:

05:21:39 /private/var/lib/jenkins/workspace/pytorch-builds/pytorch-macos-10.13-py3-test2/aten/src/ATen/native/DispatchStub.h:72:18: error: instantiation of variable 'at::native::DispatchStub<void (*)(at::Tensor &, const at::Tensor &), at::native::softmax_lastdim_kernel>::AVX' required here, but no definition is available [-Werror,-Wundefined-var-template]
05:21:39       AT_ASSERTM(AVX, "DispatchStub: missing AVX kernel");
05:21:39                  ^
05:21:39 /private/var/lib/jenkins/workspace/pytorch-builds/pytorch-macos-10.13-py3-test2/aten/src/ATen/native/DispatchStub.h:50:28: note: in instantiation of member function 'at::native::DispatchStub<void (*)(at::Tensor &, const at::Tensor &), at::native::softmax_lastdim_kernel>::choose_cpu_impl' requested here
05:21:39         cpu_dispatch_ptr = choose_cpu_impl();
05:21:39                            ^
05:21:39 /private/var/lib/jenkins/workspace/pytorch-builds/pytorch-macos-10.13-py3-test2/aten/src/ATen/native/SoftMax.cpp:131:27: note: in instantiation of function template specialization 'at::native::DispatchStub<void (*)(at::Tensor &, const at::Tensor &), at::native::softmax_lastdim_kernel>::operator()<at::Tensor, at::Tensor>' requested here
05:21:39     softmax_lastdim_kernel(kCPU, output, input);
05:21:39                           ^
05:21:39 /private/var/lib/jenkins/workspace/pytorch-builds/pytorch-macos-10.13-py3-test2/aten/src/ATen/native/DispatchStub.h:84:16: note: forward declaration of template entity is here
05:21:39   static FnPtr AVX;
05:21:39                ^
05:21:39 /private/var/lib/jenkins/workspace/pytorch-builds/pytorch-macos-10.13-py3-test2/aten/src/ATen/native/DispatchStub.h:72:18: note: add an explicit instantiation declaration to suppress this warning if 'at::native::DispatchStub<void (*)(at::Tensor &, const at::Tensor &), at::native::softmax_lastdim_kernel>::AVX' is explicitly instantiated in another translation unit
05:21:39       AT_ASSERTM(AVX, "DispatchStub: missing AVX kernel");

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@gchanan gchanan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 23, 2018
Summary:
This is a modification of the strategy from pytorch/pytorch#8919 and pytorch/pytorch#9579.

```
Previously, the CPU architecture-specific kernels self-registered with
the DispatchStub. When linking as part of a static library, this requires
the flag --whole-archive to be passed to the linker to ensure that the
object files for the kernels are included. Caffe2 and TensorFlow use that
strategy.

We ran into some issues with --whole-archive blowing up the binary size
of some downstream projects in Facebook. This PR avoids --whole-archive
for CPU kernels. The downside is that the generic code needs to be aware
of whether kernels are compiled with AVX and with AVX2 (via
HAVE_AVX_CPU_DEFINITION and HAVE_AVX2_CPU_DEFINITION).

The CUDA kernels still self-register with DispatchStub because the CPU
library is not aware of whether the CUDA library will be available at
runtime.

There are a few major changes to DispatchStub

 - The environment variable ATEN_CPU_CAPABILITY overrides the CPU
   capability detection code (Previous ATEN_DISABLE_AVX/AVX2)

 - DispatchStub is defined in the generic native code instead of the
   CPU_CAPABILITY_DEFAULT kernel.
```
Pull Request resolved: pytorch/pytorch#9664

Differential Revision: D8943350

Pulled By: colesbury

fbshipit-source-id: 329229b0ee9ff94fc001b960287814bd734096ef
jramseyer pushed a commit to jramseyer/pytorch that referenced this pull request Jul 30, 2018
Summary:
This is a modification of the strategy from pytorch#8919 and pytorch#9579.

```
Previously, the CPU architecture-specific kernels self-registered with
the DispatchStub. When linking as part of a static library, this requires
the flag --whole-archive to be passed to the linker to ensure that the
object files for the kernels are included. Caffe2 and TensorFlow use that
strategy.

We ran into some issues with --whole-archive blowing up the binary size
of some downstream projects in Facebook. This PR avoids --whole-archive
for CPU kernels. The downside is that the generic code needs to be aware
of whether kernels are compiled with AVX and with AVX2 (via
HAVE_AVX_CPU_DEFINITION and HAVE_AVX2_CPU_DEFINITION).

The CUDA kernels still self-register with DispatchStub because the CPU
library is not aware of whether the CUDA library will be available at
runtime.

There are a few major changes to DispatchStub

 - The environment variable ATEN_CPU_CAPABILITY overrides the CPU
   capability detection code (Previous ATEN_DISABLE_AVX/AVX2)

 - DispatchStub is defined in the generic native code instead of the
   CPU_CAPABILITY_DEFAULT kernel.
```
Pull Request resolved: pytorch#9664

Differential Revision: D8943350

Pulled By: colesbury

fbshipit-source-id: 329229b0ee9ff94fc001b960287814bd734096ef
goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
Summary:
This is a modification of the strategy from pytorch#8919 and pytorch#9579.

```
Previously, the CPU architecture-specific kernels self-registered with
the DispatchStub. When linking as part of a static library, this requires
the flag --whole-archive to be passed to the linker to ensure that the
object files for the kernels are included. Caffe2 and TensorFlow use that
strategy.

We ran into some issues with --whole-archive blowing up the binary size
of some downstream projects in Facebook. This PR avoids --whole-archive
for CPU kernels. The downside is that the generic code needs to be aware
of whether kernels are compiled with AVX and with AVX2 (via
HAVE_AVX_CPU_DEFINITION and HAVE_AVX2_CPU_DEFINITION).

The CUDA kernels still self-register with DispatchStub because the CPU
library is not aware of whether the CUDA library will be available at
runtime.

There are a few major changes to DispatchStub

 - The environment variable ATEN_CPU_CAPABILITY overrides the CPU
   capability detection code (Previous ATEN_DISABLE_AVX/AVX2)

 - DispatchStub is defined in the generic native code instead of the
   CPU_CAPABILITY_DEFAULT kernel.
```
Pull Request resolved: pytorch#9664

Differential Revision: D8943350

Pulled By: colesbury

fbshipit-source-id: 329229b0ee9ff94fc001b960287814bd734096ef
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants