Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions .github/workflows/run_tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
name: Run tests
on:
pull_request:
push: { branches: main }

jobs:
run-test-suite:
name: Run test suite
runs-on: ubuntu-latest
container: python:latest
# TODO: use a gpu-compatible image, setup runners with a compatible gpu and activate
# gpu passthrough options

steps:
- name: Checkout code
uses: actions/checkout@v3

# Use a pinned commit from the `feature/engine-api` branch at
# https://github.com/scikit-learn/scikit-learn.git to enable smooth
# synchronization with the development of this branch.
# Development tracker: https://github.com/scikit-learn/scikit-learn/pull/25535/
# TODO: Remove this step when the plugin API is officially released
- name: Install pytest, sklearn branch "feature/engine-api", and sklearn-numba-dpex
# Use official scikit-learn build guide at
# https://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge

run: |
apt-get update --quiet &&
# Install prerequisites
apt-get install -y build-essential python3-dev &&
pip install cython numpy scipy joblib threadpoolctl &&
# Build and install
pip install torch --index-url https://download.pytorch.org/whl/cpu &&
pip install pytest git+https://github.com/fcharras/scikit-learn.git@2ccfc8c4bdf66db005d7681757b4145842944fb9#egg=scikit-learn -e .


- name: Run sklearn_pytorch_engine tests
run: pytest -v sklearn_pytorch_engine/

# TODO: run those tests in a separate pipeline
# NB: `sklearn_pytorch_engine` set the estimators to output arrays of type
# `torch.tensor` and store fitted attributes with this same type.
# This behavior is not compatible with sklearn unit tests, that expect numpy
# arrays or at least arrays that would closely mimic the NumPy Python API. To
# keep compatibility with sklearn unit tests the engine must be set to a
# different behavior where its methods are wrapped in data conversion steps so
# that fitted attributes and outputs are numpy arrays. Currently this behavior is
# activated when the environment variable SKLEARN_PYTORCH_ENGINE_TESTING_MODE is set
# to 1.
- name: Run sklearn test suites with sklearn_pytorch_engine
run: SKLEARN_RUN_FLOAT32_TESTS=1 SKLEARN_PYTORCH_ENGINE_TESTING_MODE=1 pytest -v --sklearn-engine-provider sklearn_pytorch_engine --pyargs sklearn.cluster.tests.test_k_means
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@ repos:
rev: v0.961
hooks:
- id: mypy
files: sklearn_numba_dpex/
files: sklearn_pytorch_engine/
additional_dependencies: [pytest==6.2.4]
- repo: https://github.com/PyCQA/isort
rev: 5.12.0
hooks:
- id: isort
files: sklearn_numba_dpex/
files: sklearn_pytorch_engine/
162 changes: 162 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# sklearn-pytorch-engine

Experimental plugin for scikit-learn that implements a backend for (some) scikit-learn
estimators, written in `pytorch`, so that it benefits from `pytorch` ability to
dispatch data and compute to many devices, providing the appropriate pytorch extensions
are installed.

This package requires working with the following experimental branch of scikit-learn:

- `feature/engine-api` branch on https://github.com/scikit-learn/scikit-learn

## List of Included Engines

- `sklearn.cluster.KMeans` for the standard LLoyd's algorithm on dense data arrays,
including `kmeans++` support.

## Getting started:

### Pre-requisites

#### Step 1: Install PyTorch

Getting started requires a working python environment for using `pytorch`. Depending on
the device you target, install PyTorch extensions accordingly, including (but not
limited to):

- using one of the [native distributions](https://pytorch.org/get-started/locally/) for
cuda (for nvidia gpus), rocm (amd gpus) or mps (apple gpus) support

- using [Intel distributions](https://intel.github.io/intel-extension-for-pytorch/xpu/2.0.110+xpu/tutorials/installations/linux.html#install-via-prebuilt-wheel-files)
for xpu (for intel gpus), has experimental (unofficial) support for igpus if compiling
from source
[with appropriate flags](https://intel.github.io/intel-extension-for-pytorch/xpu/2.0.110+xpu/tutorials/installations/linux.html#configure-the-aot-optional)

#### Step 2: Install scikit-learn from source

Using the plugin requires the experimental development branch `feature/engine-api` of
scikit-learn that implements the compatible plugin system. The `sklearn_pytorch_engine`
plugin is compatible with the commit 2ccfc8c4bdf66db005d7681757b4145842944fb9 available
in the fork [fcharras/scikit-learn](https://github.com/fcharras/scikit-learn/) .

Please refer to the relevant [scikit-learn documentation page](https://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge)
for a comprehensive guide regarding installing from source. For instance, using `pip`
and `apt` (assuming `apt`-based environment):

```
apt-get update --quiet
# Install prerequisites
apt-get install -y build-essential python3-dev git
pip install cython numpy scipy joblib threadpoolctl
# Build and install
pip install git+https://github.com/fcharras/scikit-learn.git@2ccfc8c4bdf66db005d7681757b4145842944fb9#egg=scikit-learn
```

#### Step 3: Install this plugin

When loaded into your PyTorch + scikit-learn environment, run:

```
git clone https://github.com/soda-inria/sklearn-pytorch-engine
cd sklearn-pytorch-engine
pip install -e .
```

## Using the plugin

See the `sklearn_pytorch_engine/kmeans/tests` folder for example usage.

🚧 TODO: write some examples here instead.

### Running the tests

To run the tests run the following from the root of the `sklearn_pytorch_engine`
repository:

```bash
pytest sklearn_pytorch_engine
```

To run the `scikit-learn` tests with the `sklearn_pytorch_engine` engine you can run the
following:

```bash
SKLEARN_PYTORCH_ENGINE_TESTING_MODE=1 pytest --sklearn-engine-provider sklearn_pytorch_engine --pyargs sklearn.cluster.tests.test_k_means
```

(change the `--pyargs` option accordingly to select other test suites).

The `--sklearn-engine-provider sklearn_pytorch_engine` option offered by the sklearn
pytest plugin will automatically activate the `sklearn_pytorch_engine` engine for all
tests.

Tests covering unsupported features (that trigger
`sklearn.exceptions.FeatureNotCoveredByPluginError`) will be automatically marked as
_xfailed_.

### Additional environment variables for device selection behavior

By default, the engine will use the _compute follow data_ principle, meaning that it
will run the compute on the device that manages the data. For instance `kmeans.fit(X)`
will run compute on corresponding xpu device if `X` is a `torch.tensor` array such that
`X.device.type` is `"xpu"`, and will run on cpu if `X.device.type` is `"cpu"`, etc.

It's possible to alter this behavior and have the engine force offload the compute to
a specific device, using the environment variable
`SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE`. For instance, on a compatible computer,
`SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE=mps` will force the compute to the
`mps`-compatible device, even if it requires copying the input data under the hood to
do so.

Both internal and scikit-learn test suites can run with any value of
`SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE` as long as the compatible pytorch extension
is available and that the host hardware is compatible, for instance:

```bash
export SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE=xpu
pytest sklearn_pytorch_engine
SKLEARN_PYTORCH_ENGINE_TESTING_MODE=1 pytest --sklearn-engine-provider sklearn_pytorch_engine --pyargs sklearn.cluster.tests.test_k_means
```

will run all compute on the relevant `xpu` device.

At the moment, both tests suite will create test data that is hosted on the CPU by
default. For internal tests, this behavior can be changed with the environment variable
`SKLEARN_PYTORCH_ENGINE_TEST_INPUTS_DEVICE`, for instance the command

```bash
SKLEARN_PYTORCH_ENGINE_TEST_INPUTS_DEVICE=cuda SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE=cpu pytest sklearn_pytorch_engine
```

will run the tests while enforcing that the test data is generated on the cuda device
but the compute is done on cpu (since `SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE` is set
to `cpu`).

All combinations of those two environment variables makes for a reasonably exhaustive
test matrix regarding internal data conversions.

### Notes about the preferred floating point precision (float32)

In many machine learning applications, operations using single-precision (float32)
floating point data require twice as less memory that double-precision (float64), are
regarded as faster, accurate enough and more suitable for GPU compute. Besides, most
GPUs used in machine learning projects are significantly faster with float32 than with
double-precision (float64) floating point data.

To leverage the full potential of GPU execution, it's strongly advised to use a float32
data type.

By default, unless specified otherwise numpy array are created with type float64, so be
especially careful to the type whenever the loader does not explicitly document the
type nor expose a type option.

Transforming NumPy arrays from float64 to float32 is also possible using
[`numpy.ndarray.astype`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html),
although it is less recommended to prevent avoidable data copies. `numpy.ndarray.astype`
can be used as follows:

```python
X = my_data_loader()
X_float32 = X.astype(float32)
my_gpu_compute(X_float32)
```
1 change: 0 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
from setuptools import setup


setup()
8 changes: 8 additions & 0 deletions sklearn_pytorch_engine/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
try:
# ensure xpu backend is loaded if available
import intel_extension_for_pytorch as ipex # noqa
import torch

torch.zeros(1, 1, device="xpu")
except ModuleNotFoundError:
pass
40 changes: 40 additions & 0 deletions sklearn_pytorch_engine/_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import os
from functools import lru_cache

import array_api_compat
import numpy as np
import torch

_TORCH_ARRAY_API_NAMESPACE = (lambda: array_api_compat.get_namespace(torch.empty(0)))()


def get_torch_array_api_namespace():
return _TORCH_ARRAY_API_NAMESPACE


@lru_cache
def to_pytorch_dtype(dtype):
return torch.from_numpy(np.empty(0, dtype=dtype)).dtype


# NB: value is sensitive to `torch.device` context or previous
# `torch.set_default_device` calls
def get_torch_default_device():
return torch.empty(0).device.type


def get_sklearn_pytorch_engine_default_device():
device = os.getenv("SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE", None)
if device is None:
device = get_torch_default_device()
return device


@lru_cache
def has_fp64_support(device):
try:
torch.zeros(1, dtype=torch.float64, device=device)
return True
except RuntimeError as runtime_error:
if "data type is unsupported" in str(runtime_error):
return False
Empty file.
Loading