Skip to content

Comments

STVM, NUPHAR, remove tvm from submodules list, checks pointers are not null.#10211

Merged
xadupre merged 24 commits intomicrosoft:masterfrom
xadupre:stvm1
Jan 27, 2022
Merged

STVM, NUPHAR, remove tvm from submodules list, checks pointers are not null.#10211
xadupre merged 24 commits intomicrosoft:masterfrom
xadupre:stvm1

Conversation

@xadupre
Copy link
Member

@xadupre xadupre commented Jan 6, 2022

Description:

  • Checks pointers are not null to avoid crashes.
  • Removes tvm from the list of submodules and use FetchContent to retrieve the sources. It should speed up the build for every other scenarios.

Motivation and Context
Help identy why it fails. Speed up CI.

@xadupre xadupre changed the title STVM, checks pointers are not null. STVM, NUPHAR, remove tvm from submodules list, checks pointers are not null. Jan 11, 2022
@xadupre xadupre changed the title STVM, NUPHAR, remove tvm from submodules list, checks pointers are not null. [WIP] STVM, NUPHAR, remove tvm from submodules list, checks pointers are not null. Jan 12, 2022
@@ -0,0 +1,35 @@
if (onnxruntime_USE_STVM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to add nuphar and all its dependencies that will be used in your build to cgmanifest.json.

Copy link
Member Author

@xadupre xadupre Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TVM dependencies are listed here: https://github.com/apache/tvm/blob/main/LICENSE#L204. Some of them are checked in the repository and the sources are difficult to trace (cma). Another comes from another dependency already listed in this file (compiler-rt / llvm, blockingconcurrentqueue.h). I added "comments": "dependency from tvm" for the newly added dependencies.

docs/STVM_EP.md Outdated

```
cd onnxruntime/cmake/external/tvm_update/
cd onnxruntime/cmake/external/tvm/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why users need to manually build TVM first? If that, it seems we don't need to clone TVM. We can treat TVM as CUDA SDK. We ask the user provide the install path of the software.

In whatever case, we still need to track the software with cgmanifest.json.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes from PR #10019. The first draft requires to manually build TVM. There is also an ongoing work with PR #10260 (renaming).

@vvchernov
Copy link
Contributor

Hello @xadupre, I think it is good idea leave one tvm submodule instead of two but you should know the following things. 1. Nuphar and STVM are the different EPs although they both are based on Apache TVM. They are developed by different teams and use different TVM hash. For STVM I do not think that v0.8.0 is a good choice, we use very fresh commit. Separately it should be agreed with Nuphar team.
2. If you remove tvm submodules it is needed to have TVM elsewhere and build it before ORT. Particularly so-file from TVM is used for whl-package build (onnxruntime-stvm). Moreover build system of ORT is big enough it looks like you need check(update) more places for correct work without tvm submodules.
3. TVM build outside of ORT becames the responsibility of client. But it should be noted that TVM is target-dependent project. With which keys do you plan to build it and use for ORT? There is no universal answer. Our idea was to add cmake keys to build TVM together with ORT and based on keys from ORT, for GPU we use CUDA or Metal or Vulkan, for CPU - LLVM and so on.
4. Please also take into account PR#10260.

@xadupre
Copy link
Member Author

xadupre commented Jan 14, 2022

Both NUPHAR and STVM are using TVM but not the same versions (nuphar is using a fork not updated in two years). I used whatever was already checked in. Between 0.5 and 0.8, TVM updated its C++ API, updating TVM for nuphar would require many changes (see #10183). This PR does not remove TVM, it fetches TVM sources only if it is needed. The build does not change for nuphar or any other version. The goal of this PR is let the build unchanged but clone tvm only when it is needed.

@tmoreau89
Copy link

Thank you @xadupre - I think we can go ahead and merge this PR if it's ready, after which we can go ahead and land the following PR: #10260

@jroesch
Copy link
Contributor

jroesch commented Jan 21, 2022

Makes sense to me as well, thanks for the improvement @xadupre

@vvchernov
Copy link
Contributor

Hello @xadupre I recommended you to check build of TVM inside ORT without USE_MICRO key. The latter activates tvmc which allows to work with TVM by commands from shell I do not think that it needs for ORT. We also will check it tommorow on our side

@xadupre
Copy link
Member Author

xadupre commented Jan 24, 2022

I tried but it was crashing when running "import onnxruntime.providers.stvm". So I wanted to try with the same setting given in the documentation. To be more precise, I get the following.

terminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [11:12:32] onnxruntime/build/cpu_stvm/Release/_deps/tvm-src/src/runtime/registry.cc:69: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (can_override) is false: Global PackedFunc arith.CreateAnalyzer is already registered

Aborted (core dumped)

@vvchernov
Copy link
Contributor

Hello @xadupre, we are concerned about possible situation when this PR will be merged due to CI passed, but in fact it will not work for ORT build with TVM or separated TVM build or TVM whl-package build and so on. May be we can help you or can communicate/test to be sure that all things work well before it will be merged?

@xadupre
Copy link
Member Author

xadupre commented Jan 26, 2022

Hello @vvchernov, the only way to make sure of it is to add a CI job with a unit test. It is working for me with the following steps:

python3 ./tools/ci_build/build.py --config Release --skip_tests --build_wheel --parallel  --build_dir ./build/cpu_stvm --use_stvm
export PYTHONPATH=~/github/onnxruntime/build/cpu_stvm/Release/:~/github/onnxruntime/build/cpu_stvm/Release/_deps/tvm-src/python

Then a simple script:

import tvm  # needs to be imported first
import onnxruntime
# ... the rest of the script

Now I need to check if it works too after installing both *.whl built during this build.

@KJlaccHoeUM9l
Copy link
Contributor

KJlaccHoeUM9l commented Jan 26, 2022

Hello @xadupre, import tvm will be automatically done after PR#10241 was merged.
When building the .whl package for onnxruntime-stvm, this import will be automatically inserted into _ld_preload.py. Thus, import tvm will be called at the beginning of the entire chain of imports in onnxruntime.
Therefore, import tvm can be removed from your test script.

@xadupre
Copy link
Member Author

xadupre commented Jan 26, 2022

Hi @KJlaccHoeUM9l, this PR was meged and I built with this one included. I did not work for me. I'll try again.

@xadupre xadupre changed the title [WIP] STVM, NUPHAR, remove tvm from submodules list, checks pointers are not null. STVM, NUPHAR, remove tvm from submodules list, checks pointers are not null. Jan 26, 2022
@KJlaccHoeUM9l
Copy link
Contributor

Sorry, it was my mistake in PR#10241.
Due to active work with a branch in which STVM has already been renamed to TVM, I made a mistake on this line.
For correct work, you need to change this line as follows:

if package_name == 'onnxruntime-stvm':

Maybe it will be faster to make this change within this PR? Or should I open a new PR?

@xadupre
Copy link
Member Author

xadupre commented Jan 26, 2022

@KJlaccHoeUM9l, I fixed it in this PR. I added a small example to check the provider is working. Is it possible to review it? It fails foe me: https://github.com/microsoft/onnxruntime/pull/10211/files#diff-c6a93b897de7a145372d94463bb7aa6030f3bd3578c50548a261025b6825ade6R53. The inference works but CPUExecutionProvider and StvmExecutionProvider do not give the same output.

@xadupre xadupre requested a review from snnn January 26, 2022 15:01
@vvchernov
Copy link
Contributor

@KJlaccHoeUM9l, I fixed it in this PR. I added a small example to check the provider is working. Is it possible to review it? It fails foe me: https://github.com/microsoft/onnxruntime/pull/10211/files#diff-c6a93b897de7a145372d94463bb7aa6030f3bd3578c50548a261025b6825ade6R53. The inference works but CPUExecutionProvider and StvmExecutionProvider do not give the same output.

@xadupre which model did you use for test?

@xadupre
Copy link
Member Author

xadupre commented Jan 26, 2022

@KJlaccHoeUM9l, I fixed it in this PR. I added a small example to check the provider is working. Is it possible to review it? It fails foe me: https://github.com/microsoft/onnxruntime/pull/10211/files#diff-c6a93b897de7a145372d94463bb7aa6030f3bd3578c50548a261025b6825ade6R53. The inference works but CPUExecutionProvider and StvmExecutionProvider do not give the same output.

@xadupre which model did you use for test?

The test has to be fast so I did not use any predefined model. The model is created in the script using onnx functions. It is a linear regression (XA + B).

@KJlaccHoeUM9l
Copy link
Contributor

@xadupre, this error is due to the fact that your model does not have fixed tensor shapes. At the moment, STVM EP in automatic mode can only work with models that have fixed shapes.
In order for your test to work correctly, you need to change it as follows:

X = make_tensor_value_info('X', TensorProto.FLOAT, [1, 2])
A = make_tensor_value_info('A', TensorProto.FLOAT, [2, 2])
B = make_tensor_value_info('B', TensorProto.FLOAT, [1, 2])
Y = make_tensor_value_info('Y', TensorProto.FLOAT, [1, 2])

@xadupre
Copy link
Member Author

xadupre commented Jan 26, 2022

I suggest to raise an exception if the model has dynamic shape and input_shapes is not specified.

set_target_properties(tvm_runtime PROPERTIES FOLDER ${tvm_SOURCE_DIR})

set(TVM_INCLUDES ${tvm_SOURCE_DIR}/include
${tvm_SOURCE_DIR}/3rdparty/dmlc-core/include
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, please also add dmlc-core and dlpack to our cgmanifest.json. In the json file you may add a comment saying they are for TVM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it has other dependencies. If you know any, please also add.

Copy link
Contributor

@snnn snnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. You may merge it and continue to improve the code.

@xadupre xadupre merged commit 481b96d into microsoft:master Jan 27, 2022
@prateek9623
Copy link
Contributor

@xadupre hey can you help me with NUPHAR-TVM cmake install at this PR:#8919

@tmoreau89
Copy link

Great, thank you @xadupre for getting this PR through and @snnn @KJlaccHoeUM9l and @vvchernov for the review.

message(STATUS "TVM BEFORE USE_LLVM=${USE_LLVM} USE_OPENMP=${USE_OPENMP} USE_MICRO=${USE_MICRO} CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} USE_CUDA=${USE_CUDA} USE_GTEST=${USE_GTEST} CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE}")
message(STATUS "tvm_SOURCE_DIR=${tvm_SOURCE_DIR}")
message(STATUS "tvm_BINARY_DIR=${tvm_BINARY_DIR}")
add_subdirectory(${tvm_SOURCE_DIR} ${tvm_BINARY_DIR} EXCLUDE_FROM_ALL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xadupre Do we require EXCLUDE_FROM_ALL here, as this prevents "tvm" to be installed when "onnxruntime" is installed?

petersalas pushed a commit to octoml/onnxruntime that referenced this pull request Nov 7, 2022
…t null. (microsoft#10211)

* STVM, checks pointers are not null.
* removes submodules tvm
* add missing include(FetchContent)
* add target tvm
* fix stvm test
* extend cgmanifest with dependencies of tvm

(cherry picked from commit 481b96d)
@xadupre xadupre deleted the stvm1 branch January 13, 2023 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants