-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Implement and benchmark ONNX Runtime for Inference #39
Description
Went with onnx-ecosystem which is a recent release (couple of weeks). Found nvidia-cuda-docker was not initializing, so I ditched Docker for now and ran this notebook from an environment with PyTorch v1.4.0, Transformers v2.5.1, ONNX runtimes v1.2.1 (CPU & GPU).
With the variables (max_seq_length=128, etc.) as originally specified, here is the result on GPU:
ONNX Runtime inference time: 0.00811
PyTorch Inference time = 0.02096
***** Verifying correctness *****
PyTorch and ORT matching numbers: True
PyTorch and ORT matching numbers: True
With max_seq_length=384, everything else the same, here is the result:
ONNX Runtime inference time: 0.0193
PyTorch Inference time = 0.0273
***** Verifying correctness *****
PyTorch and ORT matching numbers: True
PyTorch and ORT matching numbers: True
Should have more time tomorrow to examine these preliminary results and to further iterate & characterize the differences, including the notebook's variables per_gpu_eval_batch_size and eval_batch_size, both originally set to 1.
At this point I am more familiar with ALBERT_xxlarge inference performance, so eventually I may try to implement it in ONNX for an inference comparison on a larger model.
Here's another max_seq_length=384 run:
Inference-PyTorch-Bert-Model-for-High-Performance-in-ONNX-Runtime_WIP - Jupyter Notebook.pdf
Originally posted by @ahotrod in #23 (comment)