This example demonstrates the usage of Microsoft's Phi-3 Vision model
Phi-3 Vision ONNX is a multimodal model that combines vision and language processing. It uses three interconnected ONNX models:
- Vision model: Processes images to extract visual features
- Text embedding model: Embeds input text into a format compatible with the model
- Text generation model: Produces text outputs based on the combined visual and textual inputs
This multi-model structure requires a coordinated process:
-
Image Processing:
- Preprocess the input image
- Pass it through the vision ONNX model for visual features
-
Text Embedding:
- Tokenize input text
- Process it with the text embedding ONNX model
-
Multimodal Fusion:
- Combine visual features and text embeddings into a single input
-
Text Generation:
- The combined input is fed into the text generation ONNX model.
- The model generates text tokens one by one in an autoregressive manner.
- For each token, the model uses past key/value states to maintain context.
The specific configuration for the model can be found in data/genai_config.json.
This example currently only supports single image input.
The performance of ONNX-based LLM inference can be relatively slow, especially on CPU:
- On an Apple M1 Pro:
- For image+text input (about 300 tokens): ~7 tokens/s
- For text-only input (about 10 tokens): ~5 tokens/s
Before running the example, you'll need to download the ONNX model files to the data directory. At present, the SessionBuilder.commit_from_url method doesn't support initialization for models split into .onnx and .onnx.data files, which is the case for Phi-3 Vision models.
To get started, run the download.sh script from the data directory to download the following three model files:
phi-3-v-128k-instruct-vision.onnxandphi-3-v-128k-instruct-vision.onnx.dataphi-3-v-128k-instruct-text-embedding.onnxandphi-3-v-128k-instruct-text-embedding.onnx.dataphi-3-v-128k-instruct-text.onnxandphi-3-v-128k-instruct-text.onnx.datatokenizer.json
Once the model files are downloaded, you can run the example using Cargo:
cargo example-phi-3-vision