Getting Started with Tribuo: A Java Machine Learning Library
Tribuo is an open-source machine learning (ML) library for Java developed by Oracle Labs. Designed for production-grade ML systems, Tribuo offers tools for building, training, evaluating, and deploying ML models across a wide range of tasks, including classification, regression and clustering. This article will explore some features of Tribuo and build an example application that performs a classification task using a sample dataset.
1. An Overview of the Tribuo Library
Tribuo is a modern machine learning library developed by Oracle Labs and written entirely in Java. It is designed to simplify and clarify the machine learning process for Java developers by offering native integration with the JVM and strong type safety. Unlike many libraries built in Python or C++, Tribuo allows developers to build, train, and deploy machine learning models entirely within Java applications.
This makes it a practical, production-ready solution that eliminates the need to switch to other languages or depend on external tools. At its core, Tribuo provides support for standard machine learning tasks such as:
- Classification
- Regression
- Clustering
Tribuo also includes:
- A consistent API across tasks
- Out-of-the-box support for CSV-based data loading
- Integration with popular ML engines like XGBoost, TensorFlow, and ONNX Runtime
- Tools for model evaluation and inspection (e.g., confusion matrices, accuracy, etc.)
As this Oracle blog points out, “ML libraries for the JVM often lack strong typing, trackability, or production readiness.” Tribuo solves these issues by offering type-safe generics, transparent model metadata, and easy deployment options, all within a standard Java project.
2. Project Setup for Tribuo
To use Tribuo in a Java project, you will need to include the appropriate dependencies in your pom.xml.
<dependency>
<groupId>org.tribuo</groupId>
<artifactId>tribuo-all</artifactId>
<version>4.3.2</version>
<type>pom</type>
</dependency>
Iris Dataset (CSV File)
Create a file named iris.csv in your src/main/resources folder with this sample content:
sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.3,3.3,6.0,2.5,Iris-virginica ...
You can download the full Iris dataset from here and format it with the correct headers.
3. Java Code – Classification Example
This section will walk through a Java program that loads a sample dataset (Iris dataset), splits it into training and testing sets, trains a classification model using LibLinear (a fast linear classifier), and evaluates its accuracy. We will use Tribuo’s core APIs for loading data, training models, and evaluating results.
import java.nio.file.Paths;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.tribuo.DataSource;
import org.tribuo.Dataset;
import org.tribuo.Model;
import org.tribuo.MutableDataset;
import org.tribuo.Trainer;
import org.tribuo.classification.Label;
import org.tribuo.classification.LabelFactory;
import org.tribuo.classification.evaluation.LabelEvaluation;
import org.tribuo.classification.evaluation.LabelEvaluator;
import org.tribuo.classification.liblinear.LibLinearClassificationTrainer;
import org.tribuo.data.csv.CSVLoader;
import org.tribuo.evaluation.TrainTestSplitter;
public class IrisClassifier {
private static final Logger logger = Logger.getLogger(IrisClassifier.class.getName());
public static void main(String[] args) throws Exception {
logger.info("=== Starting Iris Classification ===");
// Load the Iris dataset
var irisPath = Paths.get("src/main/resources/iris.csv");
LabelFactory labelFactory = new LabelFactory();
CSVLoader<Label> csvLoader = new CSVLoader<>(labelFactory);
DataSource<Label> dataSource = csvLoader.loadDataSource(irisPath, "species");
// Split the dataset into training (70%) and testing (30%)
TrainTestSplitter splitter = new TrainTestSplitter<>(dataSource, 0.7, 1234L);
Dataset<Label> trainSet = new MutableDataset<>(splitter.getTrain());
Dataset<Label> testSet = new MutableDataset<>(splitter.getTest());
// Log dataset information
logger.info(String.format("Train Size: %d, Features: %d, Classes: %s",
trainSet.size(), trainSet.getFeatureMap().size(), trainSet.getOutputInfo().getDomain()));
logger.info(String.format("Test Size: %d, Features: %d, Classes: %s",
testSet.size(), testSet.getFeatureMap().size(), testSet.getOutputInfo().getDomain()));
// Train the model using LibLinear
Trainer<Label> trainer = new LibLinearClassificationTrainer();
logger.info("Training the model...");
Model<Label> model = trainer.train(trainSet);
// Evaluate model on train and test sets
LabelEvaluator evaluator = new LabelEvaluator();
logger.info("Evaluating on train set...");
LabelEvaluation trainEval = evaluator.evaluate(model, trainSet);
logger.log(Level.INFO, "Train Accuracy: {0}", trainEval.accuracy());
logger.log(Level.INFO, "Train Confusion Matrix:\n{0}", trainEval.getConfusionMatrix());
logger.info("Evaluating on test set...");
var testEval = evaluator.evaluate(model, testSet);
logger.log(Level.INFO, "Test Accuracy: {0}", testEval.accuracy());
logger.log(Level.INFO, "Test Confusion Matrix:\n{0}", testEval.getConfusionMatrix());
}
}
3.1 How the Code Works
Data Loading
Tribuo uses DataSource<Label> to load data. The generic type <Label> means this is a classification task. If we were doing regression, we would use <Regressor>. The CSV loader reads the Iris dataset from the CSV file. The last column species is used as the label (what we want to predict).
Splitting the Dataset
The TrainTestSplitter is used to divide the dataset into two parts: 70% for training and 30% for testing, allowing the model to learn from one portion of the data while reserving the other for evaluating its performance on unseen examples.
Training
We use LibLinearClassificationTrainer. The trainer learns patterns from the training data and returns a Model<Label> object.
Evaluating
We use LabelEvaluator to measure accuracy and print a confusion matrix, which shows how many labels were predicted correctly or incorrectly.
Sample Output
When you run the program, you will see something like:
INFO: Train Size: 105, Features: 4, Classes: [Iris-versicolor, Iris-virginica, Iris-setosa]
INFO: Test Size: 45, Features: 4, Classes: [Iris-versicolor, Iris-virginica, Iris-setosa]
INFO: Training the model...
INFO: Evaluating on train set...
INFO: Train Accuracy: 0.9809523809523809
INFO: Train Confusion Matrix:
Iris-versicolor Iris-virginica Iris-setosa
Iris-versicolor 37 1 0
Iris-virginica 1 31 0
Iris-setosa 0 0 35
INFO: Evaluating on test set...
INFO: Test Accuracy: 0.8888888888888888
INFO: Test Confusion Matrix:
Iris-versicolor Iris-virginica Iris-setosa
Iris-versicolor 10 2 0
Iris-virginica 3 15 0
Iris-setosa 0 0 15
Each log entry corresponds to a major stage: dataset loading, model training, evaluation, and accuracy reporting. The output includes details about the number of records in the train and test sets, the discovered classes, and a confusion matrix that summarizes how many examples were correctly or incorrectly classified.
Understanding the Training Output
The training accuracy is approximately 98.1%, meaning the model correctly classified nearly all the training examples. The confusion matrix shows that:
- 37 out of 38
Iris-versicolorsamples were correctly identified. - 31 out of 32
Iris-virginicasamples were correctly identified. - All 35
Iris-setosasamples were predicted perfectly.
This strong performance suggests that the model learned patterns in the training data very well. However, high training accuracy alone doesn’t guarantee good performance on new, unseen data—which is why we evaluate the test set.
Understanding the Testing Output
The test accuracy is 88.9%, indicating that about 40 out of 45 test samples were correctly classified. The confusion matrix reveals a few misclassifications:
- 2
Iris-versicolorflowers were incorrectly predicted asIris-virginica. - 3
Iris-virginicaflowers were incorrectly classified asIris-versicolor. - All 15
Iris-setosaflowers were correctly classified, again demonstrating that this class is the easiest to distinguish in the Iris dataset.
While the model performs very well overall, some confusion between the versicolor and virginica classes shows that additional training data might improve accuracy further.
4. Saving and Loading a Tribuo Model
Tribuo models are fully Serializable, meaning they can be saved and loaded using standard Java I/O streams. This makes it easy to persist trained models to disk and reuse them later, without retraining.
In this section, we demonstrate how to save a trained model using ObjectOutputStream, and how to safely restore it using ObjectInputStream.
Saving the Trained Model to Disk
Once the model has been trained, it can be stored to disk so it can be reused later without needing to retrain. Tribuo models implement Java’s Serializable interface, allowing them to be saved using standard ObjectOutputStream. This is useful for deploying trained models in production environments or sharing them across different systems or workflows.
File modelFile = new File("src/main/resources/irisModel.ser");
try (ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(modelFile ))) {
oos.writeObject(model);
}
In this snippet, the trained model is serialized and written to a file named irisModel.ser in the resources directory. This file contains not only the model’s learned parameters but also its feature mappings, provenance metadata, and output information—everything needed to restore the model later and make predictions.
Loading a Serialized Model from Disk
To reuse a model without retraining, you can load it from the file where it was previously saved. Since Tribuo models are Serializable, this can be done using a standard ObjectInputStream. The code snippet below shows how to open the serialized file and read the model object back into memory.
File newModelFile = new File("src/main/resources/irisModel.ser");
Model<?> loadedModel;
try (ObjectInputStream ois = new ObjectInputStream(new FileInputStream(newModelFile))) {
loadedModel = (Model<?>) ois.readObject();
}
This snippet opens the irisModel.ser file and deserializes its contents back into a Model<?> object. The use of ObjectInputStream ensures the model is restored with all of its original configuration, metadata, and learned parameters. At this point, the model is ready to be validated and used for evaluation or prediction.
Verifying That the Loaded Model Is Valid
Once a model is loaded from disk, it’s essential to verify that it’s valid and matches the expected output type. This is particularly relevant in Tribuo, where models are generically typed (e.g. Model<Label> for classification tasks). Since Java’s type information is erased at runtime, we call the model’s validate method to confirm the type.
if (loadedModel.validate(Label.class)) {
Model<Label> validModel = (Model<Label>) loadedModel;
logger.info("Model loaded and validated successfully.");
} else {
logger.severe("Loaded model does not match expected type: Label.");
}
This code uses model.validate(Label.class) to confirm that the deserialized model is indeed a classification model that outputs Label. If the type is valid, the model is safely cast and ready for prediction or evaluation.
Sample Output
INFO: Model loaded and validated successfully.
If the model’s type doesn’t match (for example, if it was a regression model instead of classification), you would see:
SEVERE: Loaded model does not match expected type: Label.
Making a Prediction with the Loaded Model
After confirming that the loaded model is valid, you can use it to make predictions on new data. To do this, create an Example<Label> with the appropriate feature names and values. The model will return a Prediction<Label> object, which includes the predicted class label and confidence scores.
// Create a new example input with known feature values
Example<Label> input = new ArrayExample<>(
new Label("species"),
new String[]{"sepal_length", "sepal_width", "petal_length", "petal_width"},
new double[]{5.1, 3.5, 1.4, 0.2}
);
// Make prediction using the validated model
Prediction<Label> prediction = validModel.predict(input);
// Output the predicted label and confidence scores
logger.info("Predicted class: " + prediction.getOutput().getLabel());
logger.info("Full prediction: " + prediction.toString());
In this code snippet, an input example is created using the same feature format as the Iris dataset. This example is then passed to the model’s predict() method to generate a prediction, which returns both the predicted class label and the confidence scores for each possible class.
Sample Output
INFO: Predicted class: Iris-setosa
INFO: Full prediction: Prediction(maxLabel=(Iris-setosa,1.355949375551975),outputScores={Iris-versicolor=(Iris-versicolor,-0.8362544145801056),Iris-virginica=(Iris-virginica,-6.238939707445729),Iris-setosa=(Iris-setosa,1.355949375551975)})
The first line shows that the model classified the input as Iris-setosa based on its internal scoring. This class received the highest score, making it the most likely match.
The second line details the raw scores for all classes. Iris-setosa had the highest score (1.36), followed by Iris-versicolor (−0.84) and Iris-virginica (−6.24). These scores reflect the model’s confidence. Higher positive values indicate stronger matches, while lower or negative values suggest weaker ones.
5. Conclusion
This guide demonstrated how Tribuo simplifies machine learning in Java, offering clear APIs for data handling, training, evaluation, and model persistence. We showed how to safely load models with deserialization filters and perform accurate predictions using real-world data. Tribuo bridges the gap between Java development and modern ML workflows without relying on external Python dependencies.
6. Download the Source Code
This article provided a guide to machine learning (ML) in Java using Tribuo.
You can download the full source code of this example here: java ml tribuo guide




