@oamsalemd, @idob8 - Winter 2024
- Introduction
- Compression
- LoRA
- Project goal
- Method
- Experiments and results
- Conclusions
- Future work
- How to run
- Ethics Statement
Compressing pre-trained neural networks reduces memory usage, speeds up inference, and enables deployment on resource-constrained devices. It optimizes model efficiency, real-time performance, and energy consumption, making deep learning models more practical for diverse computing environments. We tested multiple model compression methods that can potentially achieve better computational usage, and tested their effect on the pre-trained model.
Data type Quantization - in this method we use more compact data type to store the model weights. this technique can potentially save memory (capacity and bandwidth).
Sparsity - in this method we use "sparse" weight matrices, for any given block we allow only 1 cell to have non zero value. This technique can potentially save memory (capacity and bandwidth) and also reduce the number of effective multiplication instructions.
On the other hand, both methods can potentially damage the accuracy of the model and might demand retraining the model.

LoRA (Low Rank Adaptation) is a technique for efficiently fine-tune pre-trained models. The basic idea is to train only a low-rank matrix that will be added the pretrained weight matrix.[1] Previous works have shown the benefits of LoRA in transfer-learning for pre-trained LLM-s.[1]
- Given a 'Linear' layer
Wofin_dimXout_dim, we choose low rankrs.t.r < in_dim, out_dim. - We freeze the
Wmatrix, so it remains intact while re-training the model. - The matrices
A(ofin_dimXr) andB(ofrXout_dim) are initialized. - We set the new activation to be
h=x@(W+a*A@B)for the inputx(of1Xin_dim), and a factora. - During training, only
AandBmatrices are learned.
Our objective is to combine model compression with LoRA in pre-trained models, to optimize model size with minimal damaging to model accuracy and minimal retraining. We test the method's efficacy for image classification tasks.
- We used ‘resnet18’ pre-trained on ImageNet1K[2]
- For training we used only a small subset of the original dataset (50,000 images out of 1,281,167)
- The compression methods we tested were:
- Data type quantization to int1.
- Sparsity with block size of 4X4.
- The compression was implemented only on the FC layer of the model.
- Given:
compression ratio was calculated as follows:
Memory and instructions compression ratio for Sparse 4X4 method:
Memory compression ratio for INT1 quantization method:
- We tested the appending of LoRA layer of ranks: [2, 4, 8, 16, 32, 64, 128].
- We tested 2 initialization methods. The first was the initialization suggested in the original LoRA paper, A is initialized as N(0,\sigma^2) and B=0. The second one was SVD decomposition of the diff from original matrix.
- All model's parameters except LoRA parameters were frozen. LoRA parameters were trained for 10 epochs and the best epoch was chosen (in terms of accuracy on the validation set).
- Hyper parameters were chosen for each rank separately using Optuna:
- Optimizer, learning rate, batch size, "alpha" factor (LoRA)
- Finally we evaluated the accuracy on a test set for each LoRA rank and for each initialization method.
We expected the graph to be monotonically ascending. One potential explanation for their instability could be that the training hyper-parameters choice has a big effect on the model’s test accuracy. Even though increasing the LoRA rank increases the number parameters in the model, we could not always set the training hyper-parameters for the model to be optimized for the task and produce better accuracy.
For Sparse 4X4 compression, we can see that increasing the LoRA rank generally improves the model accuracy for the test set. For LoRA rank of 128 with SVD initialization, the experiment showed just 1.74% accuracy drop, with ×2.27 compression ratio.
For INT1 quantization, small LoRA ranks have shown significant improvement compared to the quantized-only model’s test accuracy. Unlike Sparse 4X4 compression, we could not see an improvement in the model’s accuracy for larger LoRA ranks. The best accuracy drop was for LoRA rank of 128 with paper-suggested initialization. The experiment showed 5.01% accuracy drop, with ×2.44 memory compression ratio. The best trade-off was for LoRA rank of 2 with paper-suggest initialization. The experiment showed 5.98% accuracy drop, with ×26.91 memory compression ratio.
SVD decomposition initialization showed better and more stable results for Sparse 4X4 compression. For INT1 quantization, this initialization method did not improve the results compared to the paper-suggested initialization.
- Increasing LoRA rank generally gives better accuracy, yet not matching the original model’s accuracy.
- Training the LoRA parameters requires minor computation effort.
- The combination of all LoRA ranks with compression methods that were tested results in memory compression, while sparsity method also results in computation reduction.
- LoRA training is unstable and very prone to hyper-parameters modification.
- Using initialization with SVD decomposition could provide in better results.
We believe that our project shows potential for further research of the benefits from combining model compression methods with LoRA. We believe such research could be done with:
- Test the method’s performance for ‘Linear’-rich models (e.g. Transformers, MLP-based, …)
- Explore more compression hyper-parameters (e.g. int8, sparse 3X3, ...)
- Explore more initialization methods for the LoRA matrices
- Apply the method for DoRA[3] variation and examine results
- Clone to a new directory:
git clone <URL> <DEST_DIR> cd /path/to/DEST_DIRpip install -r requirements.txt- Download ImageNet subset from: https://www.kaggle.com/datasets/tusonggao/imagenet-validation-dataset/code
- Move the images directory to:
DEST_DIR/../archive/imagenet_validation
python train_evaluate/train_model.py --init {paper_init,svd_init} [ > log.txt]
--init: determines the LoRA matrices initialization method (default:paper_init)- Recommended: pipe the output to
log.txtfile - Results will appear in
DEST_DIR/results
Description:
- Initiates the
resnet18model, pretrained on ImageNet - Loads the pre-downloaded ImageNet dataset and splits train/val/test subsets
- Per each compression method (
sparse,int1):
- Compresses the model
- Initiates LoRA appended to FC layer(s)
- Sweeps LoRA rank values, and uses Optuna to find the best training hyper-parameters per each rank
- Outputs the results to a dedicated directory
- Results directory contains:
evaluation.csv: summary of evaluation accuracy for test subset per each LoRA rankacc_quant=COMP_TYPE_r=RANK.png: accuracy per epoch (train, validation), for COMP_TYPE (sparse,int1), for RANK (LoRA rank)loss_quant=COMP_TYPE=r_RANK.png: loss per epoch (train, validation), for COMP_TYPE (sparse,int1), for RANK (LoRA rank)quant=COMP_TYPE_r=RANK_eval_acc=ACCUR.ckpt: model post-training parameters, for COMP_TYPE (sparse,int1), for RANK (LoRA rank), with test accuracy of ACCURquant=COMP_TYPE_r=RANK_optimization_history.html: Optuna trials summary for COMP_TYPE (sparse,int1) and RANK (LoRA rank) hyper-parameter tuning
End-users, deep learning researchers, technology companies, and regulatory bodies.
End-users can benefit from faster and more efficient image classification models, improving user experience. However, there may be concerns about privacy if sensitive information is processed. Deep learning researchers can advance the field with innovative techniques, but they must ensure fairness and transparency in model development and deployment. Technology companies can enhance product performance and reduce resource consumption, yet they need to address potential biases and ensure responsible AI practices. Regulatory bodies play a crucial role in establishing guidelines and standards to protect user rights, promote fairness, and mitigate risks associated with AI technologies.
Prioritizing user privacy and data protection through robust security measures and transparent data handling practices. Mitigating biases in data and algorithms to ensure fairness and equity in classification outcomes. Providing clear explanations and documentation on the use of quantization and LoRA techniques to enhance model transparency and interpretability. Engaging in ongoing dialogue with stakeholders and regulatory bodies to address ethical concerns, promote responsible AI practices, and uphold societal values in AI development and deployment.
[1] Hu, Edward J., et al. “Lora: Low-rank adaptation of large language models.” arXiv preprint arXiv:2106.09685 (2021).
[3] Liu, Shih-Yang, et al. "DoRA: Weight-Decomposed Low-Rank Adaptation." arXiv preprint arXiv:2402.09353 (2024).






