Status: is genuinly a question, I haven’t looked into it yet
Motivation
Full parameter finetuning (FPFT) is significantly better than LoRA at adapting the weights of a model without catastrophic forgetting. However, a LoRA takes up significantly less memory than a full model, which means that it’s much easier to mix and match LoRAs, to download and share them, etc.
So is there a way to get the best of both worlds?
Method
Take the base model $\Theta_{\mathrm{base}}$ and train it using full parameter finetuning:
\[\Theta_{\mathrm{FPFT}} = \mathrm{\mathtt{train}_\mathtt{fpft}}(\Theta_{\mathrm{base}})\]Then, we take the difference between the base model and the FPFT model:
\[\Delta \Theta = \Theta_{\mathrm{base}} - \Theta_{\mathrm{FPFT}}\]$\Delta \Theta$ is now equivalent to a full-rank LoRA (which doesn’t really make sense, but you get the idea). Its only problem is that is full-rank, and so takes a lot of memory.
But we can decompose $\Delta \Theta$ using SVD and save only the $r$ first singular values. This can be cheaply stored and downloaded, but can also be easily expanded to full rank and added to the base model again.
This is like a LoRA, but trained using full parameter finetuning.
Caveats
- Throwing away $d-r$ singular values (where $d$ is the dimension of the weights) could hurt performance more than just training a low-rank LoRA in the first place—I don’t know
- I don’t know how close DoRA can get to the performance of full parameter finetuning; if it’s very close, then just use DoRA
- FPFT is expensive
Citation
@misc{muller2024peftsvd,
title={Does PEFT with SVD and full parameter finetuning work?},
author={Sebastian M\"uller},
year={2024},
month={nov},
url={https://github.com/snimu/blog/blob/main/contents/question-does-peft-with-svd-and-full-parameter-finetuning-work/README.md}
}