Question: Does PEFT with SVD and full parameter finetuning work?

Status: is genuinly a question, I haven’t looked into it yet

Motivation

Full parameter finetuning (FPFT) is significantly better than LoRA at adapting the weights of a model without catastrophic forgetting. However, a LoRA takes up significantly less memory than a full model, which means that it’s much easier to mix and match LoRAs, to download and share them, etc.

So is there a way to get the best of both worlds?

Method

Take the base model $\Theta_{\mathrm{base}}$ and train it using full parameter finetuning:

\[\Theta_{\mathrm{FPFT}} = \mathrm{\mathtt{train}_\mathtt{fpft}}(\Theta_{\mathrm{base}})\]

Then, we take the difference between the base model and the FPFT model:

\[\Delta \Theta = \Theta_{\mathrm{base}} - \Theta_{\mathrm{FPFT}}\]

$\Delta \Theta$ is now equivalent to a full-rank LoRA (which doesn’t really make sense, but you get the idea). Its only problem is that is full-rank, and so takes a lot of memory.

But we can decompose $\Delta \Theta$ using SVD and save only the $r$ first singular values. This can be cheaply stored and downloaded, but can also be easily expanded to full rank and added to the base model again.

This is like a LoRA, but trained using full parameter finetuning.

Caveats

Throwing away $d-r$ singular values (where $d$ is the dimension of the weights) could hurt performance more than just training a low-rank LoRA in the first place—I don’t know
I don’t know how close DoRA can get to the performance of full parameter finetuning; if it’s very close, then just use DoRA
FPFT is expensive

Citation

@misc{muller2024peftsvd,
    title={Does PEFT with SVD and full parameter finetuning work?},
    author={Sebastian M\"uller},
    year={2024},
    month={nov},
    url={https://github.com/snimu/blog/blob/main/contents/question-does-peft-with-svd-and-full-parameter-finetuning-work/README.md}
}