LoRA: Low-Rank Adaptation of Large Language Models
低秩自适应LoRA
,通过冻结预训练模型参数,只将可训练的秩分解矩阵注入到Transformer架构中,极大的降低了下游任务的训练参数。LoRA优点:
img
彼此不认识,那就不相关,就有秩序,问题就好解决;反之,彼此相关,就没有秩序,问题就不好解决。所以,数学中定义,矩阵中最大的不相关的向量的个数,叫做秩,可以理解为有秩序的程度。
如果矩阵表达的是结构性信息,例如图像、用户-商品推荐表等,那么这个矩阵各行之间存在一定的相关性,那这个矩阵一般是低秩的。
如果矩阵之间各行的相关性很强,那么就表示这个矩阵实际可以投影到更低维的线性子空间,也就是用几个向量就可以完全表达了,它就是低秩的。
如果X是一个m行n列的数值矩阵,rank(x)是x的秩,假如rank (X)远小于m和n,则称x是低秩矩阵。低秩矩阵每行或每列都可以用其他的行或列线性表示,可见它包含大量的冗余信息。利用这种冗余信息,可以对数据进行恢复,也可以对数据进行特征提取。
矩阵的秩的度量其实就是矩阵的行列之间的相关性。如果矩阵的各行或列是线性无关的,矩阵就是满秩的。非零元素的行数或列数决定了秩的多少。
低秩与稀疏。低秩是指矩阵的秩较小,稀疏是指矩阵中非零元素的个数少。如果对矩阵进行奇异值分解,并把其所有奇异值排列为一个向量,那么这个向量的稀疏性便对应于该矩阵的低秩性
若将图像看成一个矩阵,那么它的基的数量越少,基对应的线性无关向量数量就越少,矩阵的秩就越小。当它远远小于矩阵的大小的时候,图像就是低秩的。低秩矩阵的每行或者每列都可以用其他的行或者列线性表示,这说明这个矩阵包含了大量的冗余信息。利用这种冗余信息可以对确实图像信息进行恢复,可以将多出来的噪声信息进行去除,还可以对错误的图像信息进行恢复。
image-20230831121351040
模型是过参数化
),在适应特定任务时,PLM可能具有较低的"内在维度",即使投射到较小的子空间,也可以有效地进行学习。image-20230831121708759
image-20230831143646297
image-20230831143741010
image-20230831142906008
img
loralib
We only supportnn.Linear
,nn.Embedding
, andnn.Conv2d
for now. We also support aMergedLinear
for cases where a singlenn.Linear
represents more than one layers, such as in some implementations of the attentionqkv
projection (see Additional Notes for more).
nn.Linear
, nn.Embedding
, nn.Conv2d
, MergedLinear
(attention)loralib
is simply$ pip install loralib
# Alternatively
# pip install git+https://github.com/microsoft/LoRA
loralib
. We only support nn.Linear
, nn.Embedding
, and nn.Conv2d
for now. We also support a MergedLinear
for cases where a single nn.Linear
represents more than one layers, such as in some implementations of the attention qkv
projection (see Additional Notes for more).# ===== Before =====
# layer = nn.Linear(in_features, out_features)
# ===== After ======
import loralib as lora
# Add a pair of low-rank adaptation matrices with rank r=16
layer = lora.Linear(in_features, out_features, r=16)
import loralib as lora
model = BigModel()
# This sets requires_grad to False for all parameters without the string "lora_" in their names
lora.mark_only_lora_as_trainable(model)
# Training loop
for batch in dataloader:
...
state_dict
that only contains LoRA parameters.# ===== Before =====
# torch.save(model.state_dict(), checkpoint_path)
# ===== After =====
torch.save(lora.lora_state_dict(model), checkpoint_path)
load_state_dict
, be sure to set strict=False
.# Load the pretrained checkpoint first
model.load_state_dict(torch.load('ckpt_pretrained.pt'), strict=False)
# Then load the LoRA checkpoint
model.load_state_dict(torch.load('ckpt_lora.pt'), strict=False)
q
and v
projection in a Transformer, in our examples, LoRA can be apply to any subsets of pre-trained weights. We encourage you to explore different configurations, such as adapting the embedding layer by replacing nn.Embedding
with lora.Embedding
and/or adapting the MLP layers. It's very likely that the optimal configuration varies for different model architectures and tasks.nn.Linear
for the projection matrices for query, key, and value. If one wishes to constrain the rank of the updates to the individual matrices, one has to either break it up into three separate matrices or use lora.MergedLinear
. Make sure to modify the checkpoint accordingly if you choose to break up the layer.# ===== Before =====
# qkv_proj = nn.Linear(d_model, 3*d_model)
# ===== After =====
# Break it up (remember to modify the pretrained checkpoint accordingly)
q_proj = lora.Linear(d_model, d_model, r=8)
k_proj = nn.Linear(d_model, d_model)
v_proj = lora.Linear(d_model, d_model, r=8)
# Alternatively, use lora.MergedLinear (recommended)
qkv_proj = lora.MergedLinear(d_model, 3*d_model, r=8, enable_lora=[True, False, True])
lora
. You can mark some biases as trainable by passing "all" or "lora_only" to bias=
when calling mark_only_lora_as_trainable
. Remember to pass the corresponding bias=
argument to lora_state_dict
when saving a checkpoint.# ===== Before =====
# lora.mark_only_lora_as_trainable(model) # Not training any bias vectors
# ===== After =====
# Training all bias vectors associated with modules we apply LoRA to
lora.mark_only_lora_as_trainable(model, bias='lora_only')
# Alternatively, we can train *all* bias vectors in the model, including LayerNorm biases
lora.mark_only_lora_as_trainable(model, bias='all')
# When saving a checkpoint, use the same bias= ('all' or 'lora_only')
torch.save(lora.lora_state_dict(model, bias='all'), checkpoint_path)
model.eval()
will trigger the merging of LoRA parameters with the corresponding pretrained ones, which eliminates additional latency for subsequent forward passes. Calling model.train()
again will undo the merge. This can be disabled by passing merge_weights=False
to LoRA layers.本文分享自 iResearch666 微信公众号,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文参与 腾讯云自媒体同步曝光计划 ,欢迎热爱写作的你一起参与!