Transformers 4.37 中文文档（六）

ApacheCN_飞龙

发布于 2024-06-26 14:45:29

1260

发布于 2024-06-26 14:45:29

文章被收录于专栏：信数据得永生信数据得永生

原文：huggingface.co/docs/transformers

视觉问答

原文链接：huggingface.co/docs/transformers/v4.37.2/en/tasks/visual_question_answering

视觉问答（VQA）是根据图像回答开放式问题的任务。支持此任务的模型的输入通常是图像和问题的组合，输出是用自然语言表达的答案。

VQA 的一些值得注意的用例示例包括：

视障人士的辅助应用程序。
教育：提出关于讲座或教科书中呈现的视觉材料的问题。VQA 也可以用于互动博物馆展览或历史遗址。
客户服务和电子商务：VQA 可以通过让用户询问有关产品的问题来增强用户体验。
图像检索：VQA 模型可用于检索具有特定特征的图像。例如，用户可以询问“有狗吗？”以找到一组图像中所有带有狗的图像。

在本指南中，您将学习如何：

在Graphcore/vqa数据集上对分类 VQA 模型（特别是 ViLT）进行微调。
使用您微调的 ViLT 进行推断。
使用生成模型（如 BLIP-2）进行零样本 VQA 推断。

微调 ViLT

ViLT 模型将文本嵌入集成到 Vision Transformer（ViT）中，使其在视觉和语言预训练（VLP）方面具有最小的设计。该模型可用于多个下游任务。对于 VQA 任务，分类器头部放置在顶部（线性层放在[CLS]标记的最终隐藏状态之上）并随机初始化。因此，视觉问答被视为分类问题。

最近的模型，如 BLIP、BLIP-2 和 InstructBLIP，将 VQA 视为生成任务。在本指南中，我们将说明如何将它们用于零样本 VQA 推断。

在开始之前，请确保已安装所有必要的库。

pip install -q transformers datasets

我们鼓励您与社区分享您的模型。登录到您的 Hugging Face 帐户将其上传到🤗 Hub。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

让我们将模型检查点定义为全局变量。

>>> model_checkpoint = "dandelin/vilt-b32-mlm"

加载数据

出于说明目的，在本指南中，我们使用了带注释的视觉问答Graphcore/vqa数据集的一个非常小的样本。您可以在🤗 Hub上找到完整的数据集。

作为对Graphcore/vqa数据集的替代，您可以从官方的VQA 数据集页面手动下载相同的数据。如果您希望使用自定义数据跟随教程，请查看🤗数据集文档中的创建图像数据集指南。

让我们加载验证集中的前 200 个示例并探索数据集的特点：

>>> from datasets import load_dataset

>>> dataset = load_dataset("Graphcore/vqa", split="validation[:200]")
>>> dataset
Dataset({
    features: ['question', 'question_type', 'question_id', 'image_id', 'answer_type', 'label'],
    num_rows: 200
})

让我们看一个例子来了解数据集的特点：

>>> dataset[0]
{'question': 'Where is he looking?',
 'question_type': 'none of the above',
 'question_id': 262148000,
 'image_id': '/root/.cache/huggingface/datasets/downloads/extracted/ca733e0e000fb2d7a09fbcc94dbfe7b5a30750681d0e965f8e0a23b1c2f98c75/val2014/COCO_val2014_000000262148.jpg',
 'answer_type': 'other',
 'label': {'ids': ['at table', 'down', 'skateboard', 'table'],
  'weights': [0.30000001192092896,
   1.0,
   0.30000001192092896,
   0.30000001192092896]}}

与任务相关的特征包括：

question：要从图像回答的问题
image_id：问题所指图像的路径
label：注释

我们可以删除其余的特征，因为它们不会是必要的：

>>> dataset = dataset.remove_columns(['question_type', 'question_id', 'answer_type'])

正如您所看到的，label特征包含了同一个问题的几个答案（这里称为ids），这些答案是由不同的人类注释者收集的。这是因为对问题的答案可能是主观的。在这种情况下，问题是“他在看哪里？”。有些人用“向下”注释，其他人用“看着桌子”，另一个人用“滑板”等等。

看一看图像，考虑你会给出什么答案：

>>> from PIL import Image

>>> image = Image.open(dataset[0]['image_id'])
>>> image

由于问题和答案的模糊性，像这样的数据集被视为多标签分类问题（因为可能有多个答案有效）。此外，与其只创建一个独热编码向量，不如创建一个软编码，基于某个答案在注释中出现的次数。

例如，在上面的示例中，因为答案“down”被选中的次数远远超过其他答案，它的得分（数据集中称为weight）为 1.0，而其余答案的得分<1.0。

为了以后用适当的分类头实例化模型，让我们创建两个字典：一个将标签名称映射到整数，另一个将整数映射回标签名称：

>>> import itertools

>>> labels = [item['ids'] for item in dataset['label']]
>>> flattened_labels = list(itertools.chain(*labels))
>>> unique_labels = list(set(flattened_labels))

>>> label2id = {label: idx for idx, label in enumerate(unique_labels)}
>>> id2label = {idx: label for label, idx in label2id.items()}

现在我们有了映射，我们可以用它们的 id 替换字符串答案，并将数据集扁平化，以便进行更方便的进一步预处理。

>>> def replace_ids(inputs):
...   inputs["label"]["ids"] = [label2id[x] for x in inputs["label"]["ids"]]
...   return inputs

>>> dataset = dataset.map(replace_ids)
>>> flat_dataset = dataset.flatten()
>>> flat_dataset.features
{'question': Value(dtype='string', id=None),
 'image_id': Value(dtype='string', id=None),
 'label.ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'label.weights': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None)}

数据预处理

下一步是加载 ViLT 处理器，为模型准备图像和文本数据。ViltProcessor 将 BERT 标记器和 ViLT 图像处理器封装到一个方便的单处理器中：

>>> from transformers import ViltProcessor

>>> processor = ViltProcessor.from_pretrained(model_checkpoint)

为了预处理数据，我们需要使用 ViltProcessor 对图像和问题进行编码。处理器将使用 BertTokenizerFast 对文本进行标记化，并为文本数据创建input_ids、attention_mask和token_type_ids。至于图像，处理器将利用 ViltImageProcessor 来调整大小和规范化图像，并创建pixel_values和pixel_mask。

所有这些预处理步骤都是在幕后完成的，我们只需要调用processor。但是，我们仍然需要准备目标标签。在这种表示中，每个元素对应一个可能的答案（标签）。对于正确答案，元素保存其相应的分数（权重），而其余元素设置为零。

以下函数将processor应用于图像和问题，并按上述描述格式化标签：

>>> import torch

>>> def preprocess_data(examples):
...     image_paths = examples['image_id']
...     images = [Image.open(image_path) for image_path in image_paths]
...     texts = examples['question']    

...     encoding = processor(images, texts, padding="max_length", truncation=True, return_tensors="pt")

...     for k, v in encoding.items():
...           encoding[k] = v.squeeze()

...     targets = []

...     for labels, scores in zip(examples['label.ids'], examples['label.weights']):
...         target = torch.zeros(len(id2label))

...         for label, score in zip(labels, scores):
...             target[label] = score

...         targets.append(target)

...     encoding["labels"] = targets

...     return encoding

要在整个数据集上应用预处理函数，使用🤗 Datasets 的map函数。您可以通过设置batched=True来加速map，以一次处理数据集的多个元素。此时，可以随意删除不需要的列。

>>> processed_dataset = flat_dataset.map(preprocess_data, batched=True, remove_columns=['question','question_type',  'question_id', 'image_id', 'answer_type', 'label.ids', 'label.weights'])
>>> processed_dataset
Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'pixel_values', 'pixel_mask', 'labels'],
    num_rows: 200
})

作为最后一步，使用 DefaultDataCollator 创建一批示例：

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()

训练模型

现在您已经准备好开始训练您的模型了！使用 ViltForQuestionAnswering 加载 ViLT。指定标签数量以及标签映射：

>>> from transformers import ViltForQuestionAnswering

>>> model = ViltForQuestionAnswering.from_pretrained(model_checkpoint, num_labels=len(id2label), id2label=id2label, label2id=label2id)

此时，只剩下三个步骤：

在 TrainingArguments 中定义您的训练超参数：

>>> from transformers import TrainingArguments

>>> repo_id = "MariaK/vilt_finetuned_200"

>>> training_args = TrainingArguments(
...     output_dir=repo_id,
...     per_device_train_batch_size=4,
...     num_train_epochs=20,
...     save_steps=200,
...     logging_steps=50,
...     learning_rate=5e-5,
...     save_total_limit=2,
...     remove_unused_columns=False,
...     push_to_hub=True,
... )

将训练参数传递给 Trainer，同时还需要传递模型、数据集、处理器和数据收集器。

>>> from transformers import Trainer

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=data_collator,
...     train_dataset=processed_dataset,
...     tokenizer=processor,
... )

调用 train()来微调您的模型。

>>> trainer.train()

一旦训练完成，使用 push_to_hub()方法将您的模型分享到🤗 Hub 上：

>>> trainer.push_to_hub()

推理

现在您已经对 ViLT 模型进行了微调，并将其上传到🤗 Hub，您可以用它进行推理。尝试使用 Pipeline 中的微调模型进行推理的最简单方法。

>>> from transformers import pipeline

>>> pipe = pipeline("visual-question-answering", model="MariaK/vilt_finetuned_200")

本指南中的模型仅在 200 个示例上进行了训练，因此不要对其抱有很大期望。让我们看看它是否至少从数据中学到了一些东西，并从数据集中取第一个示例来说明推理：

>>> example = dataset[0]
>>> image = Image.open(example['image_id'])
>>> question = example['question']
>>> print(question)
>>> pipe(image, question, top_k=1)
"Where is he looking?"
[{'score': 0.5498199462890625, 'answer': 'down'}]

尽管不是很自信，但模型确实学到了一些东西。有了更多的例子和更长的训练，你会得到更好的结果！

如果愿意，您也可以手动复制管道的结果：

拿一张图片和一个问题，使用你模型的处理器为模型准备它们。
将结果或预处理通过模型传递。
从 logits 中获取最可能答案的 id，并在id2label中找到实际答案。

>>> processor = ViltProcessor.from_pretrained("MariaK/vilt_finetuned_200")

>>> image = Image.open(example['image_id'])
>>> question = example['question']

>>> # prepare inputs
>>> inputs = processor(image, question, return_tensors="pt")

>>> model = ViltForQuestionAnswering.from_pretrained("MariaK/vilt_finetuned_200")

>>> # forward pass
>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits = outputs.logits
>>> idx = logits.argmax(-1).item()
>>> print("Predicted answer:", model.config.id2label[idx])
Predicted answer: down

零样本 VQA

先前的模型将 VQA 视为分类任务。一些最近的模型，如 BLIP、BLIP-2 和 InstructBLIP，将 VQA 视为生成任务。让我们以 BLIP-2 为例。它引入了一种新的视觉语言预训练范式，其中可以使用任何组合的预训练视觉编码器和 LLM（在BLIP-2 博客文章中了解更多）。这使得在多个视觉语言任务中包括视觉问答上实现了最先进的结果。

让我们说明如何使用这个模型进行 VQA。首先，让我们加载模型。在这里，如果可用，我们将明确将模型发送到 GPU，这在训练时不需要做，因为 Trainer 会自动处理：

>>> from transformers import AutoProcessor, Blip2ForConditionalGeneration
>>> import torch

>>> processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
>>> model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> model.to(device)

该模型将图像和文本作为输入，因此让我们使用 VQA 数据集中第一个示例中完全相同的图像/问题对：

>>> example = dataset[0]
>>> image = Image.open(example['image_id'])
>>> question = example['question']

要将 BLIP-2 用于视觉问答任务，文本提示必须遵循特定格式：问题：{} 答案：。

>>> prompt = f"Question: {question} Answer:"

现在我们需要使用模型的处理器对图像/提示进行预处理，通过模型传递处理后的输入，并解码输出：

>>> inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)

>>> generated_ids = model.generate(**inputs, max_new_tokens=10)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
>>> print(generated_text)
"He is looking at the crowd"

正如您所看到的，模型识别了人群和脸部的方向（向下看），但似乎忽略了人群在滑冰者后面的事实。然而，在无法获取人类注释数据集的情况下，这种方法可以快速产生有用的结果。

文本到语音

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/text-to-speech

文本到语音（TTS）是从文本创建自然语音的任务，语音可以用多种语言和多个说话者生成。目前在🤗 Transformers 中有几种文本到语音模型，如 Bark、MMS、VITS 和 SpeechT5。

您可以轻松使用"text-to-audio"流水线（或其别名"text-to-speech"）生成音频。一些模型，如 Bark，还可以被调节以生成非语言交流，如笑声、叹息和哭泣，甚至添加音乐。以下是您如何使用"text-to-speech"流水线与 Bark 的示例：

>>> from transformers import pipeline

>>> pipe = pipeline("text-to-speech", model="suno/bark-small")
>>> text = "[clears throat] This is a test ... and I just took a long pause."
>>> output = pipe(text)

以下是一个代码片段，您可以使用它在笔记本中听取生成的音频：

>>> from IPython.display import Audio
>>> Audio(output["audio"], rate=output["sampling_rate"])

有关 Bark 和其他预训练 TTS 模型的更多示例，请参考我们的音频课程。

如果您想要微调 TTS 模型，目前在🤗 Transformers 中唯一可用的文本到语音模型是 SpeechT5 和 FastSpeech2Conformer，未来将会添加更多。SpeechT5 在文本到语音和语音到文本数据的组合上进行了预训练，使其能够学习文本和语音共享的隐藏表示空间。这意味着相同的预训练模型可以用于不同的任务。此外，SpeechT5 通过 x-vector 说话者嵌入支持多个说话者。

本指南的其余部分将说明如何：

微调 SpeechT5，该模型最初是在英语语音上进行训练的，在VoxPopuli数据集的荷兰语（nl）语言子集上。
使用您精炼的模型进行推理的两种方式之一：使用流水线或直接。

在开始之前，请确保已安装所有必要的库：

pip install datasets soundfile speechbrain accelerate

从源代码安装🤗Transformers，因为并非所有 SpeechT5 功能都已合并到官方发布中：

pip install git+https://github.com/huggingface/transformers.git

要按照本指南操作，您将需要一个 GPU。如果您在笔记本中工作，请运行以下命令以检查 GPU 是否可用：

!nvidia-smi

或者适用于 AMD GPU：

!rocm-smi

我们鼓励您登录您的 Hugging Face 账户，将您的模型上传并与社区分享。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载数据集

VoxPopuli是一个大规模的多语音语料库，包含 2009-2020 年欧洲议会活动录音的数据。它包含了 15 种欧洲语言的带标签音频转录数据。在本指南中，我们使用荷兰语子集，可以随意选择其他子集。

请注意，VoxPopuli 或任何其他自动语音识别（ASR）数据集可能不是训练 TTS 模型的最佳选择。对于 ASR 有益的特性，如过多的背景噪音，在 TTS 中通常是不希望的。然而，找到高质量、多语言和多说话者的 TTS 数据集可能会非常具有挑战性。

让我们加载数据：

>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("facebook/voxpopuli", "nl", split="train")
>>> len(dataset)
20968

20968 个示例应该足够进行微调。SpeechT5 期望音频数据的采样率为 16 kHz，因此请确保数据集中的示例符合此要求：

dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

预处理数据

让我们首先定义要使用的模型检查点并加载适当的处理器：

>>> from transformers import SpeechT5Processor

>>> checkpoint = "microsoft/speecht5_tts"
>>> processor = SpeechT5Processor.from_pretrained(checkpoint)

SpeechT5 分词的文本清理

首先清理文本数据。您将需要处理文本的分词器部分：

>>> tokenizer = processor.tokenizer

数据集示例包含raw_text和normalized_text特征。在决定使用哪个特征作为文本输入时，请考虑 SpeechT5 分词器没有任何数字标记。在normalized_text中，数字被写成文本。因此，它更适合，我们建议使用normalized_text作为输入文本。

因为 SpeechT5 是在英语上进行训练的，可能无法识别荷兰数据集中的某些字符。如果保持原样，这些字符将被转换为<unk>标记。然而，在荷兰语中，像à这样的特定字符用于强调音节。为了保留文本的含义，我们可以用普通的a替换这个字符。

为了识别不支持的标记，使用SpeechT5Tokenizer提取数据集中的所有唯一字符，该分词器使用字符作为标记。为此，编写extract_all_chars映射函数，将所有示例的转录连接成一个字符串，并将其转换为字符集。确保在dataset.map()中设置batched=True和batch_size=-1，以便所有转录都可以一次性用于映射函数。

>>> def extract_all_chars(batch):
...     all_text = " ".join(batch["normalized_text"])
...     vocab = list(set(all_text))
...     return {"vocab": [vocab], "all_text": [all_text]}

>>> vocabs = dataset.map(
...     extract_all_chars,
...     batched=True,
...     batch_size=-1,
...     keep_in_memory=True,
...     remove_columns=dataset.column_names,
... )

>>> dataset_vocab = set(vocabs["vocab"][0])
>>> tokenizer_vocab = {k for k, _ in tokenizer.get_vocab().items()}

现在您有两组字符：一组来自数据集的词汇表，另一组来自分词器的词汇表。为了识别数据集中的任何不支持的字符，您可以取这两组之间的差集。结果集将包含数据集中存在但不在分词器中的字符。

>>> dataset_vocab - tokenizer_vocab
{' ', 'à', 'ç', 'è', 'ë', 'í', 'ï', 'ö', 'ü'}

为了处理前一步骤中识别出的不支持的字符，定义一个函数，将这些字符映射到有效的标记。请注意，分词器中的空格已经被替换为▁，不需要单独处理。

>>> replacements = [
...     ("à", "a"),
...     ("ç", "c"),
...     ("è", "e"),
...     ("ë", "e"),
...     ("í", "i"),
...     ("ï", "i"),
...     ("ö", "o"),
...     ("ü", "u"),
... ]

>>> def cleanup_text(inputs):
...     for src, dst in replacements:
...         inputs["normalized_text"] = inputs["normalized_text"].replace(src, dst)
...     return inputs

>>> dataset = dataset.map(cleanup_text)

现在您已经处理了文本中的特殊字符，是时候将重点转移到音频数据上了。

发言者

VoxPopuli 数据集包含多位发言者的讲话，但数据集中代表了多少位发言者？为了确定这一点，我们可以计算独特发言者的数量以及每位发言者对数据集的贡献示例数量。在数据集中共有 20,968 个示例，这些信息将帮助我们更好地了解数据中发言者和示例的分布。

>>> from collections import defaultdict

>>> speaker_counts = defaultdict(int)

>>> for speaker_id in dataset["speaker_id"]:
...     speaker_counts[speaker_id] += 1

通过绘制直方图，您可以了解每位发言者的数据量。

>>> import matplotlib.pyplot as plt

>>> plt.figure()
>>> plt.hist(speaker_counts.values(), bins=20)
>>> plt.ylabel("Speakers")
>>> plt.xlabel("Examples")
>>> plt.show()

直方图显示，数据集中大约三分之一的发言者拥有少于 100 个示例，而大约有十位发言者拥有超过 500 个示例。为了提高训练效率并平衡数据集，我们可以将数据限制在具有 100 到 400 个示例之间的发言者。

>>> def select_speaker(speaker_id):
...     return 100 <= speaker_counts[speaker_id] <= 400

>>> dataset = dataset.filter(select_speaker, input_columns=["speaker_id"])

让我们检查还剩下多少发言者：

>>> len(set(dataset["speaker_id"]))
42

让我们看看还剩下多少示例：

>>> len(dataset)
9973

您现在剩下大约 40 位独特发言者的不到 10,000 个示例，这应该足够了。

请注意，一些示例较少的发言者实际上可能有更多的音频可用，如果示例很长。然而，确定每位发言者的总音频量需要扫描整个数据集，这是一个耗时的过程，涉及加载和解码每个音频文件。因此，我们选择跳过这一步骤。

发言者嵌入

为了使 TTS 模型能够区分多个发言者，您需要为每个示例创建一个发言者嵌入。发言者嵌入是模型的另一个输入，捕捉特定发言者的语音特征。为了生成这些发言者嵌入，使用 SpeechBrain 中的预训练spkrec-xvect-voxceleb模型。

创建一个名为create_speaker_embedding()的函数，该函数接受输入音频波形，并输出一个包含相应发言者嵌入的 512 元素向量。

>>> import os
>>> import torch
>>> from speechbrain.pretrained import EncoderClassifier

>>> spk_model_name = "speechbrain/spkrec-xvect-voxceleb"

>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> speaker_model = EncoderClassifier.from_hparams(
...     source=spk_model_name,
...     run_opts={"device": device},
...     savedir=os.path.join("/tmp", spk_model_name),
... )

>>> def create_speaker_embedding(waveform):
...     with torch.no_grad():
...         speaker_embeddings = speaker_model.encode_batch(torch.tensor(waveform))
...         speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)
...         speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy()
...     return speaker_embeddings

重要的是要注意，speechbrain/spkrec-xvect-voxceleb模型是在 VoxCeleb 数据集的英语语音上训练的，而本指南中的训练示例是荷兰语。虽然我们相信这个模型仍然会为我们的荷兰数据集生成合理的说话者嵌入，但这种假设在所有情况下可能并不成立。

为了获得最佳结果，我们建议首先在目标语音上训练一个 X-vector 模型。这将确保模型更好地捕捉荷兰语中存在的独特语音特征。

处理数据集

最后，让我们将数据处理成模型期望的格式。创建一个prepare_dataset函数，该函数接受一个单个示例，并使用SpeechT5Processor对象对输入文本进行标记化，并将目标音频加载到对数梅尔频谱图中。它还应该添加说话者嵌入作为额外输入。

>>> def prepare_dataset(example):
...     audio = example["audio"]

...     example = processor(
...         text=example["normalized_text"],
...         audio_target=audio["array"],
...         sampling_rate=audio["sampling_rate"],
...         return_attention_mask=False,
...     )

...     # strip off the batch dimension
...     example["labels"] = example["labels"][0]

...     # use SpeechBrain to obtain x-vector
...     example["speaker_embeddings"] = create_speaker_embedding(audio["array"])

...     return example

通过查看单个示例来验证处理是否正确：

>>> processed_example = prepare_dataset(dataset[0])
>>> list(processed_example.keys())
['input_ids', 'labels', 'stop_labels', 'speaker_embeddings']

说话者嵌入应该是一个 512 元素向量：

>>> processed_example["speaker_embeddings"].shape
(512,)

标签应该是一个具有 80 个 mel 频率箱的对数梅尔频谱图。

>>> import matplotlib.pyplot as plt

>>> plt.figure()
>>> plt.imshow(processed_example["labels"].T)
>>> plt.show()

侧记：如果您觉得这个频谱图令人困惑，可能是因为您熟悉将低频放在底部，高频放在顶部的惯例。然而，在使用 matplotlib 库将频谱图绘制为图像时，y 轴是翻转的，频谱图看起来是倒置的。

现在将处理函数应用于整个数据集。这将需要 5 到 10 分钟。

>>> dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)

您将看到一个警告，说数据集中的一些示例比模型可以处理的最大输入长度（600 个标记）要长。从数据集中删除这些示例。在这里，我们甚至进一步去除了超过 200 个标记的任何内容，以允许更大的批次大小。

>>> def is_not_too_long(input_ids):
...     input_length = len(input_ids)
...     return input_length < 200

>>> dataset = dataset.filter(is_not_too_long, input_columns=["input_ids"])
>>> len(dataset)
8259

接下来，创建一个基本的训练/测试拆分：

>>> dataset = dataset.train_test_split(test_size=0.1)

数据整理器

为了将多个示例组合成一个批次，您需要定义一个自定义数据整理器。这个整理器将用填充标记填充较短的序列，确保所有示例具有相同的长度。对于频谱图标签，填充部分将替换为特殊值-100。这个特殊值指示模型在计算频谱图损失时忽略该部分频谱图。

>>> from dataclasses import dataclass
>>> from typing import Any, Dict, List, Union

>>> @dataclass
... class TTSDataCollatorWithPadding:
...     processor: Any

...     def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
...         input_ids = [{"input_ids": feature["input_ids"]} for feature in features]
...         label_features = [{"input_values": feature["labels"]} for feature in features]
...         speaker_features = [feature["speaker_embeddings"] for feature in features]

...         # collate the inputs and targets into a batch
...         batch = processor.pad(input_ids=input_ids, labels=label_features, return_tensors="pt")

...         # replace padding with -100 to ignore loss correctly
...         batch["labels"] = batch["labels"].masked_fill(batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100)

...         # not used during fine-tuning
...         del batch["decoder_attention_mask"]

...         # round down target lengths to multiple of reduction factor
...         if model.config.reduction_factor > 1:
...             target_lengths = torch.tensor([len(feature["input_values"]) for feature in label_features])
...             target_lengths = target_lengths.new(
...                 [length - length % model.config.reduction_factor for length in target_lengths]
...             )
...             max_length = max(target_lengths)
...             batch["labels"] = batch["labels"][:, :max_length]

...         # also add in the speaker embeddings
...         batch["speaker_embeddings"] = torch.tensor(speaker_features)

...         return batch

在 SpeechT5 中，模型的解码器部分的输入减少了 2 倍。换句话说，它会丢弃目标序列中的每隔一个时间步。然后解码器会预测一个长度是原来两倍的序列。由于原始目标序列长度可能是奇数，数据整理器确保将批次的最大长度向下舍入为 2 的倍数。

>>> data_collator = TTSDataCollatorWithPadding(processor=processor)

训练模型

从与您用于加载处理器的相同检查点加载预训练模型：

>>> from transformers import SpeechT5ForTextToSpeech

>>> model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint)

use_cache=True选项与梯度检查点不兼容。在训练时禁用它。

>>> model.config.use_cache = False

定义训练参数。在训练过程中，我们不计算任何评估指标。相反，我们只关注损失：

>>> from transformers import Seq2SeqTrainingArguments

>>> training_args = Seq2SeqTrainingArguments(
...     output_dir="speecht5_finetuned_voxpopuli_nl",  # change to a repo name of your choice
...     per_device_train_batch_size=4,
...     gradient_accumulation_steps=8,
...     learning_rate=1e-5,
...     warmup_steps=500,
...     max_steps=4000,
...     gradient_checkpointing=True,
...     fp16=True,
...     evaluation_strategy="steps",
...     per_device_eval_batch_size=2,
...     save_steps=1000,
...     eval_steps=1000,
...     logging_steps=25,
...     report_to=["tensorboard"],
...     load_best_model_at_end=True,
...     greater_is_better=False,
...     label_names=["labels"],
...     push_to_hub=True,
... )

实例化Trainer对象，并将模型、数据集和数据整理器传递给它。

>>> from transformers import Seq2SeqTrainer

>>> trainer = Seq2SeqTrainer(
...     args=training_args,
...     model=model,
...     train_dataset=dataset["train"],
...     eval_dataset=dataset["test"],
...     data_collator=data_collator,
...     tokenizer=processor,
... )

有了这些，您现在可以开始训练了！训练将需要几个小时。根据您的 GPU，当您开始训练时可能会遇到 CUDA“内存不足”错误。在这种情况下，您可以逐步减少per_device_train_batch_size，每次减少 2 倍，并将gradient_accumulation_steps增加 2 倍以补偿。

>>> trainer.train()

为了能够使用您的检查点进行管道处理，请确保将处理器与检查点一起保存：

>>> processor.save_pretrained("YOUR_ACCOUNT_NAME/speecht5_finetuned_voxpopuli_nl")

将最终模型推送到🤗 Hub：

>>> trainer.push_to_hub()

推断

使用管道进行推断

很好，现在您已经对模型进行了微调，可以用它进行推断了！首先，让我们看看如何在相应的管道中使用它。让我们使用您的检查点创建一个"text-to-speech"管道：

>>> from transformers import pipeline

>>> pipe = pipeline("text-to-speech", model="YOUR_ACCOUNT_NAME/speecht5_finetuned_voxpopuli_nl")

选择一段荷兰语文本，例如：

>>> text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!"

要在管道中使用 SpeechT5，您需要一个说话者嵌入。让我们从测试数据集中的一个示例中获取它：

>>> example = dataset["test"][304]
>>> speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)

现在您可以将文本和说话者嵌入传递给管道，它会处理剩下的部分：

>>> forward_params = {"speaker_embeddings": speaker_embeddings}
>>> output = pipe(text, forward_params=forward_params)
>>> output
{'audio': array([-6.82714235e-05, -4.26525949e-04,  1.06134125e-04, ...,
        -1.22392643e-03, -7.76011671e-04,  3.29112721e-04], dtype=float32),
 'sampling_rate': 16000}

然后您可以听结果：

>>> from IPython.display import Audio
>>> Audio(output['audio'], rate=output['sampling_rate'])

手动运行推断

您可以在不使用管道的情况下实现相同的推断结果，但是需要更多的步骤。

从🤗 Hub 加载模型：

>>> model = SpeechT5ForTextToSpeech.from_pretrained("YOUR_ACCOUNT/speecht5_finetuned_voxpopuli_nl")

从测试数据集中选择一个示例获取说话者嵌入。

>>> example = dataset["test"][304]
>>> speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)

定义输入文本并对其进行标记化。

>>> text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!"
>>> inputs = processor(text=text, return_tensors="pt")

使用您的模型创建一个频谱图：

>>> spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)

如果您愿意，可视化频谱图：

>>> plt.figure()
>>> plt.imshow(spectrogram.T)
>>> plt.show()

！[生成的对数梅尔频谱图]（…/Images/8bcf491c8356ebfa61722c3c271cd0f7.png）

最后，使用声码器将频谱图转换为声音。

>>> with torch.no_grad():
...     speech = vocoder(spectrogram)

>>> from IPython.display import Audio

>>> Audio(speech.numpy(), rate=16000)

根据我们的经验，从这个模型获得令人满意的结果可能具有挑战性。说话者嵌入的质量似乎是一个重要因素。由于 SpeechT5 是用英语 x-vectors 预训练的，因此在使用英语说话者嵌入时表现最佳。如果合成的语音听起来很差，尝试使用不同的说话者嵌入。

增加训练持续时间也可能会提高结果的质量。即使如此，语音明显是荷兰语而不是英语，并且它捕捉到说话者的声音特征（与示例中的原始音频进行比较）。另一个要尝试的是模型的配置。例如，尝试使用config.reduction_factor = 1，看看是否会改善结果。

最后，重要的是考虑道德考量。尽管 TTS 技术有许多有用的应用，但也可能被用于恶意目的，例如未经他们的知识或同意冒充某人的声音。请明智和负责任地使用 TTS。

生成

文本生成策略

原文：huggingface.co/docs/transformers/v4.37.2/en/generation_strategies

文本生成对于许多 NLP 任务至关重要，例如开放式文本生成、摘要、翻译等。它还在各种混合模态应用中发挥作用，这些应用的输出是文本，如语音转文本和视觉转文本。一些可以生成文本的模型包括 GPT2、XLNet、OpenAI GPT、CTRL、TransformerXL、XLM、Bart、T5、GIT、Whisper。

查看一些使用 generate()方法为不同任务生成文本输出的示例：

文本摘要
图像标题
音频转录

请注意，生成方法的输入取决于模型的模态。它们由模型的预处理器类返回，例如 AutoTokenizer 或 AutoProcessor。如果模型的预处理器创建多种类型的输入，请将所有输入传递给 generate()。您可以在相应模型的文档中了解更多关于各个模型的预处理器的信息。

选择生成文本的输出标记的过程称为解码，您可以自定义generate()方法将使用的解码策略。修改解码策略不会改变任何可训练参数的值。但是，它可能会显著影响生成输出的质量。它可以帮助减少文本中的重复，并使其更连贯。

本指南描述：

默认生成配置
常见的解码策略及其主要参数
在🤗 Hub 上保存和共享自定义生成配置与您的微调模型

默认文本生成配置

模型的解码策略在其生成配置中定义。在管道内使用预训练模型进行推断时，模型调用PreTrainedModel.generate()方法，在幕后应用默认生成配置。当没有保存自定义配置与模型一起时，也会使用默认配置。

当您显式加载模型时，您可以通过model.generation_config检查随之提供的生成配置：

>>> from transformers import AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")
>>> model.generation_config
GenerationConfig {
    "bos_token_id": 50256,
    "eos_token_id": 50256,
}

打印出model.generation_config只显示与默认生成配置不同的值，并不列出任何默认值。

默认生成配置限制输出与输入提示的组合大小最多为 20 个标记，以避免遇到资源限制。默认解码策略是贪婪搜索，这是一种最简单的解码策略，它选择具有最高概率的标记作为下一个标记。对于许多任务和小输出大小，这种方法效果很好。然而，当用于生成较长的输出时，贪婪搜索可能会开始产生高度重复的结果。

自定义文本生成

您可以通过直接将参数及其值传递给generate方法来覆盖任何generation_config：

>>> my_model.generate(**inputs, num_beams=4, do_sample=True)

即使默认解码策略对您的任务大部分有效，您仍然可以微调一些内容。一些常调整的参数包括：

max_new_tokens：要生成的标记的最大数量。换句话说，输出序列的大小，不包括提示中的标记。作为使用输出长度作为停止标准的替代方案，您可以选择在完整生成超过某个时间量时停止生成。要了解更多信息，请查看 StoppingCriteria。
num_beams：通过指定高于 1 的波束数量，您实际上是从贪婪搜索切换到波束搜索。这种策略在每个时间步评估几个假设，最终选择具有整个序列的最高概率的假设。这有一个优点，可以识别以较低概率初始标记开头的高概率序列，并且会被贪婪搜索忽略。
do_sample：如果设置为True，此参数将启用解码策略，如多项式采样、波束搜索多项式采样、Top-K 采样和 Top-p 采样。所有这些策略从整个词汇表的概率分布中选择下一个标记，具有各种特定策略的调整。
num_return_sequences：要为每个输入返回的序列候选数。此选项仅适用于支持多个序列候选的解码策略，例如波束搜索和采样的变体。贪婪搜索和对比搜索等解码策略返回单个输出序列。

保存带有您的模型的自定义解码策略

如果您想要与特定生成配置共享您微调的模型，您可以：

创建一个 GenerationConfig 类实例
指定解码策略参数
使用 GenerationConfig.save_pretrained()保存您的生成配置，确保将其config_file_name参数留空
将push_to_hub设置为True，将您的配置上传到模型的存储库

>>> from transformers import AutoModelForCausalLM, GenerationConfig

>>> model = AutoModelForCausalLM.from_pretrained("my_account/my_model")
>>> generation_config = GenerationConfig(
...     max_new_tokens=50, do_sample=True, top_k=50, eos_token_id=model.config.eos_token_id
... )
>>> generation_config.save_pretrained("my_account/my_model", push_to_hub=True)

您还可以在单个目录中存储多个生成配置，利用 GenerationConfig.save_pretrained()中的config_file_name参数。您可以稍后使用 GenerationConfig.from_pretrained()实例化它们。如果您想为单个模型存储多个生成配置（例如，一个用于采样的创意文本生成，一个用于波束搜索的摘要），则必须具有正确的 Hub 权限以向模型添加配置文件。

>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig

>>> tokenizer = AutoTokenizer.from_pretrained("t5-small")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

>>> translation_generation_config = GenerationConfig(
...     num_beams=4,
...     early_stopping=True,
...     decoder_start_token_id=0,
...     eos_token_id=model.config.eos_token_id,
...     pad_token=model.config.pad_token_id,
... )

>>> # Tip: add `push_to_hub=True` to push to the Hub
>>> translation_generation_config.save_pretrained("/tmp", "translation_generation_config.json")

>>> # You could then use the named generation config file to parameterize generation
>>> generation_config = GenerationConfig.from_pretrained("/tmp", "translation_generation_config.json")
>>> inputs = tokenizer("translate English to French: Configuration files are easy to use!", return_tensors="pt")
>>> outputs = model.generate(**inputs, generation_config=generation_config)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
['Les fichiers de configuration sont faciles à utiliser!']

流式传输

generate()支持流式传输，通过其streamer输入。streamer输入与具有以下方法的类的任何实例兼容：put()和end()。在内部，put()用于推送新标记，end()用于标记文本生成的结束。

流媒体类的 API 仍在开发中，可能会在未来发生变化。

实际上，您可以为各种目的制作自己的流式传输类！我们还为您准备了基本的流式传输类供您使用。例如，您可以使用 TextStreamer 类将generate()的输出流式传输到屏幕上，每次一个词：

>>> from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

>>> tok = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")
>>> inputs = tok(["An increasing sequence: one,"], return_tensors="pt")
>>> streamer = TextStreamer(tok)

>>> # Despite returning the usual output, the streamer will also print the generated text to stdout.
>>> _ = model.generate(**inputs, streamer=streamer, max_new_tokens=20)
An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven,

解码策略

某些generate()参数的组合，最终generation_config可以用于启用特定的解码策略。如果您对这个概念还不熟悉，我们建议阅读这篇博文，展示了常见的解码策略如何工作。

在这里，我们将展示控制解码策略的一些参数，并说明如何使用它们。

贪婪搜索

generate默认使用贪婪搜索解码，因此您无需传递任何参数来启用它。这意味着参数num_beams设置为 1，do_sample=False。

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> prompt = "I look forward to"
>>> checkpoint = "distilgpt2"

>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> outputs = model.generate(**inputs)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['I look forward to seeing you all again!\n\n\n\n\n\n\n\n\n\n\n']

对比搜索

对比搜索解码策略是在 2022 年的论文A Contrastive Framework for Neural Text Generation中提出的。它展示了生成非重复但连贯的长输出的优越结果。要了解对比搜索的工作原理，请查看这篇博客文章。启用和控制对比搜索行为的两个主要参数是penalty_alpha和top_k：

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> checkpoint = "gpt2-large"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)

>>> prompt = "Hugging Face Company is"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> outputs = model.generate(**inputs, penalty_alpha=0.6, top_k=4, max_new_tokens=100)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Hugging Face Company is a family owned and operated business. We pride ourselves on being the best
in the business and our customer service is second to none.\n\nIf you have any questions about our
products or services, feel free to contact us at any time. We look forward to hearing from you!']

多项式抽样

与总是选择具有最高概率的标记作为下一个标记的贪婪搜索相反，多项式抽样（也称为祖先抽样）根据模型给出的整个词汇表上的概率分布随机选择下一个标记。每个具有非零概率的标记都有被选择的机会，从而降低重复的风险。

要启用多项式抽样，请设置do_sample=True和num_beams=1。

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
>>> set_seed(0)  # For reproducibility

>>> checkpoint = "gpt2-large"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)

>>> prompt = "Today was an amazing day because"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today was an amazing day because when you go to the World Cup and you don\'t, or when you don\'t get invited,
that\'s a terrible feeling."']

束搜索解码

与贪婪搜索不同，束搜索解码在每个时间步保留几个假设，并最终选择整个序列的总体概率最高的假设。这有助于识别以较低概率初始标记开头的高概率序列，这些序列在贪婪搜索中会被忽略。

要启用这种解码策略，请指定num_beams（即要跟踪的假设数量）大于 1。

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> prompt = "It is astonishing how one can"
>>> checkpoint = "gpt2-medium"

>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)

>>> outputs = model.generate(**inputs, num_beams=5, max_new_tokens=50)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['It is astonishing how one can have such a profound impact on the lives of so many people in such a short period of
time."\n\nHe added: "I am very proud of the work I have been able to do in the last few years.\n\n"I have']

束搜索多项式抽样

正如其名称所示，这种解码策略将束搜索与多项式抽样结合在一起。您需要指定num_beams大于 1，并设置do_sample=True以使用这种解码策略。

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, set_seed
>>> set_seed(0)  # For reproducibility

>>> prompt = "translate English to German: The house is wonderful."
>>> checkpoint = "t5-small"

>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

>>> outputs = model.generate(**inputs, num_beams=5, do_sample=True)
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'Das Haus ist wunderbar.'

多样束搜索解码

多样束搜索解码策略是束搜索策略的扩展，允许生成更多样化的束序列供选择。要了解其工作原理，请参考Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models。这种方法有三个主要参数：num_beams、num_beam_groups和diversity_penalty。多样性惩罚确保输出在组间是不同的，并且在每个组内使用束搜索。

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

>>> checkpoint = "google/pegasus-xsum"
>>> prompt = (
...     "The Permaculture Design Principles are a set of universal design principles "
...     "that can be applied to any location, climate and culture, and they allow us to design "
...     "the most efficient and sustainable human habitation and food production systems. "
...     "Permaculture is a design system that encompasses a wide variety of disciplines, such "
...     "as ecology, landscape design, environmental science and energy conservation, and the "
...     "Permaculture design principles are drawn from these various disciplines. Each individual "
...     "design principle itself embodies a complete conceptual framework based on sound "
...     "scientific principles. When we bring all these separate  principles together, we can "
...     "create a design system that both looks at whole systems, the parts that these systems "
...     "consist of, and how those parts interact with each other to create a complex, dynamic, "
...     "living system. Each design principle serves as a tool that allows us to integrate all "
...     "the separate parts of a design, referred to as elements, into a functional, synergistic, "
...     "whole system, where the elements harmoniously interact and work together in the most "
...     "efficient way possible."
... )

>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

>>> outputs = model.generate(**inputs, num_beams=5, num_beam_groups=5, max_new_tokens=30, diversity_penalty=1.0)
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'The Design Principles are a set of universal design principles that can be applied to any location, climate and
culture, and they allow us to design the'

本指南说明了启用各种解码策略的主要参数。generate方法还有更高级的参数，可以进一步控制generate方法的行为。有关可用参数的完整列表，请参考 API 文档。

推测解码

推测解码（也称为辅助解码）是上述解码策略的修改版本，它使用一个助理模型（理想情况下是一个更小的模型）与相同的分词器，生成一些候选标记。然后主模型在单个前向传递中验证候选标记，从而加快解码过程。如果do_sample=True，则使用推测解码论文中引入的重新抽样进行标记验证。

目前，只支持贪婪搜索和抽样与辅助解码，并且辅助解码不支持批量输入。要了解更多关于辅助解码的信息，请查看这篇博客文章。

要启用辅助解码，请使用一个模型设置assistant_model参数。

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> prompt = "Alice and Bob"
>>> checkpoint = "EleutherAI/pythia-1.4b-deduped"
>>> assistant_checkpoint = "EleutherAI/pythia-160m-deduped"

>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
>>> outputs = model.generate(**inputs, assistant_model=assistant_model)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']

在使用辅助解码与抽样方法时，您可以使用temperature参数来控制随机性，就像在多项式抽样中一样。然而，在辅助解码中，降低温度可能有助于提高延迟。

>>> from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
>>> set_seed(42)  # For reproducibility

>>> prompt = "Alice and Bob"
>>> checkpoint = "EleutherAI/pythia-1.4b-deduped"
>>> assistant_checkpoint = "EleutherAI/pythia-160m-deduped"

>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
>>> outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.5)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob are going to the same party. It is a small party, in a small']

提示

使用 IDEFICS 进行图像任务

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/idefics

虽然可以通过微调专门的模型来解决单个任务，但最近出现并受到欢迎的另一种方法是使用大型模型处理各种任务而无需微调。例如，大型语言模型可以处理诸如摘要、翻译、分类等 NLP 任务。这种方法不再局限于单一模态，比如文本，在本指南中，我们将说明如何使用名为 IDEFICS 的大型多模态模型解决图像文本任务。

IDEFICS 是一个基于Flamingo的开放式视觉和语言模型，Flamingo 是由 DeepMind 最初开发的最先进的视觉语言模型。该模型接受任意序列的图像和文本输入，并生成连贯的文本作为输出。它可以回答关于图像的问题，描述视觉内容，创建基于多个图像的故事等。IDEFICS 有两个变体 - 80 亿参数和90 亿参数，这两个变体都可以在🤗 Hub 上找到。对于每个变体，您还可以找到为对话使用案例调整的模型的微调指导版本。

这个模型非常灵活，可以用于各种图像和多模态任务。然而，作为一个大型模型意味着它需要大量的计算资源和基础设施。您需要决定这种方法是否比为每个单独任务微调专门的模型更适合您的用例。

在本指南中，您将学习如何：

加载 IDEFICS 和加载模型的量化版本
使用 IDEFICS 进行：
- 图像加标题
- 提示的图像加标题
- 少样本提示
- 视觉问答
- 图像分类
- 图像引导文本生成
批处理模式下运行推理
运行 IDEFICS 指导进行对话使用

在开始之前，请确保已安装所有必要的库。

pip install -q bitsandbytes sentencepiece accelerate transformers

要运行以下示例，您将需要至少 20GB 的 GPU 内存来使用模型检查点的非量化版本。

加载模型

让我们从加载模型的 90 亿参数检查点开始：

>>> checkpoint = "HuggingFaceM4/idefics-9b"

就像其他 Transformer 模型一样，您需要从检查点加载处理器和模型本身。IDEFICS 处理器将 LlamaTokenizer 和 IDEFICS 图像处理器包装成一个单一处理器，以负责为模型准备文本和图像输入。

>>> import torch

>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto")

将device_map设置为"auto"将自动确定如何以最优化的方式加载和存储模型权重，考虑到现有设备。

量化模型

如果高内存 GPU 可用性是一个问题，您可以加载模型的量化版本。要加载模型和处理器的 4 位精度版本，请将BitsAndBytesConfig传递给from_pretrained方法，模型将在加载时即时压缩。

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig

>>> quantization_config = BitsAndBytesConfig(
...     load_in_4bit=True,
...     bnb_4bit_compute_dtype=torch.float16,
... )

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(
...     checkpoint,
...     quantization_config=quantization_config,
...     device_map="auto"
... )

现在您已经以建议的方式之一加载了模型，让我们继续探索您可以使用 IDEFICS 的任务。

图像加标题

图像加标题是预测给定图像的标题的任务。一个常见的应用是帮助视障人士在不同情况下导航，例如，在线探索图像内容。

为了说明任务，获取一个需要加标题的图像，例如：

照片由Hendo Wang拍摄。

IDEFICS 接受文本和图像提示。但是，要为图像添加字幕，您不必向模型提供文本提示，只需提供预处理后的输入图像。没有文本提示，模型将从 BOS（序列开始）标记开始生成文本，从而创建字幕。

作为模型的图像输入，您可以使用图像对象（PIL.Image）或可以从中检索图像的 url。

>>> prompt = [
...     "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
A puppy in a flower bed

在调用generate时，最好包含bad_words_ids，以避免在增加max_new_tokens时出现错误：当模型要生成一个新的<image>或<fake_token_around_image>标记时，而模型没有生成图像时，会出现错误。您可以像本指南中那样即时设置它，或者像文本生成策略指南中描述的那样存储在GenerationConfig中。

提示的图像字幕

您可以通过提供文本提示来扩展图像字幕，模型将继续给出图像。让我们拿另一张图片来说明：

照片由Denys Nevozhai拍摄。

文本和图像提示可以作为单个列表传递给模型的处理器，以创建适当的输入。

>>> prompt = [
...     "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...     "This is an image of ",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
This is an image of the Eiffel Tower in Paris, France.

少量提示

虽然 IDEFICS 展示了出色的零-shot 结果，但您的任务可能需要一定格式的字幕，或者伴随其他限制或要求，增加任务的复杂性。少量提示可用于启用上下文学习。通过在提示中提供示例，您可以引导模型生成类似于给定示例格式的结果。

让我们以埃菲尔铁塔的上一张图片作为模型的示例，并构建一个提示，向模型展示除了学习图像中的对象是什么之外，我们还希望获得一些有趣的信息。然后，让我们看看，如果我们可以为自由女神像的图片获得相同的响应格式：

照片由Juan Mayobre拍摄。

>>> prompt = ["User:",
...            "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...            "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n",
...            "User:",
...            "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80",
...            "Describe this image.\nAssistant:"
...            ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
User: Describe this image.
Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. 
User: Describe this image.
Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.

请注意，仅从单个示例（即 1-shot）中，模型已经学会了如何执行任务。对于更复杂的任务，请随时尝试使用更多的示例（例如 3-shot，5-shot 等）。

视觉问题回答

视觉问题回答（VQA）是根据图像回答开放式问题的任务。与图像字幕类似，它可以用于辅助功能应用程序，还可以用于教育（关于视觉材料的推理）、客户服务（基于图像的产品问题）和图像检索。

让我们为这个任务获取一张新的图片：

照片由Jarritos Mexican Soda拍摄。

您可以通过适当的指示将模型从图像字幕转向视觉问题回答：

>>> prompt = [
...     "Instruction: Provide an answer to the question. Use the image to answer.\n",
...     "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...     "Question: Where are these people and what's the weather like? Answer:"
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Provide an answer to the question. Use the image to answer.
 Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day.

图像分类

IDEFICS 能够将图像分类为不同的类别，而无需明确在包含来自这些特定类别的标记示例的数据上进行训练。给定一组类别并利用其图像和文本理解能力，模型可以推断图像可能属于哪个类别。

假设我们有这样一个蔬菜摊的图片：

照片由Peter Wendt拍摄。

我们可以指示模型将图像分类为我们拥有的类别之一：

>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office']
>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n",
...     "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",    
...     "Category: "
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office'].
Category: Vegetables

在上面的示例中，我们指示模型将图像分类为单个类别，但是，您也可以提示模型进行排名分类。

图像引导的文本生成

对于更有创意的应用，您可以使用基于图像的文本生成来根据图像生成文本。这可以用于创建产品描述、广告、场景描述等。

让我们提示 IDEFICS 根据一扇红门的简单图像撰写一个故事：

照片由Craig Tidball提供。

>>> prompt = ["Instruction: Use the image to write a story. \n",
...     "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80",
...     "Story: \n"]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0]) 
Instruction: Use the image to write a story. 
 Story: 
Once upon a time, there was a little girl who lived in a house with a red door.  She loved her red door.  It was the prettiest door in the whole world.

One day, the little girl was playing in her yard when she noticed a man standing on her doorstep.  He was wearing a long black coat and a top hat.

The little girl ran inside and told her mother about the man.

Her mother said, “Don’t worry, honey.  He’s just a friendly ghost.”

The little girl wasn’t sure if she believed her mother, but she went outside anyway.

When she got to the door, the man was gone.

The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep.

He was wearing a long black coat and a top hat.

The little girl ran

看起来 IDEFICS 注意到了门廊上的南瓜，并选择了一个关于鬼魂的恐怖万圣节故事。

对于像这样的较长输出，您将受益于调整文本生成策略。这可以帮助您显着提高生成输出的质量。查看文本生成策略以了解更多信息。

批量模式下运行推理

之前的所有部分都展示了 IDEFICS 的一个示例。以非常相似的方式，您可以通过传递提示列表来为一批示例运行推理：

>>> prompts = [
...     [   "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
... ]

>>> inputs = processor(prompts, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i,t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n") 
0:
This is an image of the Eiffel Tower in Paris, France.

1:
This is an image of a couple on a picnic blanket.

2:
This is an image of a vegetable stand.

用于会话使用的 IDEFICS 指导

对于会话使用情况，您可以在🤗 Hub 上找到模型的经过微调的指导版本：HuggingFaceM4/idefics-80b-instruct和HuggingFaceM4/idefics-9b-instruct。

这些检查点是在混合监督和指导微调数据集上对各自基本模型进行微调的结果，这可以提高下游性能，同时使模型在会话设置中更易于使用。

会话使用和提示与使用基本模型非常相似：

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> device = "cuda" if torch.cuda.is_available() else "cpu"

>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct"
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> prompts = [
...     [
...         "User: What is in this image?",
...         "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
...         "<end_of_utterance>",

...         "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",

...         "\nUser:",
...         "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
...         "And who is that?<end_of_utterance>",

...         "\nAssistant:",
...     ],
... ]

>>> # --batched mode
>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
>>> # --single sample mode
>>> # inputs = processor(prompts[0], return_tensors="pt").to(device)

>>> # Generation args
>>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i, t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n")

LLM 提示指南

原文链接：huggingface.co/docs/transformers/v4.37.2/en/tasks/prompting

像 Falcon、LLaMA 等大型语言模型是预训练的变压器模型，最初训练用于预测给定一些输入文本的下一个标记。它们通常具有数十亿个参数，并且已经在长时间内训练了数万亿个标记。因此，这些模型变得非常强大和多功能，您可以通过用自然语言提示指导模型来解决多个 NLP 任务。

设计这样的提示以确保最佳输出通常被称为“提示工程”。提示工程是一个需要大量实验的迭代过程。自然语言比编程语言更加灵活和表达丰富，但也可能引入一些歧义。同时，自然语言中的提示对变化非常敏感。即使提示中进行轻微修改也可能导致截然不同的输出。

虽然没有确切的配方可以创建适用于所有情况的提示，但研究人员已经制定出一些最佳实践，有助于更一致地实现最佳结果。

本指南涵盖了提示工程的最佳实践，以帮助您制作更好的 LLM 提示并解决各种 NLP 任务。您将学到：

提示的基础知识
LLM 提示的最佳实践
高级提示技术：少样本提示和思维链
何时进行微调而不是提示

提示工程仅是 LLM 输出优化过程的一部分。另一个重要组成部分是选择最佳的文本生成策略。您可以自定义 LLM 在生成文本时如何选择每个后续标记，而无需修改任何可训练参数。通过调整文本生成参数，您可以减少生成文本中的重复，并使其更连贯和更具人类声音。文本生成策略和参数超出了本指南的范围，但您可以在以下指南中了解更多相关主题：

使用 LLM 进行生成
文本生成策略

提示的基础知识

模型类型

现代 LLM 大多数是仅解码器的变压器。一些例子包括：LLaMA, Llama2, Falcon, GPT2。但是，您也可能遇到编码器-解码器变压器 LLM，例如 Flan-T5 和 BART。

编码器-解码器风格的模型通常用于生成任务，其中输出严重依赖于输入，例如翻译和总结。解码器模型用于所有其他类型的生成任务。

在使用管道生成 LLM 文本时，了解您正在使用的 LLM 类型很重要，因为它们使用不同的管道。

使用text-generation管道运行仅解码器模型的推理：

>>> from transformers import pipeline
>>> import torch

>>> torch.manual_seed(0)
>>> generator = pipeline('text-generation', model = 'gpt2')
>>> prompt = "Hello, I'm a language model"

>>> generator(prompt, max_length = 30)
[{'generated_text': "Hello, I'm a language model expert, so I'm a big believer in the concept that I know very well and then I try to look into"}]

要使用编码器-解码器进行推理，请使用text2text-generation管道：

>>> text2text_generator = pipeline("text2text-generation", model = 'google/flan-t5-base')
>>> prompt = "Translate from English to French: I'm very happy to see you"

>>> text2text_generator(prompt)
[{'generated_text': 'Je suis très heureuse de vous rencontrer.'}]

基础版 vs 指导/聊天版模型

🤗 Hub 上提供的大多数最新 LLM 检查点都有两个版本：基础版和指导版（或聊天版）。例如，tiiuae/falcon-7b 和 tiiuae/falcon-7b-instruct。

基础模型在给定初始提示时完成文本的能力非常出色，但是它们并不适合需要遵循指令或用于对话的 NLP 任务。这就是指导（聊天）版本的用武之地。这些检查点是在预训练基础版本上进一步微调指令和对话数据的结果。这种额外的微调使它们成为许多 NLP 任务的更好选择。

让我们举例说明一些简单的提示，您可以使用tiiuae/falcon-7b-instruct来解决一些常见的 NLP 任务。

自然语言处理任务

首先，让我们设置环境：

pip install -q transformers accelerate

接下来，让我们使用适当的管道（"text-generation"）加载模型：

>>> from transformers import pipeline, AutoTokenizer
>>> import torch

>>> torch.manual_seed(0)
>>> model = "tiiuae/falcon-7b-instruct"

>>> tokenizer = AutoTokenizer.from_pretrained(model)
>>> pipe = pipeline(
...     "text-generation",
...     model=model,
...     tokenizer=tokenizer,
...     torch_dtype=torch.bfloat16,
...     device_map="auto",
... )

请注意，Falcon 模型是使用bfloat16数据类型训练的，因此我们建议您也使用相同的数据类型。这需要一个最新版本的 CUDA，并且在现代显卡上效果最佳。

现在我们已经通过管道加载了模型，让我们探讨如何使用提示来解决 NLP 任务。

文本分类

文本分类中最常见的形式之一是情感分析，它为一段文本分配一个标签，比如“积极”、“消极”或“中性”。让我们编写一个提示，指示模型对给定的文本（电影评论）进行分类。我们将从给出指令开始，然后指定要分类的文本。请注意，我们不仅仅止步于此，还添加了响应的开头 - "情感："：

>>> torch.manual_seed(0)
>>> prompt = """Classify the text into neutral, negative or positive. 
... Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen.
... Sentiment:
... """

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=10,
... )

>>> for seq in sequences:
...     print(f"Result: {seq['generated_text']}")
Result: Classify the text into neutral, negative or positive. 
Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen.
Sentiment:
Positive

因此，输出包含了我们在指令中提供的列表中的一个分类标签，而且是正确的！

您可能注意到，除了提示之外，我们还传递了一个max_new_tokens参数。它控制模型应该生成的标记数量，这是您可以在文本生成策略指南中了解的许多文本生成参数之一。

命名实体识别

命名实体识别（NER）是在文本中找到命名实体的任务，比如人物、地点或组织。让我们修改提示中的指令，让 LLM 执行这个任务。在这里，我们还设置return_full_text = False，这样输出就不包含提示了：

>>> torch.manual_seed(1)
>>> prompt = """Return a list of named entities in the text.
... Text: The Golden State Warriors are an American professional basketball team based in San Francisco.
... Named entities:
... """

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=15,
...     return_full_text = False,    
... )

>>> for seq in sequences:
...     print(f"{seq['generated_text']}")
- Golden State Warriors
- San Francisco

正如您所看到的，模型正确识别了给定文本中的两个命名实体。

翻译

LLM 可以执行的另一个任务是翻译。您可以选择使用编码器-解码器模型来执行此任务，但是在这里，为了简化示例，我们将继续使用 Falcon-7b-instruct，它做得相当不错。再次，这是您如何编写一个基本提示，指示模型将一段文本从英语翻译成意大利语：

>>> torch.manual_seed(2)
>>> prompt = """Translate the English text to Italian.
... Text: Sometimes, I've believed as many as six impossible things before breakfast.
... Translation:
... """

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=20,
...     do_sample=True,
...     top_k=10,
...     return_full_text = False,
... )

>>> for seq in sequences:
...     print(f"{seq['generated_text']}")
A volte, ho creduto a sei impossibili cose prima di colazione.

在这里，我们添加了do_sample=True和top_k=10，以允许模型在生成输出时更加灵活。

文本摘要

与翻译类似，文本摘要是另一个生成任务，输出严重依赖于输入，编码器-解码器模型可能是更好的选择。然而，解码器风格的模型也可以用于这个任务。以前，我们将指令放在提示的开头。然而，提示的最后也可以是一个合适的位置来放置指令。通常，最好将指令放在两端之一。

>>> torch.manual_seed(3)
>>> prompt = """Permaculture is a design process mimicking the diversity, functionality and resilience of natural ecosystems. The principles and practices are drawn from traditional ecological knowledge of indigenous cultures combined with modern scientific understanding and technological innovations. Permaculture design provides a framework helping individuals and communities develop innovative, creative and effective strategies for meeting basic needs while preparing for and mitigating the projected impacts of climate change.
... Write a summary of the above text.
... Summary:
... """

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=30,
...     do_sample=True,
...     top_k=10,
...     return_full_text = False,
... )

>>> for seq in sequences:
...     print(f"{seq['generated_text']}")
Permaculture is an ecological design mimicking natural ecosystems to meet basic needs and prepare for climate change. It is based on traditional knowledge and scientific understanding.

问答

对于问答任务，我们可以将提示结构化为以下逻辑组件：指令、上下文、问题和引导词或短语（"Answer:"），以促使模型开始生成答案：

>>> torch.manual_seed(4)
>>> prompt = """Answer the question using the context below.
... Context: Gazpacho is a cold soup and drink made of raw, blended vegetables. Most gazpacho includes stale bread, tomato, cucumbers, onion, bell peppers, garlic, olive oil, wine vinegar, water, and salt. Northern recipes often include cumin and/or pimentón (smoked sweet paprika). Traditionally, gazpacho was made by pounding the vegetables in a mortar with a pestle; this more laborious method is still sometimes used as it helps keep the gazpacho cool and avoids the foam and silky consistency of smoothie versions made in blenders or food processors.
... Question: What modern tool is used to make gazpacho?
... Answer:
... """

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=10,
...     do_sample=True,
...     top_k=10,
...     return_full_text = False,
... )

>>> for seq in sequences:
...     print(f"Result: {seq['generated_text']}")
Result: Modern tools are used, such as immersion blenders

推理

推理是 LLM 中最困难的任务之一，要取得良好的结果通常需要应用高级提示技术，比如思维链。

让我们尝试看看我们是否可以让模型通过一个基本提示来推理一个简单的算术任务：

>>> torch.manual_seed(5)
>>> prompt = """There are 5 groups of students in the class. Each group has 4 students. How many students are there in the class?"""

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=30,
...     do_sample=True,
...     top_k=10,
...     return_full_text = False,
... )

>>> for seq in sequences:
...     print(f"Result: {seq['generated_text']}")
Result: 
There are a total of 5 groups, so there are 5 x 4=20 students in the class.

正确！让我们稍微增加一点复杂性，看看我们是否仍然可以通过一个基本提示来完成：

>>> torch.manual_seed(6)
>>> prompt = """I baked 15 muffins. I ate 2 muffins and gave 5 muffins to a neighbor. My partner then bought 6 more muffins and ate 2\. How many muffins do we now have?"""

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=10,
...     do_sample=True,
...     top_k=10,
...     return_full_text = False,
... )

>>> for seq in sequences:
...     print(f"Result: {seq['generated_text']}")
Result: 
The total number of muffins now is 21

这是一个错误答案，应该是 12。在这种情况下，这可能是因为提示过于基础，或者是因为模型选择不当，毕竟我们选择了 Falcon 的最小版本。对于所有大小的模型来说，推理都是困难的，但更大的模型可能表现更好。

LLM 提示的最佳实践

在本指南的这一部分中，我们编制了一份倾向于改善提示结果的最佳实践清单：

在选择要使用的模型时，最新和最有能力的模型可能表现更好。
从一个简单而短的提示开始，然后逐步迭代。
将指令放在提示的开头或最后。在处理大量上下文时，模型会应用各种优化措施，以防止注意力复杂度呈二次方增长。这可能会使模型更加关注提示的开头或结尾，而不是中间部分。
将指令与其适用的文本清晰分开-更多内容请参见下一节。
对任务和期望结果进行具体和描述性的说明-其格式、长度、风格、语言等。
避免模棱两可的描述和指令。
更倾向于说“要做什么”而不是说“不要做什么”的指令。
通过编写第一个单词（甚至开始第一个句子）来“引导”输出朝着正确方向发展。
使用高级技术，如少样本提示和思维链
使用不同模型测试您的提示，以评估其稳健性。
版本和跟踪提示的性能。

高级提示技术

少样本提示

上述部分的基本提示是“零样本”提示的示例，这意味着模型已经获得了指令和上下文，但没有带有解决方案的示例。通常在指令数据集上进行微调的 LLM 在这种“零样本”任务上表现良好。然而，您可能会发现您的任务更加复杂或微妙，也许您对模型没有从指令中捕捉到的输出有一些要求。在这种情况下，您可以尝试称为少样本提示的技术。

在少样本提示中，我们在提示中提供示例，为模型提供更多上下文以提高性能。这些示例会让模型生成遵循示例模式的输出。

这里有一个例子：

>>> torch.manual_seed(0)
>>> prompt = """Text: The first human went into space and orbited the Earth on April 12, 1961.
... Date: 04/12/1961
... Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon. 
... Date:"""

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=8,
...     do_sample=True,
...     top_k=10,
... )

>>> for seq in sequences:
...     print(f"Result: {seq['generated_text']}")
Result: Text: The first human went into space and orbited the Earth on April 12, 1961.
Date: 04/12/1961
Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon. 
Date: 09/28/1960

在上面的代码片段中，我们使用了一个示例来向模型展示所需的输出，因此这可以称为“一次性”提示。然而，根据任务的复杂性，您可能需要使用多个示例。

少样本提示技术的局限性：

虽然 LLM 可以捕捉到示例中的模式，但这些技术在复杂的推理任务上效果不佳
少样本提示需要创建较长的提示。具有大量标记的提示可能会增加计算和延迟。提示的长度也有限制。
有时，当给定多个示例时，模型可能会学习您并非打算让它学习的模式，例如第三个电影评论总是负面的。

思维链

思维链（CoT）提示是一种技术，它促使模型产生中间推理步骤，从而提高复杂推理任务的结果。

有两种方法可以引导模型产生推理步骤：

通过用详细答案说明示例来进行少样本提示，向模型展示如何解决问题。
通过添加短语，如“让我们一步一步地思考”或“深呼吸，一步一步地解决问题”，指导模型进行推理。

如果我们将 CoT 技术应用于推理部分中的松饼示例，并使用更大的模型，例如（tiiuae/falcon-180B-chat），您可以在HuggingChat中尝试，我们将在推理结果上获得显著的改进：

Let's go through this step-by-step:
1\. You start with 15 muffins.
2\. You eat 2 muffins, leaving you with 13 muffins.
3\. You give 5 muffins to your neighbor, leaving you with 8 muffins.
4\. Your partner buys 6 more muffins, bringing the total number of muffins to 14.
5\. Your partner eats 2 muffins, leaving you with 12 muffins.
Therefore, you now have 12 muffins.

提示 vs 微调

通过优化您的提示，您可以取得出色的结果，但是您可能仍然在考虑是否微调模型对您的情况更有效。以下是一些微调较小模型可能是首选的情况：

您的领域与 LLMs 预先训练的领域大相径庭，广泛的提示优化并未产生足够的结果。
您需要您的模型在资源稀缺的语言中表现良好。
您需要训练模型的数据是受严格监管的敏感数据。
由于成本、隐私、基础设施或其他限制，您必须使用小型模型。

在上述所有示例中，您需要确保您已经拥有或可以轻松获得足够大的领域特定数据集，以合理的成本来微调模型。您还需要有足够的时间和资源来微调模型。

如果上述示例不适用于您，优化提示可能会更有益。

开发者指南

使用🤗 Tokenizers 中的分词器

原始文本：huggingface.co/docs/transformers/v4.37.2/en/fast_tokenizers

PreTrainedTokenizerFast 依赖于 🤗 Tokenizers 库。从🤗 Tokenizers 库获得的分词器可以非常简单地加载到🤗 Transformers 中。

在进入具体内容之前，让我们首先通过几行代码创建一个虚拟的分词器：

>>> from tokenizers import Tokenizer
>>> from tokenizers.models import BPE
>>> from tokenizers.trainers import BpeTrainer
>>> from tokenizers.pre_tokenizers import Whitespace

>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

>>> tokenizer.pre_tokenizer = Whitespace()
>>> files = [...]
>>> tokenizer.train(files, trainer)

我们现在有一个在我们定义的文件上训练过的分词器。我们可以继续在该运行时中使用它，或者将其保存到一个 JSON 文件中以供将来重复使用。

直接从分词器对象加载

让我们看看如何在🤗 Transformers 库中利用这个分词器对象。PreTrainedTokenizerFast 类允许通过接受实例化的 tokenizer 对象作为参数来轻松实例化：

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

这个对象现在可以与🤗 Transformers 分词器共享的所有方法一起使用！请前往分词器页面获取更多信息。

从一个 JSON 文件加载

为了从一个 JSON 文件中加载一个分词器，让我们首先保存我们的分词器：

>>> tokenizer.save("tokenizer.json")

我们保存这个文件的路径可以通过 tokenizer_file 参数传递给 PreTrainedTokenizerFast 初始化方法：

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

这个对象现在可以与🤗 Transformers 分词器共享的所有方法一起使用！请前往分词器页面获取更多信息。 ng the total number of muffins to 14. 5. Your partner eats 2 muffins, leaving you with 12 muffins. Therefore, you now have 12 muffins.

## 提示 vs 微调

通过优化您的提示，您可以取得出色的结果，但是您可能仍然在考虑是否微调模型对您的情况更有效。以下是一些微调较小模型可能是首选的情况：

+   您的领域与 LLMs 预先训练的领域大相径庭，广泛的提示优化并未产生足够的结果。

+   您需要您的模型在资源稀缺的语言中表现良好。

+   您需要训练模型的数据是受严格监管的敏感数据。

+   由于成本、隐私、基础设施或其他限制，您必须使用小型模型。

在上述所有示例中，您需要确保您已经拥有或可以轻松获得足够大的领域特定数据集，以合理的成本来微调模型。您还需要有足够的时间和资源来微调模型。

如果上述示例不适用于您，优化提示可能会更有益。


# 开发者指南


# 使用🤗 Tokenizers 中的分词器

> 原始文本：[`huggingface.co/docs/transformers/v4.37.2/en/fast_tokenizers`](https://huggingface.co/docs/transformers/v4.37.2/en/fast_tokenizers)

PreTrainedTokenizerFast 依赖于 [🤗 Tokenizers](https://huggingface.co/docs/tokenizers) 库。从🤗 Tokenizers 库获得的分词器可以非常简单地加载到🤗 Transformers 中。

在进入具体内容之前，让我们首先通过几行代码创建一个虚拟的分词器：

```py
>>> from tokenizers import Tokenizer
>>> from tokenizers.models import BPE
>>> from tokenizers.trainers import BpeTrainer
>>> from tokenizers.pre_tokenizers import Whitespace

>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

>>> tokenizer.pre_tokenizer = Whitespace()
>>> files = [...]
>>> tokenizer.train(files, trainer)

我们现在有一个在我们定义的文件上训练过的分词器。我们可以继续在该运行时中使用它，或者将其保存到一个 JSON 文件中以供将来重复使用。

直接从分词器对象加载

让我们看看如何在🤗 Transformers 库中利用这个分词器对象。PreTrainedTokenizerFast 类允许通过接受实例化的 tokenizer 对象作为参数来轻松实例化：

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

这个对象现在可以与🤗 Transformers 分词器共享的所有方法一起使用！请前往分词器页面获取更多信息。

从一个 JSON 文件加载

为了从一个 JSON 文件中加载一个分词器，让我们首先保存我们的分词器：

>>> tokenizer.save("tokenizer.json")

我们保存这个文件的路径可以通过 tokenizer_file 参数传递给 PreTrainedTokenizerFast 初始化方法：

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

这个对象现在可以与🤗 Transformers 分词器共享的所有方法一起使用！请前往分词器页面获取更多信息。

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2024-06-26，如有侵权请联系 cloudcommunity@tencent.com 删除

模型

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！