Transformers 4.37 中文文档（四）

ApacheCN_飞龙

发布于 2024-06-26 14:30:11

1080

发布于 2024-06-26 14:30:11

文章被收录于专栏：信数据得永生信数据得永生

原文：huggingface.co/docs/transformers

音频

音频分类

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/audio_classification

www.youtube-nocookie.com/embed/KWwzcmG98Ds

音频分类 - 就像文本一样 - 从输入数据中分配一个类标签输出。唯一的区别是，您有原始音频波形而不是文本输入。音频分类的一些实际应用包括识别说话者意图、语言分类，甚至通过声音识别动物物种。

本指南将向您展示如何：

在MInDS-14数据集上对Wav2Vec2进行微调，以分类说话者意图。
使用您微调的模型进行推断。

本教程中所示的任务由以下模型架构支持：

音频频谱变换器、Data2VecAudio、Hubert、SEW、SEW-D、UniSpeech、UniSpeechSat、Wav2Vec2、Wav2Vec2-BERT、Wav2Vec2-Conformer、WavLM、Whisper

在开始之前，请确保已安装所有必要的库：

pip install transformers datasets evaluate

我们鼓励您登录您的 Hugging Face 帐户，这样您就可以上传和分享您的模型给社区。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 MInDS-14 数据集

首先从🤗数据集库中加载 MInDS-14 数据集：

>>> from datasets import load_dataset, Audio

>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train")

使用train_test_split方法将数据集的train拆分为较小的训练集和测试集。这将让您有机会进行实验，并确保一切正常，然后再花更多时间处理完整数据集。

>>> minds = minds.train_test_split(test_size=0.2)

然后查看数据集：

>>> minds
DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

虽然数据集包含许多有用信息，比如lang_id和english_transcription，但在本指南中，您将专注于audio和intent_class。使用remove_columns方法删除其他列：

>>> minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])

现在看一个示例：

>>> minds["train"][0]
{'audio': {'array': array([ 0.        ,  0.        ,  0.        , ..., -0.00048828,
         -0.00024414, -0.00024414], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
  'sampling_rate': 8000},
 'intent_class': 2}

有两个领域：

audio：必须调用的语音信号的一维array，以加载和重新采样音频文件。
intent_class：表示说话者意图的类别 ID。

为了让模型更容易从标签 ID 中获取标签名称，创建一个将标签名称映射到整数以及反之的字典：

>>> labels = minds["train"].features["intent_class"].names
>>> label2id, id2label = dict(), dict()
>>> for i, label in enumerate(labels):
...     label2id[label] = str(i)
...     id2label[str(i)] = label

现在您可以将标签 ID 转换为标签名称：

>>> id2label[str(2)]
'app_error'

预处理

下一步是加载 Wav2Vec2 特征提取器来处理音频信号：

>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

MInDS-14 数据集的采样率为 8000khz（您可以在其数据集卡片中找到此信息），这意味着您需要将数据集重新采样为 16000kHz 以使用预训练的 Wav2Vec2 模型：

>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
>>> minds["train"][0]
{'audio': {'array': array([ 2.2098757e-05,  4.6582241e-05, -2.2803260e-05, ...,
         -2.8419291e-04, -2.3305941e-04, -1.1425107e-04], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
  'sampling_rate': 16000},
 'intent_class': 2}

现在创建一个预处理函数，该函数：

调用audio列进行加载，并在必要时重新采样音频文件。
检查音频文件的采样率是否与模型预训练时的音频数据的采样率匹配。您可以在 Wav2Vec2 的模型卡片中找到此信息。
设置最大输入长度以批处理更长的输入而不截断它们。

>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
...     )
...     return inputs

要在整个数据集上应用预处理函数，请使用🤗 Datasets map函数。您可以通过设置batched=True来加速map，以一次处理数据集的多个元素。删除您不需要的列，并将intent_class重命名为label，因为这是模型期望的名称：

>>> encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
>>> encoded_minds = encoded_minds.rename_column("intent_class", "label")

评估

在训练过程中包含一个度量通常有助于评估模型的性能。您可以通过🤗 Evaluate库快速加载一个评估方法。对于这个任务，加载accuracy度量（查看🤗 Evaluate quick tour以了解如何加载和计算度量）：

>>> import evaluate

>>> accuracy = evaluate.load("accuracy")

然后创建一个函数，将您的预测和标签传递给compute以计算准确性：

>>> import numpy as np

>>> def compute_metrics(eval_pred):
...     predictions = np.argmax(eval_pred.predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

您的compute_metrics函数现在已经准备就绪，当您设置训练时将返回到它。

训练

Pytorch 隐藏 Pytorch 内容

如果您不熟悉使用 Trainer 微调模型，请查看这里的基本教程[…/training#train-with-pytorch-trainer]！

现在您已经准备好开始训练您的模型了！使用 AutoModelForAudioClassification 加载 Wav2Vec2，以及预期标签的数量和标签映射：

>>> from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer

>>> num_labels = len(id2label)
>>> model = AutoModelForAudioClassification.from_pretrained(
...     "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
... )

此时，只剩下三个步骤：

在 TrainingArguments 中定义您的训练超参数。唯一必需的参数是output_dir，指定保存模型的位置。通过设置push_to_hub=True将此模型推送到 Hub（您需要登录 Hugging Face 才能上传模型）。在每个时代结束时，Trainer 将评估准确性并保存训练检查点。
将训练参数传递给 Trainer，以及模型、数据集、分词器、数据整理器和compute_metrics函数。
调用 train()来微调您的模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_mind_model",
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=3e-5,
...     per_device_train_batch_size=32,
...     gradient_accumulation_steps=4,
...     per_device_eval_batch_size=32,
...     num_train_epochs=10,
...     warmup_ratio=0.1,
...     logging_steps=10,
...     load_best_model_at_end=True,
...     metric_for_best_model="accuracy",
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=encoded_minds["train"],
...     eval_dataset=encoded_minds["test"],
...     tokenizer=feature_extractor,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

训练完成后，使用 push_to_hub()方法将您的模型共享到 Hub，这样每个人都可以使用您的模型：

>>> trainer.push_to_hub()

要了解如何为音频分类微调模型的更深入示例，请查看相应的PyTorch 笔记本。

推理

现在，您已经微调了一个模型，可以用它进行推理了！

加载您想要进行推理的音频文件。记得重新采样音频文件的采样率，以匹配模型的采样率（如果需要）！

>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> audio_file = dataset[0]["audio"]["path"]

尝试使用一个 pipeline()来进行推理的最简单方法是在其中使用微调后的模型。使用您的模型实例化一个用于音频分类的pipeline，并将音频文件传递给它：

>>> from transformers import pipeline

>>> classifier = pipeline("audio-classification", model="stevhliu/my_awesome_minds_model")
>>> classifier(audio_file)
[
    {'score': 0.09766869246959686, 'label': 'cash_deposit'},
    {'score': 0.07998877018690109, 'label': 'app_error'},
    {'score': 0.0781070664525032, 'label': 'joint_account'},
    {'score': 0.07667109370231628, 'label': 'pay_bill'},
    {'score': 0.0755252093076706, 'label': 'balance'}
]

如果您愿意，也可以手动复制pipeline的结果：

Pytorch 隐藏 Pytorch 内容

加载一个特征提取器来预处理音频文件，并将input返回为 PyTorch 张量：

>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("stevhliu/my_awesome_minds_model")
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

将您的输入传递给模型并返回 logits：

>>> from transformers import AutoModelForAudioClassification

>>> model = AutoModelForAudioClassification.from_pretrained("stevhliu/my_awesome_minds_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits

获取具有最高概率的类，并使用模型的id2label映射将其转换为标签：

>>> import torch

>>> predicted_class_ids = torch.argmax(logits).item()
>>> predicted_label = model.config.id2label[predicted_class_ids]
>>> predicted_label
'cash_deposit'

自动语音识别

原文链接：huggingface.co/docs/transformers/v4.37.2/en/tasks/asr

www.youtube-nocookie.com/embed/TksaY_FDgnk

自动语音识别（ASR）将语音信号转换为文本，将一系列音频输入映射到文本输出。虚拟助手如 Siri 和 Alexa 使用 ASR 模型帮助用户日常，还有许多其他有用的用户界面应用，如实时字幕和会议记录。

本指南将向您展示如何：

在MInDS-14数据集上对Wav2Vec2进行微调，将音频转录为文本。
使用您微调的模型进行推理。

本教程中演示的任务由以下模型架构支持：

Data2VecAudio, Hubert, M-CTC-T, SEW, SEW-D, UniSpeech, UniSpeechSat, Wav2Vec2, Wav2Vec2-BERT, Wav2Vec2-Conformer, WavLM

在开始之前，请确保已安装所有必要的库：

pip install transformers datasets evaluate jiwer

我们鼓励您登录您的 Hugging Face 账户，这样您就可以上传和与社区分享您的模型。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 MInDS-14 数据集

首先加载来自🤗数据集库的MInDS-14数据集的较小子集。这将让您有机会进行实验，并确保一切正常，然后再花更多时间在完整数据集上进行训练。

>>> from datasets import load_dataset, Audio

>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]")

使用~Dataset.train_test_split方法将数据集的train拆分为训练集和测试集：

>>> minds = minds.train_test_split(test_size=0.2)

然后查看数据集：

>>> minds
DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 16
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 4
    })
})

虽然数据集包含许多有用信息，如lang_id和english_transcription，但在本指南中，您将专注于audio和transcription。使用remove_columns方法删除其他列：

>>> minds = minds.remove_columns(["english_transcription", "intent_class", "lang_id"])

再次查看示例：

>>> minds["train"][0]
{'audio': {'array': array([-0.00024414,  0.        ,  0.        , ...,  0.00024414,
          0.00024414,  0.00024414], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
  'sampling_rate': 8000},
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}

有两个字段：

audio：必须调用的语音信号的一维array，用于加载和重采样音频文件。
transcription：目标文本。

预处理

接下来的步骤是加载一个 Wav2Vec2 处理器来处理音频信号：

>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")

MInDS-14 数据集的采样率为 8000kHz（您可以在其数据集卡片中找到此信息），这意味着您需要将数据集重采样为 16000kHz 以使用预训练的 Wav2Vec2 模型：

>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
>>> minds["train"][0]
{'audio': {'array': array([-2.38064706e-04, -1.58618059e-04, -5.43987835e-06, ...,
          2.78103951e-04,  2.38446111e-04,  1.18740834e-04], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
  'sampling_rate': 16000},
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}

如上所示的transcription，文本包含大小写混合的字符。Wav2Vec2 分词器只训练大写字符，所以您需要确保文本与分词器的词汇匹配：

>>> def uppercase(example):
...     return {"transcription": example["transcription"].upper()}

>>> minds = minds.map(uppercase)

现在创建一个预处理函数，它：

调用audio列加载和重采样音频文件。
从音频文件中提取input_values并使用处理器对transcription列进行标记。

>>> def prepare_dataset(batch):
...     audio = batch["audio"]
...     batch = processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["transcription"])
...     batch["input_length"] = len(batch["input_values"][0])
...     return batch

要在整个数据集上应用预处理函数，使用🤗数据集map函数。您可以通过增加num_proc参数来加快map的速度。使用remove_columns方法删除不需要的列：

>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)

🤗 Transformers 没有用于 ASR 的数据整理器，因此您需要调整 DataCollatorWithPadding 以创建一批示例。它还会动态填充您的文本和标签到其批次中最长元素的长度（而不是整个数据集），以使它们具有统一的长度。虽然可以通过在tokenizer函数中设置padding=True来填充文本，但动态填充更有效。

与其他数据整理器不同，这个特定的数据整理器需要对input_values和labels应用不同的填充方法：

>>> import torch

>>> from dataclasses import dataclass, field
>>> from typing import Any, Dict, List, Optional, Union

>>> @dataclass
... class DataCollatorCTCWithPadding:
...     processor: AutoProcessor
...     padding: Union[bool, str] = "longest"

...     def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
...         # split inputs and labels since they have to be of different lengths and need
...         # different padding methods
...         input_features = [{"input_values": feature["input_values"][0]} for feature in features]
...         label_features = [{"input_ids": feature["labels"]} for feature in features]

...         batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt")

...         labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")

...         # replace padding with -100 to ignore loss correctly
...         labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

...         batch["labels"] = labels

...         return batch

现在实例化您的DataCollatorForCTCWithPadding：

>>> data_collator = DataCollatorCTCWithPadding(processor=processor, padding="longest")

评估

在训练过程中包含一个指标通常有助于评估模型的性能。您可以使用🤗 Evaluate库快速加载一个评估方法。对于这个任务，加载word error rate (WER)指标（查看🤗 Evaluate 快速入门以了解如何加载和计算指标）：

>>> import evaluate

>>> wer = evaluate.load("wer")

然后创建一个函数，将您的预测和标签传递给compute以计算 WER：

>>> import numpy as np

>>> def compute_metrics(pred):
...     pred_logits = pred.predictions
...     pred_ids = np.argmax(pred_logits, axis=-1)

...     pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

...     pred_str = processor.batch_decode(pred_ids)
...     label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

...     wer = wer.compute(predictions=pred_str, references=label_str)

...     return {"wer": wer}

您的compute_metrics函数已经准备就绪，当您设置训练时会返回到它。

训练

PytorchHide Pytorch 内容

如果您不熟悉使用 Trainer 微调模型，请查看这里的基本教程[training#train-with-pytorch-trainer]！

现在您已经准备好开始训练您的模型了！使用 AutoModelForCTC 加载 Wav2Vec2。指定要应用的减少量，使用ctc_loss_reduction参数。通常最好使用平均值而不是默认的求和：

>>> from transformers import AutoModelForCTC, TrainingArguments, Trainer

>>> model = AutoModelForCTC.from_pretrained(
...     "facebook/wav2vec2-base",
...     ctc_loss_reduction="mean",
...     pad_token_id=processor.tokenizer.pad_token_id,
... )

此时，只剩下三个步骤：

在 TrainingArguments 中定义您的训练超参数。唯一必需的参数是output_dir，指定保存模型的位置。通过设置push_to_hub=True将此模型推送到 Hub（您需要登录 Hugging Face 才能上传模型）。在每个时代结束时，Trainer 将评估 WER 并保存训练检查点。
将训练参数传递给 Trainer，同时还需要传递模型、数据集、分词器、数据整理器和compute_metrics函数。
调用 train()来微调您的模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_asr_mind_model",
...     per_device_train_batch_size=8,
...     gradient_accumulation_steps=2,
...     learning_rate=1e-5,
...     warmup_steps=500,
...     max_steps=2000,
...     gradient_checkpointing=True,
...     fp16=True,
...     group_by_length=True,
...     evaluation_strategy="steps",
...     per_device_eval_batch_size=8,
...     save_steps=1000,
...     eval_steps=1000,
...     logging_steps=25,
...     load_best_model_at_end=True,
...     metric_for_best_model="wer",
...     greater_is_better=False,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=encoded_minds["train"],
...     eval_dataset=encoded_minds["test"],
...     tokenizer=processor,
...     data_collator=data_collator,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

训练完成后，使用 push_to_hub()方法将您的模型共享到 Hub，以便每个人都可以使用您的模型：

>>> trainer.push_to_hub()

要了解如何为自动语音识别微调模型的更深入示例，请查看这篇博客post以获取英语 ASR，以及这篇post以获取多语言 ASR。

推理

很好，现在您已经微调了一个模型，可以用它进行推理！

加载要运行推理的音频文件。记得重新采样音频文件的采样率以匹配模型的采样率（如果需要的话）！

>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> audio_file = dataset[0]["audio"]["path"]

尝试使用 pipeline()来进行推理是尝试您微调模型的最简单方法。使用您的模型实例化一个用于自动语音识别的pipeline，并将音频文件传递给它：

>>> from transformers import pipeline

>>> transcriber = pipeline("automatic-speech-recognition", model="stevhliu/my_awesome_asr_minds_model")
>>> transcriber(audio_file)
{'text': 'I WOUD LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'}

转录结果还不错，但可以更好！尝试在更多示例上微调您的模型，以获得更好的结果！

如果您愿意，也可以手动复制pipeline的结果：

Pytorch 隐藏 Pytorch 内容

加载处理器以预处理音频文件和转录，并将input返回为 PyTorch 张量：

>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("stevhliu/my_awesome_asr_mind_model")
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

将您的输入传递给模型并返回 logits：

>>> from transformers import AutoModelForCTC

>>> model = AutoModelForCTC.from_pretrained("stevhliu/my_awesome_asr_mind_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits

获取具有最高概率的预测input_ids，并使用处理器将预测的input_ids解码回文本：

>>> import torch

>>> predicted_ids = torch.argmax(logits, dim=-1)
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription
['I WOUL LIKE O SET UP JOINT ACOUNT WTH Y PARTNER']

计算机视觉

图像分类

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/image_classification

www.youtube-nocookie.com/embed/tjAIM7BOYhw

图像分类为图像分配一个标签或类别。与文本或音频分类不同，输入是组成图像的像素值。图像分类有许多应用，例如在自然灾害后检测损坏、监测作物健康或帮助筛查医学图像中的疾病迹象。

本指南说明了如何：

在Food-101数据集上对 ViT 进行微调，以对图像中的食物项目进行分类。
使用您微调的模型进行推断。

本教程中所示的任务由以下模型架构支持：

BEiT、BiT、ConvNeXT、ConvNeXTV2、CvT、Data2VecVision、DeiT、DiNAT、DINOv2、EfficientFormer、EfficientNet、FocalNet、ImageGPT、LeViT、MobileNetV1、MobileNetV2、MobileViT、MobileViTV2、NAT、Perceiver、PoolFormer、PVT、RegNet、ResNet、SegFormer、SwiftFormer、Swin Transformer、Swin Transformer V2、VAN、ViT、ViT Hybrid、ViTMSN

在开始之前，请确保您已安装所有必要的库：

pip install transformers datasets evaluate

我们鼓励您登录您的 Hugging Face 帐户，以便上传和与社区分享您的模型。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 Food-101 数据集

首先从🤗数据集库中加载 Food-101 数据集的一个较小子集。这将让您有机会进行实验，并确保一切正常，然后再花更多时间在完整数据集上进行训练。

>>> from datasets import load_dataset

>>> food = load_dataset("food101", split="train[:5000]")

使用train_test_split方法将数据集的train拆分为训练集和测试集：

>>> food = food.train_test_split(test_size=0.2)

然后看一个示例：

>>> food["train"][0]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F52AFC8AC50>,
 'label': 79}

数据集中的每个示例都有两个字段：

image：食物项目的 PIL 图像
label：食物项目的标签类别

为了使模型更容易从标签 ID 获取标签名称，创建一个将标签名称映射到整数及反之的字典：

>>> labels = food["train"].features["label"].names
>>> label2id, id2label = dict(), dict()
>>> for i, label in enumerate(labels):
...     label2id[label] = str(i)
...     id2label[str(i)] = label

现在您可以将标签 ID 转换为标签名称：

>>> id2label[str(79)]
'prime_rib'

预处理

下一步是加载一个 ViT 图像处理器，将图像处理为张量：

>>> from transformers import AutoImageProcessor

>>> checkpoint = "google/vit-base-patch16-224-in21k"
>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)

PytorchHide Pytorch content

对图像应用一些图像转换，使模型更具抗过拟合能力。在这里，您将使用 torchvision 的transforms模块，但您也可以使用您喜欢的任何图像库。

裁剪图像的随机部分，调整大小，并使用图像的均值和标准差进行归一化：

>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
>>> size = (
...     image_processor.size["shortest_edge"]
...     if "shortest_edge" in image_processor.size
...     else (image_processor.size["height"], image_processor.size["width"])
... )
>>> _transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])

然后创建一个预处理函数来应用转换并返回pixel_values - 图像的模型输入：

>>> def transforms(examples):
...     examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
...     del examples["image"]
...     return examples

要在整个数据集上应用预处理函数，请使用🤗数据集的with_transform方法。当加载数据集的元素时，转换会即时应用：

>>> food = food.with_transform(transforms)

现在使用 DefaultDataCollator 创建一批示例。与🤗 Transformers 中的其他数据整理器不同，DefaultDataCollator不会应用额外的预处理，如填充。

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()

TensorFlow 隐藏 TensorFlow 内容

为了避免过拟合并使模型更加健壮，在数据集的训练部分添加一些数据增强。在这里，我们使用 Keras 预处理层来定义训练数据（包括数据增强）的转换，以及验证数据（仅中心裁剪、调整大小和归一化）的转换。您可以使用tf.image或您喜欢的任何其他库。

>>> from tensorflow import keras
>>> from tensorflow.keras import layers

>>> size = (image_processor.size["height"], image_processor.size["width"])

>>> train_data_augmentation = keras.Sequential(
...     [
...         layers.RandomCrop(size[0], size[1]),
...         layers.Rescaling(scale=1.0 / 127.5, offset=-1),
...         layers.RandomFlip("horizontal"),
...         layers.RandomRotation(factor=0.02),
...         layers.RandomZoom(height_factor=0.2, width_factor=0.2),
...     ],
...     name="train_data_augmentation",
... )

>>> val_data_augmentation = keras.Sequential(
...     [
...         layers.CenterCrop(size[0], size[1]),
...         layers.Rescaling(scale=1.0 / 127.5, offset=-1),
...     ],
...     name="val_data_augmentation",
... )

接下来，创建函数将适当的转换应用于一批图像，而不是一次一个图像。

>>> import numpy as np
>>> import tensorflow as tf
>>> from PIL import Image

>>> def convert_to_tf_tensor(image: Image):
...     np_image = np.array(image)
...     tf_image = tf.convert_to_tensor(np_image)
...     # `expand_dims()` is used to add a batch dimension since
...     # the TF augmentation layers operates on batched inputs.
...     return tf.expand_dims(tf_image, 0)

>>> def preprocess_train(example_batch):
...     """Apply train_transforms across a batch."""
...     images = [
...         train_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"]
...     ]
...     example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images]
...     return example_batch

... def preprocess_val(example_batch):
...     """Apply val_transforms across a batch."""
...     images = [
...         val_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"]
...     ]
...     example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images]
...     return example_batch

使用🤗 Datasets set_transform在运行时应用转换：

food["train"].set_transform(preprocess_train)
food["test"].set_transform(preprocess_val)

作为最后的预处理步骤，使用DefaultDataCollator创建一批示例。与🤗 Transformers 中的其他数据整理器不同，DefaultDataCollator不会应用额外的预处理，如填充。

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator(return_tensors="tf")

评估

在训练过程中包含一个度量通常有助于评估模型的性能。您可以使用🤗 Evaluate库快速加载评估方法。对于此任务，加载accuracy度量（查看🤗 Evaluate 快速导览以了解如何加载和计算度量）：

>>> import evaluate

>>> accuracy = evaluate.load("accuracy")

然后创建一个函数，将您的预测和标签传递给compute以计算准确性：

>>> import numpy as np

>>> def compute_metrics(eval_pred):
...     predictions, labels = eval_pred
...     predictions = np.argmax(predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=labels)

您的compute_metrics函数现在已经准备就绪，当您设置训练时会返回到它。

训练

Pytorch 隐藏 Pytorch 内容

如果您不熟悉使用 Trainer 对模型进行微调，请查看基本教程这里！

现在您可以开始训练您的模型了！使用 AutoModelForImageClassification 加载 ViT。指定标签数量以及预期标签数量和标签映射：

>>> from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

>>> model = AutoModelForImageClassification.from_pretrained(
...     checkpoint,
...     num_labels=len(labels),
...     id2label=id2label,
...     label2id=label2id,
... )

在这一点上，只剩下三个步骤：

在 TrainingArguments 中定义您的训练超参数。重要的是不要删除未使用的列，因为那会删除image列。没有image列，您就无法创建pixel_values。设置remove_unused_columns=False以防止这种行为！唯一的其他必需参数是output_dir，指定保存模型的位置。通过设置push_to_hub=True将此模型推送到 Hub（您需要登录 Hugging Face 才能上传您的模型）。在每个 epoch 结束时，Trainer 将评估准确性并保存训练检查点。
将训练参数传递给 Trainer，以及模型、数据集、分词器、数据整理器和compute_metrics函数。
调用 train()来微调您的模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_food_model",
...     remove_unused_columns=False,
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=5e-5,
...     per_device_train_batch_size=16,
...     gradient_accumulation_steps=4,
...     per_device_eval_batch_size=16,
...     num_train_epochs=3,
...     warmup_ratio=0.1,
...     logging_steps=10,
...     load_best_model_at_end=True,
...     metric_for_best_model="accuracy",
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=data_collator,
...     train_dataset=food["train"],
...     eval_dataset=food["test"],
...     tokenizer=image_processor,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

训练完成后，使用 push_to_hub()方法将您的模型共享到 Hub，这样每个人都可以使用您的模型：

>>> trainer.push_to_hub()

TensorFlow 隐藏 TensorFlow 内容

如果您不熟悉使用 Keras 微调模型，请先查看基本教程！

要在 TensorFlow 中微调模型，请按照以下步骤进行：

定义训练超参数，并设置优化器和学习率调度。
实例化一个预训练模型。
将🤗数据集转换为tf.data.Dataset。
编译您的模型。
添加回调并使用fit()方法运行训练。
将您的模型上传到🤗 Hub 以与社区共享。

首先定义超参数、优化器和学习率调度：

>>> from transformers import create_optimizer

>>> batch_size = 16
>>> num_epochs = 5
>>> num_train_steps = len(food["train"]) * num_epochs
>>> learning_rate = 3e-5
>>> weight_decay_rate = 0.01

>>> optimizer, lr_schedule = create_optimizer(
...     init_lr=learning_rate,
...     num_train_steps=num_train_steps,
...     weight_decay_rate=weight_decay_rate,
...     num_warmup_steps=0,
... )

然后，使用 TFAutoModelForImageClassification 加载 ViT 以及标签映射：

>>> from transformers import TFAutoModelForImageClassification

>>> model = TFAutoModelForImageClassification.from_pretrained(
...     checkpoint,
...     id2label=id2label,
...     label2id=label2id,
... )

使用to_tf_dataset和您的data_collator将数据集转换为tf.data.Dataset格式：

>>> # converting our train dataset to tf.data.Dataset
>>> tf_train_dataset = food["train"].to_tf_dataset(
...     columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator
... )

>>> # converting our test dataset to tf.data.Dataset
>>> tf_eval_dataset = food["test"].to_tf_dataset(
...     columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator
... )

使用compile()配置模型进行训练：

>>> from tensorflow.keras.losses import SparseCategoricalCrossentropy

>>> loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
>>> model.compile(optimizer=optimizer, loss=loss)

要从预测中计算准确性并将模型推送到🤗 Hub，请使用 Keras 回调。将您的compute_metrics函数传递给 KerasMetricCallback，并使用 PushToHubCallback 上传模型：

>>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_eval_dataset)
>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="food_classifier",
...     tokenizer=image_processor,
...     save_strategy="no",
... )
>>> callbacks = [metric_callback, push_to_hub_callback]

最后，您已经准备好训练您的模型了！使用您的训练和验证数据集、时代数和回调来微调模型调用fit()：

>>> model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=num_epochs, callbacks=callbacks)
Epoch 1/5
250/250 [==============================] - 313s 1s/step - loss: 2.5623 - val_loss: 1.4161 - accuracy: 0.9290
Epoch 2/5
250/250 [==============================] - 265s 1s/step - loss: 0.9181 - val_loss: 0.6808 - accuracy: 0.9690
Epoch 3/5
250/250 [==============================] - 252s 1s/step - loss: 0.3910 - val_loss: 0.4303 - accuracy: 0.9820
Epoch 4/5
250/250 [==============================] - 251s 1s/step - loss: 0.2028 - val_loss: 0.3191 - accuracy: 0.9900
Epoch 5/5
250/250 [==============================] - 238s 949ms/step - loss: 0.1232 - val_loss: 0.3259 - accuracy: 0.9890

恭喜！您已经对模型进行了微调，并在🤗 Hub 上共享。现在您可以用它进行推理！

要了解如何为图像分类微调模型的更深入示例，请查看相应的PyTorch 笔记本。

推理

太棒了，现在您已经对模型进行了微调，可以用于推理！

加载要运行推理的图像：

>>> ds = load_dataset("food101", split="validation[:10]")
>>> image = ds["image"][0]

尝试使用您微调的模型进行推理的最简单方法是在 pipeline()中使用它。使用您的模型实例化一个用于图像分类的pipeline，并将图像传递给它：

>>> from transformers import pipeline

>>> classifier = pipeline("image-classification", model="my_awesome_food_model")
>>> classifier(image)
[{'score': 0.31856709718704224, 'label': 'beignets'},
 {'score': 0.015232225880026817, 'label': 'bruschetta'},
 {'score': 0.01519392803311348, 'label': 'chicken_wings'},
 {'score': 0.013022331520915031, 'label': 'pork_chop'},
 {'score': 0.012728818692266941, 'label': 'prime_rib'}]

如果愿意，您也可以手动复制pipeline的结果：

PytorchHide Pytorch 内容

加载图像处理器以预处理图像并将input返回为 PyTorch 张量：

>>> from transformers import AutoImageProcessor
>>> import torch

>>> image_processor = AutoImageProcessor.from_pretrained("my_awesome_food_model")
>>> inputs = image_processor(image, return_tensors="pt")

将输入传递给模型并返回 logits：

>>> from transformers import AutoModelForImageClassification

>>> model = AutoModelForImageClassification.from_pretrained("my_awesome_food_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits

获取具有最高概率的预测标签，并使用模型的id2label映射将其转换为标签：

>>> predicted_label = logits.argmax(-1).item()
>>> model.config.id2label[predicted_label]
'beignets'

TensorFlowHide TensorFlow 内容

加载图像处理器以预处理图像并将input返回为 TensorFlow 张量：

>>> from transformers import AutoImageProcessor

>>> image_processor = AutoImageProcessor.from_pretrained("MariaK/food_classifier")
>>> inputs = image_processor(image, return_tensors="tf")

将输入传递给模型并返回 logits：

>>> from transformers import TFAutoModelForImageClassification

>>> model = TFAutoModelForImageClassification.from_pretrained("MariaK/food_classifier")
>>> logits = model(**inputs).logits

获取具有最高概率的预测标签，并使用模型的id2label映射将其转换为标签：

>>> predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
>>> model.config.id2label[predicted_class_id]
'beignets'

图像分割

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/semantic_segmentation

www.youtube-nocookie.com/embed/dKE8SIt9C-w

图像分割模型将图像中对应不同感兴趣区域的区域分开。这些模型通过为每个像素分配一个标签来工作。有几种类型的分割：语义分割、实例分割和全景分割。

在本指南中，我们将：

查看不同类型的分割。
有一个用于语义分割的端到端微调示例。

在开始之前，请确保已安装所有必要的库：

pip install -q datasets transformers evaluate

我们鼓励您登录您的 Hugging Face 帐户，这样您就可以上传和与社区分享您的模型。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

分割类型

语义分割为图像中的每个像素分配一个标签或类。让我们看一下语义分割模型的输出。它将为图像中遇到的每个对象实例分配相同的类，例如，所有猫都将被标记为“cat”而不是“cat-1”、“cat-2”。我们可以使用 transformers 的图像分割管道快速推断一个语义分割模型。让我们看一下示例图像。

from transformers import pipeline
from PIL import Image
import requests

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/segmentation_input.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image

我们将使用nvidia/segformer-b1-finetuned-cityscapes-1024-1024。

semantic_segmentation = pipeline("image-segmentation", "nvidia/segformer-b1-finetuned-cityscapes-1024-1024")
results = semantic_segmentation(image)
results

分割管道输出包括每个预测类的掩码。

[{'score': None,
  'label': 'road',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': None,
  'label': 'sidewalk',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': None,
  'label': 'building',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': None,
  'label': 'wall',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': None,
  'label': 'pole',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': None,
  'label': 'traffic sign',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': None,
  'label': 'vegetation',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': None,
  'label': 'terrain',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': None,
  'label': 'sky',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': None,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=612x415>}]

查看汽车类的掩码，我们可以看到每辆汽车都被分类为相同的掩码。

results[-1]["mask"]

在实例分割中，目标不是对每个像素进行分类，而是为给定图像中的每个对象实例预测一个掩码。它的工作方式与目标检测非常相似，其中每个实例都有一个边界框，而这里有一个分割掩码。我们将使用facebook/mask2former-swin-large-cityscapes-instance。

instance_segmentation = pipeline("image-segmentation", "facebook/mask2former-swin-large-cityscapes-instance")
results = instance_segmentation(Image.open(image))
results

如下所示，有多辆汽车被分类，除了属于汽车和人实例的像素之外，没有对其他像素进行分类。

[{'score': 0.999944,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': 0.999945,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': 0.999652,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': 0.903529,
  'label': 'person',
  'mask': <PIL.Image.Image image mode=L size=612x415>}]

查看下面的一辆汽车掩码。

results[2]["mask"]

全景分割结合了语义分割和实例分割，其中每个像素被分类为一个类和该类的一个实例，并且每个类的每个实例有多个掩码。我们可以使用facebook/mask2former-swin-large-cityscapes-panoptic。

panoptic_segmentation = pipeline("image-segmentation", "facebook/mask2former-swin-large-cityscapes-panoptic")
results = panoptic_segmentation(Image.open(image))
results

如下所示，我们有更多的类。稍后我们将说明，每个像素都被分类为其中的一个类。

[{'score': 0.999981,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': 0.999958,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': 0.99997,
  'label': 'vegetation',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': 0.999575,
  'label': 'pole',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': 0.999958,
  'label': 'building',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': 0.999634,
  'label': 'road',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': 0.996092,
  'label': 'sidewalk',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': 0.999221,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=612x415>},
 {'score': 0.99987,
  'label': 'sky',
  'mask': <PIL.Image.Image image mode=L size=612x415>}]

让我们对所有类型的分割进行一次并排比较。

看到所有类型的分割，让我们深入研究为语义分割微调模型。

语义分割的常见实际应用包括训练自动驾驶汽车识别行人和重要的交通信息，识别医学图像中的细胞和异常，以及监测卫星图像中的环境变化。

为分割微调模型

我们现在将：

在SceneParse150数据集上对SegFormer进行微调。
使用您微调的模型进行推断。

本教程中演示的任务由以下模型架构支持：

BEiT, Data2VecVision, DPT, MobileNetV2, MobileViT, MobileViTV2, SegFormer, UPerNet

加载 SceneParse150 数据集

首先从 🤗 数据集库中加载 SceneParse150 数据集的一个较小子集。这将让您有机会进行实验，并确保一切正常，然后再花更多时间在完整数据集上进行训练。

>>> from datasets import load_dataset

>>> ds = load_dataset("scene_parse_150", split="train[:50]")

使用 train_test_split 方法将数据集的 train 分割为训练集和测试集：

>>> ds = ds.train_test_split(test_size=0.2)
>>> train_ds = ds["train"]
>>> test_ds = ds["test"]

然后看一个例子：

>>> train_ds[0]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x683 at 0x7F9B0C201F90>,
 'annotation': <PIL.PngImagePlugin.PngImageFile image mode=L size=512x683 at 0x7F9B0C201DD0>,
 'scene_category': 368}

image：场景的 PIL 图像。
annotation：分割地图的 PIL 图像，也是模型的目标。
scene_category：描述图像场景的类别 id，如“厨房”或“办公室”。在本指南中，您只需要 image 和 annotation，两者都是 PIL 图像。

您还需要创建一个将标签 id 映射到标签类的字典，这在稍后设置模型时会很有用。从 Hub 下载映射并创建 id2label 和 label2id 字典：

>>> import json
>>> from huggingface_hub import cached_download, hf_hub_url

>>> repo_id = "huggingface/label-files"
>>> filename = "ade20k-id2label.json"
>>> id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename, repo_type="dataset")), "r"))
>>> id2label = {int(k): v for k, v in id2label.items()}
>>> label2id = {v: k for k, v in id2label.items()}
>>> num_labels = len(id2label)

自定义数据集

如果您更喜欢使用 run_semantic_segmentation.py 脚本而不是笔记本实例进行训练，您也可以创建并使用自己的数据集。该脚本需要：

一个包含两个 Image 列“image”和“label”的 DatasetDict。

from datasets import Dataset, DatasetDict, Image

image_paths_train = ["path/to/image_1.jpg/jpg", "path/to/image_2.jpg/jpg", ..., "path/to/image_n.jpg/jpg"]
label_paths_train = ["path/to/annotation_1.png", "path/to/annotation_2.png", ..., "path/to/annotation_n.png"]

image_paths_validation = [...]
label_paths_validation = [...]

def create_dataset(image_paths, label_paths):
    dataset = Dataset.from_dict({"image": sorted(image_paths),
                                "label": sorted(label_paths)})
    dataset = dataset.cast_column("image", Image())
    dataset = dataset.cast_column("label", Image())
    return dataset

# step 1: create Dataset objects
train_dataset = create_dataset(image_paths_train, label_paths_train)
validation_dataset = create_dataset(image_paths_validation, label_paths_validation)

# step 2: create DatasetDict
dataset = DatasetDict({
     "train": train_dataset,
     "validation": validation_dataset,
     }
)

# step 3: push to Hub (assumes you have ran the huggingface-cli login command in a terminal/notebook)
dataset.push_to_hub("your-name/dataset-repo")

# optionally, you can push to a private repo on the Hub
# dataset.push_to_hub("name of repo on the hub", private=True)

一个 id2label 字典，将类整数映射到它们的类名

import json
# simple example
id2label = {0: 'cat', 1: 'dog'}
with open('id2label.json', 'w') as fp:
json.dump(id2label, fp)

例如，查看这个示例数据集，该数据集是使用上述步骤创建的。

预处理

下一步是加载一个 SegFormer 图像处理器，准备图像和注释以供模型使用。某些数据集，如此类数据集，使用零索引作为背景类。但是，背景类实际上不包括在 150 个类中，因此您需要设置 reduce_labels=True，从所有标签中减去一个。零索引被替换为 255，因此 SegFormer 的损失函数会忽略它：

>>> from transformers import AutoImageProcessor

>>> checkpoint = "nvidia/mit-b0"
>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint, reduce_labels=True)

Pytorch 隐藏 Pytorch 内容

通常会对图像数据集应用一些数据增强，以使模型更具抗过拟合能力。在本指南中，您将使用 ColorJitter 函数从 torchvision 随机更改图像的颜色属性，但您也可以使用任何您喜欢的图像库。

>>> from torchvision.transforms import ColorJitter

>>> jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1)

现在创建两个预处理函数，准备图像和注释以供模型使用。这些函数将图像转换为 pixel_values，将注释转换为 labels。对于训练集，在将图像提供给图像处理器之前应用 jitter。对于测试集，图像处理器裁剪和规范化 images，仅裁剪 labels，因为在测试期间不应用数据增强。

>>> def train_transforms(example_batch):
...     images = [jitter(x) for x in example_batch["image"]]
...     labels = [x for x in example_batch["annotation"]]
...     inputs = image_processor(images, labels)
...     return inputs

>>> def val_transforms(example_batch):
...     images = [x for x in example_batch["image"]]
...     labels = [x for x in example_batch["annotation"]]
...     inputs = image_processor(images, labels)
...     return inputs

要在整个数据集上应用 jitter，请使用 🤗 数据集 set_transform 函数。变换是实时应用的，速度更快，占用的磁盘空间更少：

>>> train_ds.set_transform(train_transforms)
>>> test_ds.set_transform(val_transforms)

TensorFlow 隐藏 TensorFlow 内容

对图像数据集应用一些数据增强是常见的，可以使模型更具抗过拟合能力。在本指南中，您将使用tf.image来随机更改图像的颜色属性，但您也可以使用任何您喜欢的图像库。定义两个单独的转换函数：

包括图像增强的训练数据转换
验证数据转换仅转置图像，因为🤗 Transformers 中的计算机视觉模型期望通道优先布局

>>> import tensorflow as tf

>>> def aug_transforms(image):
...     image = tf.keras.utils.img_to_array(image)
...     image = tf.image.random_brightness(image, 0.25)
...     image = tf.image.random_contrast(image, 0.5, 2.0)
...     image = tf.image.random_saturation(image, 0.75, 1.25)
...     image = tf.image.random_hue(image, 0.1)
...     image = tf.transpose(image, (2, 0, 1))
...     return image

>>> def transforms(image):
...     image = tf.keras.utils.img_to_array(image)
...     image = tf.transpose(image, (2, 0, 1))
...     return image

接下来，创建两个预处理函数，用于为模型准备图像和注释的批处理。这些函数应用图像转换，并使用之前加载的image_processor将图像转换为pixel_values，将注释转换为labels。ImageProcessor还负责调整大小和规范化图像。

>>> def train_transforms(example_batch):
...     images = [aug_transforms(x.convert("RGB")) for x in example_batch["image"]]
...     labels = [x for x in example_batch["annotation"]]
...     inputs = image_processor(images, labels)
...     return inputs

>>> def val_transforms(example_batch):
...     images = [transforms(x.convert("RGB")) for x in example_batch["image"]]
...     labels = [x for x in example_batch["annotation"]]
...     inputs = image_processor(images, labels)
...     return inputs

要在整个数据集上应用预处理转换，使用🤗 Datasets set_transform函数。转换是实时应用的，速度更快，占用的磁盘空间更少：

>>> train_ds.set_transform(train_transforms)
>>> test_ds.set_transform(val_transforms)

评估

在训练过程中包含一个度量标准通常有助于评估模型的性能。您可以使用🤗 Evaluate库快速加载一个评估方法。对于这个任务，加载mean Intersection over Union (IoU)度量标准（查看🤗 Evaluate quick tour以了解如何加载和计算度量标准）：

>>> import evaluate

>>> metric = evaluate.load("mean_iou")

然后创建一个函数来compute度量标准。您的预测需要首先转换为 logits，然后重新调整形状以匹配标签的大小，然后才能调用compute：

PytorchHide Pytorch 内容

>>> import numpy as np
>>> import torch
>>> from torch import nn

>>> def compute_metrics(eval_pred):
...     with torch.no_grad():
...         logits, labels = eval_pred
...         logits_tensor = torch.from_numpy(logits)
...         logits_tensor = nn.functional.interpolate(
...             logits_tensor,
...             size=labels.shape[-2:],
...             mode="bilinear",
...             align_corners=False,
...         ).argmax(dim=1)

...         pred_labels = logits_tensor.detach().cpu().numpy()
...         metrics = metric.compute(
...             predictions=pred_labels,
...             references=labels,
...             num_labels=num_labels,
...             ignore_index=255,
...             reduce_labels=False,
...         )
...         for key, value in metrics.items():
...             if isinstance(value, np.ndarray):
...                 metrics[key] = value.tolist()
...         return metrics

TensorFlowHide TensorFlow 内容

>>> def compute_metrics(eval_pred):
...     logits, labels = eval_pred
...     logits = tf.transpose(logits, perm=[0, 2, 3, 1])
...     logits_resized = tf.image.resize(
...         logits,
...         size=tf.shape(labels)[1:],
...         method="bilinear",
...     )

...     pred_labels = tf.argmax(logits_resized, axis=-1)
...     metrics = metric.compute(
...         predictions=pred_labels,
...         references=labels,
...         num_labels=num_labels,
...         ignore_index=-1,
...         reduce_labels=image_processor.do_reduce_labels,
...     )

...     per_category_accuracy = metrics.pop("per_category_accuracy").tolist()
...     per_category_iou = metrics.pop("per_category_iou").tolist()

...     metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)})
...     metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)})
...     return {"val_" + k: v for k, v in metrics.items()}

您的compute_metrics函数现在已经准备就绪，当您设置训练时会再次用到它。

训练

PytorchHide Pytorch 内容

如果您不熟悉如何使用 Trainer 对模型进行微调，请查看这里的基本教程[…/training#finetune-with-trainer]！

您现在已经准备好开始训练您的模型了！使用 AutoModelForSemanticSegmentation 加载 SegFormer，并将模型传递给标签 id 和标签类之间的映射：

>>> from transformers import AutoModelForSemanticSegmentation, TrainingArguments, Trainer

>>> model = AutoModelForSemanticSegmentation.from_pretrained(checkpoint, id2label=id2label, label2id=label2id)

目前只剩下三个步骤：

在 TrainingArguments 中定义您的训练超参数。重要的是不要删除未使用的列，因为这会删除image列。没有image列，您就无法创建pixel_values。设置remove_unused_columns=False以防止这种行为！另一个必需的参数是output_dir，指定保存模型的位置。通过设置push_to_hub=True将此模型推送到 Hub（您需要登录 Hugging Face 才能上传您的模型）。在每个 epoch 结束时，Trainer 将评估 IoU 度量标准并保存训练检查点。
将训练参数传递给 Trainer，同时还需要传递模型、数据集、分词器、数据整理器和compute_metrics函数。
调用 train()来微调您的模型。

>>> training_args = TrainingArguments(
...     output_dir="segformer-b0-scene-parse-150",
...     learning_rate=6e-5,
...     num_train_epochs=50,
...     per_device_train_batch_size=2,
...     per_device_eval_batch_size=2,
...     save_total_limit=3,
...     evaluation_strategy="steps",
...     save_strategy="steps",
...     save_steps=20,
...     eval_steps=20,
...     logging_steps=1,
...     eval_accumulation_steps=5,
...     remove_unused_columns=False,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=train_ds,
...     eval_dataset=test_ds,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

训练完成后，使用 push_to_hub()方法将您的模型共享到 Hub，这样每个人都可以使用您的模型：

>>> trainer.push_to_hub()

TensorFlowHide TensorFlow 内容

如果您不熟悉使用 Keras 进行模型微调，请先查看基本教程！

要在 TensorFlow 中微调模型，请按照以下步骤进行：

定义训练超参数，并设置优化器和学习率调度。
实例化一个预训练模型。
将一个🤗数据集转换为tf.data.Dataset。
编译您的模型。
添加回调以计算指标并将您的模型上传到🤗 Hub
使用fit()方法运行训练。

首先定义超参数、优化器和学习率调度：

>>> from transformers import create_optimizer

>>> batch_size = 2
>>> num_epochs = 50
>>> num_train_steps = len(train_ds) * num_epochs
>>> learning_rate = 6e-5
>>> weight_decay_rate = 0.01

>>> optimizer, lr_schedule = create_optimizer(
...     init_lr=learning_rate,
...     num_train_steps=num_train_steps,
...     weight_decay_rate=weight_decay_rate,
...     num_warmup_steps=0,
... )

然后，使用 TFAutoModelForSemanticSegmentation 加载 SegFormer 以及标签映射，并使用优化器对其进行编译。请注意，Transformers 模型都有一个默认的与任务相关的损失函数，因此除非您想要指定一个，否则不需要指定：

>>> from transformers import TFAutoModelForSemanticSegmentation

>>> model = TFAutoModelForSemanticSegmentation.from_pretrained(
...     checkpoint,
...     id2label=id2label,
...     label2id=label2id,
... )
>>> model.compile(optimizer=optimizer)  # No loss argument!

使用to_tf_dataset和 DefaultDataCollator 将您的数据集转换为tf.data.Dataset格式：

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator(return_tensors="tf")

>>> tf_train_dataset = train_ds.to_tf_dataset(
...     columns=["pixel_values", "label"],
...     shuffle=True,
...     batch_size=batch_size,
...     collate_fn=data_collator,
... )

>>> tf_eval_dataset = test_ds.to_tf_dataset(
...     columns=["pixel_values", "label"],
...     shuffle=True,
...     batch_size=batch_size,
...     collate_fn=data_collator,
... )

要从预测中计算准确率并将您的模型推送到🤗 Hub，请使用 Keras 回调。将您的compute_metrics函数传递给 KerasMetricCallback，并使用 PushToHubCallback 来上传模型：

>>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback

>>> metric_callback = KerasMetricCallback(
...     metric_fn=compute_metrics, eval_dataset=tf_eval_dataset, batch_size=batch_size, label_cols=["labels"]
... )

>>> push_to_hub_callback = PushToHubCallback(output_dir="scene_segmentation", tokenizer=image_processor)

>>> callbacks = [metric_callback, push_to_hub_callback]

最后，您已经准备好训练您的模型了！使用您的训练和验证数据集、时代数量和回调来调用fit()来微调模型：

>>> model.fit(
...     tf_train_dataset,
...     validation_data=tf_eval_dataset,
...     callbacks=callbacks,
...     epochs=num_epochs,
... )

恭喜！您已经对模型进行了微调并在🤗 Hub 上分享了它。现在您可以用它进行推理！

推理

很好，现在您已经对模型进行了微调，可以用它进行推理！

加载一张图片进行推理：

>>> image = ds[0]["image"]
>>> image

Pytorch 隐藏 Pytorch 内容

现在我们将看到如何在没有管道的情况下进行推理。使用图像处理器处理图像，并将pixel_values放在 GPU 上：

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # use GPU if available, otherwise use a CPU
>>> encoding = image_processor(image, return_tensors="pt")
>>> pixel_values = encoding.pixel_values.to(device)

将输入传递给模型并返回logits：

>>> outputs = model(pixel_values=pixel_values)
>>> logits = outputs.logits.cpu()

接下来，将 logits 重新缩放到原始图像大小：

>>> upsampled_logits = nn.functional.interpolate(
...     logits,
...     size=image.size[::-1],
...     mode="bilinear",
...     align_corners=False,
... )

>>> pred_seg = upsampled_logits.argmax(dim=1)[0]

TensorFlow 隐藏 TensorFlow 内容

加载一个图像处理器来预处理图像并将输入返回为 TensorFlow 张量：

>>> from transformers import AutoImageProcessor

>>> image_processor = AutoImageProcessor.from_pretrained("MariaK/scene_segmentation")
>>> inputs = image_processor(image, return_tensors="tf")

将输入传递给模型并返回logits：

>>> from transformers import TFAutoModelForSemanticSegmentation

>>> model = TFAutoModelForSemanticSegmentation.from_pretrained("MariaK/scene_segmentation")
>>> logits = model(**inputs).logits

接下来，将 logits 重新缩放到原始图像大小，并在类维度上应用 argmax：

>>> logits = tf.transpose(logits, [0, 2, 3, 1])

>>> upsampled_logits = tf.image.resize(
...     logits,
...     # We reverse the shape of `image` because `image.size` returns width and height.
...     image.size[::-1],
... )

>>> pred_seg = tf.math.argmax(upsampled_logits, axis=-1)[0]

要可视化结果，加载数据集颜色调色板作为ade_palette()，将每个类映射到它们的 RGB 值。然后您可以组合并绘制您的图像和预测的分割地图：

>>> import matplotlib.pyplot as plt
>>> import numpy as np

>>> color_seg = np.zeros((pred_seg.shape[0], pred_seg.shape[1], 3), dtype=np.uint8)
>>> palette = np.array(ade_palette())
>>> for label, color in enumerate(palette):
...     color_seg[pred_seg == label, :] = color
>>> color_seg = color_seg[..., ::-1]  # convert to BGR

>>> img = np.array(image) * 0.5 + color_seg * 0.5  # plot the image with the segmentation map
>>> img = img.astype(np.uint8)

>>> plt.figure(figsize=(15, 10))
>>> plt.imshow(img)
>>> plt.show()

视频分类

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/video_classification

视频分类是将标签或类别分配给整个视频的任务。预期每个视频只有一个类别。视频分类模型将视频作为输入，并返回关于视频属于哪个类别的预测。这些模型可用于对视频内容进行分类。视频分类的现实应用是动作/活动识别，对于健身应用非常有用。对于视力受损的个体，尤其是在通勤时，这也是有帮助的。

本指南将向您展示如何：

在UCF101数据集的子集上对VideoMAE进行微调。
使用您微调的模型进行推断。

本教程中所示的任务由以下模型架构支持：

TimeSformer, VideoMAE, ViViT

在开始之前，请确保您已安装所有必要的库：

pip install -q pytorchvideo transformers evaluate

您将使用PyTorchVideo（称为pytorchvideo）来处理和准备视频。

我们鼓励您登录您的 Hugging Face 帐户，这样您就可以上传和与社区分享您的模型。提示时，请输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 UCF101 数据集

首先加载UCF-101 数据集的子集。这将让您有机会进行实验，并确保一切正常，然后再花更多时间在完整数据集上进行训练。

>>> from huggingface_hub import hf_hub_download

>>> hf_dataset_identifier = "sayakpaul/ucf101-subset"
>>> filename = "UCF101_subset.tar.gz"
>>> file_path = hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset")

在下载子集后，您需要提取压缩存档：

>>> import tarfile

>>> with tarfile.open(file_path) as t:
...      t.extractall(".")

在高层次上，数据集的组织方式如下：

UCF101_subset/
    train/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
    val/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
    test/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...

（排序后的）视频路径看起来像这样：

...
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c04.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c06.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c01.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c02.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c06.avi'
...

您会注意到有属于同一组/场景的视频片段，其中组在视频文件路径中用g表示。例如，v_ApplyEyeMakeup_g07_c04.avi和v_ApplyEyeMakeup_g07_c06.avi。

对于验证和评估拆分，您不希望从同一组/场景中获取视频片段，以防止数据泄漏。本教程中使用的子集考虑了这些信息。

接下来，您将推导数据集中存在的标签集。还要创建两个在初始化模型时有用的字典：

label2id：将类名映射到整数。
id2label：将整数映射到类名。

>>> class_labels = sorted({str(path).split("/")[2] for path in all_video_file_paths})
>>> label2id = {label: i for i, label in enumerate(class_labels)}
>>> id2label = {i: label for label, i in label2id.items()}

>>> print(f"Unique classes: {list(label2id.keys())}.")

# Unique classes: ['ApplyEyeMakeup', 'ApplyLipstick', 'Archery', 'BabyCrawling', 'BalanceBeam', 'BandMarching', 'BaseballPitch', 'Basketball', 'BasketballDunk', 'BenchPress'].

有 10 个独特的类别。每个类别在训练集中有 30 个视频。

加载一个模型进行微调

从预训练的检查点和其关联的图像处理器实例化一个视频分类模型。模型的编码器带有预训练参数，分类头是随机初始化的。当为我们的数据集编写预处理流水线时，图像处理器会派上用场。

>>> from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification

>>> model_ckpt = "MCG-NJU/videomae-base"
>>> image_processor = VideoMAEImageProcessor.from_pretrained(model_ckpt)
>>> model = VideoMAEForVideoClassification.from_pretrained(
...     model_ckpt,
...     label2id=label2id,
...     id2label=id2label,
...     ignore_mismatched_sizes=True,  # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
... )

当模型加载时，您可能会注意到以下警告：

Some weights of the model checkpoint at MCG-NJU/videomae-base were not used when initializing VideoMAEForVideoClassification: [..., 'decoder.decoder_layers.1.attention.output.dense.bias', 'decoder.decoder_layers.2.attention.attention.key.weight']
- This IS expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of VideoMAEForVideoClassification were not initialized from the model checkpoint at MCG-NJU/videomae-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

警告告诉我们，我们正在丢弃一些权重（例如classifier层的权重和偏差），并随机初始化其他一些权重和偏差（新classifier层的权重和偏差）。在这种情况下，这是预期的，因为我们正在添加一个新的头部，我们没有预训练的权重，所以库警告我们在使用它进行推断之前应该微调这个模型，这正是我们要做的。

请注意，此检查点在此任务上表现更好，因为该检查点是在一个具有相当大领域重叠的类似下游任务上微调得到的。您可以查看此检查点，该检查点是通过微调MCG-NJU/videomae-base-finetuned-kinetics获得的。

为训练准备数据集

为了对视频进行预处理，您将利用PyTorchVideo 库。首先导入我们需要的依赖项。

>>> import pytorchvideo.data

>>> from pytorchvideo.transforms import (
...     ApplyTransformToKey,
...     Normalize,
...     RandomShortSideScale,
...     RemoveKey,
...     ShortSideScale,
...     UniformTemporalSubsample,
... )

>>> from torchvision.transforms import (
...     Compose,
...     Lambda,
...     RandomCrop,
...     RandomHorizontalFlip,
...     Resize,
... )

对于训练数据集的转换，使用统一的时间子采样、像素归一化、随机裁剪和随机水平翻转的组合。对于验证和评估数据集的转换，保持相同的转换链，除了随机裁剪和水平翻转。要了解这些转换的详细信息，请查看PyTorchVideo 的官方文档。

使用与预训练模型相关联的image_processor来获取以下信息：

用于归一化视频帧像素的图像均值和标准差。
将视频帧调整为的空间分辨率。

首先定义一些常量。

>>> mean = image_processor.image_mean
>>> std = image_processor.image_std
>>> if "shortest_edge" in image_processor.size:
...     height = width = image_processor.size["shortest_edge"]
>>> else:
...     height = image_processor.size["height"]
...     width = image_processor.size["width"]
>>> resize_to = (height, width)

>>> num_frames_to_sample = model.config.num_frames
>>> sample_rate = 4
>>> fps = 30
>>> clip_duration = num_frames_to_sample * sample_rate / fps

现在，分别定义数据集特定的转换和数据集。从训练集开始：

>>> train_transform = Compose(
...     [
...         ApplyTransformToKey(
...             key="video",
...             transform=Compose(
...                 [
...                     UniformTemporalSubsample(num_frames_to_sample),
...                     Lambda(lambda x: x / 255.0),
...                     Normalize(mean, std),
...                     RandomShortSideScale(min_size=256, max_size=320),
...                     RandomCrop(resize_to),
...                     RandomHorizontalFlip(p=0.5),
...                 ]
...             ),
...         ),
...     ]
... )

>>> train_dataset = pytorchvideo.data.Ucf101(
...     data_path=os.path.join(dataset_root_path, "train"),
...     clip_sampler=pytorchvideo.data.make_clip_sampler("random", clip_duration),
...     decode_audio=False,
...     transform=train_transform,
... )

相同的工作流程顺序可以应用于验证集和评估集：

>>> val_transform = Compose(
...     [
...         ApplyTransformToKey(
...             key="video",
...             transform=Compose(
...                 [
...                     UniformTemporalSubsample(num_frames_to_sample),
...                     Lambda(lambda x: x / 255.0),
...                     Normalize(mean, std),
...                     Resize(resize_to),
...                 ]
...             ),
...         ),
...     ]
... )

>>> val_dataset = pytorchvideo.data.Ucf101(
...     data_path=os.path.join(dataset_root_path, "val"),
...     clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
...     decode_audio=False,
...     transform=val_transform,
... )

>>> test_dataset = pytorchvideo.data.Ucf101(
...     data_path=os.path.join(dataset_root_path, "test"),
...     clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
...     decode_audio=False,
...     transform=val_transform,
... )

注意：上述数据集管道取自官方 PyTorchVideo 示例。我们使用pytorchvideo.data.Ucf101()函数，因为它专为 UCF-101 数据集定制。在内部，它返回一个pytorchvideo.data.labeled_video_dataset.LabeledVideoDataset对象。LabeledVideoDataset类是 PyTorchVideo 数据集中所有视频相关内容的基类。因此，如果您想使用 PyTorchVideo 不支持的自定义数据集，可以相应地扩展LabeledVideoDataset类。请参考data API 文档以了解更多。此外，如果您的数据集遵循类似的结构（如上所示），那么使用pytorchvideo.data.Ucf101()应该可以正常工作。

您可以访问num_videos参数以了解数据集中的视频数量。

>>> print(train_dataset.num_videos, val_dataset.num_videos, test_dataset.num_videos)
# (300, 30, 75)

可视化预处理后的视频以进行更好的调试

>>> import imageio
>>> import numpy as np
>>> from IPython.display import Image

>>> def unnormalize_img(img):
...     """Un-normalizes the image pixels."""
...     img = (img * std) + mean
...     img = (img * 255).astype("uint8")
...     return img.clip(0, 255)

>>> def create_gif(video_tensor, filename="sample.gif"):
...     """Prepares a GIF from a video tensor.
...     
...     The video tensor is expected to have the following shape:
...     (num_frames, num_channels, height, width).
...     """
...     frames = []
...     for video_frame in video_tensor:
...         frame_unnormalized = unnormalize_img(video_frame.permute(1, 2, 0).numpy())
...         frames.append(frame_unnormalized)
...     kargs = {"duration": 0.25}
...     imageio.mimsave(filename, frames, "GIF", **kargs)
...     return filename

>>> def display_gif(video_tensor, gif_name="sample.gif"):
...     """Prepares and displays a GIF from a video tensor."""
...     video_tensor = video_tensor.permute(1, 0, 2, 3)
...     gif_filename = create_gif(video_tensor, gif_name)
...     return Image(filename=gif_filename)

>>> sample_video = next(iter(train_dataset))
>>> video_tensor = sample_video["video"]
>>> display_gif(video_tensor)

训练模型

利用🤗 Transformers 中的Trainer来训练模型。要实例化一个Trainer，您需要定义训练配置和一个评估指标。最重要的是TrainingArguments，这是一个包含所有属性以配置训练的类。它需要一个输出文件夹名称，用于保存模型的检查点。它还有助于将模型存储库中的所有信息同步到🤗 Hub 中。

大多数训练参数都是不言自明的，但这里有一个非常重要的参数是remove_unused_columns=False。这个参数将删除模型调用函数未使用的任何特征。默认情况下是True，因为通常最好删除未使用的特征列，这样更容易将输入解压缩到模型的调用函数中。但是，在这种情况下，您需要未使用的特征（特别是‘video’）以便创建pixel_values（这是我们的模型在输入中期望的一个必需键）。

>>> from transformers import TrainingArguments, Trainer

>>> model_name = model_ckpt.split("/")[-1]
>>> new_model_name = f"{model_name}-finetuned-ucf101-subset"
>>> num_epochs = 4

>>> args = TrainingArguments(
...     new_model_name,
...     remove_unused_columns=False,
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=5e-5,
...     per_device_train_batch_size=batch_size,
...     per_device_eval_batch_size=batch_size,
...     warmup_ratio=0.1,
...     logging_steps=10,
...     load_best_model_at_end=True,
...     metric_for_best_model="accuracy",
...     push_to_hub=True,
...     max_steps=(train_dataset.num_videos // batch_size) * num_epochs,
... )

pytorchvideo.data.Ucf101()返回的数据集没有实现__len__方法。因此，在实例化TrainingArguments时，我们必须定义max_steps。

接下来，您需要定义一个函数来计算从预测中得出的指标，该函数将使用您现在将加载的metric。您唯一需要做的预处理是取出我们预测的 logits 的 argmax：

import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

关于评估的说明：

在VideoMAE 论文中，作者使用以下评估策略。他们在测试视频的几个剪辑上评估模型，并对这些剪辑应用不同的裁剪，并报告聚合得分。然而，出于简单和简洁的考虑，我们在本教程中不考虑这一点。

此外，定义一个collate_fn，用于将示例批处理在一起。每个批次包括 2 个键，即pixel_values和labels。

>>> def collate_fn(examples):
...     # permute to (num_frames, num_channels, height, width)
...     pixel_values = torch.stack(
...         [example["video"].permute(1, 0, 2, 3) for example in examples]
...     )
...     labels = torch.tensor([example["label"] for example in examples])
...     return {"pixel_values": pixel_values, "labels": labels}

然后，将所有这些与数据集一起传递给Trainer：

>>> trainer = Trainer(
...     model,
...     args,
...     train_dataset=train_dataset,
...     eval_dataset=val_dataset,
...     tokenizer=image_processor,
...     compute_metrics=compute_metrics,
...     data_collator=collate_fn,
... )

您可能想知道为什么在预处理数据时将image_processor作为标记器传递。这只是为了确保图像处理器配置文件（存储为 JSON）也将上传到 Hub 上的存储库中。

现在通过调用train方法对我们的模型进行微调：

>>> train_results = trainer.train()

训练完成后，使用 push_to_hub()方法将您的模型共享到 Hub，以便每个人都可以使用您的模型：

>>> trainer.push_to_hub()

推断

很好，现在您已经对模型进行了微调，可以将其用于推断！

加载视频进行推断：

>>> sample_test_video = next(iter(test_dataset))

尝试使用您微调的模型进行推断的最简单方法是在pipeline中使用它。使用您的模型实例化一个视频分类的pipeline，并将视频传递给它：

>>> from transformers import pipeline

>>> video_cls = pipeline(model="my_awesome_video_cls_model")
>>> video_cls("https://huggingface.co/datasets/sayakpaul/ucf101-subset/resolve/main/v_BasketballDunk_g14_c06.avi")
[{'score': 0.9272987842559814, 'label': 'BasketballDunk'},
 {'score': 0.017777055501937866, 'label': 'BabyCrawling'},
 {'score': 0.01663011871278286, 'label': 'BalanceBeam'},
 {'score': 0.009560945443809032, 'label': 'BandMarching'},
 {'score': 0.0068979403004050255, 'label': 'BaseballPitch'}]

如果愿意，您也可以手动复制pipeline的结果。

>>> def run_inference(model, video):
...     # (num_frames, num_channels, height, width)
...     perumuted_sample_test_video = video.permute(1, 0, 2, 3)
...     inputs = {
...         "pixel_values": perumuted_sample_test_video.unsqueeze(0),
...         "labels": torch.tensor(
...             [sample_test_video["label"]]
...         ),  # this can be skipped if you don't have labels available.
...     }

...     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
...     inputs = {k: v.to(device) for k, v in inputs.items()}
...     model = model.to(device)

...     # forward pass
...     with torch.no_grad():
...         outputs = model(**inputs)
...         logits = outputs.logits

...     return logits

现在，将您的输入传递给模型并返回logits：

>>> logits = run_inference(trained_model, sample_test_video["video"])

解码logits，我们得到：

>>> predicted_class_idx = logits.argmax(-1).item()
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
# Predicted class: BasketballDunk

_metrics, … data_collator=collate_fn, … )

您可能想知道为什么在预处理数据时将`image_processor`作为标记器传递。这只是为了确保图像处理器配置文件（存储为 JSON）也将上传到 Hub 上的存储库中。

现在通过调用`train`方法对我们的模型进行微调：

```py
>>> train_results = trainer.train()

训练完成后，使用 push_to_hub()方法将您的模型共享到 Hub，以便每个人都可以使用您的模型：

>>> trainer.push_to_hub()

推断

很好，现在您已经对模型进行了微调，可以将其用于推断！

加载视频进行推断：

>>> sample_test_video = next(iter(test_dataset))

[外链图片转存中…(img-lPAORD5L-1719115353645)]

尝试使用您微调的模型进行推断的最简单方法是在pipeline中使用它。使用您的模型实例化一个视频分类的pipeline，并将视频传递给它：

>>> from transformers import pipeline

>>> video_cls = pipeline(model="my_awesome_video_cls_model")
>>> video_cls("https://huggingface.co/datasets/sayakpaul/ucf101-subset/resolve/main/v_BasketballDunk_g14_c06.avi")
[{'score': 0.9272987842559814, 'label': 'BasketballDunk'},
 {'score': 0.017777055501937866, 'label': 'BabyCrawling'},
 {'score': 0.01663011871278286, 'label': 'BalanceBeam'},
 {'score': 0.009560945443809032, 'label': 'BandMarching'},
 {'score': 0.0068979403004050255, 'label': 'BaseballPitch'}]

如果愿意，您也可以手动复制pipeline的结果。

>>> def run_inference(model, video):
...     # (num_frames, num_channels, height, width)
...     perumuted_sample_test_video = video.permute(1, 0, 2, 3)
...     inputs = {
...         "pixel_values": perumuted_sample_test_video.unsqueeze(0),
...         "labels": torch.tensor(
...             [sample_test_video["label"]]
...         ),  # this can be skipped if you don't have labels available.
...     }

...     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
...     inputs = {k: v.to(device) for k, v in inputs.items()}
...     model = model.to(device)

...     # forward pass
...     with torch.no_grad():
...         outputs = model(**inputs)
...         logits = outputs.logits

...     return logits

现在，将您的输入传递给模型并返回logits：

>>> logits = run_inference(trained_model, sample_test_video["video"])

解码logits，我们得到：

>>> predicted_class_idx = logits.argmax(-1).item()
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
# Predicted class: BasketballDunk

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2024-06-26，如有侵权请联系 cloudcommunity@tencent.com 删除

image

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！