Transformers 4.37 中文文档（九十六）

ApacheCN_飞龙

发布于 2024-06-26 18:54:45

830

发布于 2024-06-26 18:54:45

文章被收录于专栏：信数据得永生信数据得永生

原文：huggingface.co/docs/transformers

VipLlava

原始文本：huggingface.co/docs/transformers/v4.37.2/en/model_doc/vipllava

概述

VipLlava 模型是由 Mu Cai、Haotian Liu、Siva Karthik Mustikovela、Gregory P. Meyer、Yuning Chai、Dennis Park、Yong Jae Lee 在《Making Large Multimodal Models Understand Arbitrary Visual Prompts》中提出的。

VipLlava 通过在训练过程中标记图像并使用自然提示（如“红色边界框”或“指向箭头”）与模型进行交互，增强了 Llava 的训练协议。

该论文的摘要如下：

尽管现有的大型视觉-语言多模态模型侧重于整体图像理解，但在实现特定区域理解方面存在明显差距。目前使用文本坐标或空间编码的方法通常无法提供用户友好的视觉提示界面。为了解决这一挑战，我们引入了一种能够解码任意视觉提示的新型多模态模型。这使用户可以直观地标记图像，并使用自然提示与模型进行交互，如“红色边界框”或“指向箭头”。我们的简单设计直接将视觉标记叠加在 RGB 图像上，消除了复杂区域编码的需求，同时在 Visual7W、PointQA 和 Visual Commonsense Reasoning 基准等区域理解任务上实现了最先进的性能。此外，我们提出了 ViP-Bench，一个全面的基准，用于评估模型在理解多维视觉提示方面的能力，促进该领域的未来研究。代码、数据和模型均可公开获取。

提示：

该架构与 llava 架构类似，只是多模态投影器采用一组连接的视觉隐藏状态，并在该模块上增加了一个 layernorm 层。
我们建议用户在计算批量生成时使用padding_side="left"，因为这会导致更准确的结果。只需确保在生成之前调用processor.tokenizer.padding_side = "left"。
请注意，该模型尚未明确训练以处理同一提示中的多个图像，尽管从技术上讲这是可能的，但您可能会遇到不准确的结果。
为了获得更好的结果，我们建议用户使用正确的提示格式提示模型：

A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n<prompt>###Assistant:

对于多轮对话:

A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n<prompt1>###Assistant: <answer1>###Human: <prompt2>###Assistant:

原始代码可在此处找到。

该模型由Younes Belkada贡献

VipLlavaConfig

`class transformers.VipLlavaConfig`

<来源>

( vision_config = None text_config = None ignore_index = -100 image_token_index = 32000 projector_hidden_act = 'gelu' projector_layernorm_eps = 1e-05 vision_feature_layers = [-2, -5, -8, -11, 6] vocab_size = 32000 **kwargs )

参数

vision_config（VipLlavaVisionConfig，可选）— 自定义视觉配置或字典
text_config（Union[AutoConfig, dict]，可选）— 文本主干的配置对象。可以是LlamaConfig或MistralConfig中的任何一个。
ignore_index（int，可选，默认为-100）— 损失函数的忽略索引。
image_token_index（int，可选，默认为 32000）— 用于编码图像提示的图像标记索引。
projector_hidden_act（str，可选，默认为"gelu"）— 多模态投影器使用的激活函数。
projector_layernorm_eps（float，可选，默认为 1e-05）— 投影器 layernorm 的层归一化 epsilon
vision_feature_layers（List[int]，可选，默认为[-2, -5, -8, -11, 6]）— 选择视觉特征的层列表。
vocab_size（int，可选，默认为 32000）— VipLlava 模型的词汇量。定义了在调用~VipLlavaForConditionalGeneration 时可以表示的不同标记数量。

这是存储 VipLlavaForConditionalGeneration 配置的配置类。用于根据指定参数实例化 VipLlava 模型，定义模型架构。使用默认值实例化配置将产生类似于 VipLlava-9B 的配置。

例如ybelkada/vip-llava-7b-hf

配置对象继承自 PretrainedConfig，可用于控制模型输出。阅读 PretrainedConfig 的文档以获取更多信息。

示例：

>>> from transformers import VipLlavaForConditionalGeneration, VipLlavaConfig, CLIPVisionConfig, LlamaConfig

>>> # Initializing a CLIP-vision config
>>> vision_config = CLIPVisionConfig()

>>> # Initializing a Llama config
>>> text_config = LlamaConfig()

>>> # Initializing a VipLlava vipllava-7b style configuration
>>> configuration = VipLlavaConfig(vision_config, text_config)

>>> # Initializing a model from the vipllava-7b style configuration
>>> model = VipLlavaForConditionalGeneration(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

VipLlavaForConditionalGeneration

`class transformers.VipLlavaForConditionalGeneration`

<来源>

( config: VipLlavaConfig )

参数

config（VipLlavaConfig 或VipLlavaVisionConfig）- 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。

VIPLLAVA 模型由视觉主干和语言模型组成。此模型继承自 PreTrainedModel。检查超类文档以获取库为所有模型实现的通用方法（例如下载或保存、调整输入嵌入、修剪头等）。

此模型还是 PyTorch torch.nn.Module子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以获取有关一般用法和行为的所有信息。

前进

<来源>

( input_ids: LongTensor = None pixel_values: FloatTensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None vision_feature_layers: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → export const metadata = 'undefined';transformers.models.vipllava.modeling_vipllava.VipLlavaCausalLMOutputWithPast or tuple(torch.FloatTensor)

参数

input_ids（形状为(batch_size, sequence_length)的torch.LongTensor）- 词汇表中输入序列标记的索引。默认情况下将忽略填充。索引可以使用 AutoTokenizer 获取。有关详细信息，请参阅 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。什么是输入 ID？
pixel_values（形状为(batch_size, num_channels, image_size, image_size)的torch.FloatTensor）- 对应输入图像的张量。像素值可以使用 AutoImageProcessor 获取。有关详细信息，请参阅 CLIPImageProcessor.call()（[LlavaProcessor]使用 CLIPImageProcessor 处理图像）。
attention_mask（形状为(batch_size, sequence_length)的torch.Tensor，可选）- 避免在填充标记索引上执行注意力的蒙版。蒙版值选择在[0, 1]之间：
- 对于未被屏蔽的标记为 1，
- 对于被屏蔽的标记为 0。
什么是注意力蒙版？可以使用 AutoTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。如果使用past_key_values，则可以选择仅输入最后的decoder_input_ids（参见past_key_values）。如果要更改填充行为，您应该阅读modeling_opt._prepare_decoder_attention_mask并根据您的需求进行修改。有关默认策略的更多信息，请参阅论文中的图表 1。
- 1 表示头部未被masked，
- 0 表示头部被masked。
position_ids（形状为(batch_size, sequence_length)的torch.LongTensor，可选）— 每个输入序列标记在位置嵌入中的位置索引。选择范围为[0, config.n_positions - 1]。什么是位置 ID？
past_key_values（tuple(tuple(torch.FloatTensor))，可选，当传递use_cache=True或config.use_cache=True时返回）— 长度为config.n_layers的tuple(torch.FloatTensor)元组，每个元组有 2 个形状为(batch_size, num_heads, sequence_length, embed_size_per_head)的张量和 2 个额外的形状为(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)的张量。包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于加速顺序解码（参见past_key_values输入）。如果使用past_key_values，用户可以选择仅输入最后的decoder_input_ids（即未将其过去的键值状态提供给此模型的那些）的形状为(batch_size, 1)的张量，而不是形状为(batch_size, sequence_length)的所有decoder_input_ids。
inputs_embeds（形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor，可选）— 可选地，您可以选择直接传递嵌入表示，而不是传递input_ids。如果您希望更多控制如何将input_ids索引转换为相关向量，则这很有用，而不是使用模型的内部嵌入查找矩阵。
use_cache（bool，可选）— 如果设置为True，将返回past_key_values键值状态，并可用于加速解码（参见past_key_values）。
output_attentions（bool，可选）— 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的attentions。
output_hidden_states（bool，可选）— 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量中的hidden_states。
return_dict（bool，可选）— 是否返回 ModelOutput 而不是普通元组。参数 — 标签（形状为(batch_size, sequence_length)的torch.LongTensor，可选）：用于计算掩码语言建模损失的标签。索引应该在[0, ..., config.vocab_size]或-100（参见input_ids文档字符串）。将索引设置为-100的标记将被忽略（masked），损失仅计算具有标签在[0, ..., config.vocab_size]中的标记。

transformers.models.vipllava.modeling_vipllava.VipLlavaCausalLMOutputWithPast或tuple(torch.FloatTensor)

一个transformers.models.vipllava.modeling_vipllava.VipLlavaCausalLMOutputWithPast或一个torch.FloatTensor元组（如果传递了return_dict=False或config.return_dict=False）包含根据配置（VipLlavaConfig）和输入的不同元素。

损失 (torch.FloatTensor，形状为(1,)，可选，在提供labels时返回) — 语言建模损失（用于下一个标记的预测）。
logits (torch.FloatTensor，形状为(batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（SoftMax 之前每个词汇标记的分数）。
past_key_values (tuple(tuple(torch.FloatTensor))，可选，当传递use_cache=True或config.use_cache=True时返回) — 长度为config.n_layers的tuple(torch.FloatTensor)元组，每个元组有 2 个形状为(batch_size, num_heads, sequence_length, embed_size_per_head)的张量）包含预先计算的隐藏状态（自注意力块中的键和值），可用于加速顺序解码（查看past_key_values输入）。
hidden_states (tuple(torch.FloatTensor)，可选，当传递output_hidden_states=True或config.output_hidden_states=True时返回) — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组（如果模型有嵌入层，则为嵌入的输出+每层的输出）。模型在每一层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor)，可选，当传递output_attentions=True或config.output_attentions=True时返回) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组（每层一个）。注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。
image_hidden_states (tuple(torch.FloatTensor)，可选) — 形状为(batch_size, num_images, sequence_length, hidden_size)的torch.FloatTensor元组（用于图像嵌入的输出）。由视觉编码器生成的模型的图像隐藏状态，以及可选的感知器

VipLlavaForConditionalGeneration 的前向方法，覆盖了__call__特殊方法。

虽然前向传递的步骤需要在此函数内定义，但应该在此之后调用Module实例，而不是在此处调用，因为前者会处理运行前后处理步骤，而后者会默默地忽略它们。

示例：

>>> import torch
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, VipLlavaForConditionalGeneration

>>> model = VipLlavaForConditionalGeneration.from_pretrained("llava-hf/vip-llava-7b-hf", device_map="auto", torch_dtype=torch.float16)
>>> processor = AutoProcessor.from_pretrained("llava-hf/vip-llava-7b-hf")

>>> prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n{}###Assistant:"
>>> question = "Can you please describe this image?"
>>> prompt = prompt.format(question)
>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-neg.png"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=text, images=image, return_tensors="pt").to(0, torch.float16)

>>> # Generate
>>> generate_ids = model.generate(**inputs, max_new_tokens=20)
>>> processor.decode(generate_ids[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
The image features a brown and white cat sitting on a green surface, with a red ball in its

视觉编码器解码器模型

原始文本：huggingface.co/docs/transformers/v4.37.2/en/model_doc/vision-encoder-decoder

概述

VisionEncoderDecoderModel 可用于使用任何预训练的基于 Transformer 的视觉模型作为编码器（例如 ViT、BEiT、DeiT、Swin）和任何预训练语言模型作为解码器（例如 RoBERTa、GPT2、BERT、DistilBERT）初始化图像到文本模型。

使用预训练检查点初始化图像到文本序列模型的有效性已在 Minghao Li、Tengchao Lv、Lei Cui、Yijuan Lu、Dinei Florencio、Cha Zhang、Zhoujun Li、Furu Wei 的文章TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models中得到展示。

在训练/微调了这样一个 VisionEncoderDecoderModel 之后，它可以像其他模型一样保存/加载（有关更多信息，请参见下面的示例）。

一个示例应用是图像字幕，其中编码器用于对图像进行编码，之后自回归语言模型生成字幕。另一个示例是光学字符识别。请参考 TrOCR，这是 VisionEncoderDecoderModel 的一个实例。

从模型配置随机初始化 VisionEncoderDecoderModel。

VisionEncoderDecoderModel 可以从编码器和解码器配置随机初始化。在以下示例中，我们展示了如何使用编码器的默认 ViTModel 配置和解码器的默认BertForCausalLM配置来实现这一点。

>>> from transformers import BertConfig, ViTConfig, VisionEncoderDecoderConfig, VisionEncoderDecoderModel

>>> config_encoder = ViTConfig()
>>> config_decoder = BertConfig()

>>> config = VisionEncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
>>> model = VisionEncoderDecoderModel(config=config)

从预训练的编码器和预训练的解码器初始化 VisionEncoderDecoderModel。

VisionEncoderDecoderModel 可以从预训练的编码器检查点和预训练的解码器检查点初始化。请注意，任何预训练的基于 Transformer 的视觉模型，例如 Swin，都可以作为编码器，而预训练的自编码模型，例如 BERT，预训练的因果语言模型，例如 GPT2，以及序列到序列模型的预训练解码器部分，例如 BART 的解码器，都可以作为解码器。根据您选择的解码器架构，交叉注意力层可能会被随机初始化。从预训练的编码器和解码器检查点初始化 VisionEncoderDecoderModel 需要对模型进行下游任务的微调，正如在Warm-starting-encoder-decoder blog post中所示。为此，VisionEncoderDecoderModel类提供了一个 VisionEncoderDecoderModel.from_encoder_decoder_pretrained()方法。

>>> from transformers import VisionEncoderDecoderModel

>>> model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "microsoft/swin-base-patch4-window7-224-in22k", "bert-base-uncased"
... )

加载现有的 VisionEncoderDecoderModel 检查点并执行推理。

要加载VisionEncoderDecoderModel类的微调检查点，VisionEncoderDecoderModel 提供了from_pretrained(...)方法，就像 Transformers 中的任何其他模型架构一样。

要执行推断，可以使用 generate 方法，该方法允许自回归生成文本。此方法支持各种解码形式，如贪婪、束搜索和多项式采样。

>>> import requests
>>> from PIL import Image

>>> from transformers import GPT2TokenizerFast, ViTImageProcessor, VisionEncoderDecoderModel

>>> # load a fine-tuned image captioning model and corresponding tokenizer and image processor
>>> model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
>>> tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
>>> image_processor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

>>> # let's perform inference on an image
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values

>>> # autoregressively generate caption (uses greedy decoding by default)
>>> generated_ids = model.generate(pixel_values)
>>> generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
>>> print(generated_text)
a cat laying on a blanket next to a cat laying on a bed

将 PyTorch checkpoint 加载到 TFVisionEncoderDecoderModel 中。

TFVisionEncoderDecoderModel.from_pretrained() 目前不支持从 PyTorch checkpoint 初始化模型。将 from_pt=True 传递给此方法将引发异常。如果特定视觉编码器-解码器模型仅有 PyTorch checkpoints，可以使用以下解决方法：

>>> from transformers import VisionEncoderDecoderModel, TFVisionEncoderDecoderModel

>>> _model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

>>> _model.encoder.save_pretrained("./encoder")
>>> _model.decoder.save_pretrained("./decoder")

>>> model = TFVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "./encoder", "./decoder", encoder_from_pt=True, decoder_from_pt=True
... )
>>> # This is only for copying some specific attributes of this particular model.
>>> model.config = _model.config

训练

创建模型后，可以类似于 BART、T5 或任何其他编码器-解码器模型在（图像，文本）对数据集上进行微调。正如您所看到的，为了计算损失，模型只需要 2 个输入：pixel_values（即图像）和 labels（即编码目标序列的 input_ids）。

>>> from transformers import ViTImageProcessor, BertTokenizer, VisionEncoderDecoderModel
>>> from datasets import load_dataset

>>> image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
>>> model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google/vit-base-patch16-224-in21k", "bert-base-uncased"
... )

>>> model.config.decoder_start_token_id = tokenizer.cls_token_id
>>> model.config.pad_token_id = tokenizer.pad_token_id

>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values

>>> labels = tokenizer(
...     "an image of two cats chilling on a couch",
...     return_tensors="pt",
... ).input_ids

>>> # the forward function automatically creates the correct decoder_input_ids
>>> loss = model(pixel_values=pixel_values, labels=labels).loss

此模型由 nielsr 贡献。此模型的 TensorFlow 和 Flax 版本由 ydshieh 贡献。

VisionEncoderDecoderConfig

`class transformers.VisionEncoderDecoderConfig`

<来源>

( **kwargs )

参数

kwargs（可选）— 关键字参数字典。特别包括：
- encoder（PretrainedConfig，可选）— 定义编码器配置的配置对象实例。
- decoder（PretrainedConfig，可选）— 定义解码器配置的配置对象实例。

VisionEncoderDecoderConfig 是用于存储 VisionEncoderDecoderModel 配置的配置类。根据指定的参数实例化 Vision-Encoder-Text-Decoder 模型，定义编码器和解码器配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。阅读 PretrainedConfig 的文档以获取更多信息。

示例：

>>> from transformers import BertConfig, ViTConfig, VisionEncoderDecoderConfig, VisionEncoderDecoderModel

>>> # Initializing a ViT & BERT style configuration
>>> config_encoder = ViTConfig()
>>> config_decoder = BertConfig()

>>> config = VisionEncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)

>>> # Initializing a ViTBert model (with random weights) from a ViT & bert-base-uncased style configurations
>>> model = VisionEncoderDecoderModel(config=config)

>>> # Accessing the model configuration
>>> config_encoder = model.config.encoder
>>> config_decoder = model.config.decoder
>>> # set decoder config to causal lm
>>> config_decoder.is_decoder = True
>>> config_decoder.add_cross_attention = True

>>> # Saving the model, including its configuration
>>> model.save_pretrained("my-model")

>>> # loading model and config from pretrained folder
>>> encoder_decoder_config = VisionEncoderDecoderConfig.from_pretrained("my-model")
>>> model = VisionEncoderDecoderModel.from_pretrained("my-model", config=encoder_decoder_config)

`from_encoder_decoder_configs`

<来源>

( encoder_config: PretrainedConfig decoder_config: PretrainedConfig **kwargs ) → export const metadata = 'undefined';VisionEncoderDecoderConfig

VisionEncoderDecoderConfig

配置对象实例

从预训练编码器模型配置和解码器模型配置实例化一个 VisionEncoderDecoderConfig（或派生类）。

PytorchHide Pytorch content

VisionEncoderDecoderModel

`class transformers.VisionEncoderDecoderModel`

<来源>

( config: Optional = None encoder: Optional = None decoder: Optional = None )

参数

config (VisionEncoderDecoderConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型相关的权重，只加载配置。查看 from_pretrained() 方法以加载模型权重。

这个类可以用来初始化一个图像到文本序列模型，其中预训练的视觉自编码模型作为编码器，预训练的文本自回归模型作为解码器。编码器通过 from_pretrained() 函数加载，解码器通过 from_pretrained() 函数加载。交叉注意力层会自动添加到解码器，并应在下游生成任务（如图像字幕）上进行微调。

在 Leveraging Pre-trained Checkpoints for Sequence Generation Tasks 中，Sascha Rothe、Shashi Narayan、Aliaksei Severyn、Michael Matena、Yanqi Zhou、Wei Li、Peter J. Liu 展示了使用预训练检查点初始化序列到序列模型进行序列生成任务的有效性。

此外，在 TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models 中，展示了如何利用大型预训练视觉模型进行光学字符识别（OCR）可以显著提高性能。

训练/微调了这样一个视觉-编码器-文本-解码器模型后，它可以像其他模型一样保存/加载（有关更多信息，请参阅示例）。

这个模型继承自 PreTrainedModel。查看超类文档以获取库实现的所有模型的通用方法（如下载或保存、调整输入嵌入、修剪头等）。

这个模型也是一个 PyTorch torch.nn.Module 子类。将其用作常规的 PyTorch 模块，并参考 PyTorch 文档以获取有关一般用法和行为的所有相关信息。

VisionEncoderDecoderModel 是一个通用的模型类，当使用 :meth*~transformers.AutoModel.from_pretrained* 类方法为编码器创建一个基础视觉模型类，并为解码器创建另一个基础视觉模型类时，将实例化为一个变压器架构。

`forward`

< source >

( pixel_values: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None decoder_inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None **kwargs ) → export const metadata = 'undefined';transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — 像素值。像素值可以通过图像处理器获得（例如，如果您使用 ViT 作为编码器，应该使用 AutoImageProcessor）。有关详细信息，请参阅 ViTImageProcessor.call()。
decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — 词汇表中解码器输入序列标记的索引。可以使用 PreTrainedTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。什么是输入 ID？如果使用了 past_key_values，可选择仅输入最后的 decoder_input_ids（参见 past_key_values）。对于训练，decoder_input_ids 会被模型自动创建，通过将 labels 向右移动，用 pad_token_id 替换 -100，并在前面加上 decoder_start_token_id。
decoder_attention_mask (torch.BoolTensor of shape (batch_size, target_sequence_length), optional) — 默认行为：生成一个张量，忽略 decoder_input_ids 中的填充标记。因果掩码也将默认使用。
encoder_outputs (tuple(torch.FloatTensor), optional) — 此元组必须包含 (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) 是编码器最后一层的隐藏状态张量。用于解码器的交叉注意力。
past_key_values (tuple(tuple(torch.FloatTensor))，长度为 config.n_layers，每个元组包含 4 个形状为 (batch_size, num_heads, sequence_length - 1, embed_size_per_head) 的张量） — 包含注意力块的预计算键和值隐藏状态。可用于加速解码。如果使用了 past_key_values，用户可以选择仅输入形状为 (batch_size, 1) 的最后的 decoder_input_ids（即没有将过去的键值状态提供给该模型的那些）而不是形状为 (batch_size, sequence_length) 的所有 decoder_input_ids。
decoder_inputs_embeds (torch.FloatTensor of shape (batch_size, target_sequence_length, hidden_size), optional) — 可选地，可以直接传递嵌入表示，而不是传递 decoder_input_ids。如果您想要更多控制如何将 decoder_input_ids 索引转换为相关向量，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — 用于计算解码器的掩码语言建模损失的标签。索引应在 [-100, 0, ..., config.vocab_size] 范围内（参见 input_ids 文档字符串）。索引设置为 -100 的标记将被忽略（掩码），损失仅计算具有标签在 [0, ..., config.vocab_size] 范围内的标记。
use_cache (bool, optional) — 如果设置为 True，将返回 past_key_values 键值状态，可用于加速解码（参见 past_key_values）。
output_attentions (bool, optional) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的 attentions。
output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量中的 hidden_states。
return_dict (bool, optional) — 如果设置为 True，模型将返回一个 ~utils.Seq2SeqLMOutput 而不是一个普通元组。
kwargs (optional) — 剩余的关键字参数字典。关键字参数有两种类型：
- 没有前缀，将作为 **encoder_kwargs 输入到编码器前向函数中。
- 使用 decoder_ 前缀，将作为 **decoder_kwargs 输入到解码器前向函数中。

transformers.modeling_outputs.Seq2SeqLMOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.Seq2SeqLMOutput 或一个torch.FloatTensor元组（如果传递return_dict=False或config.return_dict=False）包含根据配置（VisionEncoderDecoderConfig）和输入而异的各种元素。

loss (torch.FloatTensor，形状为(1,)，optional，当提供labels时返回) — 语言建模损失。
logits (torch.FloatTensor，形状为(batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（SoftMax 之前每个词汇标记的分数）。
past_key_values (tuple(tuple(torch.FloatTensor))，optional，当传递use_cache=True或config.use_cache=True时返回) — 一个长度为config.n_layers的元组，每个元组有 2 个形状为(batch_size, num_heads, sequence_length, embed_size_per_head)的张量和 2 个额外的形状为(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)的张量。包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于加速顺序解码（参见past_key_values输入）。
decoder_hidden_states (tuple(torch.FloatTensor)，optional，当传递output_hidden_states=True或config.output_hidden_states=True时返回) — 一个元组，包含形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor（如果模型有嵌入层，则为嵌入层输出的一个+每层输出的一个）。解码器在每一层输出的隐藏状态加上初始嵌入输出。
decoder_attentions (tuple(torch.FloatTensor), optional, 当传递output_attentions=True或config.output_attentions=True时返回) — 一个元组，包含形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor（每层一个）。解码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。
cross_attentions (tuple(torch.FloatTensor)，optional，当传递output_attentions=True或config.output_attentions=True时返回) — 一个元组，包含形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor（每层一个）。解码器的交叉注意力层的注意力权重，在注意力 softmax 之后，用于计算交叉注意力头中的加权平均值。
encoder_last_hidden_state (torch.FloatTensor，形状为(batch_size, sequence_length, hidden_size)，optional) — 模型编码器最后一层的隐藏状态序列。
encoder_hidden_states (tuple(torch.FloatTensor)，optional，当传递output_hidden_states=True或config.output_hidden_states=True时返回) — 一个元组，包含形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor（如果模型有嵌入层，则为嵌入层输出的一个+每层输出的一个）。编码器在每一层输出的隐藏状态加上初始嵌入输出。
encoder_attentions (tuple(torch.FloatTensor)，optional，当传递output_attentions=True或config.output_attentions=True时返回) — 一个元组，包含形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor（每层一个）。编码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。

VisionEncoderDecoderModel 的前向方法，覆盖了__call__特殊方法。

虽然前向传递的配方需要在此函数内定义，但应该在此之后调用Module实例，而不是这个，因为前者负责运行前后处理步骤，而后者会默默地忽略它们。

示例：

>>> from transformers import AutoProcessor, VisionEncoderDecoderModel
>>> import requests
>>> from PIL import Image
>>> import torch

>>> processor = AutoProcessor.from_pretrained("microsoft/trocr-base-handwritten")
>>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")

>>> # load image from the IAM dataset
>>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

>>> # training
>>> model.config.decoder_start_token_id = processor.tokenizer.cls_token_id
>>> model.config.pad_token_id = processor.tokenizer.pad_token_id
>>> model.config.vocab_size = model.config.decoder.vocab_size

>>> pixel_values = processor(image, return_tensors="pt").pixel_values
>>> text = "hello world"
>>> labels = processor.tokenizer(text, return_tensors="pt").input_ids
>>> outputs = model(pixel_values=pixel_values, labels=labels)
>>> loss = outputs.loss

>>> # inference (generation)
>>> generated_ids = model.generate(pixel_values)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

`from_encoder_decoder_pretrained`

<来源>

( encoder_pretrained_model_name_or_path: str = None decoder_pretrained_model_name_or_path: str = None *model_args **kwargs )

参数

encoder_pretrained_model_name_or_path（str，可选）- 启动图像编码器所需的信息。可以是：
- 一个字符串，预训练模型的模型 ID，托管在 huggingface.co 上的模型存储库内。一个示例是google/vit-base-patch16-224-in21k。
- 一个包含使用 save_pretrained()保存的模型权重的目录的路径，例如，./my_model_directory/。
- 一个指向 tensorflow 索引检查点文件的路径或 url（例如，./tf_model/model.ckpt.index）。在这种情况下，from_tf应设置为True，并且应将配置对象提供为config参数。使用此加载路径比使用提供的转换脚本将 TensorFlow 检查点转换为 PyTorch 模型并加载 PyTorch 模型要慢。
decoder_pretrained_model_name_or_path（str，可选，默认为None）- 启动文本解码器所需的信息。可以是：
- 一个字符串，预训练模型的模型 ID，托管在 huggingface.co 上的模型存储库内。有效的模型 ID 可以位于根级别，如bert-base-uncased，或者在用户或组织名称下命名空间，如dbmdz/bert-base-german-cased。
- 一个包含使用 save_pretrained()保存的模型权重的目录的路径，例如，./my_model_directory/。
- 一个指向 tensorflow 索引检查点文件的路径或 url（例如，./tf_model/model.ckpt.index）。在这种情况下，from_tf应设置为True，并且应将配置对象提供为config参数。使用此加载路径比使用提供的转换脚本将 TensorFlow 检查点转换为 PyTorch 模型并加载 PyTorch 模型要慢。
model_args（剩余的位置参数，可选）- 所有剩余的位置参数将传递给底层模型的__init__方法。
kwargs（剩余的关键字参数字典，可选）- 可用于更新配置对象（在加载后）并启动模型（例如，output_attentions=True）。
- 要更新编码器配置，请为每个配置参数使用前缀encoder_。
- 要更新解码器配置，请为每个配置参数使用前缀decoder_。
- 要更新父模型配置，请不要为每个配置参数使用前缀。
根据是否提供config而表现不同。

从预训练模型检查点中的一个或两个基类库中实例化一个编码器和一个解码器。

默认情况下，使用model.eval()将模型设置为评估模式（Dropout 模块被停用）。要训练模型，您需要首先使用model.train()将其设置回训练模式。

示例：

>>> from transformers import VisionEncoderDecoderModel

>>> # initialize a vit-bert from a pretrained ViT and a pretrained BERT model. Note that the cross-attention layers will be randomly initialized
>>> model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google/vit-base-patch16-224-in21k", "bert-base-uncased"
... )
>>> # saving model after fine-tuning
>>> model.save_pretrained("./vit-bert")
>>> # load fine-tuned model
>>> model = VisionEncoderDecoderModel.from_pretrained("./vit-bert")

TensorFlowHide TensorFlow 内容

TFVisionEncoderDecoderModel

`class transformers.TFVisionEncoderDecoderModel`

<来源>

( config: Optional[PretrainedConfig] = None encoder: Optional[TFPreTrainedModel] = None decoder: Optional[TFPreTrainedModel] = None )

参数

config（VisionEncoderDecoderConfig）— 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型相关的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。

这个类可用于使用任何预训练的视觉自编码模型作为编码器和任何预训练的文本自回归模型作为解码器来初始化一个图像到文本序列模型。编码器通过 from_pretrained()函数加载，解码器通过 from_pretrained()函数加载。交叉注意力层会自动添加到解码器，并应在下游生成任务（如图像字幕）上进行微调。

在Sascha Rothe, Shashi Narayan, Aliaksei Severyn. Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu 的《利用预训练检查点进行序列生成任务》中展示了使用预训练检查点初始化序列生成任务的序列到序列模型的有效性。

此外，在TrOCR: 基于 Transformer 的预训练模型的光学字符识别中展示了如何利用大型预训练的视觉模型进行光学字符识别（OCR）可以显著提高性能。

在训练/微调了这样一个 Vision-Encoder-Text-Decoder 模型之后，它可以像任何其他模型一样保存/加载（查看示例以获取更多信息）。

这个模型继承自 TFPreTrainedModel。查看超类文档以获取库为所有模型实现的通用方法（如下载或保存、调整输入嵌入、修剪头等）。

这个模型也是一个tf.keras.Model子类。将其用作常规的 TF 2.0 Keras 模型，并参考 TF 2.0 文档以获取与一般用法和行为相关的所有内容。

TFVisionEncoderDecoderModel 是一个通用模型类，当使用 from_pretrained()类方法为编码器创建一个库中的基础视觉模型类，并为解码器创建另一个基础模型类时，将实例化为一个变压器架构。

`call`

<来源>

( pixel_values: np.ndarray | tf.Tensor | None = None decoder_input_ids: np.ndarray | tf.Tensor | None = None decoder_attention_mask: np.ndarray | tf.Tensor | None = None encoder_outputs: Optional[Union[Tuple, TFBaseModelOutput]] = None past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None decoder_inputs_embeds: np.ndarray | tf.Tensor | None = None labels: np.ndarray | tf.Tensor | None = None use_cache: Optional[bool] = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: bool = False **kwargs ) → export const metadata = 'undefined';transformers.modeling_tf_outputs.TFSeq2SeqLMOutput or tuple(tf.Tensor)

参数

pixel_values（np.ndarray，tf.Tensor，List[tf.Tensor]，Dict[str, tf.Tensor]或Dict[str, np.ndarray]，每个示例的形状必须为(batch_size, num_channels, height, width)）— 像素值。像素值可以使用视觉模型的图像处理器获得。例如，使用 AutoImageProcessor。有关详细信息，请参阅 ViTImageProcessor.call()。
decoder_input_ids (np.ndarray or tf.Tensor of shape (batch_size, target_sequence_length), optional) — 词汇表中解码器输入序列标记的索引。可以使用 PreTrainedTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。什么是输入 ID？如果使用了 past_key_values，可以选择仅输入最后的 decoder_input_ids（参见 past_key_values）。为解码器提供序列到序列训练。可以使用 PreTrainedTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。
decoder_attention_mask (np.ndarray or tf.Tensor of shape (batch_size, target_sequence_length), optional) — 默认行为：生成一个张量，忽略 decoder_input_ids 中的填充标记。因果掩码也将默认使用。
encoder_outputs (tuple(tuple(tf.Tensor), optional) — 这个元组必须包含 (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) 是编码器最后一层的隐藏状态张量。用于解码器的交叉注意力。
past_key_values (tuple(tuple(tf.Tensor)) of length config.n_layers with each tuple having 4 tensors of shape (batch_size, num_heads, sequence_length - 1, embed_size_per_head)) — 包含注意力块的预计算键和值隐藏状态。可用于加速解码。如果使用了 past_key_values，用户可以选择仅输入最后的 decoder_input_ids（这些没有将它们的过去键值状态提供给此模型）的形状为 (batch_size, 1)，而不是所有形状为 (batch_size, sequence_length) 的 decoder_input_ids。
decoder_inputs_embeds (np.ndarray or tf.Tensor of shape (batch_size, target_sequence_length, hidden_size), optional) — 可选地，您可以直接传递嵌入表示，而不是传递 decoder_input_ids。如果您想要更多控制权来将 decoder_input_ids 索引转换为相关向量，这将非常有用，而不是使用模型的内部嵌入查找矩阵。
labels (np.ndarray or tf.Tensor of shape (batch_size, sequence_length), optional) — 用于计算解码器的掩码语言建模损失的标签。索引应在 [-100, 0, ..., config.vocab_size]（参见 input_ids 文档字符串）。索引设置为 -100 的标记将被忽略（掩码），损失仅计算具有标签在 [0, ..., config.vocab_size] 中的标记。
use_cache (bool, optional) — 如果设置为 True，将返回 past_key_values 键值状态，并可用于加速解码（参见 past_key_values）。
output_attentions (bool, optional) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量中的 attentions。
output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量中的 hidden_states。
return_dict (bool, optional) — 如果设置为 True，模型将返回一个 ~utils.Seq2SeqLMOutput 而不是一个普通元组。
training (bool, optional, 默认为False) — 是否在训练模式下使用模型（一些模块如 dropout 模块在训练和评估之间有不同的行为）。
kwargs (optional) — 剩余的关键字参数字典。关键字参数有两种类型：
- 没有前缀，将作为编码器前向函数的**encoder_kwargs输入。
- 带有*decoder_*前缀，将作为解码器前向函数的**decoder_kwargs输入。

transformers.modeling_tf_outputs.TFSeq2SeqLMOutput 或tuple(tf.Tensor)

一个 transformers.modeling_tf_outputs.TFSeq2SeqLMOutput 或一个tf.Tensor元组（如果传递return_dict=False或config.return_dict=False）包含根据配置（VisionEncoderDecoderConfig）和输入的各种元素。

loss (tf.Tensor of shape (n,), optional, 其中 n 是未屏蔽标签的数量，在提供labels时返回) — 语言建模损失。
logits (tf.Tensor of shape (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（SoftMax 之前每个词汇标记的分数）。
past_key_values (List[tf.Tensor], optional, 当传递use_cache=True或config.use_cache=True时返回) — 长度为config.n_layers的tf.Tensor列表，每个张量的形状为(2, batch_size, num_heads, sequence_length, embed_size_per_head)。包含解码器的预计算隐藏状态（注意力块中的键和值），可用于加速顺序解码。
decoder_hidden_states (tuple(tf.Tensor), optional, 当传递output_hidden_states=True或config.output_hidden_states=True时返回) — 形状为(batch_size, sequence_length, hidden_size)的tf.Tensor元组。解码器在每一层输出的隐藏状态加上初始嵌入输出。
decoder_attentions (tuple(tf.Tensor), optional, 当传递output_attentions=True或config.output_attentions=True时返回) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的tf.Tensor元组。解码器的注意力权重，在注意力 SoftMax 之后，用于计算自注意力头中的加权平均值。
cross_attentions (tuple(tf.Tensor), optional, 当传递output_attentions=True或config.output_attentions=True时返回) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的tf.Tensor元组。解码器的交叉注意力层的注意力权重，在注意力 SoftMax 之后，用于计算交叉注意力头中的加权平均值。
encoder_last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — 模型编码器最后一层的隐藏状态序列。
encoder_hidden_states (tuple(tf.Tensor), optional, 当传递output_hidden_states=True或config.output_hidden_states=True时返回) — 形状为(batch_size, sequence_length, hidden_size)的tf.Tensor元组（一个用于嵌入的输出，一个用于每一层的输出）。编码器在每一层输出的隐藏状态加上初始嵌入输出。
encoder_attentions (tuple(tf.Tensor), optional, 当传递output_attentions=True或config.output_attentions=True时返回) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的tf.Tensor元组。编码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。

TFVisionEncoderDecoderModel 的前向方法，覆盖__call__特殊方法。

虽然前向传递的步骤需要在此函数内定义，但应该在此之后调用Module实例，而不是在此之后调用，因为前者会处理运行前后处理步骤，而后者会默默地忽略它们。

示例：

>>> from transformers import AutoImageProcessor, AutoTokenizer, TFVisionEncoderDecoderModel
>>> from PIL import Image
>>> import requests

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> decoder_tokenizer = AutoTokenizer.from_pretrained("gpt2")

>>> # initialize a bert2gpt2 from a pretrained BERT and GPT2 models. Note that the cross-attention layers will be randomly initialized
>>> model = TFVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google/vit-base-patch16-224-in21k", "gpt2"
... )

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> img = Image.open(requests.get(url, stream=True).raw)

>>> # forward
>>> pixel_values = image_processor(images=img, return_tensors="tf").pixel_values  # Batch size 1
>>> decoder_input_ids = decoder_tokenizer("Linda Davis", return_tensors="tf").input_ids  # Batch size 1
>>> outputs = model(pixel_values=pixel_values, decoder_input_ids=decoder_input_ids)

>>> # training
>>> outputs = model(pixel_values=pixel_values, decoder_input_ids=decoder_input_ids, labels=decoder_input_ids)
>>> loss, logits = outputs.loss, outputs.logits

>>> # save and load from pretrained
>>> model.save_pretrained("vit-gpt2")
>>> model = TFVisionEncoderDecoderModel.from_pretrained("vit-gpt2")

>>> # generation
>>> generated = model.generate(pixel_values, decoder_start_token_id=model.config.decoder.bos_token_id)

`from_encoder_decoder_pretrained`

<来源>

( encoder_pretrained_model_name_or_path: str = None decoder_pretrained_model_name_or_path: str = None *model_args **kwargs )

参数

encoder_pretrained_model_name_or_path（str，可选） — 初始化编码器所需的信息。可以是：
- 预训练模型的模型 id，托管在 huggingface.co 上的模型存储库中。例如，google/vit-base-patch16-224-in21k。
- 指向使用 save_pretrained()保存的模型权重的目录的路径，例如，./my_model_directory/。
- 指向pytorch 索引检查点文件的路径或 url（例如，./pt_model/）。在这种情况下，encoder_from_pt应设置为True。
decoder_pretrained_model_name_or_path（str，可选，默认为None） — 初始化解码器所需的信息。可以是：
- 预训练模型的模型 id，托管在 huggingface.co 上的模型存储库中。有效的模型 id 可以位于根级别，如bert-base-uncased，或在用户或组织名称下命名空间化，如dbmdz/bert-base-german-cased。
- 指向包含使用 save_pretrained()保存的模型权重的目录的路径，例如，./my_model_directory/。
- 指向pytorch 检查点文件的路径或 url（例如，./pt_model/）。在这种情况下，decoder_from_pt应设置为True。
model_args（剩余的位置参数，可选） — 所有剩余的位置参数将传递给底层模型的__init__方法。
kwargs（剩余的关键字参数字典，可选） — 可用于更新配置对象（加载后）并初始化模型（例如，output_attentions=True）。
- 更新编码器配置时，对每个配置参数使用前缀encoder_。
- 更新解码器配置时，对每个配置参数使用前缀decoder_。
- 要更新父模型配置，请不要对每个配置参数使用前缀。
根据是否提供config或自动加载而表现不同。

从预训练模型检查点实例化一个编码器和一个解码器，可以是库中一个或两个基类的预训练模型检查点。

示例：

>>> from transformers import TFVisionEncoderDecoderModel

>>> # initialize a vit-bert from a pretrained ViT and a pretrained BERT model. Note that the cross-attention layers will be randomly initialized
>>> model = TFVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google/vit-base-patch16-224-in21k", "bert-base-uncased"
... )
>>> # saving model after fine-tuning
>>> model.save_pretrained("./vit-bert")
>>> # load fine-tuned model
>>> model = TFVisionEncoderDecoderModel.from_pretrained("./vit-bert")

JAXHide JAX content

FlaxVisionEncoderDecoderModel

`class transformers.FlaxVisionEncoderDecoderModel`

<来源>

( config: VisionEncoderDecoderConfig input_shape: Optional = None seed: int = 0 dtype: dtype = <class 'jax.numpy.float32'> _do_init: bool = True **kwargs )

参数

config（VisionEncoderDecoderConfig） — 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。
dtype（jax.numpy.dtype，可选，默认为jax.numpy.float32） — 计算的数据类型。可以是jax.numpy.float32、jax.numpy.float16（在 GPU 上）和jax.numpy.bfloat16（在 TPU 上）之一。这可以用于在 GPU 或 TPU 上启用混合精度训练或半精度推断。如果指定了dtype，则所有计算将使用给定的dtype执行。 请注意，这仅指定计算的 dtype，不影响模型参数的 dtype。 如果您希望更改模型参数的 dtype，请参阅 to_fp16()和 to_bf16()。

这个类可以用来初始化一个图像到文本序列模型，其中编码器是任何预训练的视觉自编码模型，解码器是任何预训练的文本自回归模型。编码器通过 from_pretrained()函数加载，解码器通过 from_pretrained()函数加载。交叉注意力层会自动添加到解码器上，并应该在下游生成任务（如图像字幕）上进行微调。

在Sascha Rothe, Shashi Narayan, Aliaksei Severyn. Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu的研究中展示了使用预训练检查点初始化序列生成任务的序列到序列模型的有效性。

此外，在TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models中展示了如何利用大型预训练视觉模型进行光学字符识别（OCR）可以显著提高性能。

训练/微调了这样一个视觉-编码器-文本-解码器模型后，它可以像其他模型一样保存/加载（有关更多信息，请参阅示例）。

这个模型继承自 FlaxPreTrainedModel。查看超类文档以了解库实现的所有模型的通用方法（如下载或保存、调整输入嵌入、修剪头等）。

这个模型也是一个 Flax Linen flax.nn.Module子类。将其用作常规 Flax 模块，并参考 Flax 文档以了解与一般用法和行为相关的所有事项。

FlaxVisionEncoderDecoderModel 是一个通用的模型类，当使用:meth*~transformers.FlaxAutoModel.from_pretrained类方法为编码器创建模块（flax.nn.Module）时，会实例化为一个 transformer 架构，库中的一个基本视觉模型类作为编码器模块，另一个作为解码器模块，并使用:meth~transformers.FlaxAutoModelForCausalLM.from_pretrained*类方法为解码器创建模块。

`call`

< source >

( pixel_values: Array decoder_input_ids: Optional = None decoder_attention_mask: Optional = None decoder_position_ids: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None train: bool = False params: dict = None dropout_rng: PRNGKey = None ) → export const metadata = 'undefined';transformers.modeling_flax_outputs.FlaxSeq2SeqLMOutput or tuple(torch.FloatTensor)

参数

pixel_values（jnp.ndarray，形状为(batch_size, num_channels, height, width)）— 像素值。像素值可以使用视觉模型的图像处理器获得。例如，使用 AutoImageProcessor。有关详细信息，请参阅 ViTImageProcessor.call()。
decoder_input_ids（jnp.ndarray，形状为(batch_size, target_sequence_length)，可选）— 词汇表中解码器输入序列标记的索引。可以使用 PreTrainedTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。什么是解码器输入 ID？
decoder_attention_mask (jnp.ndarray，形状为(batch_size, target_sequence_length)，可选) — 默认行为：生成一个张量，忽略decoder_input_ids中的填充标记。因果掩码也将默认使用。
decoder_position_ids (jnp.ndarray，形状为(batch_size, sequence_length)，可选) — 每个解码器输入序列标记在位置嵌入中的位置索引。选择范围为[0, config.decoder.max_position_embeddings - 1]。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量中的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量中的hidden_states。
return_dict (bool, 可选) — 如果设置为True，模型将返回一个~utils.FlaxSeq2SeqLMOutput而不是一个普通元组。

transformers.modeling_flax_outputs.FlaxSeq2SeqLMOutput 或tuple(torch.FloatTensor)

一个 transformers.modeling_flax_outputs.FlaxSeq2SeqLMOutput 或一个torch.FloatTensor元组（如果传递return_dict=False或config.return_dict=False）包含根据配置（VisionEncoderDecoderConfig）和输入而异的各种元素。

logits (jnp.ndarray，形状为(batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数（SoftMax 之前每个词汇标记的分数）。
past_key_values (tuple(tuple(jnp.ndarray)), 可选, 当传递use_cache=True或config.use_cache=True时返回) — 长度为config.n_layers的tuple(jnp.ndarray)元组，每个元组有 2 个形状为(batch_size, num_heads, sequence_length, embed_size_per_head)的张量和 2 个额外的形状为(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)的张量。包含预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于加速顺序解码（参见past_key_values输入）。
decoder_hidden_states (tuple(jnp.ndarray), 可选, 当传递output_hidden_states=True或config.output_hidden_states=True时返回) — 形状为(batch_size, sequence_length, hidden_size)的jnp.ndarray元组（一个用于嵌入输出，一个用于每一层的输出）。解码器在每一层输出的隐藏状态以及初始嵌入输出。
decoder_attentions (tuple(jnp.ndarray), 可选, 当传递output_attentions=True或config.output_attentions=True时返回) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的jnp.ndarray元组（每层一个）。解码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。
cross_attentions (tuple(jnp.ndarray), 可选, 当传递output_attentions=True或config.output_attentions=True时返回) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的jnp.ndarray元组（每层一个）。解码器交叉注意力层的注意力权重，在注意力 softmax 之后，用于计算交叉注意力头中的加权平均值。
encoder_last_hidden_state (jnp.ndarray，形状为(batch_size, sequence_length, hidden_size)，可选) — 模型编码器最后一层的隐藏状态序列。
encoder_hidden_states (tuple(jnp.ndarray), 可选, 当传递output_hidden_states=True或当config.output_hidden_states=True时返回) — 形状为(batch_size, sequence_length, hidden_size)的jnp.ndarray元组（一个用于嵌入的输出 + 一个用于每层的输出）。编码器在每层输出的隐藏状态加上初始嵌入输出。
encoder_attentions (tuple(jnp.ndarray), 可选, 当传递output_attentions=True或当config.output_attentions=True时返回) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的jnp.ndarray元组（每层一个）。编码器的注意力权重，在注意力 softmax 之后，用于计算自注意力头中的加权平均值。

FlaxVisionEncoderDecoderModel 的前向方法，覆盖了__call__特殊方法。

尽管前向传递的配方需要在此函数内定义，但应该在此之后调用Module实例，而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

>>> from transformers import FlaxVisionEncoderDecoderModel, AutoImageProcessor, AutoTokenizer
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")

>>> # load output tokenizer
>>> tokenizer_output = AutoTokenizer.from_pretrained("gpt2")

>>> # initialize a vit-gpt2 from pretrained ViT and GPT2 models. Note that the cross-attention layers will be randomly initialized
>>> model = FlaxVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google/vit-base-patch16-224-in21k", "gpt2"
... )

>>> pixel_values = image_processor(images=image, return_tensors="np").pixel_values

>>> # use GPT2's eos_token as the pad as well as eos token
>>> model.config.eos_token_id = model.config.decoder.eos_token_id
>>> model.config.pad_token_id = model.config.eos_token_id

>>> # generation
>>> sequences = model.generate(pixel_values, num_beams=4, max_length=12).sequences

>>> captions = tokenizer_output.batch_decode(sequences, skip_special_tokens=True)

`from_encoder_decoder_pretrained`

<来源>

( encoder_pretrained_model_name_or_path: Union = None decoder_pretrained_model_name_or_path: Union = None *model_args **kwargs )

参数

encoder_pretrained_model_name_or_path (Union[str, os.PathLike], 可选) — 初始化编码器所需的信息。可以是：
- 一个字符串，托管在 huggingface.co 上的模型存储库中的预训练模型的模型 ID。一个示例是google/vit-base-patch16-224-in21k。
- 一个包含使用 save_pretrained()保存的模型权重的目录路径，例如，./my_model_directory/。
decoder_pretrained_model_name_or_path (Union[str, os.PathLike], 可选, 默认为None) — 初始化解码器所需的信息。可以是：
- 一个字符串，预训练模型的模型 ID，托管在 huggingface.co 上的模型存储库中。有效的模型 ID 可以位于根级别，如bert-base-uncased，或者在用户或组织名称下命名空间化，如dbmdz/bert-base-german-cased。
- 一个包含使用 save_pretrained()保存的模型权重的目录路径，例如，./my_model_directory/。
model_args（剩余的位置参数，可选） — 所有剩余的位置参数将传递给底层模型的__init__方法。
kwargs（剩余的关键字参数字典，可选） — 可用于更新配置对象（在加载后）并初始化模型（例如，output_attentions=True）。
- 要更新编码器配置，请为每个配置参数使用前缀encoder_。
- 要更新解码器配置，请为每个配置参数使用前缀decoder_。
- 更新父模型配置时，不要为每个配置参数使用前缀。
根据是否提供config或自动加载而表现不同。

从预训练模型检查点实例化一个编码器和一个解码器，可以是库中一个或两个基类。

示例：

>>> from transformers import FlaxVisionEncoderDecoderModel

>>> # initialize a vit-gpt2 from a pretrained ViT and a pretrained GPT2 model. Note that the cross-attention layers will be randomly initialized
>>> model = FlaxVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google/vit-base-patch16-224-in21k", "gpt2"
... )
>>> # saving model after fine-tuning
>>> model.save_pretrained("./vit-gpt2")
>>> # load fine-tuned model
>>> model = FlaxVisionEncoderDecoderModel.from_pretrained("./vit-gpt2")

VisionTextDualEncoder

原文链接: huggingface.co/docs/transformers/v4.37.2/en/model_doc/vision-text-dual-encoder

概述

VisionTextDualEncoderModel 可以用于使用任何预训练的视觉自编码模型作为视觉编码器（如 ViT, BEiT, DeiT）和任何预训练的文本自编码模型作为文本编码器（如 RoBERTa, BERT）初始化视觉文本双编码器模型。在视觉和文本编码器的顶部添加了两个投影层，将输出嵌入投影到共享的潜在空间。投影层是随机初始化的，因此模型应该在下游任务上进行微调。该模型可用于使用类似 CLIP 的对比图像文本训练来对齐视觉文本嵌入，然后可用于零样本视觉任务，如图像分类或检索。

在 LiT: Zero-Shot Transfer with Locked-image Text Tuning 中展示了如何利用预训练的（锁定/冻结）图像和文本模型进行对比学习，从而在新的零样本视觉任务（如图像分类或检索）上取得显著改进。

VisionTextDualEncoderConfig

`class transformers.VisionTextDualEncoderConfig`

< source >

( projection_dim = 512 logit_scale_init_value = 2.6592 **kwargs )

参数

projection_dim (int, optional, 默认为 512) — 文本和视觉投影层的维度。
logit_scale_init_value (float, optional, 默认为 2.6592) — logit_scale 参数的初始值。默认值根据原始 CLIP 实现使用。
kwargs (optional) — 关键字参数的字典。

VisionTextDualEncoderConfig 是用于存储 VisionTextDualEncoderModel 配置的配置类。它用于根据指定的参数实例化 VisionTextDualEncoderModel 模型，定义文本模型和视觉模型配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。阅读来自 PretrainedConfig 的文档以获取更多信息。

示例:

>>> from transformers import ViTConfig, BertConfig, VisionTextDualEncoderConfig, VisionTextDualEncoderModel

>>> # Initializing a BERT and ViT configuration
>>> config_vision = ViTConfig()
>>> config_text = BertConfig()

>>> config = VisionTextDualEncoderConfig.from_vision_text_configs(config_vision, config_text, projection_dim=512)

>>> # Initializing a BERT and ViT model (with random weights)
>>> model = VisionTextDualEncoderModel(config=config)

>>> # Accessing the model configuration
>>> config_vision = model.config.vision_config
>>> config_text = model.config.text_config

>>> # Saving the model, including its configuration
>>> model.save_pretrained("vit-bert")

>>> # loading model and config from pretrained folder
>>> vision_text_config = VisionTextDualEncoderConfig.from_pretrained("vit-bert")
>>> model = VisionTextDualEncoderModel.from_pretrained("vit-bert", config=vision_text_config)

`from_vision_text_configs`

< source >

( vision_config: PretrainedConfig text_config: PretrainedConfig **kwargs ) → export const metadata = 'undefined';VisionTextDualEncoderConfig

VisionTextDualEncoderConfig

配置对象的一个实例

从文本模型配置和视觉模型配置实例化一个 VisionTextDualEncoderConfig（或派生类）。

VisionTextDualEncoderProcessor

`class transformers.VisionTextDualEncoderProcessor`

< source >

( image_processor = None tokenizer = None **kwargs )

参数

image_processor (AutoImageProcessor, optional) — 图像处理器是必需的输入。
tokenizer（PreTrainedTokenizer，可选）— 标记器是必需的输入。

构建一个 VisionTextDualEncoder 处理器，将图像处理器和标记器包装成一个单一处理器。

VisionTextDualEncoderProcessor 提供了 AutoImageProcessor 和 AutoTokenizer 的所有功能。有关更多信息，请参阅__call__()和 decode()。

`batch_decode`

<来源>

( *args **kwargs )

该方法将其所有参数转发给 VisionTextDualEncoderTokenizer 的 batch_decode()。有关更多信息，请参阅此方法的文档字符串。

`decode`

<来源>

( *args **kwargs )

该方法将其所有参数转发给 VisionTextDualEncoderTokenizer 的 decode()。有关更多信息，请参阅此方法的文档字符串。

PytorchHide Pytorch 内容

VisionTextDualEncoderModel

`class transformers.VisionTextDualEncoderModel`

<来源>

( config: Optional = None vision_model: Optional = None text_model: Optional = None )

参数

config（VisionEncoderDecoderConfig）— 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只会加载配置。查看 from_pretrained()方法以加载模型权重。

此类可用于使用任何预训练的视觉自编码模型作为视觉编码器和任何预训练的文本模型作为文本编码器初始化视觉文本双编码器模型。视觉和文本编码器通过 from_pretrained()方法加载。投影层会自动添加到模型中，并应在下游任务（如对比图像文本建模）上进行微调。

在LiT: Zero-Shot Transfer with Locked-image Text Tuning中展示了如何利用预训练（锁定/冻结）图像和文本模型进行对比学习，对新的零样本视觉任务（如图像分类或检索）产生了显著的改进。

训练/微调了这样一个 Vision-Text-Dual-Encoder 模型之后，它可以像任何其他模型一样保存/加载（有关更多信息，请参阅示例）。

该模型继承自 PreTrainedModel。查看超类文档以获取库为所有模型实现的通用方法（例如下载或保存、调整输入嵌入、修剪头等）。

这个模型也是一个 PyTorch torch.nn.Module 的子类。将其用作常规的 PyTorch 模块，并参考 PyTorch 文档以获取与一般用法和行为相关的所有内容。

`forward`

< source >

( input_ids: Optional = None pixel_values: Optional = None attention_mask: Optional = None position_ids: Optional = None return_loss: Optional = None token_type_ids: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → export const metadata = 'undefined';transformers.models.clip.modeling_clip.CLIPOutput or tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — 词汇表中输入序列标记的索引。默认情况下将忽略填充。可以使用 AutoTokenizer 获取索引。有关详细信息，请参见 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。什么是输入 ID？
attention_mask (torch.Tensor of shape (batch_size, sequence_length), 可选) — 避免在填充标记索引上执行注意力的掩码。掩码值选择在 [0, 1]：
- 对于未被“masked”的标记为 1，
- 对于被masked的标记为 0。
什么是注意力掩码？
position_ids (torch.LongTensor of shape (batch_size, sequence_length), 可选) — 每个输入序列标记在位置嵌入中的位置索引。选择范围为 [0, config.max_position_embeddings - 1]。什么是位置 ID？
pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — 像素值。默认情况下将忽略填充。可以使用图像处理器获取像素值（例如，如果您使用 ViT 作为编码器，应该使用 AutoImageProcessor）。有关详细信息，请参见 ViTImageProcessor.call()。
return_loss (bool, 可选) — 是否返回对比损失。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回的张量下的 attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回的张量下的 hidden_states。
return_dict (bool, 可选) — 是否返回一个 ModelOutput 而不是一个普通元组。

transformers.models.clip.modeling_clip.CLIPOutput 或 tuple(torch.FloatTensor)

一个 transformers.models.clip.modeling_clip.CLIPOutput 或一个 torch.FloatTensor 元组（如果传递了 return_dict=False 或当 config.return_dict=False 时）包含各种元素，这取决于配置（VisionTextDualEncoderConfig）和输入。

loss (torch.FloatTensor of shape (1,), 可选, 当 return_loss 为 True 时返回) — 图像-文本相似性的对比损失。
logits_per_image:(torch.FloatTensor of shape (image_batch_size, text_batch_size)) — image_embeds 和 text_embeds 之间的缩放点积分数。这代表了图像-文本相似性分数。
logits_per_text:(torch.FloatTensor of shape (text_batch_size, image_batch_size)) — text_embeds 和 image_embeds 之间的缩放点积分数。这代表了文本-图像相似性分数。
text_embeds(torch.FloatTensor of shape (batch_size, output_dim) — 通过将投影层应用于 CLIPTextModel 的汇总输出获得的文本嵌入。
image_embeds(torch.FloatTensor of shape (batch_size, output_dim) — 通过将投影层应用于 CLIPVisionModel 的汇总输出获得的图像嵌入。
text_model_output(BaseModelOutputWithPooling): CLIPTextModel 的输出。
vision_model_output(BaseModelOutputWithPooling): CLIPVisionModel 的输出。

VisionTextDualEncoderModel 的前向方法覆盖了__call__特殊方法。

尽管前向传递的步骤需要在此函数内定义，但应该在此之后调用Module实例，而不是在此处调用，因为前者会负责运行前后处理步骤，而后者会默默忽略它们。

示例：

>>> from PIL import Image
>>> import requests
>>> from transformers import (
...     VisionTextDualEncoderModel,
...     VisionTextDualEncoderProcessor,
...     AutoImageProcessor,
...     AutoTokenizer,
... )

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
>>> processor = VisionTextDualEncoderProcessor(image_processor, tokenizer)
>>> model = VisionTextDualEncoderModel.from_vision_text_pretrained(
...     "google/vit-base-patch16-224", "bert-base-uncased"
... )

>>> # contrastive training
>>> urls = [
...     "http://images.cocodataset.org/val2017/000000039769.jpg",
...     "https://farm3.staticflickr.com/2674/5850229113_4fe05d5265_z.jpg",
... ]
>>> images = [Image.open(requests.get(url, stream=True).raw) for url in urls]
>>> inputs = processor(
...     text=["a photo of a cat", "a photo of a dog"], images=images, return_tensors="pt", padding=True
... )
>>> outputs = model(
...     input_ids=inputs.input_ids,
...     attention_mask=inputs.attention_mask,
...     pixel_values=inputs.pixel_values,
...     return_loss=True,
... )
>>> loss, logits_per_image = outputs.loss, outputs.logits_per_image  # this is the image-text similarity score

>>> # save and load from pretrained
>>> model.save_pretrained("vit-bert")
>>> model = VisionTextDualEncoderModel.from_pretrained("vit-bert")

>>> # inference
>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

TensorFlowHide TensorFlow content

FlaxVisionTextDualEncoderModel

`class transformers.FlaxVisionTextDualEncoderModel`

< source >

( config: VisionTextDualEncoderConfig input_shape: Optional = None seed: int = 0 dtype: dtype = <class 'jax.numpy.float32'> _do_init: bool = True **kwargs )

参数

config (VisionTextDualEncoderConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。
dtype (jax.numpy.dtype, 可选，默认为jax.numpy.float32) — 计算的数据类型。可以是jax.numpy.float32、jax.numpy.float16（在 GPU 上）和jax.numpy.bfloat16（在 TPU 上）之一。这可以用于在 GPU 或 TPU 上启用混合精度训练或半精度推断。如果指定了dtype，则所有计算将使用给定的dtype执行。 请注意，这仅指定计算的数据类型，不影响模型参数的数据类型。 如果要更改模型参数的数据类型，请参阅 to_fp16()和 to_bf16()。

此类可用于使用任何预训练视觉自编码模型作为视觉编码器和任何预训练文本模型作为文本编码器初始化视觉文本双编码器模型。视觉和文本编码器通过 from_pretrained()方法加载。投影层会自动添加到模型中，并应在下游任务（如对比图像文本建模）上进行微调。

在LiT: Zero-Shot Transfer with Locked-image Text Tuning中，展示了如何利用预训练（锁定/冻结）图像和文本模型进行对比学习，从而在新的零样本视觉任务（如图像分类或检索）上取得显著改进。

训练/微调了这样一个视觉文本双编码器模型后，它可以像其他模型一样保存/加载（有关更多信息，请参阅示例）。

此模型继承自 PreTrainedModel。查看超类文档以获取库为所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、修剪头等）。

此模型还是一个flax.linen.Module子类。将其用作常规的 Flax 亚麻模块，并参考 Flax 文档以获取与一般用法和行为相关的所有内容。

最后，此模型支持内在的 JAX 特性，例如：

`call`

<来源>

( input_ids pixel_values attention_mask = None position_ids = None token_type_ids = None params: dict = None dropout_rng: PRNGKey = None train: bool = False output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → export const metadata = 'undefined';transformers.models.clip.modeling_flax_clip.FlaxCLIPOutput or tuple(torch.FloatTensor)

参数

input_ids（形状为(batch_size, sequence_length)的numpy.ndarray）— 词汇表中输入序列标记的索引。默认情况下将忽略填充。可以使用 AutoTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。什么是输入 ID？
attention_mask（形状为(batch_size, sequence_length)的torch.Tensor，可选）— 避免在填充标记索引上执行注意力的掩码。掩码值选择在[0, 1]之间：
- 1 用于未被“掩码”的标记，
- 0 用于被“掩码”的标记。
什么是注意力掩码？
position_ids（形状为(batch_size, sequence_length)的numpy.ndarray，可选）— 每个输入序列标记在位置嵌入中的位置索引。选择范围为[0, config.max_position_embeddings - 1]。什么是位置 ID？
pixel_values（形状为(batch_size, num_channels, height, width)的torch.FloatTensor）— 像素值。默认情况下将忽略填充。可以使用图像处理器获取像素值（例如，如果您使用 ViT 作为编码器，则应使用 AutoImageProcessor）。有关详细信息，请参阅 ViTImageProcessor.call()。
output_attentions（bool，可选）— 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回的张量下的attentions。
output_hidden_states（bool，可选）— 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回的张量下的hidden_states。
return_dict（bool，可选）— 是否返回 ModelOutput 而不是普通元组。

transformers.models.clip.modeling_flax_clip.FlaxCLIPOutput或tuple(torch.FloatTensor)

一个transformers.models.clip.modeling_flax_clip.FlaxCLIPOutput或一个torch.FloatTensor元组（如果传递了return_dict=False或config.return_dict=False时）包含根据配置（VisionTextDualEncoderConfig）和输入的各种元素。

logits_per_image:(jnp.ndarray of shape (image_batch_size, text_batch_size)) — image_embeds和text_embeds之间的缩放点积分数。这代表图像-文本相似性分数。
logits_per_text:(jnp.ndarray of shape (text_batch_size, image_batch_size)) — text_embeds和image_embeds之间的缩放点积分数。这代表文本-图像相似性分数。
text_embeds(jnp.ndarray of shape (batch_size, output_dim) — 通过将 FlaxCLIPTextModel 的池化输出应用于投影层获得的文本嵌入。
image_embeds(jnp.ndarray of shape (batch_size, output_dim) — 通过将 FlaxCLIPVisionModel 的池化输出应用于投影层获得的图像嵌入。
text_model_output(FlaxBaseModelOutputWithPooling): FlaxCLIPTextModel 的输出。
vision_model_output(FlaxBaseModelOutputWithPooling): FlaxCLIPVisionModel 的输出。

FlaxVisionTextDualEncoderModel 的前向方法重写了__call__特殊方法。

虽然前向传递的步骤需要在此函数内定义，但应该在之后调用Module实例，而不是调用此函数，因为前者会负责运行前后处理步骤，而后者会默默忽略它们。

示例：

>>> from PIL import Image
>>> import requests
>>> import jax
>>> from transformers import (
...     FlaxVisionTextDualEncoderModel,
...     VisionTextDualEncoderProcessor,
...     AutoImageProcessor,
...     AutoTokenizer,
... )

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> image_processor = AutoImageProcesor.from_pretrained("google/vit-base-patch16-224")
>>> processor = VisionTextDualEncoderProcessor(image_processor, tokenizer)
>>> model = FlaxVisionTextDualEncoderModel.from_vision_text_pretrained(
...     "google/vit-base-patch16-224", "bert-base-uncased"
... )

>>> # contrastive training
>>> urls = [
...     "http://images.cocodataset.org/val2017/000000039769.jpg",
...     "https://farm3.staticflickr.com/2674/5850229113_4fe05d5265_z.jpg",
... ]
>>> images = [Image.open(requests.get(url, stream=True).raw) for url in urls]
>>> inputs = processor(
...     text=["a photo of a cat", "a photo of a dog"], images=images, return_tensors="np", padding=True
... )
>>> outputs = model(
...     input_ids=inputs.input_ids,
...     attention_mask=inputs.attention_mask,
...     pixel_values=inputs.pixel_values,
... )
>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score

>>> # save and load from pretrained
>>> model.save_pretrained("vit-bert")
>>> model = FlaxVisionTextDualEncoderModel.from_pretrained("vit-bert")

>>> # inference
>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = jax.nn.softmax(logits_per_image, axis=1)  # we can take the softmax to get the label probabilities

JAXHide JAX content

TFVisionTextDualEncoderModel

`class transformers.TFVisionTextDualEncoderModel`

< source >

( config: Optional[VisionTextDualEncoderConfig] = None vision_model: Optional[TFPreTrainedModel] = None text_model: Optional[TFPreTrainedModel] = None )

参数

config（VisionEncoderDecoderConfig） — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，只会加载配置。查看 from_pretrained()方法以加载模型权重。

此类可用于使用任何预训练的视觉自编码模型作为视觉编码器和任何预训练的文本模型作为文本编码器初始化视觉文本双编码器模型。视觉和文本编码器通过 from_pretrained()方法加载。投影层会自动添加到模型中，并应在下游任务（如对比图像-文本建模）上进行微调。

在LiT: Zero-Shot Transfer with Locked-image Text Tuning中展示了如何利用预训练（锁定/冻结）的图像和文本模型进行对比学习，从而在新的零样本视觉任务（如图像分类或检索）上取得显著改进。

经过训练/微调的 Vision-Text-Dual-Encoder 模型可以像其他模型一样保存/加载（有关更多信息，请参见示例）。

该模型继承自 TFPreTrainedModel。查看超类文档以获取库为所有模型实现的通用方法（如下载或保存、调整输入嵌入、修剪头等）。

该模型也是 Keras Model子类。将其用作常规 Keras 模型，并参考 TF 文档以获取与一般使用和行为相关的所有信息。

`call`

<来源>

( input_ids: tf.Tensor | None = None pixel_values: tf.Tensor | None = None attention_mask: tf.Tensor | None = None position_ids: tf.Tensor | None = None return_loss: Optional[bool] = None token_type_ids: tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: bool = False ) → export const metadata = 'undefined';transformers.models.clip.modeling_tf_clip.TFCLIPOutput or tuple(tf.Tensor)

参数

input_ids (tf.Tensor of shape (batch_size, sequence_length)) — 词汇表中输入序列标记的索引。默认情况下，如果提供了填充，将被忽略。可以使用 AutoTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。什么是输入 ID？
attention_mask (tf.Tensor of shape (batch_size, sequence_length), 可选) — 避免在填充标记索引上执行注意力的掩码。选择的掩码值在[0, 1]中：
- 对于未被masked的标记为 1，
- 对于被masked的标记为 0。
什么是注意力掩码？
position_ids (tf.Tensor of shape (batch_size, sequence_length), 可选) — 每个输入序列标记在位置嵌入中的位置索引。在范围[0, config.max_position_embeddings - 1]中选择。什么是位置 ID？
pixel_values (tf.Tensor of shape (batch_size, num_channels, height, width)) — 像素值。默认情况下，如果提供了填充，将被忽略。可以使用图像处理器获取像素值（例如，如果您使用 ViT 作为编码器，应该使用 AutoImageProcessor）。有关详细信息，请参阅 ViTImageProcessor.call()。
return_loss (bool, 可选) — 是否返回对比损失。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的hidden_states。
return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通元组。

transformers.models.clip.modeling_tf_clip.TFCLIPOutput或tuple(tf.Tensor)

一个transformers.models.clip.modeling_tf_clip.TFCLIPOutput或一组tf.Tensor（如果传递了return_dict=False或当config.return_dict=False时）包括根据配置（VisionTextDualEncoderConfig）和输入的各种元素。

loss (tf.Tensor of shape (1,), 可选, 当return_loss为True时返回) — 图像-文本相似性的对比损失。
logits_per_image:(tf.Tensor of shape (image_batch_size, text_batch_size)) — image_embeds和text_embeds之间的缩放点积分数。这代表图像-文本相似性分数。
logits_per_text:(tf.Tensor of shape (text_batch_size, image_batch_size)) — text_embeds和image_embeds之间的缩放点积分数。这代表文本-图像相似性分数。
text_embeds(tf.Tensor of shape (batch_size, output_dim) — 通过将投影层应用于 TFCLIPTextModel 的汇总输出获得的文本嵌入。
image_embeds(tf.Tensor of shape (batch_size, output_dim) — 通过将投影层应用于 TFCLIPVisionModel 的汇总输出获得的图像嵌入。
text_model_output(~modeling_tf_utils.TFBaseModelOutputWithPooling): TFCLIPTextModel 的输出。
vision_model_output(~modeling_tf_utils.TFBaseModelOutputWithPooling): TFCLIPVisionModel 的输出。

TFVisionTextDualEncoderModel 的前向方法，覆盖了__call__特殊方法。

虽然前向传递的配方需要在这个函数内定义，但应该在此之后调用Module实例，而不是这个，因为前者负责运行前处理和后处理步骤，而后者则默默地忽略它们。

示例：

>>> from PIL import Image
>>> import requests
>>> from transformers import (
...     TFVisionTextDualEncoderModel,
...     VisionTextDualEncoderProcessor,
...     AutoImageProcessor,
...     AutoTokenizer,
... )

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
>>> processor = VisionTextDualEncoderProcessor(image_processor, tokenizer)
>>> model = TFVisionTextDualEncoderModel.from_vision_text_pretrained(
...     "google/vit-base-patch16-224", "bert-base-uncased"
... )

>>> # contrastive training
>>> urls = [
...     "http://images.cocodataset.org/val2017/000000039769.jpg",
...     "https://farm3.staticflickr.com/2674/5850229113_4fe05d5265_z.jpg",
... ]
>>> images = [Image.open(requests.get(url, stream=True).raw) for url in urls]
>>> inputs = processor(
...     text=["a photo of a cat", "a photo of a dog"], images=images, return_tensors="np", padding=True
... )
>>> outputs = model(
...     input_ids=inputs.input_ids,
...     attention_mask=inputs.attention_mask,
...     pixel_values=inputs.pixel_values,
...     return_loss=True,
... )
>>> loss, logits_per_image = outputs.loss, outputs.logits_per_image  # this is the image-text similarity score

>>> # save and load from pretrained
>>> model.save_pretrained("vit-bert")
>>> model = TFVisionTextDualEncoderModel.from_pretrained("vit-bert")

>>> # inference
>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = tf.nn.softmax(logits_per_image, axis=1)  # we can take the softmax to get the label probabilities

VisualBERT

原始文本：huggingface.co/docs/transformers/v4.37.2/en/model_doc/visual_bert

概述

VisualBERT 模型是由 Liunian Harold Li、Mark Yatskar、Da Yin、Cho-Jui Hsieh、Kai-Wei Chang 在VisualBERT: A Simple and Performant Baseline for Vision and Language中提出的。VisualBERT 是一个在各种（图像，文本）对上训练的神经网络。

论文摘要如下：

我们提出了 VisualBERT，一个简单灵活的框架，用于建模广泛的视觉和语言任务。VisualBERT 由一堆 Transformer 层组成，通过自注意力隐式对齐输入文本的元素和相关输入图像中的区域。我们进一步提出了两个基于视觉的语言模型目标，用于在图像标题数据上预训练 VisualBERT。对包括 VQA、VCR、NLVR2 和 Flickr30K 在内的四个视觉和语言任务的实验表明，VisualBERT 在简化的同时优于或与最先进的模型相媲美。进一步的分析表明，VisualBERT 可以将语言元素与图像区域联系起来，而无需任何明确的监督，并且甚至对句法关系敏感，例如跟踪动词和与其参数对应的图像区域之间的关联。

此模型由gchhablani贡献。原始代码可以在这里找到。

使用提示

提供的大多数检查点适用于 VisualBertForPreTraining 配置。提供的其他检查点是用于下游任务微调的检查点 - VQA（‘visualbert-vqa’）、VCR（‘visualbert-vcr’）、NLVR2（‘visualbert-nlvr2’）。因此，如果您不是在进行这些下游任务，建议使用预训练检查点。
对于 VCR 任务，作者使用了一个经过微调的检测器来生成视觉嵌入，对于所有的检查点。我们不会将检测器及其权重作为软件包的一部分提供，但它将在研究项目中提供，并且状态可以直接加载到提供的检测器中。

VisualBERT 是一个多模态视觉和语言模型。它可用于视觉问答、多项选择、视觉推理和区域到短语对应任务。VisualBERT 使用类似 BERT 的变压器来为图像-文本对准备嵌入。然后将文本和视觉特征投影到具有相同维度的潜在空间中。

要将图像馈送到模型中，必须通过预训练的对象检测器传递每个图像，并提取区域和边界框。作者使用通过将这些区域通过预训练的 CNN（如 ResNet）传递后生成的特征作为视觉嵌入。他们还添加了绝对位置嵌入，并将生成的向量序列馈送到标准的 BERT 模型中。文本输入在嵌入层的前面与视觉嵌入连接，并且预期由[CLS]和[SEP]标记限定，就像 BERT 一样。段 ID 也必须适当设置为文本和视觉部分。

使用 BertTokenizer 对文本进行编码。必须使用自定义检测器/图像处理器来获取视觉嵌入。以下示例笔记本展示了如何使用类似 Detectron 的模型与 VisualBERT 一起使用：

VisualBERT VQA 演示笔记本：此笔记本包含 VisualBERT VQA 的示例。
为 VisualBERT 生成嵌入（Colab 笔记本）：此笔记本包含如何生成视觉嵌入的示例。

以下示例显示如何使用 VisualBertModel 获取最后一个隐藏状态：

>>> import torch
>>> from transformers import BertTokenizer, VisualBertModel

>>> model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre")
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

>>> inputs = tokenizer("What is the man eating?", return_tensors="pt")
>>> # this is a custom function that returns the visual embeddings given the image path
>>> visual_embeds = get_visual_embeddings(image_path)

>>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
>>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
>>> inputs.update(
...     {
...         "visual_embeds": visual_embeds,
...         "visual_token_type_ids": visual_token_type_ids,
...         "visual_attention_mask": visual_attention_mask,
...     }
... )
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state

VisualBertConfig

`class transformers.VisualBertConfig`

< source >

( vocab_size = 30522 hidden_size = 768 visual_embedding_dim = 512 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.1 attention_probs_dropout_prob = 0.1 max_position_embeddings = 512 type_vocab_size = 2 initializer_range = 0.02 layer_norm_eps = 1e-12 bypass_transformer = False special_visual_initialize = True pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 **kwargs )

参数

vocab_size (int, optional, defaults to 30522) — VisualBERT 模型的词汇表大小。定义了在调用 VisualBertModel 时可以由 inputs_ids 表示的不同标记数量。模型的词汇表大小。定义了可以由传递给 VisualBertModel 的 forward 方法的 inputs_ids 表示的不同标记数量。
hidden_size (int, optional, defaults to 768) — 编码器层和池化器层的维度。
visual_embedding_dim (int, optional, defaults to 512) — 要传递给模型的视觉嵌入的维度。
num_hidden_layers (int, optional, defaults to 12) — Transformer 编码器中的隐藏层数。
num_attention_heads (int, optional, defaults to 12) — Transformer 编码器中每个注意力层的注意力头数。
intermediate_size (int, optional, defaults to 3072) — Transformer 编码器中“中间”（即前馈）层的维度。
hidden_act (str or function, optional, defaults to "gelu") — 编码器和池化器中的非线性激活函数（函数或字符串）。如果是字符串，支持 "gelu"、"relu"、"selu" 和 "gelu_new"。
hidden_dropout_prob (float, optional, defaults to 0.1) — 嵌入层、编码器和池化器中所有全连接层的 dropout 概率。
attention_probs_dropout_prob (float, optional, defaults to 0.1) — 注意力概率的 dropout 比率。
max_position_embeddings (int, optional, defaults to 512) — 模型可能使用的最大序列长度。通常设置为较大的值（例如 512、1024 或 2048）。
type_vocab_size (int, optional, defaults to 2) — 在调用 VisualBertModel 时传递的 token_type_ids 的词汇表大小。
initializer_range (float, optional, defaults to 0.02) — 用于初始化所有权重矩阵的截断正态初始化器的标准差。
layer_norm_eps (float, optional, defaults to 1e-12) — 层归一化层使用的 epsilon。
bypass_transformer (bool, optional, defaults to False) — 模型是否应绕过 Transformer 处理视觉嵌入。如果设置为 True，模型直接将来自 VisualBertEmbeddings 的视觉嵌入与来自 transformers 的文本输出连接起来，然后传递给自注意力层。
special_visual_initialize (bool, optional, defaults to True) — 视觉标记类型和位置类型嵌入权重是否应该与文本标记类型和正向类型嵌入相同初始化。当设置为 True 时，文本标记类型和位置类型嵌入的权重将复制到相应的视觉嵌入层。

这是用于存储 VisualBertModel 配置的配置类。根据指定的参数实例化 VisualBERT 模型，定义模型架构。使用默认值实例化配置将产生类似于 VisualBERT uclanlp/visualbert-vqa-coco-pre 架构的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。阅读 PretrainedConfig 的文档以获取更多信息。

示例：

>>> from transformers import VisualBertConfig, VisualBertModel

>>> # Initializing a VisualBERT visualbert-vqa-coco-pre style configuration
>>> configuration = VisualBertConfig.from_pretrained("uclanlp/visualbert-vqa-coco-pre")

>>> # Initializing a model (with random weights) from the visualbert-vqa-coco-pre style configuration
>>> model = VisualBertModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

VisualBertModel

`class transformers.VisualBertModel`

<来源>

( config add_pooling_layer = True )

参数

config（VisualBertConfig）— 模型的所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。

裸的 VisualBert 模型变压器输出原始隐藏状态，没有特定的头部。此模型继承自 PreTrainedModel。检查超类文档以获取库为所有模型实现的通用方法（如下载或保存，调整输入嵌入，修剪头等）。

该模型还是一个 PyTorch torch.nn.Module子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以获取有关一般用法和行为的所有相关信息。

该模型可以作为一个编码器（仅具有自注意力），遵循Ashish Vaswani，Noam Shazeer，Niki Parmar，Jakob Uszkoreit，Llion Jones，Aidan N. Gomez，Lukasz Kaiser 和 Illia Polosukhin 描述的架构。

`forward`

<来源>

( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None visual_embeds: Optional = None visual_attention_mask: Optional = None visual_token_type_ids: Optional = None image_text_alignment: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → export const metadata = 'undefined';transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)

参数

input_ids（形状为(batch_size, sequence_length)的torch.LongTensor）— 词汇表中输入序列标记的索引。可以使用 AutoTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。什么是输入 ID？
attention_mask（形状为(batch_size, sequence_length)的torch.FloatTensor，可选）— 用于避免在填充标记索引上执行注意力的掩码。掩码值在[0, 1]中选择：
- 1 表示未被“掩盖”的标记，
- 0 表示被“掩盖”的标记。
什么是注意力掩码？
token_type_ids（形状为(batch_size, sequence_length)的torch.LongTensor，可选）— 段标记索引，指示输入的第一部分和第二部分。索引在[0, 1]中选择：
- 0 对应于句子 A标记，
- 1 对应于句子 B标记。
什么是标记类型 ID？
position_ids（形状为(batch_size, sequence_length)的torch.LongTensor，可选）— 输入序列标记在位置嵌入中的位置索引。在范围[0, config.max_position_embeddings - 1]中选择。什么是位置 ID？
head_mask（形状为(num_heads,)或(num_layers, num_heads)的torch.FloatTensor，可选）— 用于使自注意力模块的选定头部失效的掩码。掩码值在[0, 1]中选择：
- 1 表示头部未被“掩盖”，
- 0 表示头部被“掩盖”。
inputs_embeds（形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor，可选）— 可选地，您可以直接传递嵌入表示而不是传递input_ids。如果您想要更多控制如何将input_ids索引转换为相关向量，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
visual_embeds（形状为(batch_size, visual_seq_length, visual_embedding_dim)的torch.FloatTensor，可选）— 视觉输入的嵌入表示，通常使用对象检测器派生。
visual_attention_mask（形状为(batch_size, visual_seq_length)的torch.FloatTensor，可选）— 用于避免在视觉嵌入上执行注意力的掩码。掩码值选在[0, 1]之间：
- 1 表示未被masked的令牌，
- 0 表示被masked的令牌。
注意力掩码是什么？
VisualBERT 的作者将visual_token_type_ids设置为所有令牌的1。令牌类型 ID 是什么？
image_text_alignment（形状为(batch_size, visual_seq_length, alignment_number)的torch.LongTensor，可选）— 用于决定视觉嵌入的位置 ID 的图像-文本对齐。
output_attentions（bool，可选）— 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的attentions。
output_hidden_states（bool，可选）— 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量中的hidden_states。
return_dict（bool，可选）— 是否返回一个 ModelOutput 而不是一个普通元组。

last_hidden_state（形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor）— 模型最后一层的隐藏状态的序列输出。

一个 transformers.modeling_outputs.BaseModelOutputWithPooling 或一个torch.FloatTensor元组（如果传递了return_dict=False或当config.return_dict=False时）包含根据配置（VisualBertConfig）和输入的不同元素。

visual_token_type_ids（形状为(batch_size, visual_seq_length)的torch.LongTensor，可选）— 段令牌索引，用于指示视觉嵌入的不同部分。
pooler_output（形状为(batch_size, hidden_size)的torch.FloatTensor）— 经过用于辅助预训练任务的层进一步处理后的序列的第一个令牌（分类令牌）的最后一层隐藏状态。例如，对于 BERT 系列模型，这将返回经过线性层和 tanh 激活函数处理后的分类令牌。线性层的权重是从预训练期间的下一个句子预测（分类）目标中训练的。
hidden_states（tuple(torch.FloatTensor)，可选，当传递output_hidden_states=True或当config.output_hidden_states=True时返回）— 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组（如果模型有嵌入层的输出，则为一个+每层输出的一个）。模型在每一层的输出的隐藏状态以及可选的初始嵌入输出。
attentions（tuple(torch.FloatTensor)，可选，当传递output_attentions=True或当config.output_attentions=True时返回）— 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组（每层一个）。在自注意力头部中使用注意力 softmax 后的注意力权重，用于计算加权平均值。

VisualBertModel 的前向方法，覆盖了__call__特殊方法。

虽然前向传递的配方需要在此函数内定义，但应该在此之后调用Module实例，而不是在此处调用，因为前者会处理运行前后处理步骤，而后者会默默地忽略它们。

示例：

# Assumption: *get_visual_embeddings(image)* gets the visual embeddings of the image.
from transformers import AutoTokenizer, VisualBertModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre")

inputs = tokenizer("The capital of France is Paris.", return_tensors="pt")
visual_embeds = get_visual_embeddings(image).unsqueeze(0)
visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)

inputs.update(
    {
        "visual_embeds": visual_embeds,
        "visual_token_type_ids": visual_token_type_ids,
        "visual_attention_mask": visual_attention_mask,
    }
)

outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

VisualBertForPreTraining

`class transformers.VisualBertForPreTraining`

<来源>

( config )

参数

config (VisualBertConfig) — 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。

VisualBert 模型在预训练期间在顶部有两个头部：一个掩码语言建模头部和一个句子-图像预测（分类）头部。

这个模型继承自 PreTrainedModel。检查超类文档以获取库为所有模型实现的通用方法（如下载或保存，调整输入嵌入大小，修剪头部等）。

这个模型也是一个 PyTorch torch.nn.Module子类。将其用作常规的 PyTorch 模块，并参考 PyTorch 文档以获取与一般用法和行为相关的所有内容。

`forward`

<来源>

( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None visual_embeds: Optional = None visual_attention_mask: Optional = None visual_token_type_ids: Optional = None image_text_alignment: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None labels: Optional = None sentence_image_labels: Optional = None ) → export const metadata = 'undefined';transformers.models.visual_bert.modeling_visual_bert.VisualBertForPreTrainingOutput or tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — 词汇表中输入序列标记的索引。可以使用 AutoTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。什么是输入 ID？
attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — 用于避免在填充标记索引上执行注意力的掩码。掩码值在[0, 1]中选择：
- 1 表示未屏蔽的标记，
- 0 表示已屏蔽的标记。
什么是注意力掩码？
token_type_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — 段标记索引，用于指示输入的第一部分和第二部分。索引在[0, 1]中选择：
- 0 对应于一个句子 A标记，
- 1 对应于一个句子 B标记。
什么是标记类型 ID？
position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — 每个输入序列标记在位置嵌入中的位置索引。在范围[0, config.max_position_embeddings - 1]中选择。什么是位置 ID？
head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) — 用于使自注意力模块中选择的头部失效的掩码。掩码值在[0, 1]中选择：
- 1 表示头部是未屏蔽，
- 0 表示头部是已屏蔽。
inputs_embeds（形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor，可选）— 可选地，您可以选择直接传递一个嵌入表示，而不是传递input_ids。如果您想要更多控制如何将input_ids索引转换为相关向量，而不是使用模型的内部嵌入查找矩阵，这将非常有用。
visual_embeds（形状为(batch_size, visual_seq_length, visual_embedding_dim)的torch.FloatTensor，可选）— 视觉输入的嵌入表示，通常使用对象检测器派生。
visual_attention_mask（形状为(batch_size, visual_seq_length)的torch.FloatTensor，可选）— 用于避免在视觉嵌入上执行注意力的掩码。选择的掩码值在[0, 1]中：
- 1 表示未被掩码的标记，
- 0 表示被掩码的标记。
什么是注意力掩码？
visual_token_type_ids（形状为(batch_size, visual_seq_length)的torch.LongTensor，可选）— 段标记索引，用于指示视觉嵌入的不同部分。什么是标记类型 ID？ VisualBERT 的作者将visual_token_type_ids设置为1以表示所有标记。
image_text_alignment（形状为(batch_size, visual_seq_length, alignment_number)的torch.LongTensor，可选）— 用于决定视觉嵌入的位置 ID 的图像-文本对齐。
output_attentions（bool，可选）— 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的attentions。
output_hidden_states（bool，可选）— 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict（bool，可选）— 是否返回一个 ModelOutput 而不是一个普通的元组。
labels（形状为(batch_size, total_sequence_length)的torch.LongTensor，可选）— 用于计算掩码语言建模损失的标签。索引应在[-100, 0, ..., config.vocab_size]中（请参见input_ids文档字符串）。将索引设置为-100的标记将被忽略（掩码），损失仅计算具有标签在[0, ..., config.vocab_size]中的标记。
sentence_image_labels（形状为(batch_size,)的torch.LongTensor，可选）— 用于计算句子-图像预测（分类）损失的标签。输入应为一个序列对（请参见input_ids文档字符串）索引应在[0, 1]中。
- 0 表示对于给定图像，序列 B 是序列 A 的匹配对，
- 1 表示对于给定图像，序列 B 是相对于 A 的随机序列。

transformers.models.visual_bert.modeling_visual_bert.VisualBertForPreTrainingOutput或tuple(torch.FloatTensor)

一个transformers.models.visual_bert.modeling_visual_bert.VisualBertForPreTrainingOutput或一个torch.FloatTensor元组（如果传递了return_dict=False或当config.return_dict=False时），包含根据配置（VisualBertConfig）和输入的不同元素。

loss（可选，当提供labels时返回，形状为(1,)的torch.FloatTensor）— 作为掩码语言建模损失和句子-图像预测（分类）损失之和的总损失。
prediction_logits（形状为(batch_size, sequence_length, config.vocab_size)的torch.FloatTensor）— 语言建模头的预测分数（SoftMax 之前每个词汇标记的分数）。
seq_relationship_logits（形状为(batch_size, 2)的torch.FloatTensor）— 句子-图像预测（分类）头的预测分数（SoftMax 之前的 True/False 连续分数）。
hidden_states（tuple(torch.FloatTensor)，可选，当传递output_hidden_states=True或config.output_hidden_states=True时返回）- 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组（一个用于嵌入的输出 + 一个用于每个层的输出）。模型在每一层的输出以及初始嵌入输出的隐藏状态。
attentions（tuple(torch.FloatTensor)，可选，当传递output_attentions=True或config.output_attentions=True时返回）- 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组（每层一个）。在自注意力头中用于计算加权平均值的注意力 softmax 后的注意力权重。

VisualBertForPreTraining 的前向方法，覆盖了__call__特殊方法。

虽然前向传递的步骤需要在此函数内定义，但应该在此之后调用Module实例，而不是这个，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

# Assumption: *get_visual_embeddings(image)* gets the visual embeddings of the image in the batch.
from transformers import AutoTokenizer, VisualBertForPreTraining

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = VisualBertForPreTraining.from_pretrained("uclanlp/visualbert-vqa-coco-pre")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
visual_embeds = get_visual_embeddings(image).unsqueeze(0)
visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)

inputs.update(
    {
        "visual_embeds": visual_embeds,
        "visual_token_type_ids": visual_token_type_ids,
        "visual_attention_mask": visual_attention_mask,
    }
)
max_length = inputs["input_ids"].shape[-1] + visual_embeds.shape[-2]
labels = tokenizer(
    "The capital of France is Paris.", return_tensors="pt", padding="max_length", max_length=max_length
)["input_ids"]
sentence_image_labels = torch.tensor(1).unsqueeze(0)  # Batch_size

outputs = model(**inputs, labels=labels, sentence_image_labels=sentence_image_labels)
loss = outputs.loss
prediction_logits = outputs.prediction_logits
seq_relationship_logits = outputs.seq_relationship_logits

VisualBertForQuestionAnswering

`class transformers.VisualBertForQuestionAnswering`

<来源>

( config )

参数

config（VisualBertConfig）- 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。

VisualBert 模型在顶部具有分类/回归头（在汇总输出的顶部有一个 dropout 和一个线性层）用于 VQA。

这个模型继承自 PreTrainedModel。检查超类文档以获取库为所有模型实现的通用方法（例如下载或保存，调整输入嵌入大小，修剪头等）。

这个模型也是 PyTorch torch.nn.Module子类。将其用作常规的 PyTorch 模块，并参考 PyTorch 文档以获取与一般用法和行为相关的所有事项。

`forward`

<来源>

( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None visual_embeds: Optional = None visual_attention_mask: Optional = None visual_token_type_ids: Optional = None image_text_alignment: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None labels: Optional = None ) → export const metadata = 'undefined';transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)

参数

input_ids（形状为(batch_size, sequence_length)的torch.LongTensor）- 词汇表中输入序列标记的索引。可以使用 AutoTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。什么是输入 ID？
attention_mask（形状为(batch_size, sequence_length)的torch.FloatTensor，可选）- 避免在填充标记索引上执行注意力的掩码。选择的掩码值在[0, 1]中：
- 1 代表未被掩码的标记，
- 0 代表被掩码的标记。
什么是注意力掩码？
token_type_ids（形状为(batch_size, sequence_length)的torch.LongTensor，可选）- 段标记索引，指示输入的第一部分和第二部分。索引在[0, 1]中选择：
- 0 对应于句子 A标记，
- 1 对应于句子 B标记。
什么是标记类型 ID？
position_ids (torch.LongTensor，形状为(batch_size, sequence_length)，可选) — 每个输入序列令牌的位置在位置嵌入中的索引。选择范围为[0, config.max_position_embeddings - 1]。什么是位置 ID？
head_mask (torch.FloatTensor，形状为(num_heads,)或(num_layers, num_heads)，可选) — 用于使自注意力模块中选择的头部失效的掩码。掩码值选择在[0, 1]之间：
- 1 表示头部未被masked，
- 0 表示头部被masked。
inputs_embeds (torch.FloatTensor，形状为(batch_size, sequence_length, hidden_size)，可选) — 可选地，可以直接传递嵌入表示而不是传递input_ids。如果您想要更多控制如何将input_ids索引转换为相关向量，而不是使用模型的内部嵌入查找矩阵，则这很有用。
visual_embeds (torch.FloatTensor，形状为(batch_size, visual_seq_length, visual_embedding_dim)，可选) — 视觉输入的嵌入表示，通常使用对象检测器派生。
visual_attention_mask (torch.FloatTensor，形状为(batch_size, visual_seq_length)，可选) — 避免对视觉嵌入执行注意力的掩码。掩码值选择在[0, 1]之间：
- 1 表示未被masked的令牌，
- 0 表示被masked的令牌。
什么是注意力掩码？
visual_token_type_ids (torch.LongTensor，形状为(batch_size, visual_seq_length)，可选) — 段令牌索引，用于指示视觉嵌入的不同部分。什么是令牌类型 ID？ VisualBERT 的作者将visual_token_type_ids设置为所有令牌的1。
image_text_alignment (torch.LongTensor，形状为(batch_size, visual_seq_length, alignment_number)，可选) — 用于决定视觉嵌入的位置 ID 的图像-文本对齐。
output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量中的attentions。
output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量中的hidden_states。
return_dict (bool，可选) — 是否返回一个 ModelOutput 而不是一个普通元组。
labels (torch.LongTensor，形状为(batch_size, total_sequence_length)，可选) — 用于计算序列分类/回归损失的标签。索引应在[0, ..., config.num_labels - 1]范围内。标签和返回的 logits 之间计算 KLDivLoss。

返回值

transformers.modeling_outputs.SequenceClassifierOutput 或tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.SequenceClassifierOutput 或一个torch.FloatTensor元组（如果传递了return_dict=False或当config.return_dict=False时）包含根据配置（VisualBertConfig）和输入的不同元素。

loss (torch.FloatTensor，形状为(1,)，可选，当提供labels时返回) — 分类（如果 config.num_labels==1 则为回归）损失。
logits (torch.FloatTensor，形状为(batch_size, config.num_labels)) — 分类（如果 config.num_labels==1 则为回归）得分（SoftMax 之前）。
hidden_states（tuple(torch.FloatTensor)，可选，当传递output_hidden_states=True或config.output_hidden_states=True时返回）— 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组（如果模型有嵌入层，则为嵌入的输出+每层的输出）。模型在每一层的输出的隐藏状态加上可选的初始嵌入输出。
attentions（tuple(torch.FloatTensor)，可选，当传递output_attentions=True或config.output_attentions=True时返回）— 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组（每层一个）。在自注意力头中用于计算加权平均值的注意力权重之后的注意力 softmax。

VisualBertForQuestionAnswering 的前向方法，覆盖了__call__特殊方法。

尽管前向传递的步骤需要在此函数内定义，但应该在此之后调用Module实例，而不是在此处调用，因为前者会负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

# Assumption: *get_visual_embeddings(image)* gets the visual embeddings of the image in the batch.
from transformers import AutoTokenizer, VisualBertForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = VisualBertForQuestionAnswering.from_pretrained("uclanlp/visualbert-vqa")

text = "Who is eating the apple?"
inputs = tokenizer(text, return_tensors="pt")
visual_embeds = get_visual_embeddings(image).unsqueeze(0)
visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)

inputs.update(
    {
        "visual_embeds": visual_embeds,
        "visual_token_type_ids": visual_token_type_ids,
        "visual_attention_mask": visual_attention_mask,
    }
)

labels = torch.tensor([[0.0, 1.0]]).unsqueeze(0)  # Batch size 1, Num labels 2

outputs = model(**inputs, labels=labels)
loss = outputs.loss
scores = outputs.logits

VisualBertForMultipleChoice

`class transformers.VisualBertForMultipleChoice`

<来源>

( config )

参数

config（VisualBertConfig）— 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。

VisualBert 模型在顶部具有多选分类头（池化输出顶部的线性层和 softmax），例如用于 VCR 任务。

此模型继承自 PreTrainedModel。查看超类文档以了解库为其所有模型实现的通用方法（例如下载或保存、调整输入嵌入、修剪头等）。

此模型还是 PyTorch torch.nn.Module子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以获取与一般用法和行为相关的所有内容。

forward

<来源>

( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None visual_embeds: Optional = None visual_attention_mask: Optional = None visual_token_type_ids: Optional = None image_text_alignment: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None labels: Optional = None ) → export const metadata = 'undefined';transformers.modeling_outputs.MultipleChoiceModelOutput or tuple(torch.FloatTensor)

参数

input_ids（形状为(batch_size, num_choices, sequence_length)的torch.LongTensor）— 词汇表中输入序列标记的索引。可以使用 AutoTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。什么是输入 ID？
attention_mask（形状为(batch_size, num_choices, sequence_length)的torch.FloatTensor，可选）— 避免在填充标记索引上执行注意力的蒙版。蒙版值在[0, 1]中选择：
- 1 表示未被掩码的标记，
- 0 表示被掩码的标记。
什么是注意力蒙版？
token_type_ids（torch.LongTensor，形状为(batch_size, num_choices, sequence_length)，可选）— 段标记索引，用于指示输入的第一部分和第二部分。索引在[0, 1]中选择：
- 0 对应于句子 A标记，
- 1 对应一个句子 B的标记。
什么是标记类型 ID？
position_ids（形状为(batch_size, num_choices, sequence_length)的torch.LongTensor，可选）— 每个输入序列标记在位置嵌入中的位置索引。选择范围为[0, config.max_position_embeddings - 1]。什么是位置 ID？
head_mask（形状为(num_heads,)或(num_layers, num_heads)的torch.FloatTensor，可选）— 用于使自注意力模块的选定头部失效的掩码。掩码值选择在[0, 1]范围内：
- 1 表示头部未被屏蔽，
- 0 表示头部被屏蔽。
inputs_embeds（形状为(batch_size, num_choices, sequence_length, hidden_size)的torch.FloatTensor，可选）— 可选地，您可以选择直接传递嵌入表示，而不是传递input_ids。如果您想要更多控制如何将input_ids索引转换为相关向量，这将非常有用，而不是使用模型的内部嵌入查找矩阵。
visual_embeds（形状为(batch_size, visual_seq_length, visual_embedding_dim)的torch.FloatTensor，可选）— 视觉输入的嵌入表示，通常使用对象检测器派生。
visual_attention_mask（形状为(batch_size, visual_seq_length)的torch.FloatTensor，可选）— 用于避免在视觉嵌入上执行注意力的掩码。掩码值选择在[0, 1]范围内：
- 1 用于那些未被屏蔽的标记，
- 0 用于被屏蔽的标记。
什么是注意力屏蔽？
visual_token_type_ids（形状为(batch_size, visual_seq_length)的torch.LongTensor，可选）— 段标记索引，用于指示视觉嵌入的不同部分。什么是标记类型 ID？ VisualBERT 的作者将visual_token_type_ids设置为1，适用于所有标记。
image_text_alignment（形状为(batch_size, visual_seq_length, alignment_number)的torch.LongTensor，可选）— 图像文本对齐用于决定视觉嵌入的位置 ID。
output_attentions（bool，可选）— 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的attentions。
output_hidden_states（bool，可选）— 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的hidden_states。
return_dict（bool，可选）— 是否返回 ModelOutput 而不是普通元组。
标签（形状为(batch_size,)的torch.LongTensor，可选）— 用于计算多选分类损失的标签。索引应在[0, ..., num_choices-1]范围内，其中num_choices是输入张量第二维的大小。（参见上面的input_ids）

transformers.modeling_outputs.MultipleChoiceModelOutput 或tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.MultipleChoiceModelOutput 或一个torch.FloatTensor元组（如果传递了return_dict=False或当config.return_dict=False时）包括根据配置（VisualBertConfig）和输入的不同元素。

loss（形状为*(1,)*的torch.FloatTensor，可选，在提供labels时返回）— 分类损失。
logits（形状为(batch_size, num_choices)的torch.FloatTensor）— num_choices是输入张量的第二维。（参见上面的input_ids）。分类得分（SoftMax 之前）。
hidden_states (tuple(torch.FloatTensor)，可选，当传递output_hidden_states=True或config.output_hidden_states=True时返回） — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组（如果模型有嵌入层，则为嵌入输出和每一层的输出）。模型在每一层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor)，可选，当传递output_attentions=True或config.output_attentions=True时返回） — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组（每层一个）。注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

VisualBertForMultipleChoice 的前向方法，覆盖了__call__特殊方法。

虽然前向传递的步骤需要在此函数内定义，但应该在此之后调用Module实例，而不是在此处调用，因为前者会负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例：

# Assumption: *get_visual_embeddings(image)* gets the visual embeddings of the image in the batch.
from transformers import AutoTokenizer, VisualBertForMultipleChoice
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = VisualBertForMultipleChoice.from_pretrained("uclanlp/visualbert-vcr")

prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
choice0 = "It is eaten with a fork and a knife."
choice1 = "It is eaten while held in the hand."

visual_embeds = get_visual_embeddings(image)
# (batch_size, num_choices, visual_seq_length, visual_embedding_dim)
visual_embeds = visual_embeds.expand(1, 2, *visual_embeds.shape)
visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)

labels = torch.tensor(0).unsqueeze(0)  # choice0 is correct (according to Wikipedia ;)), batch size 1

encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors="pt", padding=True)
# batch size is 1
inputs_dict = {k: v.unsqueeze(0) for k, v in encoding.items()}
inputs_dict.update(
    {
        "visual_embeds": visual_embeds,
        "visual_attention_mask": visual_attention_mask,
        "visual_token_type_ids": visual_token_type_ids,
        "labels": labels,
    }
)
outputs = model(**inputs_dict)

loss = outputs.loss
logits = outputs.logits

VisualBertForVisualReasoning

`class transformers.VisualBertForVisualReasoning`

<来源>

( config )

参数

config (VisualBertConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。

VisualBert 模型在顶部具有一个序列分类头（在池化输出的顶部有一个 dropout 和一个线性层），用于视觉推理，例如用于 NLVR 任务。

这个模型继承自 PreTrainedModel。查看超类文档以获取库为所有模型实现的通用方法（如下载或保存、调整输入嵌入、修剪头等）。

这个模型也是一个 PyTorch torch.nn.Module子类。将其用作常规的 PyTorch 模块，并参考 PyTorch 文档以获取与一般用法和行为相关的所有内容。

`forward`

<来源>

( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None visual_embeds: Optional = None visual_attention_mask: Optional = None visual_token_type_ids: Optional = None image_text_alignment: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None labels: Optional = None ) → export const metadata = 'undefined';transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor，形状为(batch_size, sequence_length)） — 词汇表中输入序列标记的索引。可以使用 AutoTokenizer 获取索引。查看 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()获取详细信息。什么是输入 ID？
attention_mask (torch.FloatTensor，形状为(batch_size, sequence_length)，可选) — 用于避免在填充标记索引上执行注意力的掩码。掩码值选在[0, 1]之间：
- 1 表示未被掩盖的标记，
- 0 表示被掩盖的标记。
什么是注意力掩码？
token_type_ids (torch.LongTensor，形状为(batch_size, sequence_length)，可选) — 段标记索引，指示输入的第一部分和第二部分。索引选在[0, 1]之间：
- 0 对应于句子 A标记，
- 1 对应于句子 B标记。
什么是标记类型 ID？
position_ids（形状为(batch_size, sequence_length)的torch.LongTensor，可选）— 每个输入序列标记在位置嵌入中的位置索引。在范围[0, config.max_position_embeddings - 1]中选择。什么是位置 ID？
head_mask（形状为(num_heads,)或(num_layers, num_heads)的torch.FloatTensor，可选）— 用于使自注意力模块的选定头部失效的掩码。在[0, 1]中选择的掩码值：
- 1 表示头部未被遮罩，
- 0 表示头部被遮罩。
inputs_embeds（形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor，可选）— 可选地，您可以选择直接传递嵌入表示而不是传递input_ids。如果您想要更多控制如何将input_ids索引转换为相关向量，而不是使用模型的内部嵌入查找矩阵，则这很有用。
visual_embeds（形状为(batch_size, visual_seq_length, visual_embedding_dim)的torch.FloatTensor，可选）— 视觉输入的嵌入表示，通常使用对象检测器派生。
visual_attention_mask（形状为(batch_size, visual_seq_length)的torch.FloatTensor，可选）— 用于避免对视觉嵌入执行注意力的掩码。在[0, 1]中选择的掩码值：
- 1 表示未被遮罩的标记，
- 0 表示被遮罩的标记。
什么是注意力掩码？
visual_token_type_ids（形状为(batch_size, visual_seq_length)的torch.LongTensor，可选）— 段标记索引，用于指示视觉嵌入的不同部分。什么是标记类型 ID？ VisualBERT 的作者将visual_token_type_ids设置为1以表示所有标记。
image_text_alignment（形状为(batch_size, visual_seq_length, alignment_number)的torch.LongTensor，可选）— 用于决定视觉嵌入的位置 ID 的图像-文本对齐。
output_attentions（bool，可选）— 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回的张量下的attentions。
output_hidden_states（bool，可选）— 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回的张量下的hidden_states。
return_dict（bool，可选）— 是否返回一个 ModelOutput 而不是一个普通元组。
labels（形状为(batch_size,)的torch.LongTensor，可选）— 用于计算序列分类/回归损失的标签。索引应在[0, ..., config.num_labels - 1]中。根据这些标签计算分类损失（交叉熵）。

transformers.modeling_outputs.SequenceClassifierOutput 或tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.SequenceClassifierOutput 或一个torch.FloatTensor元组（如果传递了return_dict=False或当config.return_dict=False时）包括各种元素，取决于配置（VisualBertConfig）和输入。

loss（形状为(1,)的torch.FloatTensor，可选，当提供labels时返回）— 分类（如果config.num_labels==1则为回归）损失。
logits（形状为(batch_size, config.num_labels)的torch.FloatTensor）— 分类（如果config.num_labels==1则为回归）得分（SoftMax 之前）。
hidden_states（tuple(torch.FloatTensor)，可选，当传递output_hidden_states=True或config.output_hidden_states=True时返回）— 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组（一个用于嵌入的输出，如果模型有一个嵌入层，+ 一个用于每一层的输出）。每一层输出的模型的隐藏状态加上可选的初始嵌入输出。
attentions（tuple(torch.FloatTensor)，可选，当传递output_attentions=True或config.output_attentions=True时返回）— 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组（每层一个）。在注意力 softmax 之后的注意力权重，用于计算自注意力头中的加权平均值。

VisualBertForVisualReasoning 的前向方法，覆盖了__call__特殊方法。

尽管前向传递的配方需要在此函数内定义，但应该在此之后调用Module实例，而不是这个，因为前者负责运行预处理和后处理步骤，而后者则默默地忽略它们。

示例：

# Assumption: *get_visual_embeddings(image)* gets the visual embeddings of the image in the batch.
from transformers import AutoTokenizer, VisualBertForVisualReasoning
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = VisualBertForVisualReasoning.from_pretrained("uclanlp/visualbert-nlvr2")

text = "Who is eating the apple?"
inputs = tokenizer(text, return_tensors="pt")
visual_embeds = get_visual_embeddings(image).unsqueeze(0)
visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)

inputs.update(
    {
        "visual_embeds": visual_embeds,
        "visual_token_type_ids": visual_token_type_ids,
        "visual_attention_mask": visual_attention_mask,
    }
)

labels = torch.tensor(1).unsqueeze(0)  # Batch size 1, Num choices 2

outputs = model(**inputs, labels=labels)
loss = outputs.loss
scores = outputs.logits

VisualBertForRegionToPhraseAlignment

`class transformers.VisualBertForRegionToPhraseAlignment`

<来源>

( config )

参数

config（VisualBertConfig）— 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，只加载配置。查看 from_pretrained()方法以加载模型权重。

VisualBert 模型具有一个用于区域到短语对齐的遮蔽语言建模头部和一个位于顶部的注意力层，例如用于 Flickr30 实体任务。

这个模型继承自 PreTrainedModel。查看超类文档以了解库为所有模型实现的通用方法（如下载或保存、调整输入嵌入、修剪头等）。

这个模型也是 PyTorch torch.nn.Module子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以获取有关一般用法和行为的所有相关信息。

`forward`

<来源>

( input_ids: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None visual_embeds: Optional = None visual_attention_mask: Optional = None visual_token_type_ids: Optional = None image_text_alignment: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None region_to_phrase_position: Optional = None labels: Optional = None ) → export const metadata = 'undefined';transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)

参数

input_ids（形状为(batch_size, sequence_length)的torch.LongTensor）— 词汇表中输入序列标记的索引。可以使用 AutoTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode()和 PreTrainedTokenizer.call()。什么是输入 ID？
attention_mask（形状为(batch_size, sequence_length)的torch.FloatTensor，可选）— 用于避免在填充标记索引上执行注意力的掩码。掩码值选在[0, 1]之间：
- 1 表示未被masked的标记，
- 对于被masked的标记为 0。
什么是注意力掩码？
token_type_ids（形状为(batch_size, sequence_length)的torch.LongTensor，可选）— 指示输入的第一部分和第二部分的段标记索引。索引选在[0, 1]之间：
- 0 对应于句子 A标记，
- 1 对应于一个 sentence B 标记。
什么是标记类型 ID？
position_ids (torch.LongTensor，形状为 (batch_size, sequence_length)，可选) — 每个输入序列标记在位置嵌入中的位置索引。选择范围为 [0, config.max_position_embeddings - 1]。什么是位置 ID？
head_mask (torch.FloatTensor，形状为 (num_heads,) 或 (num_layers, num_heads)，可选) — 用于使自注意力模块中选择的头部失效的掩码。掩码值选择在 [0, 1] 之间：
- 1 表示头部是 not masked 的，
- 0 表示头部是 masked。
inputs_embeds (torch.FloatTensor，形状为 (batch_size, sequence_length, hidden_size)，可选) — 可选地，可以直接传递嵌入表示而不是传递 input_ids。如果您想要更多控制权来将 input_ids 索引转换为相关向量，这将非常有用，而不是使用模型的内部嵌入查找矩阵。
visual_embeds (torch.FloatTensor，形状为 (batch_size, visual_seq_length, visual_embedding_dim)，可选) — 视觉输入的嵌入表示，通常使用对象检测器生成。
visual_attention_mask (torch.FloatTensor，形状为 (batch_size, visual_seq_length)，可选) — 用于避免在视觉嵌入上执行注意力的掩码。掩码值选择在 [0, 1] 之间：
- 1 表示未被 masked 的标记，
- 0 表示头部是 masked 的标记。
什么是注意力掩码？
visual_token_type_ids (torch.LongTensor，形状为 (batch_size, visual_seq_length)，可选) — 段标记索引，用于指示视觉嵌入的不同部分。什么是标记类型 ID？ VisualBERT 的作者将 visual_token_type_ids 设置为 1 以表示所有标记。
image_text_alignment (torch.LongTensor，形状为 (batch_size, visual_seq_length, alignment_number)，可选) — 图像-文本对齐用于决定视觉嵌入的位置 ID。
output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量中的 attentions。
output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量中的 hidden_states。
return_dict (bool, 可选) — 是否返回一个 ModelOutput 而不是一个普通的元组。
region_to_phrase_position (torch.LongTensor，形状为 (batch_size, total_sequence_length)，可选) — 描述图像嵌入位置与文本标记位置对应的位置。
labels (torch.LongTensor，形状为 (batch_size, total_sequence_length, visual_sequence_length)，可选) — 用于计算掩码语言建模损失的标签。KLDivLoss 是根据这些标签和注意力层的输出计算的。

transformers.modeling_outputs.SequenceClassifierOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.SequenceClassifierOutput 或一个 torch.FloatTensor 元组（如果传递了 return_dict=False 或当 config.return_dict=False 时）包含各种元素，取决于配置（VisualBertConfig）和输入。

loss (torch.FloatTensor，形状为 (1,)，可选，当提供了 labels 时返回) — 分类（如果 config.num_labels==1 则为回归）损失。
logits (torch.FloatTensor，形状为 (batch_size, config.num_labels)) — 分类（如果 config.num_labels==1 则为回归）得分（SoftMax 之前）。
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). 模型在每一层输出的隐藏状态，以及可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). 在注意力 softmax 之后的注意力权重，用于计算自注意力头中的加权平均值。

VisualBertForRegionToPhraseAlignment 的前向方法重写了__call__特殊方法。

虽然前向传递的步骤需要在这个函数内定义，但应该在此之后调用Module实例，而不是在此处调用，因为前者会处理运行前后的处理步骤，而后者会默默地忽略它们。

示例：

# Assumption: *get_visual_embeddings(image)* gets the visual embeddings of the image in the batch.
from transformers import AutoTokenizer, VisualBertForRegionToPhraseAlignment
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = VisualBertForRegionToPhraseAlignment.from_pretrained("uclanlp/visualbert-vqa-coco-pre")

text = "Who is eating the apple?"
inputs = tokenizer(text, return_tensors="pt")
visual_embeds = get_visual_embeddings(image).unsqueeze(0)
visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
region_to_phrase_position = torch.ones((1, inputs["input_ids"].shape[-1] + visual_embeds.shape[-2]))

inputs.update(
    {
        "region_to_phrase_position": region_to_phrase_position,
        "visual_embeds": visual_embeds,
        "visual_token_type_ids": visual_token_type_ids,
        "visual_attention_mask": visual_attention_mask,
    }
)

labels = torch.ones(
    (1, inputs["input_ids"].shape[-1] + visual_embeds.shape[-2], visual_embeds.shape[-2])
)  # Batch size 1

outputs = model(**inputs, labels=labels)
loss = outputs.loss
scores = outputs.logits

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2024-06-26，如有侵权请联系 cloudcommunity@tencent.com 删除

配置

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！