谁是最强多模态模型？评测框架 VLMEvalKit 全方位揭秘多模态能力

OpenMMLab 官方账号

发布于 2024-01-19 09:06:34

1.8K00

代码可运行

文章被收录于专栏：OpenMMLabOpenMMLab

运行总次数：0

代码可运行

随着 OpenFlamingo, LLaVA, MiniGPT-4 等一系列多模态理解先驱项目的推出，我们见证了超过一百种创新多模态模型和众多评测数据集的诞生。面对这一领域的迅速扩张，我们意识到一个挑战：

不同的多模态模型通常会提供不同评测集上的测试结果，但迄今为止，尚无一个统一的开源评测框架来全面覆盖这些多样化的模型和评测集。

为此，OpenCompass 团队开发了 VLMEvalKit，一个全新的开源多模态评测框架，旨在提供可靠、可复现的评测结果，助力社区更准确地比较不同多模态模型在各种任务上的性能。

GitHub:

https://github.com/open-compass/VLMEvalKit

（欢迎使用，文末点击阅读原文可直达）

主要特性

我们将 VLMEvalKit 的主要特性总结如下：

1. 适用范围：

目前的 VLMEvalKit 主要适用于图文多模态模型的评测，基于模型能力范围，可以支持单对图文输入或是任意数量的图文交错输入。下面的代码展示如何基于 VLMEvalKit 进行单对图文或任意交错图文的推理：

from vlmeval.config import supported_VLM
model = supported_VLM['idefics_9b_instruct']()
# 基于 VLMEvalKit 进行单对图文推理
ret = model.generate('apple.jpg', 'What is in this image?')
# ret: "The image features a red apple with a leaf on it."
# 基于 VLMEvalKit 进行任意交错图文推理
ret = model.interleave_generate(
    ['apple.jpg', 'apple.jpg', 'How many apples are there in the provided images? ']
)
# ret: "There are two apples in the provided images."

2. 丰富的模型与评测集支持：

支持三个主流多模 API 模型：GPT-4v，GeminiPro，QwenVLPlus
支持包括 llava-v1.5，mPLUG-Owl2, XComposer, CogVLM 等模型在内的超过三十个开源多模态模型
支持包括 MME, MMBench, SEEDBench, MMMU 等评测集在内的十余个开源多模态评测集
基于支持的模型和评测集进行了翔实的评测，结果发布在 OpenCompass 多模态整体榜单：https://opencompass.org.cn/leaderboard-multimodal

3. 便捷的一站式评测：

对于所有 VLMEvalKit 支持的数据集，均无需进行手动数据预处理
只需一条命令即可完成对多个多模态模型和评测集的评测

4. 易于扩展：基于 VLMEvalKit 框架，你可以轻松添加新的多模态模型 / 评测集。并且，当你完成了模型 / 评测集添加后，任意原有的评测集 / 模型都可适用于新的模型 / 评测集的评测。

添加新的评测集：基本上，只需要将自定义评测集转化为 VLMEvalKit 支持的 TSV 格式，并提供相应的自定义 prompt 构建方式，即可在 VLMEvalKit 中添加新的评测集。[AI2D](https://github.com/open-compass/VLMEvalKit/pull/51) 提供了一个可供参考的例子。
添加新的多模模型：为添加新的多模模型，你需支持一个新类，该类只需支持简单的 generate(image, prompt) 接口即可满足要求。这种方式同时适用于 API 模型 (QwenVLPlus，参考：https://github.com/open-compass/VLMEvalKit/pull/27/) 与开源模型 (Monkey，参考：https://github.com/open-compass/VLMEvalKit/pull/45)。
为不同评测集选用自定义 prompt：我们理解开发者可能为不同的评测集选择不同的 prompt 模板以达到最佳效果，因此，我们在 VLMEvalKit 中支持了这一功能。

如何使用

安装

git clone git@github.com:open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .

Demo 检查是否安装成功

rom vlmeval.config import supported_VLM
model = supported_VLM['idefics_9b_instruct']()
ret = model.generate('apple.jpg', 'What is in this image?')
# ret: "The image features a red apple with a leaf on it."
ret = model.interleave_generate(
    ['apple.jpg', 'apple.jpg', 'How many apples are there in the provided images? ']
)
# ret: "There are two apples in the provided images."

进行评测

# 模型：qwen_chat; 评测集范围：MME; 机器配置：2 卡 A100
torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
# 模型：IDEFICS-9B-Instruct, Qwen-VL-Chat, mPLUG-Owl2
# 评测集范围：MMBench_DEV_EN, MME, SEEDBench_IMG
# 机器配置：8 卡 A100
torchrun --nproc-per-node=8 run.py --data MMBench_DEV_EN MME SEEDBench_IMG \
         --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose

评测结果

我们将测试结果公布在 OpenCompass 多模态大模型性能整体榜单: https://opencompass.org.cn/leaderboard-multimodal。目前，榜单上包含了 VLMEvalKit 中所有多模模型在 9 个评测集上的性能。下图截取了部分评测结果：

定量结果

整体而言，我们有以下发现：

1. 闭源多模态 API 模型整体性能仍处于领先地位：计算各个模型在不同评测集上的平均排名，可以发现，排名前三的 GeminiPro，GPT-4v，QwenVLPlus 均为闭源 API 模型。

2. 开源多模态模型在推理能力上存在欠缺：在一些需要较强推理能力的测试集上 (如 MMMU，MMVet，MathVista 等)，开源模型（如 InternLM-XComposer）尚与闭源模型存在一定差距。

为便于用户对多模态模型的性能进行比较，我们选取了 9 个主流的多模态模型，进行了性能可视化：

定性结果

为了解目前的多模态模型尚有哪些不足之处，我们选取了在上图九个评测集中，所有多模态模型均无法正确答对的题目进行可视化，以下是部分结果：

1. 需要外部知识才能回答的题目

来源: MathVista

题目：What is the age gap between these two people in image? (Unit: years)

答案：11

来源: MMMU

题目：In the Section of left leg, identify the 170 structure.

选项：A. Tibialis anterior B. Tibialis posterior C. Flexor hallucis longus D. Peroneus longus

答案：D

来源: MMMU

题目：The maximum number of stereoisomers that could exist for the compound below ?

选项：A. 6 B. 8 C.10 D.16

答案：C

来源: MME

题目：Is the person inside the red bounding box called Michael Keaton? Please answer yes or no.

答案：Yes

复杂的多模态推理

来源: MathVista

题目：如图，在ABCD中，AB＝AC，∠CAB＝40°，则∠D的度数是（）

选项：(A) 40° (B) 50° (C) 60° (D) 70°

答案：(D) 70°

来源: MathVista

题目：How many Triangles do you see in the picture?

答案：12

3. 复杂的图表分析

来源: MathVista

题目：What is the difference between genres of tv shows watched by highest female and lowest female?

答案：39

来源: MMMU

题目：Each of the following situations relates to a different company. For company B, find the missing amounts.

选项：A. 63,020 B. 58,410 C. 71,320 D. 77,490

答案：D

附录

VLMEvalKit 项目地址：

https://github.com/open-compass/VLMEvalKit

MMBench 性能榜单：

https://mmbench.opencompass.org.cn/leaderboard

OpenCompass 多模态大模型性能整体榜单：

https://opencompass.org.cn/leaderboard-multimodal

如何加入 MMBench 性能榜单：

发送 MMBench 评测集上的预测结果或官网评测 ID 至 opencompass@pjlab.org.cn;
官网评测 ID 可以通过提交预测结果至评测服务 (https://mmbench.opencompass.org.cn/mmbench-submission) 获得。

如何加入多模态大模型性能整体榜单：

提交 PR 至 VLMEvalKit 项目，榜单会随后更新。参考:

【支持新模型】Support Monkey (#45)： https://github.com/open-compass/VLMEvalKit/pull/45/files
【支持新数据集】Support AI2D (#51)： https://github.com/open-compass/VLMEvalKit/pull/51/files

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2024-01-17，如有侵权请联系 cloudcommunity@tencent.com 删除

性能