《书生大模型实战营第3期》基础岛第2关：8G 显存玩转书生大模型 Demo

流川疯

发布于 2024-08-11 08:26:10

4790

文章被收录于专栏：流川疯编写程序的艺术流川疯编写程序的艺术

简介

课程学习笔记

环境配置

我们首先来为 Demo 创建一个可用的环境。

# 创建环境
conda create -n demo python=3.10 -y
# 激活环境
conda activate demo
# 安装 torch
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia -y
# 安装其他依赖
pip install transformers==4.38
pip install sentencepiece==0.1.99
pip install einops==0.8.0
pip install protobuf==5.27.2
pip install accelerate==0.33.0
pip install streamlit==1.37.0

Cli Demo 部署 InternLM2-Chat-1.8B 模型

首先，我们创建一个目录，用于存放我们的代码。并创建一个 cli_demo.py。

mkdir -p /root/demo
touch /root/demo/cli_demo.py

然后，我们将下面的代码复制到 cli_demo.py 中。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


model_name_or_path = "/root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, device_map='cuda:0')
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='cuda:0')
model = model.eval()

system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
"""

messages = [(system_prompt, '')]

print("=============Welcome to InternLM chatbot, type 'exit' to exit.=============")

while True:
    input_text = input("\nUser  >>> ")
    input_text = input_text.replace(' ', '')
    if input_text == "exit":
        break

    length = 0
    for response, _ in model.stream_chat(tokenizer, input_text, messages):
        if response is not None:
            print(response[length:], flush=True, end="")
            length = len(response)

接下来，我们便可以通过 python /root/demo/cli_demo.py 来启动我们的 Demo。

效果如下图所示：

Streamlit Web Demo 部署 InternLM2-Chat-1.8B 模型

在本节中，我们将演示如何使用 Streamlit 部署 InternLM2-Chat-1.8B 模型。

我们执行如下代码来把本教程仓库 clone 到本地，以执行后续的代码。

cd /root/demo
git clone https://github.com/InternLM/Tutorial.git

然后，我们执行如下代码来启动一个 Streamlit 服务。

cd /root/demo
streamlit run /root/demo/Tutorial/tools/streamlit_demo.py --server.address 127.0.0.1 --server.port 6006

接下来，我们在本地的 PowerShell 中输入以下命令，将端口映射到本地。

ssh -CNg -L 6006:127.0.0.1:6006 root@ssh.intern-ai.org.cn -p 你的 ssh 端口号

然后将 SSH 密码复制并粘贴到 PowerShell 中，回车，即可完成端口映射。正常情况下，我们会看到如下输出：

在完成端口映射后，我们便可以通过浏览器访问 http://localhost:6006 来启动我们的 Demo。

效果如下图所示：

如果遇到了 OSError: [Errno 28] inotify watch limit reached 的问题，请不要慌张。稍等一段时间后重新执行即可。

LMDeploy 部署 InternLM-XComposer2-VL-1.8B 模型

InternLM-XComposer2 是一款基于 InternLM2 的视觉语言大模型，其擅长自由形式的文本图像合成和理解。其主要特点包括：

自由形式的交错文本图像合成：InternLM-XComposer2 可以根据大纲、详细文本要求和参考图像等不同输入，生成连贯且上下文相关，具有交错图像和文本的文章，从而实现高度可定制的内容创建。
准确的视觉语言问题解决：InternLM-XComposer2 基于自由形式的指令准确地处理多样化和具有挑战性的视觉语言问答任务，在识别，感知，详细标签，视觉推理等方面表现出色。
令人惊叹的性能：基于 InternLM2-7B 的InternLM-XComposer2 在多个基准测试中位于开源多模态模型第一梯队，而且在部分基准测试中与 GPT-4V 和 Gemini Pro 相当甚至超过它们。

LMDeploy 是一个用于压缩、部署和服务 LLM 的工具包，由 MMRazor 和 MMDeploy 团队开发。它具有以下核心功能：

高效的推理：LMDeploy 通过引入持久化批处理、块 KV 缓存、动态分割与融合、张量并行、高性能 CUDA 内核等关键技术，提供了比 vLLM 高 1.8 倍的推理性能。
有效的量化：LMDeploy 支持仅权重量化和 k/v 量化，4bit 推理性能是 FP16 的 2.4 倍。量化后模型质量已通过 OpenCompass 评估确认。
轻松的分发：利用请求分发服务，LMDeploy 可以在多台机器和设备上轻松高效地部署多模型服务。
交互式推理模式：通过缓存多轮对话过程中注意力的 k/v，推理引擎记住对话历史，从而避免重复处理历史会话。
优秀的兼容性：LMDeploy支持 KV Cache Quant，AWQ 和自动前缀缓存同时使用。

LMDeploy 已经支持了 InternLM-XComposer2 系列的部署，但值得注意的是 LMDeploy 仅支持了 InternLM-XComposer2 系列模型的视觉对话功能。

接下来，我们将演示如何使用 LMDeploy 部署 InternLM-XComposer2-VL-1.8B 模型。

首先，我们激活环境并安装 LMDeploy 以及其他依赖。

conda activate demo
pip install lmdeploy[all]==0.5.1
pip install timm==1.0.7

接下来，我们使用 LMDeploy 启动一个与 InternLM-XComposer2-VL-1.8B 模型交互的 Gradio 服务。

lmdeploy serve gradio /share/new_models/Shanghai_AI_Laboratory/internlm-xcomposer2-vl-1_8b --cache-max-entry-count 0.1

在使用 Upload Image 上传图片后，我们输入 Instruction 后按下回车，便可以看到模型的输出。

LMDeploy 部署 InternVL2-2B 模型

InternVL2 是上海人工智能实验室推出的新一代视觉-语言多模态大模型，是首个综合性能媲美国际闭源商业模型的开源多模态大模型。InternVL2 系列从千亿大模型到端侧小模型全覆盖，通专融合，支持多种模态。

LMDeploy 也已经支持了 InternVL2 系列模型的部署，让我们一起来使用 LMDeploy 部署 InternVL2-2B 模型。

我们可以通过下面的命令来启动 InternVL2-2B 模型的 Gradio 服务。

conda activate demo
lmdeploy serve gradio /share/new_models/OpenGVLab/InternVL2-2B --cache-max-entry-count 0.1

在完成端口映射后，我们便可以通过浏览器访问 http://localhost:6006 来启动我们的 Demo。

在使用 Upload Image 上传图片后，我们输入 Instruction 后按下回车，便可以看到模型的输出。

任务

https://github.com/InternLM/Tutorial/blob/camp3/docs/L1/Demo/task.md

基础任务

使用 Cli Demo 完成 InternLM2-Chat-1.8B 模型的部署，并生成 300 字小故事，记录复现过程并截图。

提示词请基于真实历史和事实，创作一个800字左右包含智慧，启迪人心的故事，并且最好能着重说明如何启发人们

故事不错，就是启动和回复的太慢了。

回答全文：

在远古时期，人类社会经历了漫长而艰难的探索与进化，从狩猎采集到农业文明，再到工业革命，科技的飞速发展改变了人们的生活方式和社会结构。然而，在这个过程中，我们不断追求智慧与进步的同时，也必须警惕其可能带来的负面影响。曾经有一位名叫“智慧老人”的智者，他拥有着丰富的智慧和经验，对于人类社会的演变有着深刻的理解。在人类历史的某个时刻，智慧老人发现，虽然科技进步带来了许多便利和进步，但同时也加剧了社会的分化和贫富差距。智慧老人决定提出一个名为“智慧之河”的概念，希望通过智慧的力量，引导人类走向更加平衡和和谐的未来。他希望通过教育、文化传承和社区建设，让每个人都能够从中受益，并成为推动社会进步的力量。智慧老人提出了以下智慧之河的三个阶段：

教育之河：智慧老人认为，教育是通向智慧之河的起点。他提倡普及教育，让每个人都有机会接受良好的教育，掌握知识和技能，从而在未来的社会中拥有更多的选择和机会。同时，教育也应注重培养人们的道德观念和人文素养，帮助他们成为有责任感和同理心的公民。
文化之河：智慧老人强调文化的重要性，认为文化是智慧之河的源泉。通过保护和传承文化遗产，人类可以保持对过去的尊重和记忆，同时也能从中汲取智慧和灵感。智慧老人提倡保护和推广本土文化，同时也要尊重其他文化的独特性，促进不同文化之间的交流与融合，以促进全球化视野下的和谐共存。
社区之河：智慧老人认为，社区是智慧之河的纽带。社区的建设需要关注社会公正和公平，确保每个人都有平等的机会参与决策和管理。智慧老人提倡建立多元化的社区，鼓励不同背景和经验的人们共同参与，以解决社会问题并创造共同的繁荣。

智慧老人的智慧之河不仅是一种理念的表达，更是一种行动的指南。他希望通过自己的智慧和行动，激发人们的内在动力，让每个人都能在智慧的指引下，为构建一个更加美好、公正和繁荣的社会而努力。在实践过程中，智慧老人不仅以身作则，积极参与社区建设，还通过各种方式传播他的理念，如举办讲座、组织培训和参与社区活动。他鼓励人们在日常生活中实践智慧之河的理念，比如通过志愿服务、文化交流和环保行动，来推动社会的和谐发展。智慧老人的智慧之河不仅是一种愿景，更是一种行动指南，它提醒我们，在追求进步的同时，我们必须时刻警惕智慧的过度应用可能带来的负面影响，并努力寻求平衡，以实现人类社会的持续进步和繁荣。

进阶任务

使用 LMDeploy 完成 InternLM-XComposer2-VL-1.8B 的部署，并完成一次图文理解对话，记录复现过程并截图。

(demo) root@intern-studio-50001213:~/demo# lmdeploy serve gradio /share/new_models/Shanghai_AI_Laboratory/internlm-xcomposer2-vl-1_8b --cache-max-entry-count 0.1
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
2024-08-08 22:21:26,969 - lmdeploy - INFO - matching vision model: Xcomposer2VisionModel
Set max length to 4096
Could not locate the ixc_utils.py inside /share/new_models/Shanghai_AI_Laboratory/internlm-xcomposer2-vl-1_8b.
Could not locate the ixc_utils.py inside /share/new_models/Shanghai_AI_Laboratory/internlm-xcomposer2-vl-1_8b.
2024-08-08 22:21:44,404 - lmdeploy - INFO - matching type of ModelType.XCOMPOSER2
2024-08-08 22:22:19,666 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name=None, model_format=None, tp=1, session_len=8192, max_batch_size=128, cache_max_entry_count=0.1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
2024-08-08 22:22:19,667 - lmdeploy - INFO - input chat_template_config=ChatTemplateConfig(model_name=None, system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability='chat', stop_words=None)
2024-08-08 22:22:19,850 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='internlm-xcomposer2', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability='chat', stop_words=None)
2024-08-08 22:22:19,850 - lmdeploy - INFO - model_source: hf_model
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
You are using a model of type internlmxcomposer2 to instantiate a model of type internlm. This is not supported for all configurations of models and can yield errors.
Could not locate the ixc_utils.py inside /share/new_models/Shanghai_AI_Laboratory/internlm-xcomposer2-vl-1_8b.
Could not locate the ixc_utils.py inside /share/new_models/Shanghai_AI_Laboratory/internlm-xcomposer2-vl-1_8b.
Could not locate the ixc_utils.py inside /share/new_models/Shanghai_AI_Laboratory/internlm-xcomposer2-vl-1_8b.
Could not locate the ixc_utils.py inside /share/new_models/Shanghai_AI_Laboratory/internlm-xcomposer2-vl-1_8b.
2024-08-08 22:22:25,044 - lmdeploy - INFO - model_config:

[llama]
model_name = internlm-xcomposer2
model_arch = InternLMXComposer2ForCausalLM
tensor_para_size = 1
head_num = 16
kv_head_num = 8
vocab_size = 92544
num_layer = 24
inter_size = 8192
norm_eps = 1e-05
attn_bias = 0
start_id = 1
end_id = 2
session_len = 8192
weight_type = bf16
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 0
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.1
cache_block_seq_len = 64
cache_chunk_size = -1
enable_prefix_caching = False
num_tokens_per_iter = 8192
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 2.0
use_dynamic_ntk = 1
use_logn_attn = 0
lora_policy = plora
lora_r = 256
lora_scale = 1.0
lora_max_wo_r = 256
lora_rank_pattern = 
lora_scale_pattern = 


[TM][WARNING] [LlamaTritonModel] `max_context_token_num` = 8192.
2024-08-08 22:22:25,502 - lmdeploy - WARNING - get 411 model params
2024-08-08 22:22:54,298 - lmdeploy - INFO - updated backend_config=TurbomindEngineConfig(model_name=None, model_format=None, tp=1, session_len=8192, max_batch_size=128, cache_max_entry_count=0.1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[TM][INFO] [BlockManager] block_size = 6 MB
[TM][INFO] [BlockManager] max_block_count = 48
[TM][INFO] [BlockManager] chunk_size = 48
[TM][WARNING] No enough blocks for `session_len` (8192), `session_len` truncated to 3072.
[TM][INFO] LlamaBatch<T>::Start()
Running on local URL:  http://0.0.0.0:6006

Could not create share link. Missing file: /root/.conda/envs/demo/lib/python3.10/site-packages/gradio/frpc_linux_amd64_v0.2. 

Please check your internet connection. This can happen if your antivirus software blocks the download of this file. You can install manually by following these steps: 

1. Download this file: https://cdn-media.huggingface.co/frpc-gradio-0.2/frpc_linux_amd64
2. Rename the downloaded file to: frpc_linux_amd64_v0.2
3. Move the file to this location: /root/.conda/envs/demo/lib/python3.10/site-packages/gradio
2024-08-08 22:24:58,737 - lmdeploy - INFO - prompt: ('图中有神马？', [<PIL.Image.Image image mode=RGB size=768x1152 at 0x7F950D92A3B0>])
2024-08-08 22:24:58,737 - lmdeploy - WARNING - Can not found event loop in current thread. Create a new event loop.
2024-08-08 22:24:58,737 - lmdeploy - WARNING - auto append <IMAGE_TOKEN> at the beginning, the user can manually insert the token to prompt
2024-08-08 22:24:58,738 - lmdeploy - INFO - start ImageEncoder._forward_loop
2024-08-08 22:24:58,738 - lmdeploy - INFO - ImageEncoder received 1 images, left 1 images.
2024-08-08 22:24:58,738 - lmdeploy - INFO - ImageEncoder process 1 images, left 0 images.
2024-08-08 22:25:03,816 - lmdeploy - INFO - ImageEncoder forward 1 images, cost 5.079s
2024-08-08 22:25:03,817 - lmdeploy - INFO - ImageEncoder done 1 images, left 0 images.
2024-08-08 22:25:03,819 - lmdeploy - INFO - preprocess cost 5.082s
2024-08-08 22:25:03,819 - lmdeploy - INFO - input_ids: [1, 92543, 9081, 364, 2770, 657, 589, 15358, 17993, 6843, 963, 505, 4576, 11146, 30778, 1234, 20248, 451, 62442, 60752, 60721, 61255, 61104, 4452, 285, 4576, 11146, 30778, 1234, 20248, 451, 62442, 60752, 60721, 61255, 61104, 313, 505, 395, 7445, 17218, 2881, 7659, 1813, 4287, 1762, 560, 505, 8020, 684, 36956, 15358, 31288, 451, 68589, 76659, 71581, 699, 1226, 505, 6342, 442, 517, 11100, 328, 10894, 328, 454, 51978, 756, 285, 4576, 11146, 30778, 1234, 20248, 451, 62442, 60752, 60721, 61255, 61104, 313, 777, 3696, 454, 19187, 19829, 4563, 435, 410, 4287, 12032, 684, 410, 1341, 1893, 569, 6519, 454, 262, 69093, 756, 285, 4576, 11146, 30778, 1234, 20248, 451, 62442, 60752, 60721, 61255, 61104, 313, 505, 13026, 446, 12824, 2613, 454, 27943, 15613, 14644, 13585, 3285, 519, 410, 4054, 2321, 281, 92542, 364, 92543, 1008, 364, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 60472, 70635, 60780, 60832, 60504, 92542, 364, 92543, 525, 11353, 364]
2024-08-08 22:25:03,820 - lmdeploy - INFO - Register stream callback for 0
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] [ProcessInferRequests] Request for 0 received.
[TM][INFO] [Forward] [0, 1), dc_bsz = 0, pf_bsz = 1, n_tok = 1373, max_q = 1373, max_k = 1373

# 这里卡住，很长时间不出结果。。。