快速上手chatglm.cpp模型量化工具

原创

Luoyger

修改于 2024-03-13 12:30:52

3K0

修改于 2024-03-13 12:30:52

文章被收录于专栏：AI技术探索和应用

chatglm.cpp可以对ChatGLM系列的模型进行量化，满足在低性能的机器上进行推理，其使用的教程如下。

下载代码

git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cpp

量化模型

支持量化的模型包括：ChatGLM-6B、ChatGLM2-6B、CodeGeeX2及这些的量化模型。

-i 参数指定原模型，可以是HuggingFace上的模型，也可以是本地路径的模型。

-t <type>:

q4_0: 4-bit integer quantization with fp16 scales.
q4_1: 4-bit integer quantization with fp16 scales and minimum values.
q5_0: 5-bit integer quantization with fp16 scales.
q5_1: 5-bit integer quantization with fp16 scales and minimum values.
q8_0: 8-bit integer quantization with fp16 scales.
f16: half precision floating point weights without quantization.
f32: single precision floating point weights without quantization.

-l <lora_model_name_or_path>可以合并 LoRA weights 到基础模型。

python3 chatglm_cpp/convert.py -i THUDM/chatglm-6b -t q4_0 -o chatglm-ggml.bin

运行模型

cpp工具运行

编译工具

cmake -B build
cmake --build build -j --config Release

运行

./build/bin/main -m chatglm-ggml.bin -p 你好

交互式，这种模式下，聊天记录会被带到下一次对话中。

./build/bin/main -m chatglm-ggml.bin -i

使用Python库

安装Python库。

注意: 当前目录下有一个目录名为chatglm_cpp，与import的依赖同名，后续使用这个包，都会出现冲突，需要把运行的脚本放到另外一个目录下运行，并注意加载的模型路径。或者在安装后把chatglm_cpp目录重命名，比如改为chatglm_cpp.origin

pip install -U chatglm-cpp

加载模型，这种回答不是流式的。

import chatglm_cpp

pipeline = chatglm_cpp.Pipeline("../chatglm-ggml.bin")
pipeline.chat(["你好"])
'你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。'

使用流式

examples/目录下运行cli_chat.py

python3 cli_chat.py -m ../chatglm-ggml.bin -i
# python3 cli_chat.py -m ../chatglm2-ggml.bin -p 你好 --temp 0.8 --top_p 0.8  # CLI demo

浏览器中对话

examples/目录下运行web_demo.py

python3 web_demo.py -m ../chatglm-ggml.bin
# python3 web_demo.py -m ../chatglm2-ggml.bin --temp 0.8 --top_p 0.8  # web demo

自动量化并运行

模型路径可以是HuggingFace或者本地路径。

import chatglm_cpp

pipeline = chatglm_cpp.Pipeline("THUDM/chatglm-6b", dtype="q4_0")
pipeline.chat(["你好"])
# '你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。'

API Server

pip install 'chatglm-cpp[api]'

LangChain API

如果有依赖冲突问题，新建一个conda环境重新安装依赖和运行。注意修改量化后模型的名称和地址。

MODEL=./chatglm-ggml.bin uvicorn chatglm_cpp.langchain_api:app --host 127.0.0.1 --port 8000

curl测试API

curl http://127.0.0.1:8000 -H 'Content-Type: application/json' -d '{"prompt": "你好"}'

Client使用LangChain测试

from langchain.llms import ChatGLM

llm = ChatGLM(endpoint_url="http://127.0.0.1:8000")
llm.predict("你好")
'你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。'

OpenAI API

如果有依赖冲突问题，新建一个conda环境重新安装依赖和运行。注意修改量化后模型的名称和地址。

MODEL=./chatglm-ggml.bin uvicorn chatglm_cpp.openai_api:app --host 127.0.0.1 --port 8000

curl测试API

curl http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
    -d '{"messages": [{"role": "user", "content": "你好"}]}'

Client使用OpenAI测试

import openai

openai.api_base = "http://127.0.0.1:8000/v1"
response = openai.ChatCompletion.create(model="default-model", messages=[{"role": "user", "content": "你好"}])
response["choices"][0]["message"]["content"]
'你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。'

Client流式输出

OPENAI_API_BASE=http://127.0.0.1:8000/v1 python3 examples/openai_client.py --stream --prompt 你好

性能参考

环境:

CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz 16 线程.
CUDA: V100-SXM2-32GB GPU 单线程.
MPS backend is measured on an Apple M2 Ultra device using 1 thread (currently only supports ChatGLM2).

ChatGLM-6B:

	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0	F16	F32
ms/token (CPU @ Platinum 8260)	74	77	86	89	114	189	357
ms/token (CUDA @ V100 SXM2)	10	9.8	10.7	10.6	14.6	19.8	34.2
file size	3.3GB	3.7GB	4.0GB	4.4GB	6.2GB	12GB	23GB
mem usage	4.0GB	4.4GB	4.7GB	5.1GB	6.9GB	13GB	24GB

ChatGLM2-6B:

	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0	F16	F32
ms/token (CPU @ Platinum 8260)	64	71	79	83	106	189	372
ms/token (CUDA @ V100 SXM2)	9.7	9.4	10.3	10.2	14	19.1	33
ms/token (MPS @ M2 Ultra)	11	11.7	N/A	N/A	N/A	32.1	N/A
file size	3.3GB	3.7GB	4.0GB	4.4GB	6.2GB	12GB	24GB
mem usage	3.4GB	3.8GB	4.1GB	4.5GB	6.2GB	12GB	23GB