Qwen/Qwen-7B-Chat

Flower0 · August 30, 2023, 8:07am

通義千問-7B（Qwen-7B）是阿裏雲研發的通義千問大模型系列的70億參數規模的模型。Qwen-7B是基於Transformer的大語言模型, 在超大規模的預訓練數據上進行訓練得到。預訓練數據類型多樣，覆蓋廣泛，包括大量網絡文本、專業書籍、代碼等。同時，在Qwen-7B的基礎上，我們使用對齊機製打造了基於大語言模型的AI助手Qwen-7B-Chat。本倉庫為Qwen-7B-Chat的倉庫。

如果您想了解更多關於通義千問-7B開源模型的細節，我們建議您參閱Github代碼庫。

Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B`is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-7B-Chat.

For more details about the open-source model of Qwen-7B, please refer to the Github code repository.

要求（Requirements）
python 3.8及以上版本
pytorch 1.12及以上版本，推薦2.0及以上版本
建議使用CUDA 11.4及以上（GPU用戶、flash-attention用戶等需考慮此選項）
python 3.8 and above
pytorch 1.12 and above, 2.0 and above are recommended
CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
依賴項（Dependency）
運行Qwen-7B-Chat，請確保滿足上述要求，再執行以下pip命令安裝依賴庫

To run Qwen-7B-Chat, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.

pip install transformers==4.31.0 accelerate tiktoken einops

另外，推薦安裝flash-attention庫，以實現更高的效率和更低的顯存占用。

In addition, it is recommended to install the flash-attention library for higher efficiency and lower memory usage.

git clone -b v1.0.8 GitHub - Dao-AILab/flash-attention: Fast and memory-efficient exact attention
cd flash-attention && pip install .

下方安裝可選，安裝可能比較緩慢。

Below are optional. Installing them might be slow.

pip install csrc/layer_norm

pip install csrc/rotary

快速使用（Quickstart）
下面我們展示了一個使用Qwen-7B-Chat模型，進行多輪對話交互的樣例：

We show an example of multi-turn interaction with Qwen-7B-Chat in the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

Note: The default behavior now has injection attack prevention off.

tokenizer = AutoTokenizer.from_pretrained(“Qwen/Qwen-7B-Chat”, trust_remote_code=True)

use bf16

model = AutoModelForCausalLM.from_pretrained(“Qwen/Qwen-7B-Chat”, device_map=“auto”, trust_remote_code=True, bf16=True).eval()

use fp16

model = AutoModelForCausalLM.from_pretrained(“Qwen/Qwen-7B-Chat”, device_map=“auto”, trust_remote_code=True, fp16=True).eval()

use cpu only

model = AutoModelForCausalLM.from_pretrained(“Qwen/Qwen-7B-Chat”, device_map=“cpu”, trust_remote_code=True).eval()

use auto mode, automatically select precision based on the device.

model = AutoModelForCausalLM.from_pretrained(“Qwen/Qwen-7B-Chat”, device_map=“auto”, trust_remote_code=True).eval()

Specify hyperparameters for generation

model.generation_config = GenerationConfig.from_pretrained(“Qwen/Qwen-7B-Chat”, trust_remote_code=True) # 可指定不同的生成長度、top_p等相關超參

第一輪對話 1st dialogue turn

response, history = model.chat(tokenizer, “你好”, history=None)
print(response)

你好！很高興為你提供幫助。

第二輪對話 2nd dialogue turn

response, history = model.chat(tokenizer, “給我講一個年輕人奮鬥創業最終取得成功的故事。”, history=history)
print(response)

這是一個關於一個年輕人奮鬥創業最終取得成功的故事。

故事的主人公叫李明，他來自一個普通的家庭，父母都是普通的工人。從小，李明就立下了一個目標：要成為一名成功的企業家。

為了實現這個目標，李明勤奮學習，考上了大學。在大學期間，他積極參加各種創業比賽，獲得了不少獎項。他還利用課余時間去實習，積累了寶貴的經驗。

畢業後，李明決定開始自己的創業之路。他開始尋找投資機會，但多次都被拒絕了。然而，他並沒有放棄。他繼續努力，不斷改進自己的創業計劃，並尋找新的投資機會。

最終，李明成功地獲得了一筆投資，開始了自己的創業之路。他成立了一家科技公司，專註於開發新型軟件。在他的領導下，公司迅速發展起來，成為了一家成功的科技企業。

李明的成功並不是偶然的。他勤奮、堅韌、勇於冒險，不斷學習和改進自己。他的成功也證明了，只要努力奮鬥，任何人都有可能取得成功。

第三輪對話 3rd dialogue turn

response, history = model.chat(tokenizer, “給這個故事起一個標題”, history=history)
print(response)

《奮鬥創業：一個年輕人的成功之路》

關於更多的使用說明，請參考我們的Github repo獲取更多信息。

For more information, please refer to our Github repo for more information.

Tokenizer
註：作為術語的「tokenization」在中文中尚無共識的概念對應，本文檔采用英文表達以利說明。

基於tiktoken的分詞器有別於其他分詞器，比如sentencepiece分詞器。尤其在微調階段，需要特別註意特殊token的使用。關於tokenizer的更多信息，以及微調時涉及的相關使用，請參閱文檔。

Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the documentation.

量化 (Quantization)
用法 (Usage)
請註意：我們更新量化方案為基於AutoGPTQ的量化，提供Qwen-7B-Chat的Int4量化模型點擊這裏。相比此前方案，該方案在模型評測效果幾乎無損，且存儲需求更低，推理速度更優。

Note: we provide a new solution based on AutoGPTQ, and release an Int4 quantized model for Qwen-7B-Chat Click here, which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution.

以下我們提供示例說明如何使用Int4量化模型。在開始使用前，請先保證滿足AutoGPTQ的要求，並使用源代碼安裝（由於最新支持Qwen的代碼未發布到PyPI）：

Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of AutoGPTQ and install it from source (temporarily the codes for Qwen are not yet released in the latest version of PyPI package):

git clone GitHub - PanQiWei/AutoGPTQ: An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm. && cd AutoGPTQ
pip install .

隨後便能輕松讀取量化模型：

Then you can load the quantized model easily as shown below

from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(“Qwen/Qwen-7B-Chat-Int4”, device_map=“auto”, trust_remote_code=True, use_safetensors=True).eval()

推理方法和基礎用法類似，但註意需要從外部傳入generation config：

To run inference, it is similar to the basic usage demonstrated above, but remember to pass in the generation configuration explicitly:

from transformers import GenerationConfig
config = GenerationConfig.from_pretrained(“Qwen/Qwen-7B-Chat-Int4”, trust_remote_code=True)
response, history = model.chat(tokenizer, “Hi”, history=None, generation_config=config)

效果評測
我們對BF16和Int4模型在基準評測上做了測試，發現量化模型效果損失較小，結果如下所示：

We illustrate the model performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:

Quantization MMLU CEval (val) GSM8K Humaneval
BF16 53.9 54.2 41.1 24.4
Int4 52.6 52.9 38.1 23.8
推理速度 (Inference Speed)
我們測算了BF16和Int4模型生成2048和8192個token的平均推理速度。如圖所示：

We measured the average inference speed of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization level, respectively.

Quantization Speed (2048 tokens) Speed (8192 tokens)
BF16 30.53 28.51
Int4 45.60 33.83
具體而言，我們記錄在長度為1的上下文的條件下生成8192個token的性能。評測運行於單張A100-SXM4-80G GPU，使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192個token的速度均值。

In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.

顯存使用 (GPU Memory Usage)
我們還測算了BF16和Int4模型編碼2048個token及生成8192個token的峰值顯存占用情況。結果如下所示：

We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.

Quantization Level Peak Usage for Encoding 2048 Tokens Peak Usage for Generating 8192 Tokens
BF16 18.99GB 24.40GB
Int4 10.20GB 15.61GB
上述性能測算使用此腳本完成。

The above speed and memory profiling are conducted using this script.

模型細節（Model）
與Qwen-7B預訓練模型相同，Qwen-7B-Chat模型規模基本情況如下所示

The details of the model architecture of Qwen-7B-Chat are listed as follows

Hyperparameter Value
n_layers 32
n_heads 32
d_model 4096
vocab size 151851
sequence length 2048
在位置編碼、FFN激活函數和normalization的實現方式上，我們也采用了目前最流行的做法，即RoPE相對位置編碼、SwiGLU激活函數、RMSNorm（可選安裝flash-attention加速）。

在分詞器方面，相比目前主流開源模型以中英詞表為主，Qwen-7B-Chat使用了約15萬token大小的詞表。該詞表在GPT-4使用的BPE詞表cl100k_base基礎上，對中文、多語言進行了優化，在對中、英、代碼數據的高效編解碼的基礎上，對部分多語言更加友好，方便用戶在不擴展詞表的情況下對部分語種進行能力增強。詞表對數字按單個數字位切分。調用較為高效的tiktoken分詞庫進行分詞。

For position encoding, FFN activation function, and normalization calculation methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).

For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-7B-Chat uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization.

評測效果（Evaluation）
對於Qwen-7B-Chat模型，我們同樣評測了常規的中文理解（C-Eval）、英文理解（MMLU）、代碼（HumanEval）和數學（GSM8K）等權威任務，同時包含了長序列任務的評測結果。由於Qwen-7B-Chat模型經過對齊後，激發了較強的外部系統調用能力，我們還進行了工具使用能力方面的評測。

提示：由於硬件和框架造成的舍入誤差，復現結果如有波動屬於正常現象。

For Qwen-7B-Chat, we also evaluate the model on C-Eval, MMLU, HumanEval, GSM8K, etc., as well as the benchmark evaluation for long-context understanding, and tool usage.

Note: Due to rounding errors caused by hardware and framework, differences in reproduced results are possible.

中文評測（Chinese Evaluation）
C-Eval
在C-Eval驗證集上，我們評價了Qwen-7B-Chat模型的zero-shot準確率

We demonstrate the zero-shot accuracy of Qwen-7B-Chat on C-Eval validation set

Model Avg. Acc.
LLaMA2-7B-Chat 31.9
LLaMA2-13B-Chat 40.6
Chinese-Alpaca-2-7B 41.3
Chinese-Alpaca-Plus-13B 43.3
Baichuan-13B-Chat 50.4
ChatGLM2-6B-Chat 50.7
InternLM-7B-Chat 53.2
Qwen-7B-Chat 54.2
C-Eval測試集上，Qwen-7B-Chat模型的zero-shot準確率結果如下：

The zero-shot accuracy of Qwen-7B-Chat on C-Eval testing set is provided below:

Model Avg. STEM Social Sciences Humanities Others
Chinese-Alpaca-Plus-13B 41.5 36.6 49.7 43.1 41.2
Chinese-Alpaca-2-7B 40.3 - - - -
ChatGLM2-6B-Chat 50.1 46.4 60.4 50.6 46.9
Baichuan-13B-Chat 51.5 43.7 64.6 56.2 49.2
Qwen-7B-Chat 54.6 47.8 67.6 59.3 50.6
在7B規模模型上，經過人類指令對齊的Qwen-7B-Chat模型，準確率在同類相近規模模型中仍然處於前列。

Compared with other pretrained models with comparable model size, the human-aligned Qwen-7B-Chat performs well in C-Eval accuracy.

英文評測（English Evaluation）
MMLU
MMLU評測集上，Qwen-7B-Chat模型的zero-shot準確率如下，效果同樣在同類對齊模型中同樣表現較優。

The zero-shot accuracy of Qwen-7B-Chat on MMLU is provided below. The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.

Model Avg. Acc.
ChatGLM2-6B-Chat 45.5
LLaMA2-7B-Chat 47.0
InternLM-7B-Chat 50.8
Baichuan-13B-Chat 52.1
ChatGLM2-12B-Chat 52.1
Qwen-7B-Chat 53.9
代碼評測（Coding Evaluation）
Qwen-7B-Chat在HumanEval的zero-shot Pass@1效果如下

The zero-shot Pass@1 of Qwen-7B-Chat on HumanEval is demonstrated below

Model Pass@1
LLaMA2-7B-Chat 12.2
InternLM-7B-Chat 14.0
Baichuan-13B-Chat 16.5
LLaMA2-13B-Chat 18.9
Qwen-7B-Chat 24.4
數學評測（Mathematics Evaluation）
在評測數學能力的GSM8K上，Qwen-7B-Chat的準確率結果如下

The accuracy of Qwen-7B-Chat on GSM8K is shown below

Model Zero-shot Acc. 4-shot Acc.
ChatGLM2-6B-Chat - 28.0
LLaMA2-7B-Chat 20.4 28.2
LLaMA2-13B-Chat 29.4 36.7
InternLM-7B-Chat 32.6 34.5
Baichuan-13B-Chat - 36.3
ChatGLM2-12B-Chat - 38.1
Qwen-7B-Chat 41.1 43.5
長序列評測（Long-Context Understanding）
通過NTK插值，LogN註意力縮放可以擴展Qwen-7B-Chat的上下文長度。在長文本摘要數據集VCSUM上（文本平均長度在15K左右），Qwen-7B-Chat的Rouge-L結果如下：

(若要啟用這些技巧，請將config.json裏的use_dynamic_ntk和use_logn_attn設置為true)

We introduce NTK-aware interpolation, LogN attention scaling to extend the context length of Qwen-7B-Chat. The Rouge-L results of Qwen-7B-Chat on long-text summarization dataset VCSUM (The average length of this dataset is around 15K) are shown below:

(To use these tricks, please set use_dynamic_ntk and use_long_attn to true in config.json.)

Model VCSUM (zh)
GPT-3.5-Turbo-16k 16.0
LLama2-7B-Chat 0.2
InternLM-7B-Chat 13.0
ChatGLM2-6B-Chat 16.3
Qwen-7B-Chat 16.6
工具使用能力的評測（Tool Usage）
ReAct Prompting
千問支持通過 ReAct Prompting 調用插件/工具/API。ReAct 也是 LangChain 框架采用的主要方式之一。在我們開源的、用於評估工具使用能力的評測基準上，千問的表現如下：

Qwen-7B-Chat supports calling plugins/tools/APIs through ReAct Prompting. ReAct is also one of the main approaches used by the LangChain framework. In our evaluation benchmark for assessing tool usage capabilities, Qwen-7B-Chat’s performance is as follows:

Model Tool Selection (Acc.↑) Tool Input (Rouge-L↑) False Positive Error」↓
GPT-4 95% 0.90 15%
GPT-3.5 85% 0.88 75%
Qwen-7B-Chat 99% 0.89 9.7%
評測基準中出現的插件均沒有出現在千問的訓練集中。該基準評估了模型在多個候選插件中選擇正確插件的準確率、傳入插件的參數的合理性、以及假陽率。假陽率（False Positive）定義：在處理不該調用插件的請求時，錯誤地調用了插件。

The plugins that appear in the evaluation set do not appear in the training set of Qwen-7B-Chat. This benchmark evaluates the accuracy of the model in selecting the correct plugin from multiple candidate plugins, the rationality of the parameters passed into the plugin, and the false positive rate. False Positive: Incorrectly invoking a plugin when it should not have been called when responding to a query.

關於 ReAct Prompting 的 prompt 怎麽寫、怎麽使用，請參考 ReAct 樣例說明。使用工具能使模型更好地完成任務。基於千問的工具使用能力，我們能實現下圖所展示的效果：

For how to write and use prompts for ReAct Prompting, please refer to the ReAct examples. The use of tools can enable the model to better perform tasks, as shown in the following figures:

Huggingface Agent
千問還具備作為 HuggingFace Agent 的能力。它在 Huggingface 提供的run模式評測基準上的表現如下：

Qwen-7B-Chat also has the capability to be used as a HuggingFace Agent. Its performance on the run-mode benchmark provided by HuggingFace is as follows:

Model Tool Selection↑ Tool Used↑ Code↑
GPT-4 100 100 97.41
GPT-3.5 95.37 96.30 87.04
StarCoder-15.5B 87.04 87.96 68.89
Qwen-7B-Chat 90.74 92.59 74.07
FAQ
如遇到問題，敬請查閱FAQ以及issue區，如仍無法解決再提交issue。

If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue.

使用協議（License Agreement）
我們的代碼和模型權重對學術研究完全開放，並支持商用。請查看LICENSE了解具體的開源協議細節。如需商用，請填寫問卷申請。

Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check LICENSE for more details about the license. If you have requirements for commercial use, please fill out the form to apply.

聯系我們（Contact Us）
如果你想給我們的研發團隊和產品團隊留言，請通過郵件（[email protected]）聯系我們。

If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].