Qwen/Qwen-7B · Hugging Face

介紹 (Introduction)
通義千問-7B(Qwen-7B)是阿裏雲研發的通義千問大模型系列的70億參數規模的模型。Qwen-7B是基於Transformer的大語言模型, 在超大規模的預訓練數據上進行訓練得到。預訓練數據類型多樣,覆蓋廣泛,包括大量網絡文本、專業書籍、代碼等。同時,在Qwen-7B的基礎上,我們使用對齊機製打造了基於大語言模型的AI助手Qwen-7B-Chat。本倉庫為Qwen-7B的倉庫。

通義千問-7B(Qwen-7B)主要有以下特點:

大規模高質量訓練語料:使用超過2.2萬億tokens的數據進行預訓練,包含高質量中、英、多語言、代碼、數學等數據,涵蓋通用及專業領域的訓練語料。通過大量對比實驗對預訓練語料分布進行了優化。
強大的性能:Qwen-7B在多個中英文下遊評測任務上(涵蓋常識推理、代碼、數學、翻譯等),效果顯著超越現有的相近規模開源模型,甚至在部分指標上相比更大尺寸模型也有較強競爭力。具體評測結果請詳見下文。
覆蓋更全面的詞表:相比目前以中英詞表為主的開源模型,Qwen-7B使用了約15萬大小的詞表。該詞表對多語言更加友好,方便用戶在不擴展詞表的情況下對部分語種進行能力增強和擴展。
如果您想了解更多關於通義千問7B開源模型的細節,我們建議您參閱Github代碼庫。

Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-7B.

The features of Qwen-7B include:

Large-scale high-quality training corpora: It is pretrained on over 2.2 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments.
Competitive performance: It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results.
More comprehensive vocabulary coverage: Compared with other open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.
For more details about the open-source model of Qwen-7B, please refer to the Github code repository.

要求(Requirements)
python 3.8及以上版本
pytorch 1.12及以上版本,推薦2.0及以上版本
建議使用CUDA 11.4及以上(GPU用戶、flash-attention用戶等需考慮此選項)
python 3.8 and above
pytorch 1.12 and above, 2.0 and above are recommended
CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
依賴項 (Dependency)
運行Qwen-7B,請確保滿足上述要求,再執行以下pip命令安裝依賴庫

To run Qwen-7B, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.

pip install transformers==4.31.0 accelerate tiktoken einops

另外,推薦安裝flash-attention庫,以實現更高的效率和更低的顯存占用。

In addition, it is recommended to install the flash-attention library for higher efficiency and lower memory usage.

git clone -b v1.0.8 GitHub - Dao-AILab/flash-attention: Fast and memory-efficient exact attention
cd flash-attention && pip install .

下方安裝可選,安裝可能比較緩慢。

Below are optional. Installing them might be slow.

pip install csrc/layer_norm

pip install csrc/rotary

快速使用(Quickstart)
您可以通過以下代碼輕松調用:

You can easily call the model with the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

Note: The default behavior now has injection attack prevention off.

tokenizer = AutoTokenizer.from_pretrained(“Qwen/Qwen-7B”, trust_remote_code=True)

use bf16

model = AutoModelForCausalLM.from_pretrained(“Qwen/Qwen-7B”, device_map=“auto”, trust_remote_code=True, bf16=True).eval()

use fp16

model = AutoModelForCausalLM.from_pretrained(“Qwen/Qwen-7B”, device_map=“auto”, trust_remote_code=True, fp16=True).eval()

use cpu only

model = AutoModelForCausalLM.from_pretrained(“Qwen/Qwen-7B”, device_map=“cpu”, trust_remote_code=True).eval()

use auto mode, automatically select precision based on the device.

model = AutoModelForCausalLM.from_pretrained(“Qwen/Qwen-7B”, device_map=“auto”, trust_remote_code=True).eval()

Specify hyperparameters for generation

model.generation_config = GenerationConfig.from_pretrained(“Qwen/Qwen-7B”, trust_remote_code=True)

inputs = tokenizer(‘蒙古國的首都是烏蘭巴托(Ulaanbaatar)\n冰島的首都是雷克雅未克(Reykjavik)\n埃塞俄比亞的首都是’, return_tensors=‘pt’)
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

蒙古國的首都是烏蘭巴托(Ulaanbaatar)\n冰島的首都是雷克雅未克(Reykjavik)\n埃塞俄比亞的首都是亞的斯亞貝巴(Addis Ababa)…

關於更多的使用說明,請參考我們的Github repo獲取更多信息。

For more information, please refer to our Github repo for more information.

Tokenizer
註:作為術語的「tokenization」在中文中尚無共識的概念對應,本文檔采用英文表達以利說明。

基於tiktoken的分詞器有別於其他分詞器,比如sentencepiece分詞器。尤其在微調階段,需要特別註意特殊token的使用。關於tokenizer的更多信息,以及微調時涉及的相關使用,請參閱文檔。

Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the documentation.

模型細節 (Model)
Qwen-7B模型規模基本情況如下所示:

The details of the model architecture of Qwen-7B are listed as follows:

Hyperparameter Value
n_layers 32
n_heads 32
d_model 4096
vocab size 151851
sequence length 2048
在位置編碼、FFN激活函數和normalization的實現方式上,我們也采用了目前最流行的做法, 即RoPE相對位置編碼、SwiGLU激活函數、RMSNorm(可選安裝flash-attention加速)。

在分詞器方面,相比目前主流開源模型以中英詞表為主,Qwen-7B使用了超過15萬token大小的詞表。 該詞表在GPT-4使用的BPE詞表cl100k_base基礎上,對中文、多語言進行了優化,在對中、英、代碼數據的高效編解碼的基礎上,對部分多語言更加友好,方便用戶在不擴展詞表的情況下對部分語種進行能力增強。 詞表對數字按單個數字位切分。調用較為高效的tiktoken分詞庫進行分詞。

我們從部分語種各隨機抽取100萬個文檔語料,以對比不同模型的編碼壓縮率(以支持100語種的XLM-R為基準值1,越低越好),具體性能見圖。

可以看到Qwen-7B在保持中英代碼高效解碼的前提下,對部分使用人群較多的語種(泰語th、希伯來語he、阿拉伯語ar、韓語ko、越南語vi、日語ja、土耳其語tr、印尼語id、波蘭語pl、俄語ru、荷蘭語nl、葡萄牙語pt、意大利語it、德語de、西班牙語es、法語fr等)上也實現了較高的壓縮率,使得模型在這些語種上也具備較強的可擴展性和較高的訓練和推理效率。

在預訓練數據方面,Qwen-7B模型一方面利用了部分開源通用語料, 另一方面也積累了海量全網語料以及高質量文本內容,去重及過濾後的語料超過2.2T tokens。 囊括全網文本、百科、書籍、代碼、數學及各個領域垂類。

For position encoding, FFN activation function, and normalization methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).

For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization.

We randomly selected 1 million document corpus of each language to test and compare the encoding compression rates of different models (with XLM-R, which supports 100 languages, as the base value 1). The specific performance is shown in the figure above.

As can be seen, while ensuring the efficient decoding of Chinese, English, and code, Qwen-7B also achieves a high compression rate for many other languages (such as th, he, ar, ko, vi, ja, tr, id, pl, ru, nl, pt, it, de, es, fr etc.), equipping the model with strong scalability as well as high training and inference efficiency in these languages.

For pre-training data, on the one hand, Qwen-7B uses part of the open-source generic corpus. On the other hand, it uses a massive amount of accumulated web corpus and high-quality text content. The scale of corpus reaches over 2.2T tokens after deduplication and filtration, encompassing web text, encyclopedias, books, code, mathematics, and various domain.

評測效果(Evaluation)
中文評測(Chinese Evaluation)
C-Eval
C-Eval是評測預訓練模型中文常識能力的常用測評框架,覆蓋人文、社科、理工、其他專業四個大方向共52個學科。 我們按照標準做法,以開發集樣本作為few-shot來源,評價Qwen-7B預訓練模型的5-shot驗證集與測試集準確率。

C-Eval is a common evaluation benchmark for testing the common sense capability of pre-trained models in Chinese. It covers 52 subjects in four major directions: humanities, social sciences, STEM, and other specialties. According to the standard practice, we use the development set samples as the source of few-shot, to evaluate the 5-shot validation set and test set accuracy of the Qwen-7B pre-trained model.

在C-Eval驗證集上,Qwen-7B模型和其他模型的準確率對比如下:

The accuracy comparison of Qwen-7B and the other models on the C-Eval validation set is shown as follows:

Model Avg.
Alpaca-7B 28.9
Vicuna-7B 31.2
ChatGLM-6B 37.1
Baichuan-7B 42.7
ChatGLM2-6B 50.9
InternLM-7B 53.4
ChatGPT 53.5
Claude-v1.3 55.5
Qwen-7B 60.8
在C-Eval測試集上,Qwen-7B預訓練模型與其他模型的效果對比如下表所示:

The performance comparison of Qwen-7B and other models on the C-Eval test set is shown in the following table:

Model Avg. Avg. (Hard) STEM Social Sciences Humanities Others
ChatGLM-6B 38.9 29.2 33.3 48.3 41.3 38.0
Chinese-Alpaca-Plus-13B 41.5 30.5 36.6 49.7 43.1 41.2
Baichuan-7B 42.8 31.5 38.2 52.0 46.2 39.3
WestlakeLM-19B 44.6 34.9 41.6 51.0 44.3 44.5
AndesLM-13B 46.0 29.7 38.1 61.0 51.0 41.9
BatGPT-15B-sirius 47.0 31.9 42.7 57.5 48.6 43.6
ChatGLM2-6B 51.7 37.1 48.6 60.5 51.3 49.8
InternLM-7B 52.8 37.1 48.0 67.4 55.4 45.8
Baichuan-13B 53.6 36.7 47.0 66.8 57.3 49.8
Claude-v1.3 54.2 39.0 51.9 61.7 52.1 53.7
ChatGPT 54.4 41.4 52.9 61.8 50.9 53.6
Qwen-7B 59.6 41.0 52.8 74.1 63.1 55.2
可以看到,Qwen-7B在同等規模現有模型中取得了最高的分數,甚至相比更大規模模型也具有較強競爭力。

As can be seen, Qwen-7B achieves the best performance out of all existing models with similar scale and even surpasses larger-scale models.

英文評測(English Evaluation)
MMLU
MMLU是目前評測英文綜合能力最權威的基準評測之一,同樣覆蓋了不同學科領域、不同難度層級的57個子任務。

Qwen-7B在MMLU 5-shot準確率表現如下表:

MMLU is currently one of the most recognized benchmarks for evaluating English comprehension abilities, covering 57 subtasks across different academic fields and difficulty levels. The MMLU 5-shot accuracy performance of Qwen-7B is shown in the following table:

Model Avg. STEM Social Sciences Humanities Others
LLaMA-7B 35.1 30.5 38.3 34.0 38.1
Baichuan-7B 42.3 35.6 48.9 38.4 48.1
LLaMA2-7B 45.3 36.4 51.2 42.9 52.2
LLaMA-13B 46.9 35.8 53.8 45.0 53.3
ChatGLM2-6B 47.9 41.2 54.4 43.7 54.5
InternLM-7B 51.0 - - - -
Baichuan-13B 51.6 41.6 60.9 47.4 58.5
LLaMA2-13B 54.8 44.1 62.6 52.8 61.1
ChatGLM2-12B 56.2 48.2 65.1 52.6 60.9
Qwen-7B 56.7 47.6 65.9 51.5 64.7
在英文方面,Qwen-7B的效果同樣超過了目前國內外其他同類開源預訓練模型,同樣對比更大規模版本的模型也具有較強競爭力。

In terms of English, Qwen-7B also surpasses other similar open-source pre-trained models, and is competitive when compared to larger versions of other models.

代碼評測(Coding Evaluation)
我們在HumanEval(0-shot)上對比預訓練模型的代碼能力,結果如下:

We compared the code capabilities of pre-trained models on HumanEval, and the results are as follows:

Model Pass@1
Baichuan-7B 9.2
ChatGLM2-6B 9.2
InternLM-7B 10.4
LLaMA-7B 10.5
LLaMA2-7B 12.8
Baichuan-13B 12.8
LLaMA-13B 15.8
MPT-7B 18.3
LLaMA2-13B 18.3
Qwen-7B 24.4
數學評測(Mathematics Evaluation)
數學能力使用常用的GSM8K數據集(8-shot)評價:

We compared the math capabilities of pre-trained models on GSM8K (8-shot), and the results are as follows:

Model Acc.
MPT-7B 6.8
Falcon-7B 6.8
Baichuan-7B 9.7
LLaMA-7B 11.0
LLaMA2-7B 14.6
LLaMA-13B 17.8
Baichuan-13B 26.6
LLaMA2-13B 28.7
InternLM-7B 31.2
ChatGLM2-6B 32.4
ChatGLM2-12B 40.9
Qwen-7B 51.6
翻譯評測(Translation Evaluation)
我們使用WMT22中-英(zh-en)和英-中(en-zh)數據集(5-shot BLEU)評測:

We compared the translation capabilities of pre-trained models on WMT22 zh-en and en-zh (5-shot BLEU), and the results are as follows:

Model Avg. zh-en en-zh
InternLM-7B 11.8 9.0 14.5
LLaMA-7B 12.7 16.7 8.7
LLaMA-13B 15.8 19.5 12.0
LLaMA2-7B 19.9 21.9 17.9
Bloom-7B 20.3 19.1 21.4
LLaMA2-13B 23.3 22.4 24.2
PolyLM-13B 23.6 20.2 27.0
Baichuan-7B 24.6 22.6 26.6
Qwen-7B 27.5 24.3 30.6
長序列評測(Long-Context Evaluation)
我們引入NTK插值,LogN註意力縮放,窗口註意力等技巧,將模型的上下文長度擴展到8K以上。在arXiv數據上使用PPL指標測試Qwen-7B在不同長度下的表現,結果如下:

(若要啟用NTK和LogN註意力縮放,請將config.json裏的use_dynamic_ntk和use_logn_attn設置為true)

We introduce NTK-aware interpolation, LogN attention scaling, Window attention, etc. to extend the context length to over 8K tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation. Results are demonstrated below:

(To use NTK interpolation and LogN scaling, please set use_dynamic_ntk and use_long_attn to true in config.json.)

Model 序列長度 Sequence Length
1024 2048 4096 8192 16384
Qwen-7B 4.23 3.78 39.35 469.81 2645.09

  • dynamic_ntk 4.23 3.78 3.59 3.66 5.71
  • dynamic_ntk + logn 4.23 3.78 3.58 3.56 4.62
  • dynamic_ntk + logn + window_attn 4.23 3.78 3.58 3.49 4.32
    評測復現(Reproduction)
    我們提供了評測腳本,方便大家復現模型效果,詳見鏈接。提示:由於硬件和框架造成的舍入誤差,復現結果如有小幅波動屬於正常現象。

We have provided evaluation scripts to reproduce the performance of our model, details as link.

FAQ
如遇到問題,敬請查閱FAQ以及issue區,如仍無法解決再提交issue。

If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue.

使用協議(License Agreement)
我們的代碼和模型權重對學術研究完全開放,並支持商用。請查看LICENSE了解具體的開源協議細節。如需商用,請填寫問卷申請。

Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check LICENSE for more details about the license. If you have requirements for commercial use, please fill out the form to apply.

聯系我們(Contact Us)
如果你想給我們的研發團隊和產品團隊留言,請通過郵件([email protected])聯系我們。

If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].