TheBloke/BLOOMChat-176B-v1-GPTQ

Sambanova Systems’ BLOOMChat 1.0

These files are GPTQ 4bit model files for Sambanova Systems’ BLOOMChat 1.0.

It is the result of quantising to 4-bit using AutoGPTQ.

This is a BIG model! 2 x 80GB or 3 x 48GB GPUs are required

Important note: files must be joined before use

It is not currently possible to shard GPTQ files, therefore the model file is one single 94GB safetensors file.

Huggingface Hub has a 50GB per-file limit. I have therefore been forced to split the file in to three parts for upload.

I did this using the simple *nix command split.

To join the files on any *nix system, you can run:

cat gptq_model-4bit--1g.JOINBEFOREUSE.split-*.safetensors > gptq_model-4bit--1g.safetensors

To join the files on Windows, open a Command Prompt and run:

COPY /B gptq_model-4bit--1g.JOINBEFOREUSE.split-a.safetensors + gptq_model-4bit--1g.JOINBEFOREUSE.split-b.safetensors + gptq_model-4bit--1g.JOINBEFOREUSE.split-c.safetensors gptq_model-4bit--1g.safetensors

Or for Python code for joining the files, see the Python section below.

The SHA256SUM of the joined file will be:

9cc359fa266d2523566e818ca58e8782718b25cc2e714cb5449b7841e1c59830  gptq_model-4bit--1g.safetensors

Once you have the joined file, you can safely delete gptq_model-4bit--1g.JOINBEFOREUSE.split-*.safetensors.

Repositories available

Two files provided - separate branches

  • Main branch: gptq_model-4bit--1g.safetensors
    • Group Size = None
    • Desc Act (act-order) = True
    • This version will use the least possible VRAM, and should have higher inference performance in CUDA mode
  • Branch group_size_128g: gptq_model-4bit-128g.safetensors
    • Group Size = 128g
    • Desc Act (act-oder) = True
    • This version will use more VRAM, which shouldn’t be a problem as it shouldn’t exceed 2 x 80GB or 3 x 48GB cards.
    • However CUDA inference performance is likely to be a lot slower, possibly necessitating the use of Triton mode.

By default you will download the first file, unless you choose to download from branch group_size_128g.

Prompt template:

<human>: prompt
<bot>:

How to easily download and use this model in text-generation-webui

Please make sure you’re using the latest version of text-generation-webui.

Note 1: this is a non-Llama model which cannot be used with ExLlama. Use Loader: AutoGPTQ.

Note 2: As described above, you must join the files after downloading and before loading in text-generation-webui.

  1. Click the Model tab.
  2. Under Download custom model or LoRA, enter TheBloke/BLOOMChat-176B-v1-GPTQ.
  • If you would rather download the group_size 128g version, enter TheBloke/BLOOMChat-176B-v1-GPTQ:group_size_128g
  1. Click Download.
  2. The model will start downloading. Once it’s finished it will say “Done”. This is a huge model so it may take a while!
  3. Now follow the steps described above to join the model to get a single .safetensors file.
  4. Untick Autoload model.
  5. In the top left, click the refresh icon next to Model.
  6. In the Model dropdown, choose the model you just downloaded: BLOOMChat-176B-v1-GPTQ
  7. Make sure Loader is set to AutGPTQ.
  8. This model cannot load on one GPU, so you should set GPU Memory accordingly.
  • If using two 80GB GPUs, try: GPU0 = 60GB, GPU1 = 79GB
  • If using three 48GB GPUs, try: GPU0 = 30GB, GPU1 = 47GB, GPU2 = 47GB
  1. Click Save settings to save your settings, and then Reload to load the model.
  2. The model will load, and is now ready for use!
  3. Once you’re ready, click the Text Generation tab and enter a prompt to get started!

How to use this GPTQ model from Python code

First make sure you have AutoGPTQ installed:

GITHUB_ACTIONS=true pip install auto-gptq

Because this model has to be joined locally, you must first download it. Example download code:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="TheBloke/BLOOMChat-176B-v1-GPTQ",
  local_dir="/workspace/models/BLOOMChat-176B-v1-GPTQ",
  local_dir_use_symlinks=False)

If you want to download the group_size 128g file instead, add revision="group_size_128g" to the above command.

Now join the three split files, which can be done with the following Python code:

import glob

# Get the list of all files matching the pattern
files = sorted(glob.glob('gptq_model-4bit--1g.JOINBEFOREUSE.split-*.safetensors'))

# Open the output file in binary write mode
with open('gptq_model-4bit--1g.safetensors', 'wb') as outfile:
    for filename in files:
        with open(filename, 'rb') as infile:
            outfile.write(infile.read())

Then try the following example code:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

# Use the local path you downloaded the model to and joined the split files in
model_name_or_path = "/workspace/models/BLOOMChat-176B-v1-GPTQ"
model_basename = "gptq_model-4bit--1g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        max_memory={0: '60GiB', 1: '79GiB'} # max_memory is for 2 x 80GB GPUs; adjust if your config is different!
        use_safetensors=True,
        trust_remote_code=False,
        use_triton=use_triton,
        quantize_config=None)

prompt = "Write a story about llamas"
prompt_template=f'''<human>: {prompt}
<bot>:
'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

Provided files

Main branch:

gptq_model-4bit–1g.safetensors

This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa. It will not work with ExLlama.

It was created with group_size none (-1) to reduce VRAM usage, and with --act-order (desc_act) to improve accuracy of responses.

  • gptq_model-4bit-128g.safetensors
    • Works with AutoGPTQ in CUDA or Triton modes.
    • Does NOT work with ExLlama as it’s not a Llama model.
    • Untested with GPTQ-for-LLaMa.
    • Works with text-generation-webui, including one-click-installers.
    • Parameters: Groupsize = -1. Act Order / desc_act = True.

Branch group_size_128g

gptq_model-4bit-128g.safetensors

This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa. It will not work with ExLlama.

It was created with both group_size 128g and --act-order (desc_act) for even higher inference accuracy, at the cost of increased VRAM usage. Because we already need 2 x 80GB or 3 x 48GB GPUs, I don’t expect the increased VRAM usage to change the GPU requirements.

  • gptq_model-4bit-128g.safetensors
    • Works with AutoGPTQ in CUDA or Triton modes.
    • Does NOT work with ExLlama as it’s not a Llama model.
    • Untested with GPTQ-for-LLaMa.
    • Works with text-generation-webui, including one-click-installers.
    • Parameters: Groupsize = 128. Act Order / desc_act = True.

Discord

For further support, and discussions on these models and AI in general, join us at:

TheBloke AI’s Discord server

Thanks, and how to contribute.

Thanks to the chirper.ai team!

I’ve had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.

If you’re able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.

Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.

Special thanks to: Luke from CarbonQuill, Aemon Algiz, Dmitriy Samsonov.

Patreon special mentions: zynix , ya boyyy, Trenton Dambrowitz, Imad Khwaja, Alps Aficionado, chris gileta, John Detwiler, Willem Michiel, RoA, Mano Prime, Rainer Wilmers, Fred von Graf, Matthew Berman, Ghost , Nathan LeClaire, Iucharbius , Ai Maven, Illia Dulskyi, Joseph William Delisle, Space Cruiser, Lone Striker, Karl Bernard, Eugene Pentland, Greatston Gnanesh, Jonathan Leane, Randy H, Pierre Kircher, Willian Hasse, Stephen Murray, Alex , terasurfer , Edmond Seymore, Oscar Rangel, Luke Pendergrass, Asp the Wyvern, Junyu Yang, David Flickinger, Luke, Spiking Neurons AB, subjectnull, Pyrater, Nikolai Manek, senxiiz, Ajan Kanaga, Johann-Peter Hartmann, Artur Olbinski, Kevin Schuppel, Derek Yates, Kalila, K, Talal Aujan, Khalefa Al-Ahmad, Gabriel Puliatti, John Villwock, WelcomeToTheClub, Daniel P. Andersen, Preetika Verma, Deep Realms, Fen Risland, trip7s trip, webtim, Sean Connelly, Michael Levine, Chris McCloskey, biorpg, vamX, Viktor Bowallius, Cory Kujawski.

Thank you to all my generous patrons and donaters!

Original model card: Sambanova Systems’ BLOOMChat V1.0

BLOOMChat V1.0

BLOOMChat is a 176 billion parameter multilingual chat model. It is instruction tuned from BLOOM (176B) on assistant-style conversation datasets and supports conversation, question answering and generative answers in multiple languages.

Model Details

Model Description

Basic Information

Licensing

To increase accessibility and to support the open-source community, SambaNova is releasing BLOOMChat under a modified version of the Apache 2.0 license, which includes use-based restrictions from BLOOM’s RAIL license. While use-based restrictions are necessarily passed through, there are no blanket restrictions on reuse, distribution, commercialization or adaptation. Please review SambaNova’s BLOOMChat-176B License

Uses

Click to expand

How to Get Started with the Model

Click to expand







  • ``
  • ``

``


``











Some example completions for English

Click to expand





Some example completions for Multilingual

Click to expand







Evaluation Graphs

Click to expand


Training Details

Click to expand



Bias, Risks, and Limitations

Like all LLMs, BLOOMChat has certain limitations:

  • Hallucination: BLOOMChat may sometimes generate responses that contain plausible-sounding but factually incorrect or irrelevant information.
  • Code Switching: The model might unintentionally switch between languages or dialects within a single response, affecting the coherence and understandability of the output.
  • Repetition: BLOOMChat may produce repetitive phrases or sentences, leading to less engaging and informative responses.
  • Coding and Math: The model’s performance in generating accurate code or solving complex mathematical problems may be limited.
  • Toxicity: BLOOMChat may inadvertently generate responses containing inappropriate or harmful content.

Acknowledgment

We would like to extend our gratitude to Together for their insightful technical discussions on overall project planning, data processing, model training, human evaluation experiment design, open-source endeavors, and their contributions on data processing code on OpenChatKit, OASST1, and Dolly 2.0.

We are grateful to the various researchers and open-source projects that have contributed to the development of BLOOMChat. We thank BigScience for providing the BLOOM model, which served as the base for our instruction tuning. We also thank LAION for their OIG dataset, OpenAssistant Conversations Dataset (OASST1) and also thank Databricks for providing Dolly 2.0, to provide the dataset that we instruction tuned on.

We appreciate lm-eval-harness and BigScience for their essential benchmarking contributions, which is very helpful in evaluating BLOOMChat’s performance. We appreciate the inspiration from the wave of various recent open-source chat models, including OpenAssistant-30B, LLaMA-Adapter-V2-65B, Vicuna-13b, Koala-13b, OASST-Pythia-12b, Alpaca-13b, ChatGLM-6b, FastChat-T5-3b, Dolly-v2-12b, LLaMA-13b, StableLM-Tuned-Alpha-7b, RedPajama-INCITE-Chat-7B-v0.1, RedPajama-INCITE-Chat-3B-v1, MPT-7B-Chat and so on. We look forward to witnessing the continued growth and success of open-source chat-based models.

We highly appreciate the hard work and dedication of these researchers and organizations towards the advancement of the open-source community. Their contributions were invaluable in the development of BLOOMChat, and we hope that our model can contribute to further advancements in the field.

Cite BLOOMChat

@software{bloomchat,
  title = {{BLOOMChat: a New Open Multilingual Chat LLM}},
  author = {SambaNova Systems, Together Computer},
  url = {https://huggingface.co/sambanovasystems/BLOOMChat-176B-v1}
  month = {5},
  year = {2023},
  version = {1.0},
}