qCammel 70 - GGML
- Model creator: augtoma
- Original model: qCammel 70
Description
This repo contains GGML format model files for augtoma’s qCammel 70.
GPU acceleration is now available for Llama 2 70B GGML files, with both CUDA (NVidia) and Metal (macOS). The following clients/libraries are known to work with these files, including with CUDA GPU acceleration:
- llama.cpp, commit
e76d630
and later. - text-generation-webui, the most widely used web UI.
- KoboldCpp, version 1.37 and later. A powerful GGML web UI, especially good for story telling.
- LM Studio, a fully featured local GUI with GPU acceleration for both Windows and macOS. Use 0.1.11 or later for macOS GPU acceleration with 70B models.
- llama-cpp-python, version 0.1.77 and later. A Python library with LangChain support, and OpenAI-compatible API server.
- ctransformers, version 0.2.15 and later. A Python library with LangChain support, and OpenAI-compatible API server.
Repositories available
- GPTQ models for GPU inference, with multiple quantisation parameter options.
- 2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference
- augtoma’s original unquantised fp16 model in pytorch format, for GPU inference and for further conversions
Prompt template: Vicuna
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: {prompt}
ASSISTANT:
Compatibility
Requires llama.cpp commit e76d630
or later.
Or one of the other tools and libraries listed above.
To use in llama.cpp, you must add -gqa 8
argument.
For other UIs and libraries, please check the docs.
Explanation of the new k-quant methods
Click to see details
Provided files
Name | Quant method | Bits | Size | Max RAM required | Use case |
---|---|---|---|---|---|
qcammel-70-x.ggmlv3.q2_K.bin | q2_K | 2 | 28.59 GB | 31.09 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
qcammel-70-x.ggmlv3.q3_K_L.bin | q3_K_L | 3 | 36.15 GB | 38.65 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
qcammel-70-x.ggmlv3.q3_K_M.bin | q3_K_M | 3 | 33.04 GB | 35.54 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
qcammel-70-x.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 29.75 GB | 32.25 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
qcammel-70-x.ggmlv3.q4_0.bin | q4_0 | 4 | 38.87 GB | 41.37 GB | Original quant method, 4-bit. |
qcammel-70-x.ggmlv3.q4_1.bin | q4_1 | 4 | 43.17 GB | 45.67 GB | Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
qcammel-70-x.ggmlv3.q4_K_M.bin | q4_K_M | 4 | 41.38 GB | 43.88 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
qcammel-70-x.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 38.87 GB | 41.37 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
qcammel-70-x.ggmlv3.q5_0.bin | q5_0 | 5 | 47.46 GB | 49.96 GB | Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
qcammel-70-x.ggmlv3.q5_K_M.bin | q5_K_M | 5 | 48.75 GB | 51.25 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
qcammel-70-x.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 47.46 GB | 49.96 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
qcammel-70-x.ggmlv3.q5_1.bin | q5_1 | 5 | 51.76 GB | 54.26 GB | Original quant method, 5-bit. Higher accuracy, slower inference than q5_0. |
qcammel-70-x.ggmlv3.q6_K.bin | q6_K | 6 | 56.59 GB | 59.09 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
qcammel-70-x.ggmlv3.q8_0.bin | q8_0 | 8 | 73.23 GB | 75.73 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
q5_1, q6_K and q8_0 files require expansion from archive
Note: HF does not support uploading files larger than 50GB. Therefore I have uploaded the q6_K and q8_0 files as multi-part ZIP files. They are not compressed, they are just for storing a .bin file in two parts.
Click for instructions regarding q5_1, q6_K and q8_0 files
-
``
-
``
-
``
-
``
-
``
-
``
Downloads last month
1
Hosted inference API
Inference API has been turned off for this model.
© Hugging Face