TheBloke/qCammel-70-x-GGML

Flower0 · September 5, 2023, 7:36am

qCammel 70 - GGML

Model creator: augtoma
Original model: qCammel 70

Description

This repo contains GGML format model files for augtoma’s qCammel 70.

GPU acceleration is now available for Llama 2 70B GGML files, with both CUDA (NVidia) and Metal (macOS). The following clients/libraries are known to work with these files, including with CUDA GPU acceleration:

llama.cpp, commit e76d630 and later.
text-generation-webui, the most widely used web UI.
KoboldCpp, version 1.37 and later. A powerful GGML web UI, especially good for story telling.
LM Studio, a fully featured local GUI with GPU acceleration for both Windows and macOS. Use 0.1.11 or later for macOS GPU acceleration with 70B models.
llama-cpp-python, version 0.1.77 and later. A Python library with LangChain support, and OpenAI-compatible API server.
ctransformers, version 0.2.15 and later. A Python library with LangChain support, and OpenAI-compatible API server.

Repositories available

Prompt template: Vicuna

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:

Compatibility

Requires llama.cpp commit `e76d630` or later.

Or one of the other tools and libraries listed above.

To use in llama.cpp, you must add -gqa 8 argument.

For other UIs and libraries, please check the docs.

Explanation of the new k-quant methods

Click to see details

Provided files

Name	Quant method	Bits	Size	Max RAM required	Use case
qcammel-70-x.ggmlv3.q2_K.bin	q2_K	2	28.59 GB	31.09 GB	New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.
qcammel-70-x.ggmlv3.q3_K_L.bin	q3_K_L	3	36.15 GB	38.65 GB	New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
qcammel-70-x.ggmlv3.q3_K_M.bin	q3_K_M	3	33.04 GB	35.54 GB	New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
qcammel-70-x.ggmlv3.q3_K_S.bin	q3_K_S	3	29.75 GB	32.25 GB	New k-quant method. Uses GGML_TYPE_Q3_K for all tensors
qcammel-70-x.ggmlv3.q4_0.bin	q4_0	4	38.87 GB	41.37 GB	Original quant method, 4-bit.
qcammel-70-x.ggmlv3.q4_1.bin	q4_1	4	43.17 GB	45.67 GB	Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
qcammel-70-x.ggmlv3.q4_K_M.bin	q4_K_M	4	41.38 GB	43.88 GB	New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K
qcammel-70-x.ggmlv3.q4_K_S.bin	q4_K_S	4	38.87 GB	41.37 GB	New k-quant method. Uses GGML_TYPE_Q4_K for all tensors
qcammel-70-x.ggmlv3.q5_0.bin	q5_0	5	47.46 GB	49.96 GB	Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference.
qcammel-70-x.ggmlv3.q5_K_M.bin	q5_K_M	5	48.75 GB	51.25 GB	New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K
qcammel-70-x.ggmlv3.q5_K_S.bin	q5_K_S	5	47.46 GB	49.96 GB	New k-quant method. Uses GGML_TYPE_Q5_K for all tensors
qcammel-70-x.ggmlv3.q5_1.bin	q5_1	5	51.76 GB	54.26 GB	Original quant method, 5-bit. Higher accuracy, slower inference than q5_0.
qcammel-70-x.ggmlv3.q6_K.bin	q6_K	6	56.59 GB	59.09 GB	New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors
qcammel-70-x.ggmlv3.q8_0.bin	q8_0	8	73.23 GB	75.73 GB	Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.

q5_1, q6_K and q8_0 files require expansion from archive

Note: HF does not support uploading files larger than 50GB. Therefore I have uploaded the q6_K and q8_0 files as multi-part ZIP files. They are not compressed, they are just for storing a .bin file in two parts.

Click for instructions regarding q5_1, q6_K and q8_0 files

``
``
``
``
``
``

Downloads last month
1

Hosted inference API

Text Generation

Inference API has been turned off for this model.

TOS Privacy About Jobs Models Datasets Spaces Pricing Docs