TheBloke/qCammel-70-x-GGML

qCammel 70 - GGML

Description

This repo contains GGML format model files for augtoma’s qCammel 70.

GPU acceleration is now available for Llama 2 70B GGML files, with both CUDA (NVidia) and Metal (macOS). The following clients/libraries are known to work with these files, including with CUDA GPU acceleration:

  • llama.cpp, commit e76d630 and later.
  • text-generation-webui, the most widely used web UI.
  • KoboldCpp, version 1.37 and later. A powerful GGML web UI, especially good for story telling.
  • LM Studio, a fully featured local GUI with GPU acceleration for both Windows and macOS. Use 0.1.11 or later for macOS GPU acceleration with 70B models.
  • llama-cpp-python, version 0.1.77 and later. A Python library with LangChain support, and OpenAI-compatible API server.
  • ctransformers, version 0.2.15 and later. A Python library with LangChain support, and OpenAI-compatible API server.

Repositories available

Prompt template: Vicuna

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:

Compatibility

Requires llama.cpp commit e76d630 or later.

Or one of the other tools and libraries listed above.

To use in llama.cpp, you must add -gqa 8 argument.

For other UIs and libraries, please check the docs.

Explanation of the new k-quant methods

Click to see details

Provided files

Name Quant method Bits Size Max RAM required Use case
qcammel-70-x.ggmlv3.q2_K.bin q2_K 2 28.59 GB 31.09 GB New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.
qcammel-70-x.ggmlv3.q3_K_L.bin q3_K_L 3 36.15 GB 38.65 GB New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
qcammel-70-x.ggmlv3.q3_K_M.bin q3_K_M 3 33.04 GB 35.54 GB New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
qcammel-70-x.ggmlv3.q3_K_S.bin q3_K_S 3 29.75 GB 32.25 GB New k-quant method. Uses GGML_TYPE_Q3_K for all tensors
qcammel-70-x.ggmlv3.q4_0.bin q4_0 4 38.87 GB 41.37 GB Original quant method, 4-bit.
qcammel-70-x.ggmlv3.q4_1.bin q4_1 4 43.17 GB 45.67 GB Original quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
qcammel-70-x.ggmlv3.q4_K_M.bin q4_K_M 4 41.38 GB 43.88 GB New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K
qcammel-70-x.ggmlv3.q4_K_S.bin q4_K_S 4 38.87 GB 41.37 GB New k-quant method. Uses GGML_TYPE_Q4_K for all tensors
qcammel-70-x.ggmlv3.q5_0.bin q5_0 5 47.46 GB 49.96 GB Original quant method, 5-bit. Higher accuracy, higher resource usage and slower inference.
qcammel-70-x.ggmlv3.q5_K_M.bin q5_K_M 5 48.75 GB 51.25 GB New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K
qcammel-70-x.ggmlv3.q5_K_S.bin q5_K_S 5 47.46 GB 49.96 GB New k-quant method. Uses GGML_TYPE_Q5_K for all tensors
qcammel-70-x.ggmlv3.q5_1.bin q5_1 5 51.76 GB 54.26 GB Original quant method, 5-bit. Higher accuracy, slower inference than q5_0.
qcammel-70-x.ggmlv3.q6_K.bin q6_K 6 56.59 GB 59.09 GB New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors
qcammel-70-x.ggmlv3.q8_0.bin q8_0 8 73.23 GB 75.73 GB Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.

q5_1, q6_K and q8_0 files require expansion from archive

Note: HF does not support uploading files larger than 50GB. Therefore I have uploaded the q6_K and q8_0 files as multi-part ZIP files. They are not compressed, they are just for storing a .bin file in two parts.

Click for instructions regarding q5_1, q6_K and q8_0 files

  • ``

  • ``

  • ``

  • ``

  • ``

  • ``



Downloads last month
1

Hosted inference API


Text Generation

Inference API has been turned off for this model.

© Hugging Face

TOSPrivacyAboutJobsModelsDatasetsSpacesPricingDocs

1 Like