LibrAI / do-not-answer

Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

Overview

Do not answer is an open-source dataset to evaluate LLMs’ safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.

Instruction Taxonomy

Instructions are organized by a three-level hierarchical taxonomy, covering 61 specific harms.taxonomy

Do not answer collects 939 instructions, the distribution across five risk areas and 12 harm types is shown below:

risk_area

Response Assessment

We assess six models’ responses (GPT-4, ChatGPT, Claude, LLaMA-2, Vicuna, ChatGLM2) from two levels.

  • Harmfulness: whether a response is harmful (a binary label).
  • Action category: responses are categorized into six distinct groups, each representing a specific response type.

action

Human Evaluation

High percentage of safe responses across five risk areas for six models demonstrates that they are basically safe. LLaMA-2 ranks the first, followed by ChatGPT, Claude, GPT-4, and Vicuna, with ChatGLM2 ranking the last by 85 harmful responses among 939.

  • Harmlessness Rank

harmful_num

  • Action Category Distribution
  • action_distribution

Automatic Evaluation

Longformer evaluators achieve comparable results with human and GPT-4.

  • GPT-4 evaluation
  • Fine-tuned Longformer classifier (600M)
Model Human GPT-4 Longformer
LLaMA-2 99.7 99.4 99.3
ChatGPT 98.5 97.7 97.9
Claude 98.3 98.3 97.6
GPT-4 97.6 96.5 97.2
vicuna 94.5 94.9 95.0
ChatGLM2 90.9 92.9 92.9

Citation

If our work is useful for your own, you can cite us with the following BibTex entry:

@misc{wang2023donotanswer,
    author = {Wang, Yuxia, and 
        Li, Haonan and 
        Han, Xudong and 
        Nakov, Preslav 
        and Baldwin, Timothy},
    title = {Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs},
    year = {2023},
    howpublished = {arXiv preprint arXiv:2308.13387},
}

License

All datasets in this repository are released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. All source files in this repository are released under the Apache 2.0 license, the text of which can be found in the LICENSE file.

Downloads last month
0

Use in dataset library

Edit dataset card

Train in AutoTrain

Evaluate modelsHF Leaderboard

Size of downloaded dataset files:

1.71 MB


Size of the auto-converted Parquet files: