Luisroque / instruct-python-llama2-500k

Fine-tuning Instruct Llama2 Stack Overflow Python Q&A

Transformed Dataset

Objective

The transformed dataset is designed for fine-tuning LLMs to improve Python coding assistance by focusing on high-quality content from Stack Overflow. It has around 500k instructions.

Structure

  • Question-Answer Pairing: Questions and answers are paired using the ParentId linkage.
  • Quality Focus: Only top-rated answers for each question are retained.
  • HTML Tag Removal: All HTML tags in the content are removed.
  • Combined Question Field: Each question’s title and body are merged.
  • Filtering: Entries with negative scores or those not containing Python code structures are excluded.

Final columns:

  • score_question
  • score_answer
  • question
  • answer

Llama2 Transformation

The dataset has been transformed to match the Llama2 prompt structure, which is relevant for the model’s fine-tuning. The format is the following:

<s>[INST] <<SYS>> {{ system_prompt }} <</SYS>> {{ user_message }} [/INST]

Where:

  • system_prompt gives context or instructions to the model.
  • user_message is the user’s query following the system prompt, expecting a particular response from the model.

This structure ensures the training aligns with Llama2’s expectations, optimizing the fine-tuning quality.

Original Dataset

The dataset contains questions and answers from Stack Overflow with the python tag, covering the period from August 2, 2008, to October 19, 2016.

License

All contributions are under the CC-BY-SA 3.0. Attribution is required. The original dataset was posted here.

Keep in touch: LinkedIn