Fine-tuning Instruct Llama2 Stack Overflow Python Q&A
Transformed Dataset
Objective
The transformed dataset is designed for fine-tuning LLMs to improve Python coding assistance by focusing on high-quality content from Stack Overflow. It has around 500k instructions.
Structure
- Question-Answer Pairing: Questions and answers are paired using the
ParentId
linkage. - Quality Focus: Only top-rated answers for each question are retained.
- HTML Tag Removal: All HTML tags in the content are removed.
- Combined Question Field: Each question’s title and body are merged.
- Filtering: Entries with negative scores or those not containing Python code structures are excluded.
Final columns:
score_question
score_answer
question
answer
Llama2 Transformation
The dataset has been transformed to match the Llama2 prompt structure, which is relevant for the model’s fine-tuning. The format is the following:
<s>[INST] <<SYS>> {{ system_prompt }} <</SYS>> {{ user_message }} [/INST]
Where:
system_prompt
gives context or instructions to the model.user_message
is the user’s query following the system prompt, expecting a particular response from the model.
This structure ensures the training aligns with Llama2’s expectations, optimizing the fine-tuning quality.
Original Dataset
The dataset contains questions and answers from Stack Overflow with the python
tag, covering the period from August 2, 2008, to October 19, 2016.
License
All contributions are under the CC-BY-SA 3.0. Attribution is required. The original dataset was posted here.
Keep in touch: LinkedIn