NLPinas / ph_en_text_detoxed

jay · September 5, 2023, 8:41am

https://huggingface.co/datasets/NLPinas/ph_en_text_detoxed

PhEnText Detoxed is a large-scale and multi-domain lexical data written in Philippine English and Taglish text. The news articles, religious articles and court decisions collated by the original researchers were filtered for toxicity and special characters were further preprocessed. This dataset has been configured to easily fine-tune LLaMA-based models (Alpaca, Guanaco, Vicuna, LLaMA 2, etc.) In total, this dataset contains 6.29 million rows of training data and 2.7 million rows of testing data.

Sources

According to Canon et al. (2022), here is the original breakdown of the dataset sources:

Source	Website	Year	Number of Documents
Online news (Philippine Daily Inquirer)	inquirer.net	2009-2021	834,630
Online news (Manila Bulletin)	mb.com.ph	2018-2021	248,408
Jurisprudence	lawphil.net	1901-2021	59,905
Old digital periodicals	repository.mainlib.upd.edu.ph	1904-1981	20,999
Religious texts	cbcponline.net	2009-2022	2,281
Laws and Issuances	officialgazette.gov.ph	1906-2016	30,215

Ethical Considerations

Before and after training/fine-tuning a model on this dataset, it is important to take note of the following:

Fairness and Bias: The model’s responses may reflect biases present in the training data. Be aware of potential biases and make an effort to evaluate responses critically and fairly.
Transparency: The model operates as a predictive text generator based on patterns learned from the training data.
User Responsibility: Users should take responsibility for their own decisions and not solely rely on the information provided by the model. Consult with the appropriate professionals or reliable sources for specific advice or recommendations.
NSFW Content: The data has already been detoxified, however it may still contain sensitive topics including violence, gore, and sexual content. If you plan to further refine your model for safe/aligned usage, you are highly encouraged to implement guardrails along with it.
Timeliness The data’s cutoff date is December 2021. The data must not be used to generate content that heavily relies on events after the cutoff date.

References

@INPROCEEDINGS{9923429,
  author={Canon, Mary Joy P. and Sy, Christian Y. and Palaoag, Thelma D. and Roxas, Rachel Edita O. and Maceda, Lany L.},
  booktitle={2022 International Conference on Advanced Computer Science and Information Systems (ICACSIS)}, 
  title={Language Resource Construction of Multi-Domain Philippine English Text for Pre-training Objective}, 
  year={2022},
  volume={},
  number={},
  pages={149-154},
  doi={10.1109/ICACSIS56558.2022.9923429}}

@misc{PhEnText Detoxed,
  author = {Catapang, Jasper Kyle and Peramo, Elmer},
  title = {PhEnText Detoxed},
  year = {2023},
  publisher = {Hugging Face},
  journal = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/datasets/NLPinas/ph_en_text_detoxed}}
}