NLPinas / ph_en_text_detoxed

https://huggingface.co/datasets/NLPinas/ph_en_text_detoxed

PhEnText Detoxed is a large-scale and multi-domain lexical data written in Philippine English and Taglish text. The news articles, religious articles and court decisions collated by the original researchers were filtered for toxicity and special characters were further preprocessed. This dataset has been configured to easily fine-tune LLaMA-based models (Alpaca, Guanaco, Vicuna, LLaMA 2, etc.) In total, this dataset contains 6.29 million rows of training data and 2.7 million rows of testing data.

Sources

According to Canon et al. (2022), here is the original breakdown of the dataset sources:

Source Website Year Number of Documents
Online news (Philippine Daily Inquirer) inquirer.net 2009-2021 834,630
Online news (Manila Bulletin) mb.com.ph 2018-2021 248,408
Jurisprudence lawphil.net 1901-2021 59,905
Old digital periodicals repository.mainlib.upd.edu.ph 1904-1981 20,999
Religious texts cbcponline.net 2009-2022 2,281
Laws and Issuances officialgazette.gov.ph 1906-2016 30,215

Ethical Considerations

Before and after training/fine-tuning a model on this dataset, it is important to take note of the following:

  1. Fairness and Bias: The model’s responses may reflect biases present in the training data. Be aware of potential biases and make an effort to evaluate responses critically and fairly.
  2. Transparency: The model operates as a predictive text generator based on patterns learned from the training data.
  3. User Responsibility: Users should take responsibility for their own decisions and not solely rely on the information provided by the model. Consult with the appropriate professionals or reliable sources for specific advice or recommendations.
  4. NSFW Content: The data has already been detoxified, however it may still contain sensitive topics including violence, gore, and sexual content. If you plan to further refine your model for safe/aligned usage, you are highly encouraged to implement guardrails along with it.
  5. Timeliness The data’s cutoff date is December 2021. The data must not be used to generate content that heavily relies on events after the cutoff date.

References

@INPROCEEDINGS{9923429,
  author={Canon, Mary Joy P. and Sy, Christian Y. and Palaoag, Thelma D. and Roxas, Rachel Edita O. and Maceda, Lany L.},
  booktitle={2022 International Conference on Advanced Computer Science and Information Systems (ICACSIS)}, 
  title={Language Resource Construction of Multi-Domain Philippine English Text for Pre-training Objective}, 
  year={2022},
  volume={},
  number={},
  pages={149-154},
  doi={10.1109/ICACSIS56558.2022.9923429}}

@misc{PhEnText Detoxed,
  author = {Catapang, Jasper Kyle and Peramo, Elmer},
  title = {PhEnText Detoxed},
  year = {2023},
  publisher = {Hugging Face},
  journal = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/datasets/NLPinas/ph_en_text_detoxed}}
}