Theblackcat102 / crossvalidated-posts

Cross Validated / stats.stackexchange.com

Dataset Summary

This dataset contains all posts submitted to stats.stackexchange.com before the 30th of August 2023 formatted as Markdown text.
The data is sourced from Internet Archive StackExchange Data Dump and follows the format by mikex86/stackoverflow-posts

Dataset Structure

Each record corresponds to one post of a particular type. Original ordering from the data dump is not exactly preserved due to parallelism in the script used to process the data dump. The markdown content of each post is contained in the Body field. The license for a particular post is contained in the ContentLicense field.

Data Fields

{
  Id: long,
  PostTypeId: long, // 1=Question, 2=Answer, 3=Orphaned tag wiki, 4=Tag wiki excerpt, 5=Tag wiki, 6=Moderator nomination, 7=Wiki Placeholder, 8=Privilige Wiki
  AcceptedAnswerId: long | null, // only present if PostTypeId=1
  ParentId: long | null, // only present if PostTypeId=2
  Score: long,
  ViewCount: long | null,
  Body: string | null,
  Title: string | null,
  ContentLicense: string | null,
  FavoriteCount: long | null,
  CreationDate: string | null,
  LastActivityDate: string | null,
  LastEditDate: string | null,
  LastEditorUserId: long | null,
  OwnerUserId: long | null,
  Tags: array<string> | null
}

Also consider the StackExchange Datadump Schema Documentation, as all fields have analogs in the original dump format.

How to use?

from datasets import load_dataset

# predownload full dataset
ds = load_dataset('theblackcat102/crossvalidated-posts', split='train')

# dataset streaming (will only download the data as needed)
ds = load_dataset('theblackcat102/crossvalidated-posts', split='train', streaming=True)

for sample in iter(ds): print(sample["Body"])

How is the text stored?

The original Data Dump formats the “Body” field as HTML, using tags such as <code>, <h1>, <ul>, etc. This HTML format has been converted to Markdown following mikex86/stackoverflow-posts conversion rule.

Example:

After differencing I saw that my constant/intercept is not statistically significant. Does anybody know how to fit the same model without the const term? im using statsmodels.tsa.arima.model To give a relative example I have: ARIMA(data, order=(3,0,0)) an AR(3) model and say it that the second coefficient is insignificant. I can get rid of it by typing

ARMA(data,order=([1, 3], 0, 0)

but how can I get rid of coefficient??

1 Like

cool!!! :hugs: :hugs: :hugs: :hugs: :hugs: :hugs: :hugs: