dataset-creator-reddit-uwaterloo

Running

File size: 3,286 Bytes

ed3130d
5d9e0b8
ed3130d
5d9e0b8
ed3130d
 
 
 
5d9e0b8
 
ed3130d
 
 
 
 
 
cdbb4c0
ed3130d
5d9e0b8
 
 
ed3130d
285612d
5d9e0b8
285612d
5d9e0b8
285612d
613d6f5
5d9e0b8
613d6f5
7641c8b
5d9e0b8
bc7f4d5
 
 
 
 
2be10b3
5d9e0b8
 
 
285612d
ed3130d
 
 
 
 
afb126c
ed3130d
 
 
 
 
7641c8b
ed3130d
8bb39bf
ed3130d

import os
from datetime import datetime

import pytz
from datasets.download.download_config import DownloadConfig
from datasets.utils.file_utils import cached_path
from datasets.utils.hub import hf_hub_url

frequency = os.environ.get("FREQUENCY", '').lower()


def get_readme_path(dataset_name):
    readme_path = hf_hub_url(dataset_name, "README.md")
    return cached_path(readme_path, download_config=DownloadConfig())


def update_readme(dataset_name, subreddit, latest_date, new_rows):
    path = get_readme_path(dataset_name=dataset_name)
    latest_hour = datetime.now(pytz.utc).replace(minute=0, second=0, microsecond=0)
    latest_hour_str = latest_hour.strftime('%Y-%m-%d %H:00:00 %Z%z')

    readme_text = f"""
## Dataset Overview
The goal is to have an open dataset of [r/{subreddit}](https://www.reddit.com/r/{subreddit}/) submissions. Im leveraging PRAW and the reddit API to get downloads.

There is a limit of 1000 in an API call and limited search functionality, so this is run {frequency} to get new submissions.

## Creation Details
This dataset was created by [derek-thomas/dataset-creator-reddit-{subreddit}](https://huggingface.co/spaces/derek-thomas/dataset-creator-reddit-{subreddit})

## Update Frequency
The dataset is updated {frequency} with the most recent update being `{latest_hour_str}` where we added **{new_rows} new rows**.

## Licensing 
[Reddit Licensing terms](https://www.redditinc.com/policies/data-api-terms) as accessed on October 25:
> The Content created with or submitted to our Services by Users (“User Content”) is owned by Users and not by Reddit. Subject to your complete and ongoing compliance with the Data API Terms, Reddit grants you a non-exclusive, non-transferable, non-sublicensable, and revocable license to copy and display the User Content using the Data API solely as necessary to develop, deploy, distribute, and run your App to your App Users. You may not modify the User Content except to format it for such display. You will comply with any requirements or restrictions imposed on usage of User Content by their respective owners, which may include "all rights reserved" notices, Creative Commons licenses, or other terms and conditions that may be agreed upon between you and the owners. Except as expressly permitted by this section, no other rights or licenses are granted or implied, including any right to use User Content for other purposes, such as for training a machine learning or AI model, without the express permission of rightsholders in the applicable User Content

My take is that you can't use this data for *training* without getting permission.

## Opt-out
To opt-out of this dataset please make a request in the community tab
"""

    append_readme(path=path, readme_text=readme_text)


def append_readme(path, readme_text):
    generated_below_marker = "--- Generated Part of README Below ---"
    with open(path, "r") as file:
        content = file.read()

    if generated_below_marker in content:
        index = content.index(generated_below_marker) + len(generated_below_marker)
        content = content[:index] + "\n\n" + readme_text
    else:
        content += "\n\n" + generated_below_marker + "\n\n" + readme_text + "\n"

    with open(path, "w") as file:
        file.write(content)