⛏ Subreddit text downloader

🔖 Introduction

Download all the text comments from a subreddit

🔗 Github: pistocop/subreddit-comments-dl

Reddit is a perfect website to gather a lot of user comments about specific topics.
For this reason, it looks very attractive for NLP tasks, i.e. make sentiment analysis for specific products or politics.

Therefore I make this scraper/tool that downloads text comments from specific subreddits.

ℹ Update 20 Feb 2021

I wrote a better article for Toward Data Science on medium, you can read it here.


🚀 Usage

Basic usage to download submissions and relative comments from
subreddit AskReddit and News:

# Install the dependencies
pip install -r requirements.txt

# Download the AskReddit comments of the last 30 submissions
python src/subreddit_downloader.py AskReddit --batch-size 10 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username>

# Download the News comments after 1 January 2021
python src/subreddit_downloader.py News --batch-size 512 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username> --utc-after 1609459200

# Build the dataset and check the results under `./dataset/` path
python src/dataset_builder.py 

ℹ Where I can get the Reddit parameters?

  • Parameters indicated with <...> on the previous script

  • Official Reddit guide

  • TLDR: read this stack overflow

    Parameter name Description How get it Example of the value
    reddit_id The Client ID generated from the apps page Official guide 40oK80pF8ac3Cn
    reddit_secret The secret generated from the apps page Copy the value as showed here 9KEUOE7pi8dsjs9507asdeurowGCcg
    reddit_username The reddit account name The name you use for log in pistoSniffer

📖 Glossary

  • subreddit: section of reddit website focused on a particular topic
  • submission: the post that appear in each subreddit. When you open a subreddit page, all the posts you see. Each submission has a tree of comments.
  • comment: text wrote by a reddit user under a submission inside a subreddit
    • The main goal of this repository is sto gather the comments belong to the subreddit

✍ Notes

  • Under the hood the script use pushshift to gather submissions id,
    and praw
    for collect the submissions comments
    • With this approach we require fewer data to pushshift
    • Due to the usage of praw API, the reddit credentials are required
  • More info about the subreddit_downloader.py script under the --help command:

🙏 Technologies

Each adventure brings with it some new discoveries: those are the technologies I had uses:

  • plac - a very handy pip package to manage the script arguments
    • Discarded because it can’t handle python3 typing
  • typer - from the author of FastAPI, a script arguments manager based on python3 typing system
  • praw - the official reddit API
  • loguru - logging manager
  • PushshiftAPI - unofficial reddit API
  • PrettyErrors - prettifies Python exception output to make it legible
  • PyCharm - why mention this IDE? because after an “incident” with git from CLI, it had restored my deleted file thanks to its internal history ♥