Download all the text comments from a subreddit
🔗 Github: pistocop/subreddit-comments-dl
Reddit is a perfect website to gather a lot of user comments about specific topics.
For this reason, it looks very attractive for NLP tasks, i.e. make sentiment analysis for specific products or politics.
ℹ Update 20 Feb 2021
I wrote a better article for Toward Data Science on medium, you can read it here.
# Install the dependencies pip install -r requirements.txt # Download the AskReddit comments of the last 30 submissions python src/subreddit_downloader.py AskReddit --batch-size 10 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username> # Download the News comments after 1 January 2021 python src/subreddit_downloader.py News --batch-size 512 --laps 3 --reddit-id <reddit_id> --reddit-secret <reddit_secret> --reddit-username <reddit_username> --utc-after 1609459200 # Build the dataset and check the results under `./dataset/` path python src/dataset_builder.py
ℹ Where I can get the Reddit parameters?
Parameters indicated with
<...>on the previous script
Official Reddit guide
TLDR: read this stack overflow
Parameter name Description How get it Example of the value
The Client ID generated from the apps page Official guide 40oK80pF8ac3Cn
The secret generated from the apps page Copy the value as showed here 9KEUOE7pi8dsjs9507asdeurowGCcg
The reddit account name The name you use for log in pistoSniffer
- subreddit: section of reddit website focused on a particular topic
- submission: the post that appear in each subreddit. When you open a subreddit page, all the posts you see. Each submission has a tree of comments.
- comment: text wrote by a reddit user under a submission inside a subreddit
- The main goal of this repository is sto gather the comments belong to the subreddit
- Under the hood the script use pushshift to gather submissions id,
for collect the submissions comments
- More info about the
subreddit_downloader.pyscript under the
Each adventure brings with it some new discoveries: those are the technologies I had uses:
plac- a very handy pip package to manage the script arguments
- Discarded because it can’t handle python3 typing
- typer - from the author of FastAPI, a script arguments manager based on python3 typing system
- praw - the official reddit API
- loguru - logging manager
- PushshiftAPI - unofficial reddit API
- PrettyErrors - prettifies Python exception output to make it legible
- PyCharm - why mention this IDE? because after an “incident” with git from CLI, it had restored my deleted file thanks to its internal history ♥