Data Analysis of Reddit's /r/Politics
Using Praw to Access API Data from Reddit
Notebook Created by: David Rusho (Github Blog | Tableau | Linkedin)
What is Reddit?
Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members.
Subreddits
Posts are organized by subject into user-created boards called "communities" or "subreddits", which cover a variety of topics such as news, politics, religion, science, movies, video games, music, books, sports, fitness, cooking, pets, and image-sharing.
Upvotes/Downvotes
Submissions with more up-votes appear towards the top of their subreddit and, if they receive enough up-votes, ultimately on the site's front page
Subreddit Tabs
At the top of each page on Reddit, you will see a selection of tabs marked Hot, New, Rising, Controversial, Top, Gilded, and Wiki.
Hot posts are the posts that have been getting the most upvotes and comments recently on that subreddit. This is the tab that will be used for this notebook.
This notebook will focus on 'Hot' subreddit tab posts due to their focus on upvotes and recent comments. Data from /r/politics will be scrapped using python library Praw. Analysis will include determining top posts for this subreddit and understanding what factors contributed to their ranking beyond most upvotes and comments. Such as the correlation between comments and points, word frequency and semantic analysis of post titles
Correlation of Post Score and Number of Comments
A heatmap that was ran through Seaborn showed there was a very positive correlation between the number of comments and the score of a posts (0.89).
Word Frequency of Post Titles
Word frequency showed that Biden and Trump were the most popular key words, followed by GOP.
Sentiment Analysis
The majority of posts in /r/politics were found be neutral, followed by negative.
!pip install praw
!pip install vaderSentiment
!pip install texthero
from configparser import ConfigParser
import datetime as dt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import praw
import seaborn as sns
import texthero as herofrom
from texthero import preprocessing
from texthero import stopwords
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import warnings
warnings.filterwarnings('ignore')
# praw setup
reddit = praw.Reddit(client_id = cid, #peronal use script
client_secret = csec, #secret token
usernme = username, #profile username
password = password, #profile password
user_agent = ua, #user agent
check_for_async=False)
# list for df conversion
posts = []
# select a subreddit to scrape
sub = 'politics'
# return 500 new posts
new_bets = reddit.subreddit(sub).hot(limit=500)
# return selected reddit post attributes
for post in new_bets:
posts.append([post.title,
post.selftext,
post.score,
post.upvote_ratio,
post.num_comments,
post.created_utc,
post.is_original_content,
post.url])
# create df, rename columns, and make dtype for all data a str
posts = pd.DataFrame(posts,
columns=['title',
'post',
'score',
'upvote_ratio',
'comments',
'created',
'original_content',
'url'],
dtype='str')
posts.sample(3)
Column Descriptions
Heading | Description |
---|---|
title | The title of the submission. |
post | The submissions’ selftext - an empty string if a link post. |
score | The number of upvotes for the submission. |
upvote_ratio | The percentage of upvotes from all votes on the submission. |
comments | The number of comments on the submission. |
created | Time the submission was created, represented in Unix Time. |
original_content | Whether or not the submission has been set as original content. |
url | The URL the submission links to, or the permalink if a selfpost. |
# created timestamp column to represent correct created column data
posts['created'] = pd.to_datetime(posts['created'], unit='s')
posts['created'].head(1)
# change dytpe of score and comments cols to int
posts[['comments','score']] = posts[['comments','score']].astype('int')
posts['upvote_ratio'] = posts['upvote_ratio'].astype('float')
posts.info()
#Clean post titles using texthero
posts['clean_title'] = herofrom.clean(posts['title'])
posts['clean_title'].sample(3)
# Top 10 Popular posts based on score
top_posts = posts.groupby(['title'])['score','upvote_ratio'].sum().sort_values(by='score',ascending=False).reset_index()
top_posts[['score','upvote_ratio','title']].head(3)
# Word cloud of top words from clean_title
herofrom.wordcloud(posts.clean_title,
max_words=200,
contour_color='',
background_color='white',
colormap=cmaps,
height = 500, width=800)
# create new dateframe of top words
tw = herofrom.visualization.top_words(posts['clean_title']).head(20).to_frame()
tw.reset_index(inplace=True)
tw.rename(columns={'index':'word','clean_title':'freq'},inplace=True)
#remove word less than 2 chars
tw2 = tw[tw['word'].str.len() >=2]
tw2 = tw2.sort_values(by='freq',ascending=False)
tw2.head(3)
# Top 25 Words From Post Titles
fig = go.Figure([go.Bar(x=tw2.word,
y=tw2.freq,
textposition='auto')])
fig.update_layout(wtbckgnd, #set background to white
title={'text': f'Top 25 Words in /r/politics Post Titles ({today})',
'y':0.88,'x':0.5,'xanchor': 'center','yanchor': 'top'},
yaxis=dict(title='Word Count'))
fig.update_traces(marker_color=mcolors) #set market colors to light blue
fig.show()
# Post Scores vs Comments
fig = go.Figure(data=go.Scatter(x=posts.comments,
y=posts.score,
mode='markers',
text=posts.title)) # hover text goes here
fig.update_layout(wtbckgnd, #set background to white
title={'text': f"/r/politics Posts' Scores vs Comments ({today})",
'y':0.88,'x':0.5,'xanchor': 'center','yanchor': 'top'},
xaxis_title="Post Score", yaxis_title="No. of Comments",)
fig.update_traces(marker_color=mcolors) #set market colors to light blue
fig.show()
fig = px.histogram(posts, x="score")
fig.update_layout(wtbckgnd, #set background to white
title={'text': f'Post Scores by Post Counts',
'y':0.88,'x':0.5,'xanchor': 'center','yanchor': 'top'},
yaxis=dict(title='Post Count'),
xaxis=dict(title='Post Score'))
fig.update_traces(marker_color=mcolors) #set market colors to light blue
fig.show()
#Sentiment Analysis of Post Titles
analyzer = SentimentIntensityAnalyzer()
posts['neg'] = posts['title'].apply(lambda x:analyzer.polarity_scores(x)['neg'])
posts['neu'] = posts['title'].apply(lambda x:analyzer.polarity_scores(x)['neu'])
posts['pos'] = posts['title'].apply(lambda x:analyzer.polarity_scores(x)['pos'])
posts['compound'] = posts['title'].apply(lambda x:analyzer.polarity_scores(x)['compound'])
posts[['title','neg','neu','pos','compound']].sample(3)
# sentiment col
def sentiment(compscore):
if compscore >= 0.05:
return 'positive'
elif -0.05 < compscore < 0.05:
return 'neutral'
elif compscore <=-0.05:
return 'negative'
posts['sentiment'] = posts.compound.apply(sentiment)
posts[['title','neg','neu','pos','compound','sentiment']].sample(3)
# posts.sentiment.value_counts().to_frame().reset_index()
fig = px.histogram(posts, x="compound", color="sentiment",
# color_discrete_sequence= px.colors.sequential.Blues
color_discrete_sequence=["#1f77b4",
"#97C3E1",
"#559ACA"])
fig.update_layout(wtbckgnd, #set background to white
title={'text': f"Sentiment of /r/politics Posts ({today})",
'y':0.95,'x':0.5,'xanchor': 'center','yanchor': 'top'},
xaxis_title="Compound Score", yaxis_title="No. of Posts",)
# fig.update_traces(marker_color=mcolors) #set market colors to light blue
# Post Scores vs Compound Sentiment Score
fig = go.Figure(data=go.Scatter(x=posts.compound,
y=posts.score,
mode='markers',
text=posts.title)) # hover text goes here
fig.update_layout(wtbckgnd, #set background to white
title={'text': "/r/politics Posts' Scores vs Comments",
'y':0.88,'x':0.5,'xanchor': 'center','yanchor': 'top'},
xaxis_title="Compound Sentiment Score",
yaxis_title="Scores",)
fig.update_traces(marker_color=mcolors) #set market colors to light blue
fig.show()
# Heatmap of Dataframe
mask = np.triu(np.ones_like(posts.corr(), dtype=np.bool))# adjust mask and df
mask = mask[1:, :-1]
corr = posts.corr().iloc[1:,:-1].copy()# plot heatmap
fig, ax = plt.subplots(figsize=(11, 9))
sb.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='Blues',
vmin=-1, vmax=1, cbar_kws={"shrink": .8})# yticks
plt.yticks(rotation=0)
plt.show()
Correlation of Post Score and Number of Comments
Heatmap run through Seaborn showed there was a very positive correlation between the number of comments and the score of a posts (0.89).
Word Frequency of Post Titles
Word frequency showed that presidents Biden and Trump were the most popular key words, followed by 'GOP'.
Sentiment Analysis
The Majority of posts in /r/politics were found be Neutral, followed by negative.
- Pandas
- Plotly
- Praw (reddit api tool) 4.Texthero (NLP tool)