Notebook Created by: David Rusho (Github Blog | Tableau | Linkedin)

Introduction

About the Data

What is Reddit?

Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members.

Subreddits

Posts are organized by subject into user-created boards called "communities" or "subreddits", which cover a variety of topics such as news, politics, religion, science, movies, video games, music, books, sports, fitness, cooking, pets, and image-sharing.

Upvotes/Downvotes

Submissions with more up-votes appear towards the top of their subreddit and, if they receive enough up-votes, ultimately on the site's front page

Subreddit Tabs

At the top of each page on Reddit, you will see a selection of tabs marked Hot, New, Rising, Controversial, Top, Gilded, and Wiki.

Hot posts are the posts that have been getting the most upvotes and comments recently on that subreddit. This is the tab that will be used for this notebook.

Project Goals

This notebook will focus on 'Hot' subreddit tab posts due to their focus on upvotes and recent comments. Data from /r/politics will be scrapped using python library Praw. Analysis will include determining top posts for this subreddit and understanding what factors contributed to their ranking beyond most upvotes and comments. Such as the correlation between comments and points, word frequency and semantic analysis of post titles

Summary of Results

Correlation of Post Score and Number of Comments

A heatmap that was ran through Seaborn showed there was a very positive correlation between the number of comments and the score of a posts (0.89).

Word Frequency of Post Titles

Word frequency showed that Biden and Trump were the most popular key words, followed by GOP.

Sentiment Analysis

The majority of posts in /r/politics were found be neutral, followed by negative.

Data Collection and Cleaning

Import Libraries

!pip install praw
!pip install vaderSentiment
!pip install texthero

from configparser import ConfigParser 
import datetime as dt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import praw
import seaborn as sns
import texthero as herofrom 
from texthero import preprocessing
from texthero import stopwords
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import warnings
warnings.filterwarnings('ignore')

Praw (Reddit API) Setup

# praw setup
reddit = praw.Reddit(client_id = cid, #peronal use script
                    client_secret = csec, #secret token
                    usernme = username, #profile username
                    password = password, #profile password
                    user_agent = ua, #user agent 
                    check_for_async=False)

Organize and Clean Data

Scrap 500 Reddit Posts from /r/poltics from 'Hot' tab.

# list for df conversion
posts = [] 

# select a subreddit to scrape
sub = 'politics' 

# return 500 new posts
new_bets = reddit.subreddit(sub).hot(limit=500) 

# return selected reddit post attributes
for post in new_bets:
    posts.append([post.title, 
                  post.selftext, 
                  post.score, 
                  post.upvote_ratio,
                  post.num_comments, 
                  post.created_utc,
                  post.is_original_content,
                  post.url]) 

# create df, rename columns, and make dtype for all data a str
posts = pd.DataFrame(posts,
                     columns=['title', 
                              'post', 
                              'score', 
                              'upvote_ratio',
                              'comments', 
                              'created',
                              'original_content',
                              'url'],
                     dtype='str')

posts.sample(3)

Column Descriptions

Heading	Description
title	The title of the submission.
post	The submissions’ selftext - an empty string if a link post.
score	The number of upvotes for the submission.
upvote_ratio	The percentage of upvotes from all votes on the submission.
comments	The number of comments on the submission.
created	Time the submission was created, represented in Unix Time.
original_content	Whether or not the submission has been set as original content.
url	The URL the submission links to, or the permalink if a selfpost.

Change 'created' Column Dtype to datetime

# created timestamp column to represent correct created column data
posts['created'] = pd.to_datetime(posts['created'], unit='s')
posts['created'].head(1)

0   2021-07-05 16:00:02
Name: created, dtype: datetime64[ns]

Show Dataframe Dtypes

# change dytpe of score and comments cols to int
posts[['comments','score']] = posts[['comments','score']].astype('int')
posts['upvote_ratio'] = posts['upvote_ratio'].astype('float')

posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   title             500 non-null    object        
 1   post              500 non-null    object        
 2   score             500 non-null    int64         
 3   upvote_ratio      500 non-null    float64       
 4   comments          500 non-null    int64         
 5   created           500 non-null    datetime64[ns]
 6   original_content  500 non-null    object        
 7   url               500 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(4)
memory usage: 31.4+ KB

Clean Post Titles (NLP Preprossing)

#Clean post titles using texthero
posts['clean_title'] = herofrom.clean(posts['title'])
posts['clean_title'].sample(3)

497    nancy pelosi signals hard line formation janua...
430    foreign media skewer joe biden 'barely cogent ...
281    biden administration freezes u assets myanmar ...
Name: clean_title, dtype: object

Data Exploration

Word Frequency of Post Titles (Wordcloud)

# Word cloud of top words from clean_title
herofrom.wordcloud(posts.clean_title,
                   max_words=200,
                   contour_color='', 
                   background_color='white',
                   colormap=cmaps,
                   height = 500, width=800)

Word Frequency of Post Titles (Bar Plot)

# Top 25 Words From Post Titles 

fig = go.Figure([go.Bar(x=tw2.word, 
                        y=tw2.freq,
                        textposition='auto')])

fig.update_layout(wtbckgnd, #set background to white
                  title={'text': f'Top 25 Words in /r/politics Post Titles ({today})',
                  'y':0.88,'x':0.5,'xanchor': 'center','yanchor': 'top'},
                  yaxis=dict(title='Word Count'))

fig.update_traces(marker_color=mcolors) #set market colors to light blue

fig.show()

Post Scores vs Comments (Scatter Plot)

# Post Scores vs Comments 

fig = go.Figure(data=go.Scatter(x=posts.comments,
                                y=posts.score,
                                mode='markers',
                                text=posts.title))  # hover text goes here 
                              
fig.update_layout(wtbckgnd, #set background to white
                  title={'text': f"/r/politics Posts' Scores vs Comments ({today})", 
                         'y':0.88,'x':0.5,'xanchor': 'center','yanchor': 'top'},
                  xaxis_title="Post Score", yaxis_title="No. of Comments",)

fig.update_traces(marker_color=mcolors) #set market colors to light blue

fig.show()

Post Scores by Post Counts (Histrogram Plot)

fig = px.histogram(posts, x="score")

fig.update_layout(wtbckgnd, #set background to white
                  title={'text': f'Post Scores by Post Counts',
                  'y':0.88,'x':0.5,'xanchor': 'center','yanchor': 'top'},
                  yaxis=dict(title='Post Count'),
                  xaxis=dict(title='Post Score'))

fig.update_traces(marker_color=mcolors) #set market colors to light blue

fig.show()

Sentiment Analysis of Post Titles

Scale for determining sentiment

positive: compound score>=0.05
neutral: compound score between -0.05 and 0.05
negative: compound score<=-0.05

#Sentiment Analysis of Post Titles
analyzer = SentimentIntensityAnalyzer()

posts['neg'] = posts['title'].apply(lambda x:analyzer.polarity_scores(x)['neg'])
posts['neu'] = posts['title'].apply(lambda x:analyzer.polarity_scores(x)['neu'])
posts['pos'] = posts['title'].apply(lambda x:analyzer.polarity_scores(x)['pos'])
posts['compound'] = posts['title'].apply(lambda x:analyzer.polarity_scores(x)['compound'])

posts[['title','neg','neu','pos','compound']].sample(3)

Create Sentiment Column Using Compound Numbers

# sentiment col
def sentiment(compscore):
  if compscore >= 0.05:
    return 'positive'
  elif  -0.05 < compscore < 0.05: 
    return 'neutral'
  elif compscore <=-0.05:
    return 'negative'

posts['sentiment'] = posts.compound.apply(sentiment)
posts[['title','neg','neu','pos','compound','sentiment']].sample(3)

Sentiment of Post Titles (Histogram Plot)

# posts.sentiment.value_counts().to_frame().reset_index()

fig = px.histogram(posts, x="compound", color="sentiment",
                  #  color_discrete_sequence= px.colors.sequential.Blues
                   color_discrete_sequence=["#1f77b4",
                                            "#97C3E1",
                                            "#559ACA"])


fig.update_layout(wtbckgnd, #set background to white
                  title={'text': f"Sentiment of /r/politics Posts ({today})", 
                         'y':0.95,'x':0.5,'xanchor': 'center','yanchor': 'top'},
                  xaxis_title="Compound Score", yaxis_title="No. of Posts",)

# fig.update_traces(marker_color=mcolors) #set market colors to light blue

Post Scores vs Compound Sentiment Score (Scatter Plot)

# Post Scores vs Compound Sentiment Score

fig = go.Figure(data=go.Scatter(x=posts.compound,
                                y=posts.score,
                                mode='markers',
                                text=posts.title))  # hover text goes here 
                              
fig.update_layout(wtbckgnd, #set background to white
                  title={'text': "/r/politics Posts' Scores vs Comments",
                         'y':0.88,'x':0.5,'xanchor': 'center','yanchor': 'top'},
                  xaxis_title="Compound Sentiment Score", 
                  yaxis_title="Scores",)

fig.update_traces(marker_color=mcolors) #set market colors to light blue

fig.show()

Correlation of Dataframe (Heatmap)

Note *Plotly currently doesn't have simple solution for using dataframes directly with heatmaps.

# Heatmap of Dataframe
mask = np.triu(np.ones_like(posts.corr(), dtype=np.bool))# adjust mask and df
mask = mask[1:, :-1]
corr = posts.corr().iloc[1:,:-1].copy()# plot heatmap

fig, ax = plt.subplots(figsize=(11, 9))
sb.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='Blues',
           vmin=-1, vmax=1, cbar_kws={"shrink": .8})# yticks
plt.yticks(rotation=0)
plt.show()

Conclusion

Correlation of Post Score and Number of Comments

Heatmap run through Seaborn showed there was a very positive correlation between the number of comments and the score of a posts (0.89).

Word Frequency of Post Titles

Word frequency showed that presidents Biden and Trump were the most popular key words, followed by 'GOP'.

Sentiment Analysis

The Majority of posts in /r/politics were found be Neutral, followed by negative.

Resources

Tools Used

Pandas
Plotly
Praw (reddit api tool) 4.Texthero (NLP tool)

	title	score	upvote_ratio	comments	created	original_content	url
428	A judge blocked Florida Gov. Ron DeSantis' 'de...	1563	0.98	107	1625166787.0	False	https://www.businessinsider.com/florida-ron-de...
483	Garland orders halt to any further federal exe...	147	0.92	1	1625182268.0	False	https://abcnews.go.com/Politics/garland-orders...
218	Biden administration formally launches effort ...	3784	0.98	245	1625270143.0	False	https://www.inquirer.com/news/nation-world/bid...

	score	upvote_ratio	title
0	59394	0.82	Charles Booker makes it official, announces ru...
1	56462	0.89	Dominion has subpoenaed Rudy Giuliani, Sidney ...
2	51924	1.83	Biden says teachers deserve ‘a raise, not just...

	title	neg	neu	pos	compound
392	Biden struggles to answer Russia question at p...	0.200	0.800	0.000	-0.3612
354	Child tax credit checks will start arriving th...	0.000	0.794	0.206	0.3818
271	Trump under fire for provocative email to supp...	0.147	0.675	0.178	0.1280

	title	neg	neu	pos	compound	sentiment
126	Op-Ed: What does it mean to be American? Ask a...	0.000	1.000	0.000	0.0000	neutral
11	Want Better Policing? Make It Easier To Fire B...	0.330	0.279	0.391	0.0258	neutral
176	They kept the wheels on democracy as Trump tri...	0.158	0.842	0.000	-0.4939	negative

Data Analysis of Reddit's /r/Politics

Introduction

About the Data

What is Reddit?

Subreddits

Upvotes/Downvotes

Subreddit Tabs

Project Goals

Summary of Results

Correlation of Post Score and Number of Comments

Word Frequency of Post Titles

Sentiment Analysis

Data Collection and Cleaning

Import Libraries

Praw (Reddit API) Setup

Organize and Clean Data

Scrap 500 Reddit Posts from /r/poltics from 'Hot' tab.

Column Descriptions

Change 'created' Column Dtype to datetime

Show Dataframe Dtypes

Clean Post Titles (NLP Preprossing)

Data Exploration

Word Frequency of Post Titles (Wordcloud)

Top 25 Words From Post Titles (Bar Plot)

Word Frequency of Post Titles (Bar Plot)

Post Scores vs Comments (Scatter Plot)

Post Scores by Post Counts (Histrogram Plot)

Sentiment Analysis of Post Titles

Create Sentiment Column Using Compound Numbers

Sentiment of Post Titles (Histogram Plot)

Post Scores vs Compound Sentiment Score (Scatter Plot)

Correlation of Dataframe (Heatmap)

Conclusion

Resources

Tools Used

	word	freq
0	biden	85
1	trump	67
2	gop	43

Introduction

About the Data

What is Reddit?

Subreddits

Upvotes/Downvotes

Subreddit Tabs

Project Goals

Summary of Results

Correlation of Post Score and Number of Comments

Word Frequency of Post Titles

Sentiment Analysis

Data Collection and Cleaning

Import Libraries

Praw (Reddit API) Setup

Organize and Clean Data

Scrap 500 Reddit Posts from /r/poltics from 'Hot' tab.

Column Descriptions

Change 'created' Column Dtype to datetime

Show Dataframe Dtypes

Clean Post Titles (NLP Preprossing)

Data Exploration

Top 10 Popular Posts by Score

Word Frequency of Post Titles (Wordcloud)

Top 25 Words From Post Titles (Bar Plot)

Word Frequency of Post Titles (Bar Plot)

Post Scores vs Comments (Scatter Plot)

Post Scores by Post Counts (Histrogram Plot)

Sentiment Analysis of Post Titles

Create Sentiment Column Using Compound Numbers

Sentiment of Post Titles (Histogram Plot)

Post Scores vs Compound Sentiment Score (Scatter Plot)

Correlation of Dataframe (Heatmap)

Conclusion

Resources

Tools Used