Introduction


About the Data

This Data comes from the Florida Department of Health. They release several reports throughout the years that list the number of positive covid-19 cases by school. The types of schools include Elementary, Middle, High School, Charter/Prepatory Schools, and University/Colleges. The data is presented on a pdf, is in wide format, and the school types are note labeled.

Also, the version of Tableau that is being used is Tableau Public. This means that the data that is exported will need to be in an excel format.

Data Cleaning


Import Data

import pandas as pd

#import pdf and filter cols/rows
df = pd.read_csv('schools_latest.csv',usecols=[0,1,12,13,14,15,16,17,18,19],skiprows=5)
df.sample(2)

School County Unnamed: 12 Students.1 Teachers.1 Staff.1 Unknown.2 Yes.1 No.1 Unknown.3
4230 NaN NaN NaN NaN Cumulative (Sep 6 - May 22) NaN NaN NaN NaN NaN
6209 THE WEBSTER SCHOOL St. Johns 25 14 2 5 4 17 8 0

Rename Columns

#rename columns
df.columns=['School','County','Total_Cases','Students','Teachers','Staff','Unknown','Symptoms_Yes','Symptoms_No','Symptoms_Unknown']
df.sample(2)

School County Total_Cases Students Teachers Staff Unknown Symptoms_Yes Symptoms_No Symptoms_Unknown
2900 BOYETTE SPRINGS ELEMENTARY SCHOOL Hillsborough 24 15 3 4 2 15 3 6
3641 SIX MILE CHARTER ACADEMY Lee 17 15 1 0 1 7 7 3

Remove nan values from 'School' col

Remove nan values from 'County' col

# new df with nan values removed
df2 = df1[df1['County'].isna()==False]
df2.sample(3)

School County Total_Cases Students Teachers Staff Unknown Symptoms_Yes Symptoms_No Symptoms_Unknown
995 COCONUT GROVE MONTESSORI SCHOOL Dade 4 2 0 0 2 1 2 1
3805 MIAMI NORLAND SENIOR HIGH SCHOOL Dade 27 16 2 2 7 19 8 0
2619 HENRY S. WEST LABORATORY SCHOOL Dade 6 5 0 0 1 3 2 1

df2.isna().sum()

School                0
County                0
Total_Cases         135
Students              0
Teachers              0
Staff                 0
Unknown               0
Symptoms_Yes          0
Symptoms_No           0
Symptoms_Unknown      0
dtype: int64

Remove nan values from 'Total Cases' col

df3 = df2[df2.Total_Cases.isna()==False]
df3.sample(3)
School County Total_Cases Students Teachers Staff Unknown Symptoms_Yes Symptoms_No Symptoms_Unknown
4686 PORT CHARLOTTE MIDDLE SCHOOL Charlotte 34 28 2 0 4 25 9 0
5681 SURGE CHRISTIAN ACADEMY Pinellas 4 4 0 0 0 3 1 0
6204 VALRICO LAKE ADVANTAGE ACADEMY Hillsborough 15 10 3 0 2 7 0 8

Dataframe contains no nan values.

#df is balanced with nan values removed
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6064 entries, 0 to 6603
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   School            6064 non-null   object
 1   County            6064 non-null   object
 2   Total_Cases       6064 non-null   object
 3   Students          6064 non-null   object
 4   Teachers          6064 non-null   object
 5   Staff             6064 non-null   object
 6   Unknown           6064 non-null   object
 7   Symptoms_Yes      6064 non-null   object
 8   Symptoms_No       6064 non-null   object
 9   Symptoms_Unknown  6064 non-null   object
dtypes: object(10)
memory usage: 521.1+ KB

Drop unnecessary cols

# No way to identify which goups symptoms belong to
df4 = df3.drop(columns=['Symptoms_Yes','Symptoms_No','Symptoms_Unknown']).reset_index(drop=True)
df4.head(3)
School County Total_Cases Students Teachers Staff Unknown
0 IRVING & BEATRICE PESKOE K-8 CENTER Dade 1 0 0 0 1
1 2010 E HILLSBOROUGH AVE, Hillsborough 1 1 0 0 0
2 A BLESSED ACADEMY Polk 1 1 0 0 0

Concat 'School' and 'County' cols

Several schools have the same name. Combining School and County columns creates a unique identifier that can be split in Tableau to contain more useful data visualizations.

df4['School_County'] = (df4.School.str.cat(df4.County, sep=" (") + " County)").str.title()
df4.head()
School County Total_Cases Students Teachers Staff Unknown School_County
0 IRVING & BEATRICE PESKOE K-8 CENTER Dade 1 0 0 0 1 Irving & Beatrice Peskoe K-8 Center (Dade Cou...
1 2010 E HILLSBOROUGH AVE, Hillsborough 1 1 0 0 0 2010 E Hillsborough Ave, (Hillsborough County)
2 A BLESSED ACADEMY Polk 1 1 0 0 0 A Blessed Academy (Polk County)
3 A CHILD'S PLACE MONTESSORI SCHOOL Duval 2 2 0 0 0 A Child'S Place Montessori School (Duval County)
4 A E F SCHOOLS ALTERNATIVE EDU. FOUNDATION Broward 3 0 1 2 0 A E F Schools Alternative Edu. Foundation (Bro...

Convert df from wide to long using melt()

df_long = df4.melt(id_vars=['School','County','School_County','Total_Cases'], 
                   var_name='Individual', 
                   value_name='pos_covid')
df_long.sample(3)
School County School_County Total_Cases Individual pos_covid
6492 BENNETT ELEMENTARY SCHOOL Broward Bennett Elementary School (Broward County) 22 Teachers 5
9414 MENI NINHO MENI NINHA INC Dade Meni Ninho Meni Ninha Inc (Dade County) 1 Teachers 0
2962 LEE ADOLESCENT MOTHERS PROGRAM Lee Lee Adolescent Mothers Program (Lee County) 3 Students 3

Drop 'Total Cases' col

# Drop '_Total Cases_' col
df_long2 = df_long.drop(columns='Total_Cases')
df_long2.head(2)

School County School_County Individual pos_covid
0 IRVING & BEATRICE PESKOE K-8 CENTER Dade Irving & Beatrice Peskoe K-8 Center (Dade Cou... Students 0
1 2010 E HILLSBOROUGH AVE, Hillsborough 2010 E Hillsborough Ave, (Hillsborough County) Students 1

Clean 'School' names.

Convert to title() case

#Title Case for Schools
df_long2.School = df_long2.School.str.title()
df_long2.head(2)

School County School_County Individual pos_covid
0 Irving & Beatrice Peskoe K-8 Center Dade Irving & Beatrice Peskoe K-8 Center (Dade Cou... Students 0
1 2010 E Hillsborough Ave, Hillsborough 2010 E Hillsborough Ave, (Hillsborough County) Students 1

Label University/Colleges

#Label University/Colleges
df_long2.loc[((df_long2.School.str.contains('University')==True) | (df_long2.School.str.contains('College')==True)) & 
         (df_long2.School.str.contains('Middle')==False) &
         (df_long2.School.str.contains('Elementary')==False) & 
         (df_long2.School.str.contains('High')==False) & 
         (df_long2.School.str.contains('Academy')==False) & 
         (df_long2.School.str.contains('Charter')==False) &
         (df_long2.School.str.contains('Prep')==False) & 
         (df_long2.School.str.contains('Preparatory')==False) &
         (df_long2.School.str.contains('1st College')==False) & 
         (df_long2.School.str.contains('Kids')==False) &
         (df_long2.School.str.contains('Little College')==False), 'School_Type'] = 'University/College'

df_long2[df_long2['School'].str.contains('Uni')].sample(3)

School County School_County Individual pos_covid School_Type
1718 First United Methodist School Okaloosa First United Methodist School (Okaloosa County) Students 0 NaN
5646 University Of South Florida- St. Petersburg Pinellas University Of South Florida- St. Petersburg (P... Students 14 University/College
9222 Lynn University Palm Beach Lynn University (Palm Beach County) Teachers 0 University/College

Label High Schools

#Label High Schools
df_long2.loc[((df_long2.School.str.contains('High')==True) | (df_long2.School.str.contains('Hs')==True)) & 
                       (df_long2.School.str.contains('Middle')==False) &
                       (df_long2.School.str.contains('Elementary')==False) & 
                       (df_long2.School.str.contains('Adult')==False),'School_Type'] = 'High School'

df_long2[df_long2['School'].str.contains('High')].sample(3)

School County School_County Individual pos_covid School_Type
21009 Lake Highland Preparatory School Orange Lake Highland Preparatory School (Orange County) Unknown 2 High School
15033 Land O' Lakes High Adult Education Pasco Land O' Lakes High Adult Education (Pasco County) Staff 0 NaN
18987 Central High School Hernando Central High School (Hernando County) Unknown 1 High School

Label Middle Schools

#Identify Middle Schools

df_long2.loc[(df_long2.School.str.contains('High')==False) & 
                       (df_long2.School.str.contains('Middle')==True) &
                       (df_long2.School.str.contains('Elementary')==False) & 
                       (df_long2.School.str.contains('Adult')==False),'School_Type'] = 'Middle School'

df_long2[df_long2['School'].str.contains('Middle')].sample(3)

School County School_County Individual pos_covid School_Type
1526 Eisenhower Middle School Hillsborough Eisenhower Middle School (Hillsborough County) Students 21 Middle School
16696 Ruben Dario Middle School Dade Ruben Dario Middle School (Dade County) Staff 0 Middle School
16010 Osceola Middle School Marion Osceola Middle School (Marion County) Staff 6 Middle School

Label Elementary

#identify Elementary Schools
#some elementary schools have 'University' or 'College' in their names

df_long2.loc[(df_long2.School.str.contains('High')==False) & 
                       (df_long2.School.str.contains('Middle')==False) &
                       (df_long2.School.str.contains('Elementary')==True) & 
                       (df_long2.School.str.contains('Adult')==False),'School_Type'] = "Elementary School"

df_long2[df_long2['School'].str.contains('Elem')].sample(3)

School County School_County Individual pos_covid School_Type
1591 Estates Elementary School Collier Estates Elementary School (Collier County) Students 25 Elementary School
3460 Molino Park Elementary Escambia Molino Park Elementary (Escambia County) Students 4 Elementary School
20557 Howard Drive Elementary School Dade Howard Drive Elementary School (Dade County) Unknown 2 Elementary School

Label unknown schools a 'Other'

df_long2.loc[df_long2.School_Type.isna(),'School_Type'] = 'No Label'
df_long2[df_long2['School'].str.contains('Hs')]

School County School_County Individual pos_covid School_Type
565 Brandon Hs Hillsborough Brandon Hs (Hillsborough County) Students 1 High School
2629 Jp Taravella Hs Broward Jp Taravella Hs (Broward County) Students 0 High School
6629 Brandon Hs Hillsborough Brandon Hs (Hillsborough County) Teachers 0 High School
8693 Jp Taravella Hs Broward Jp Taravella Hs (Broward County) Teachers 0 High School
12693 Brandon Hs Hillsborough Brandon Hs (Hillsborough County) Staff 0 High School
14757 Jp Taravella Hs Broward Jp Taravella Hs (Broward County) Staff 0 High School
18757 Brandon Hs Hillsborough Brandon Hs (Hillsborough County) Unknown 0 High School
20821 Jp Taravella Hs Broward Jp Taravella Hs (Broward County) Unknown 1 High School

Change dtype of pos_covid to 'int'

#Change dtype of pos_covid to 'int'
df_long2.pos_covid = df_long2.pos_covid.str.replace(',','').astype('int')
df_long2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24256 entries, 0 to 24255
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   School         24256 non-null  object
 1   County         24256 non-null  object
 2   School_County  24256 non-null  object
 3   Individual     24256 non-null  object
 4   pos_covid      24256 non-null  int64 
 5   School_Type    24256 non-null  object
dtypes: int64(1), object(5)
memory usage: 1.1+ MB

Data Analysis

Groupby school type

#grouby school type
df_long2.groupby('School_Type').count()

School County School_County Individual pos_covid
School_Type
Elementary School 6748 6748 6748 6748 6748
High School 2536 2536 2536 2536 2536
Middle School 2236 2236 2236 2236 2236
No Label 11564 11564 11564 11564 11564
University/College 1172 1172 1172 1172 1172

Groupby School Type and Individuals

df_long2.groupby(['School_Type','Individual'])['pos_covid'].sum().to_frame().reset_index().

School_Type Individual pos_covid
0 Elementary School Staff 1978
1 Elementary School Students 23831
2 Elementary School Teachers 3300
3 Elementary School Unknown 3763
4 High School Staff 1161
5 High School Students 29353
6 High School Teachers 1491
7 High School Unknown 1962
8 Middle School Staff 720
9 Middle School Students 13081
10 Middle School Teachers 1057
11 Middle School Unknown 1248
12 No Label Staff 1680
13 No Label Students 25826
14 No Label Teachers 2515
15 No Label Unknown 3307
16 University/College Staff 1308
17 University/College Students 19201
18 University/College Teachers 286
19 University/College Unknown 1313

Groupby County and Positive Covid Cases

# groupby county and postive covid cases
df_long2.groupby('County')['pos_covid'].sum().to_frame().reset_index().sort_values(by='pos_covid',ascending=False).head(15)

County pos_covid
12 Dade 16672
5 Broward 10523
28 Hillsborough 10268
49 Palm Beach 8801
47 Orange 8610
36 Leon 6139
35 Lee 5588
52 Polk 5524
15 Duval 4996
51 Pinellas 4712
50 Pasco 4541
0 Alachua 4504
63 Volusia 3450
4 Brevard 3137
56 Seminole 2675

Groupby Select County and Individuals

County Selected = 'Orange' county

#collapse
# groupby one county and individuals
var = 'Orange'
sel_county = df_long2[df_long2['County']==var]
sel_county.groupby(['County','Individual'])['pos_covid'].sum().to_frame().reset_index()

County Individual pos_covid
0 Orange Staff 239
1 Orange Students 6976
2 Orange Teachers 229
3 Orange Unknown 1166

Export Data to Excel Format

df_long2 = df_long2.sort_values(by=['County','School']).reset_index(drop=True)
df_long2.to_excel('Florida_Schools_COVID19_Cases_2020_2021.xls')