About – How to do Exploratory data analysis Python step by step | GitHub | Code
Introduction to Exploratory Data Analysis Python
The outbreak of the deadly virus Severe Acute Respiratory Syndrome Corona Virus (SARs-CoV-1), was first reported worldwide in the Wuhan province of China.
It is no more a matter of surprise, at how rapidly this virus spread itself, killed millions of people, and spewed out a breathtaking fear around the globe.
The first Covid-19 case in India was recorded on January 30, 2020. Since then, this domestic outbreak infected a huge part of the population, with infected cases increasing exponentially every single day.
According to the latest statistics, India recorded a total of 14.1M infected cases, on the sixteenth day of April 2023.
It is very unfortunate to see that thousands of people, in spite of following all probable safety rules to stay healthy from the virus, are actually getting infected for the carelessness of the rest.
Though the government has been continuously announcing lockdowns and other safety guidelines, still it is important to get some valuable insights from the data generated from the daily cases since the outbreak of the virus.
It is to let people learn, how the virus is spreading and affecting the gender population of our country.
In this article, I will carry out an exploratory data analysis of Python step by step of the Novel coronavirus in India. This will reveal a lot of insights into how immensely this virus has surged a crisis in the whole country.
What is Exploratory Data Analysis
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations.
Before starting our hands-on exploratory data analysis with Python step by step, let’s look at the dependencies, setup, and installation needed to start with the data analysis.
Installations, Setup, and Dependencies for eda Python example
- Programming Language – Python >= 3x
- External Libraries – NumPy, Pandas, Matplotlib, Seaborn
- Data – Kaggle, Clone my GitHub Repository
- Workspace/IDLE – Jupyter Notebook (you can carry on with your preferred idle too)
Note: The data is composed of CSV (comma-separated value) files. You need to place all the CSV files in the same directory where you will be saving your code.
Otherwise, it cannot recognize the path and will throw you an error. Also, if you work with any other versions of Python, make sure to create a virtual environment for this project.
So, now the setup is complete, and you have done a great job so far!
Let us start with the code walkthrough of eda in Python step by step. I will be explaining each snippet of Python code, one by one, and provide its respective output!
Step 1: Importing the libraries on a fresh Jupyter Notebook (.ipynb file)
Let’s import all the necessary libraries we will work with, at a time and run them. Before importing, make sure you have them installed in your system.
import pandas as pd
import date time
import matplotlib.pyplot as plt
import seaborn as sns
import math
import numpy as np
import matplotlib.ticker as ticker
import matplotlib.animation as animation
%matplotlib inline
sns.set(font_scale=1.4)
So, we are done with step 1. Now, let us load the data.
Step 2: Load the data
We can easily load the data using the pandas function pd.read_csv. Load the data and print the first 5 rows of the covid19_df data to check if it is properly loaded with .head(), and also the shape of the data (rows, cols) with .shape.
covid19_df = pd.read_csv("covid_19_india.csv")
individuals_df = pd.read_csv("IndividualDetails.csv")
excel_file = pd.ExcelFile("ISPA.xlsx")
indian_states_df = excel_file.parse('Sheet1')
covid19_df.head()
covid19_df.shape
You will see a data table with all rows, columns, and their respective entries/values along with the shape displayed in (m,n) format where m = rows, and, n=columns.
Step 3: Data Wrangling and Cleaning
In step 3 of how to do exploratory data analysis in Python, we perform data wrangling and cleaning
Data wrangling is the process of cleaning bad data and configuring messy and complex data sets for easy access and analysis.
Let us check if there are any missing values in our data, by .isna().sum() function, which sums up all the missing values and shows them up.
covid19_df.isna().sum()
Output looks similar to this –
Sno 0
Date 0
Time 0
State/UnionTerritory 0
ConfirmedIndianNational 0
ConfirmedForeignNational 0
Cured 0
Deaths 0
Confirmed 0
dtype: int64
Here, we see that there are no missing values in this dataset which makes our job easier.
Let us now have a look at the most recent records for each state to gain an idea about where we stand currently. From the last set of records, we can see that we have data till 1st July 2020.
Step 4: Fetch out the latest data as of this dataset, i.e. 1st July 2020
Let us create a separate variable called covid19_df_latest to store the records of the most recent cases as of this data, which is 1st July 2020. Once again use the .head() function to check if you fetched the data correctly.
covid19_df_latest = covid19_df[covid19_df['Date']=="01/07/20"]
covid19_df_latest.head()
Now, confirm the total confirmed cases on 1st July 2020 with .sum()
covid19_df_latest['Confirmed'].sum()
The output must look similar to this –
585493
So, we got to know that there were a total of 585493 confirmed cases as of 1st July 2020.
Till now we are going smoothly for the eda python tutorial. Let’s progress more
In case of any issue refer to my exploratory data analysis python book. Link in the end.
You can also download recent COVID-19 data from Kaggle or IEEE forums replace dates with the code snippets, and get your desired results.
It is always a good practice to implement one for your own case study!
Step 5: Visualizing State-wise Figures- States with maximum Confirmed Cases
Now we will use our recently fetched data to see which were the top 5 states infected with COVID-19, with confirmed case counts.
covid19_df_latest = covid19_df_latest.sort_values(by=['Confirmed'],ascending = False)
plt.figure(figsize=(12,8),dpi = 80)
plt.bar(covid19_df_latest['State/UnionTerritory'][:5],covid19_df_latest['Confirmed'][:5],align='center',color='lightgrey')
plt.ylabel('Number of confirmed cases')
plt.title('States with maximum confirmed cases')
plt.show()
The output resembles an insightful graph like this –
In case you are facing any difficulty in exploratory data analysis Python now, feel free to connect with me
Step 6: Visualizing the Top 5 States with Maximum Deaths
Now we will use our recently fetched data to see which were the top 5 states infected with COVID-19, with death case counts.
First, let us sum up the total death count as of 1st July 2020.
covid19_df_latest['Deaths'].sum()
Output –
17400
As per the data in the dataset, India has had 17400 deaths across all states. We will now see which states have the most deaths.
covid19_df_latest = covid19_df_latest.sort_values(by=['Deaths'],ascending = False)
plt.figure(figsize=(12,8), dpi=80)
plt.bar(covid19_df_latest['State/UnionTerritory'][:5], covid19_df_latest['Deaths'][:5], align='center',color='lightgrey')
plt.ylabel('Number of Deaths')
plt.title('States with maximum deaths')
plt.show()
Let’s visualize the graph generated as the output of the code –
We can see that Maharashtra again tops the count, with death cases crossing the border at 7000, followed by Delhi, which is about to touch the 3,000 mark, whereas Gujarat, Tamil Nadu, and Uttar Pradesh are below the 2,000 mark
Step 7: Number of Deaths per Confirmed Cases in the Different Indian States
Next up, I wanted to look at the number of deaths per confirmed case in different Indian states to gain a better idea about the healthcare facilities available.
We will create a new column ‘Deaths/Confirmed Cases’ to get the values.
covid19_df_latest['Deaths/Confirmed Cases'] = (covid19_df_latest['Confirmed']/covid19_df_latest['Deaths']).round(2)
covid19_df_latest['Deaths/Confirmed Cases'] = [np.nan if x==float("inf") else x for x in covid19_df_latest['Deaths/Confirmed Cases']]
covid19_df_latest = covid19_df_latest.sort_values(by=['Deaths/Confirmed Cases'], ascending=True,na_position='last')
covid19_df_latest.iloc[:10]
So, after creating this new measure and sorting the states based on this figure, look at the ten worst states in this regard.
We see that there are some states like Meghalaya, Puducherry, Punjab, and Rajasthan where the number of cases and deaths are pretty low as of now, and it appears things are in control.
But other states like Gujarat, Maharashtra, and Madhya Pradesh look well hit by the condition. We leave West Bengal out of the equation since there has been news emerging from the state regarding the mis-publishing of numbers.
However, these statistics do not always lend a clear picture. India is a country of varying demographics and no two states are the same. Maybe equating the figures to the estimated population of a state may lend a better idea to the entire picture.
Step 8: Cases Per 10 Million
We will rename the number of the Aadhaar card assigned column as Population and discard the Area feature since I decided against using it due to recent updates in the States and UT in India.
indian_states_df = indian_states_df[['State','Aadhaar assigned as of 2019']]
indian_states_df.columns = ['State/UnionTerritory','Population']
indian_states_df.head()
We will now merge the Population dataset with our main dataset and create a new feature called Cases/10 Million to gain some more idea on which cases are more hit by the COVID-19 crisis.
I feel this new measure is now a more level-headed measure as it takes care of the population differences that exist between different states.
covid19_df_latest = pd.merge(covid19_df_latest, indian_states_df, on='State/UnionTerritory')
covid19_df_latest['Cases/10million']=(covid19_df_latest['Confirmed']/covid19_df_latest['Population'])*10000000
Let us fill the missing values with 0, by .fillna() and sort the data frame by cases per 10 million from the end by setting ascending=False.
covid19_df_latest.fillna(0, inplace=True)
covid19_df_latest.sort_values(by='Cases/10million',ascending=False)
Now let’s plot the visualization for the Number of Confirmed Cases Vs. Number of Cases per 10 million people!
df = covid19_df_latest[(covid19_df_latest['Confirmed']>=30000)| (covid19_df_latest['Cases/10million']>=4000)]
plt.figure(figsize=(12,8),dpi=80)
plt.scatter(covid19_df_latest['Confirmed'],covid19_df_latest['Cases/10million'],alpha=0.5)
plt.xlabel('Number of confirmed cases',size=12)
plt.ylabel('Number of cases per 10 million people',size=12)
plt.scatter(df['Confirmed'],df['Cases/10million'],color='red')
for i in range(df.shape[0]):
plt.annotate(df['State/UnionTerritory'].tolist()[i], xy=(df['Confirmed'].tolist()[i],df['Cases/10million'].tolist()[i]),
xytext = (df['Confirmed'].tolist()[i]+1.0, df['Cases/10million'].tolist()[i]+12.0),size=11)
plt.tight_layout()
plt.title('Visualization to display the variation in COVID 19 figures in different Indian states', size=16)
plt.show()
Let us visualize our output –
Step 9: Plot a Correlation Heatmap to Visualize Correlation Coefficients b/w different Columns
Plotting a correlation matrix using a heatmap, assigns some dark and light shade colors, where dark color codes are assigned to values that tend to be equal or near to 1, representing that there is a strong correlation between those two features, whereas a 0 gives a neutral color, representing that feature doesn’t affect our data and a -1 with light to no shade of color code, representing that feature is negatively correlated to our data.
Let us plot it and gain some good insight!
plt.figure(figsize=(10,8),dpi=80)
sns.heatmap(covid19_df_latest.corr(),annot=True
Our heatmap is generated after running this code! Let us visualize it!
We notice that some measures like Confirmed, Cured, Deaths, and Cases/10 million are very much co-related we need to take these data seriously!
Enjoying exploratory data analysis python? Drop a big YES below in the comments!
Step 10: Analyzing Individual Data
Next up, we have a look at the individual case data which we have. On initial inspection of this dataset, we see that there are a huge number of missing data in this dataset which we must take into consideration as we move forward with our analysis
individuals_df.isna().sum()
.Output –
id 0
government_id 25185
diagnosed_date 0
age 25836
gender 22869
detected_city 25832
detected_district 6984
detected_state 0
nationality 25473
current_status 0
status_change_date 402
notes 1335
dtype: int64
So, we have to remove these missing values right?
Here’s how to do it.
individuals_df.dropna()
Let us see where was the 1st case detected!
individuals_df.iloc[0]
Let us see the output-
id 0
government_id KL-TS-P1
diagnosed_date 30/01/2020
age 20
gender F
detected_city Thrissur
detected_district Thrissur
detected_state Kerala
nationality India
current_status Recovered
status_change_date 14/02/2020
notes Travelled from Wuhan
Name: 0, dtype: object
The first case in India due to COVID-19 was noticed on 30 January 2020. It was detected in the city of Thrissur in Kerala. The individual had a travel history in Wuhan.
individuals_grouped_district = individuals_df.groupby('detected_district')
individuals_grouped_district = individuals_grouped_district['id']
individuals_grouped_district.columns=['count']
individuals_grouped_district.count().sort_values(ascending=False).head()
Now we will see in the output, how many individual cases were detected in different districts.
detected_district
Mumbai 3149
Ahmedabad 2181
Indore 1176
Jaipur 808
Pune 706
Name: id, dtype: int64
Let’s find the male and female ratio who got infected by the disease!
individuals_grouped_gender = individuals_df.groupby('gender')
individuals_grouped_gender = pd.DataFrame(individuals_grouped_gender.size().reset_index(name='count'))
individuals_grouped_gender.head()
plt.figure(figsize=(10,6),dpi=80)
barlist= plt.bar(individuals_grouped_gender['gender'],individuals_grouped_gender['count'],align='center',color='grey',alpha=0.3)
barlist[1].set_color('r')
plt.ylabel('Count',size=12)
plt.title('Count on the basis of gender',size=16)
plt.show()
Now look at the graph output!
Uh oh!
This graph tells us the statistic that males are affected more than females, almost double!
Step 11: Progression of Case Count in India
In this step, we will have a look at how the number of cases increased in India. Afterward, we will inspect this curve and find similarities with the state-level curves.
For doing this analysis, I had to modify the dataset a bit. I grouped the data on the basis of the diagnosed data feature so that I had a count of the number of cases detected each day throughout India.
I followed this up by doing a cumulative sum of this feature and adding it to a new column.
individuals_grouped_date = individuals_df.groupby('diagnosed_date')
individuals_grouped_date = pd.DataFrame(individuals_grouped_date.size().reset_index(name="count"))
individuals_grouped_date[['Day','Month','Year']] = individuals_grouped_date.diagnosed_date.apply(
lambda x : pd.Series(str(x).split("/")))
individuals_grouped_date.sort_values(by=['Year','Month','Day'],inplace=True,ascending=True)
individuals_grouped_date.reset_index(inplace=True)
individuals_grouped_date['Cumulative Count'] = individuals_grouped_date['count'].cumsum()
individuals_grouped_date = individuals_grouped_date.drop(['index','Day','Month','Year'],axis=1)
individuals_grouped_date.head()
Step 12: How the Case Counts Increased in India
Now we are at the most crucial and important section of this analysis report. This section will probably let us know how quickly the cases jumped exponentially leading to a domestic and also a global pandemic!
Let us first code it –
individuals_grouped_date = individuals_grouped_date.iloc[3:]
individuals_grouped_date.reset_index(inplace=True)
individuals_grouped_date.columns = ['Day Number','diagnosed_date','count','Cumulative Count']
individuals_grouped_date['Day Number'] = individuals_grouped_date['Day Number']-2
plt.figure(figsize=(12,8), dpi=80)
plt.plot(individuals_grouped_date['Day Number'],individuals_grouped_date['Cumulative Count'],color="grey",alpha=0.5)
plt.xlabel('Number of Days', size = 12)
plt.ylabel('Number of Cases', size = 12)
plt.title('How the case count increased in India', size=16)
plt.show()
Now, let’s visualize the graph.
In the above curve, we see that the rise was more or less steady till the 20th-day mark. In the interval between 20-30, the curve inclined a little. This inclination gradually incremented, and we see a steady and steep slope after the 30-day mark with no signs of flattening. These are ominous indications.
In the next few code elements, I prepare and process the dataset to group the data in terms of different states. I used the following five states for this next analysis
Maharashtra
covid19_maharashtra = covid19_df[covid19_df['State/UnionTerritory']=="Maharashtra"
covid19_maharashtra.head()
covid19_maharashtra.reset_index(inplace=True)
covid19_maharashtra= covid19_maharashtra.drop(['index', 'Sno', 'Time', 'ConfirmedIndianNational', 'ConfirmedForeignNational','Cured'], axis = 1)
covid19_maharashtra.reset_index(inplace = True)
covid19_maharashtra.columns = ['Day Count', 'Date', 'State/UnionTerritory', 'Deaths', 'Confirmed']
covid19_maharashtra['Day Count'] = covid19_maharashtra['Day Count'] +8
missing_values = pd.DataFrame({"Day Count":[x for x in range(1,8)],
"Date": ["0"+ str(x)+"/06/20" for x in range(2,9)],
"State/UnionTerritory": ["Maharashtra"]*7,
"Deaths": [0]*7,
"Confirmed": [0]*7})
covid19_maharashtra = covid19_maharashtra.append(missing_values,ignore_index=True)
covid19_maharashtra = covid19_maharashtra.sort_values(by="Day Count", ascending = True)
covid19_maharashtra.reset_index(drop=True, inplace=True)
print(covid19_maharashtra.shape)
covid19_maharashtra.head()
Tamil Nadu
covid19_tamilnadu = covid19_df[covid19_df['State/UnionTerritory'] == "Tamil Nadu"]
covid19_tamilnadu.reset_index(inplace = True)
covid19_tamilnadu = covid19_tamilnadu.drop(['index','Sno', 'Time', 'ConfirmedIndianNational', 'ConfirmedForeignNational','Cured'], axis = 1)
covid19_tamilnadu.reset_index(inplace = True)
covid19_tamilnadu.columns = ['Day Count', 'Date', 'State/UnionTerritory', 'Deaths', 'Confirmed']
covid19_tamilnadu['Day Count'] = covid19_delhi['Day Count'] + 1
print(covid19_tamilnadu.shape)
covid19_tamilnadu.head()
Delhi
covid19_delhi = covid19_df[covid19_df['State/UnionTerritory'] == "Delhi"]
covid19_delhi.reset_index(inplace = True)
covid19_delhi = covid19_delhi.drop(['index','Sno', 'Time', 'ConfirmedIndianNational', 'ConfirmedForeignNational','Cured'], axis = 1)
covid19_delhi.reset_index(inplace = True)
covid19_delhi.columns = ['Day Count', 'Date', 'State/UnionTerritory', 'Deaths', 'Confirmed']
covid19_delhi['Day Count'] = covid19_delhi['Day Count'] + 1
print(covid19_delhi.shape)
covid19_delhi.head()
Gujarat
covid19_gujarat = covid19_df[covid19_df['State/UnionTerritory'] == "Gujarat"]
covid19_gujarat.reset_index(inplace = True)
covid19_gujarat = covid19_gujarat.drop(['index','Sno', 'Time', 'ConfirmedIndianNational', 'ConfirmedForeignNational','Cured'], axis = 1)
covid19_gujarat.reset_index(inplace = True)
covid19_gujarat.columns = ['Day Count', 'Date', 'State/UnionTerritory', 'Deaths', 'Confirmed']
covid19_gujarat['Day Count'] = covid19_gujarat['Day Count'] + 19
missing_values = pd.DataFrame({"Day Count": [x for x in range(1,19)],
"Date": [("0" + str(x) if x < 10 else str(x))+"/03/20" for x in range(2,20)],
"State/UnionTerritory": ["Gujarat"]*18,
"Deaths": [0]*18,
"Confirmed": [0]*18})
covid19_gujarat = covid19_gujarat.append(missing_values, ignore_index = True)
covid19_gujarat = covid19_gujarat.sort_values(by="Day Count", ascending = True)
covid19_gujarat.reset_index(drop=True, inplace=True)
print(covid19_gujarat.shape)
covid19_gujarat.head()
Kerala
covid19_kerala = covid19_df[covid19_df['State/UnionTerritory'] == "Kerala"]
covid19_kerala = covid19_kerala.iloc[32:]
covid19_kerala.reset_index(inplace = True)
covid19_kerala = covid19_kerala.drop(['index','Sno', 'Time', 'ConfirmedIndianNational', 'ConfirmedForeignNational','Cured'], axis = 1)
covid19_kerala.reset_index(inplace = True)
covid19_kerala.columns = ['Day Count', 'Date', 'State/UnionTerritory', 'Deaths', 'Confirmed']
covid19_kerala['Day Count'] = covid19_kerala['Day Count'] + 1
print(covid19_kerala.shape)
covid19_kerala.head()
Now, let us see which states were flattening the curve.
plt.figure(figsize=(12,8), dpi=80)
plt.plot(covid19_kerala['Day Count'], covid19_kerala['Confirmed'])
plt.plot(covid19_maharashtra['Day Count'], covid19_maharashtra['Confirmed'])
plt.plot(covid19_delhi['Day Count'], covid19_delhi['Confirmed'])
plt.plot(covid19_tamilnadu['Day Count'], covid19_tamilnadu['Confirmed'])
plt.plot(covid19_gujarat['Day Count'], covid19_gujarat['Confirmed'])
plt.legend(['Kerala', 'Maharashtra', 'Delhi', 'Tamil Nadu', 'Gujarat'], loc='upper left')
plt.xlabel('Day Count', size=12)
plt.ylabel('Confirmed Cases Count', size=12)
plt.title('Which states are flattening the curve ?', size = 16)
plt.show()
Let us visualize the graph generated in the output.
We see almost all the curves follow the curve which is displayed by the nation as a whole. The only odd one out is that of Kerala. Kerala’s curve saw a gradual incline in the period between 20-30 days as seen in other curves.
But what Kerala managed to do was it did not let the curve incline further and managed to flatten the curve. As a result, the state has been able to contain the situation.
The situation in Maharashtra looks very grave indeed. The curve has had an immense steep incline and shows no signs of slowing down. Gujarat’s curve steeped at a later time interval compared to the rest. It remained in control till the 30-day mark and the steep worsened after 40 days.
The only way we can as a whole prevent this impending crisis is by flattening the curve.
You can get the code for exploratory data analysis Python Github in my GitHub Repo. Links provided in the conclusion
I hope you have got an idea of how to do exploratory data analysis in Python.
Concluding exploratory data analysis python
This eda python tutorial with examples and code must have given you a basic step-by-step idea about media projects. There any many exploratory data analysis python projects on Git Hub. Refer them if you feel like practicing more eda using Python. To know more about such topics go through My Blind Bird.
By Author,
Hope this article helped you gain a lot of insights into the spread of the virus and its impact on the country’s population. Also, free to download my code and data from my GitHub Repository and get your hands-on with EDA!
Feel free to connect with me on GitHub, Medium, and LinkedIn for further discussions!
This concludes the topic of exploratory data analysis in Python. Thank you! Sukanya Bag
You can check out our other projects with source code below-
- Fake News Classifier with NLP 2023
- Spam Email Detection Using Machine Learning Projects for Beginners in Python (2023)
- Hands-on Exploratory Data Analysis Python Code & Steps -2023
- Interesting Python project (2023) Mouse control with hand gestures.
- Best (2023) Python Project with Source Code
- Live Color Detector (#1 ) in Python 2023
Programming
Find programming projects along with the source code & complete detailed explanation. Feel free to contact us for any queries.