Hands-on Exploratory data analysis python Code & Steps -2021

About – How to do Exploratory data analysis python step by step | GitHub | Code

Introduction to Exploratory data analysis python

The outbreak of the deadly virus Severe Acute Respiratory Syndrome Corona Virus (SARs-CoV-1), was first reported worldwide in the Wuhan province of China.

It is no more a matter of surprise, at how rapidly this virus spread itself, killed millions of people, and spew out a breathtaking fear around the globe.

The first Covid-19 case in India was recorded on January 30, 2020. Since then, this domestic outbreak infected a huge part of the population, with infected cases increasing exponentially every single day.

According to the latest statistics, India records a total of 14.1M infected cases, on the sixteenth day of April 2021.

It is very unfortunate to see that thousands of people, in spite of following all probable safety rules to stay healthy from the virus, is actually getting infected for the carelessness of the rest.

Though the government has been continuously announcing lockdowns and other safety guidelines, still it is important to get some valuable insights from the data generated from the daily cases since the outbreak of the virus.

It is to let people learn, how the virus is spreading and affecting the gender population of our country.

In this article, I will carry out an exploratory data analysis python step by step of the Novel Corona Virus in India. Thich will reveal a lot of insights on how immensely this virus has surged a crisis in the whole country.

What is Exploratory Data Analysis

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations.

Before starting our hands-on exploratory data analysis with python step by step, let’s look at the dependencies, setup, and installation needed to start with the data analysis.

Installations, Setup, and Dependencies for eda python example

  • Programming Language – Python >= 3x
  • External Libraries –  NumPy, Pandas, Matplotlib, Seaborn
  • DataKaggle, Clone my GitHub Repository
  • Workspace/IDLE – Jupyter Notebook (you can carry on with your preferred idle too)

Note:   The data is composed of CSV (comma-separated value) files. You need to place all the CSV files in the same directory where you will be saving your code.

Otherwise, it cannot recognize the path and will throw you an error. Also, if you work with any other versions of Python, make sure to create a virtual environment for this project.

So, now the setup is complete, and you did a great job so far!

Let us start with the code walkthrough of eda in python step by step. I will be explaining each snippet of python code, one by one, and provide its respective output!

Step 1: Importing the libraries on a fresh Jupyter Notebook (.ipynb file)

Let’s import all the necessary libraries we will work with, at a time and run them. Before importing, make sure you have them installed in your system.

import pandas as pd
import date time
import matplotlib.pyplot as plt
import seaborn as sns
import math
import numpy as np
import matplotlib.ticker as ticker
import matplotlib.animation as animation
%matplotlib inline
sns.set(font_scale=1.4)

So, we are done with step 1. Now, let us load the data.

Step 2: Load the data

We can easily load the data using pandas function pd.read_csv. Load the data and print the first 5 rows of the covid19_df data to check if it is properly loaded with .head(), and also the shape of the data (rows,cols) with .shape.

covid19_df = pd.read_csv("covid_19_india.csv")
individuals_df = pd.read_csv("IndividualDetails.csv")
excel_file = pd.ExcelFile("ISPA.xlsx")
indian_states_df = excel_file.parse('Sheet1')
covid19_df.head()
covid19_df.shape

You will see a data table with all rows, columns, and their respective entries/values along with the shape displayed in (m,n) format where m = rows, and, n=columns.

Step 3: Data Wrangling and Cleaning

In step 3 of how to do exploratory data analysis in python, we perform data wrangling and cleaning

Data wrangling is the process of cleaning bad data and configuring messy and complex data sets for easy access and analysis.

Let us check if there are any missing values in our data, by .isna().sum() function, which sums up all the missing values and shows them up.

covid19_df.isna().sum()

Output looks similar to this –

Sno                         0
Date                        0
Time                        0
State/UnionTerritory        0
ConfirmedIndianNational     0
ConfirmedForeignNational    0
Cured                       0
Deaths                      0
Confirmed                   0
dtype: int64

Here, we see that there are no missing values in this dataset which makes our job easier.

Let us now have a look at the most recent records for each state to gain an idea about where we stand currently. From the last set of records, we can see that we have data till 1st July 2020.

Step 4: Fetch out the latest data as of this dataset, i.e. 1st July 2020

Let us create a separate variable called covid19_df_latest to store the records of the most recent cases as of this data, which is 1st July 2020. Once again use the .head() function to check if you fetched the data correctly.  

covid19_df_latest = covid19_df[covid19_df['Date']=="01/07/20"]
covid19_df_latest.head()

Now, confirm the total confirmed cases on 1st July 2020 with .sum()

covid19_df_latest['Confirmed'].sum()

The output must look similar to this –

585493

So, we got to know that there were a total of 585493 confirmed cases s of 1st July 2020.

Till now we are going smooth for eda python tutorial. Let’s progress more

In case of any issue refer my exploratory data analysis python book. Link in the end.

You can also download recent COVID-19 data from Kaggle or IEEE forums and replace dates with the code snippets, and get your desired results.

It is always a good practice to implement one for your own case study!

Step 5: Visualizing State-wise Figures- States with maximum Confirmed Cases

Now we will use our recently fetched data to see which were the top 5 states infected with Covid-19, with confirmed case counts.

covid19_df_latest = covid19_df_latest.sort_values(by=['Confirmed'],ascending = False)
plt.figure(figsize=(12,8),dpi = 80)
plt.bar(covid19_df_latest['State/UnionTerritory'][:5],covid19_df_latest['Confirmed'][:5],align='center',color='lightgrey')
plt.ylabel('Number of confirmed cases')
plt.title('States with maximum confirmed cases')
plt.show()

The output resembles an insightful graph like this –

exploratory data analysis python
In case you are facing any difficulty in exploratory data analysis python till now, feel free to connect with me

Step 6: Visualizing the Top 5 States with Maximum Deaths

Now we will use our recently fetched data to see which were the top 5 states infected with Covid-19, with death case counts.

First, let us sum up the total death count as of 1st July 2020.

covid19_df_latest['Deaths'].sum()

Output –

17400

As per the data in the dataset, India has had 17400 deaths across all states. We will now see which states have the most deaths.

covid19_df_latest = covid19_df_latest.sort_values(by=['Deaths'],ascending = False)
plt.figure(figsize=(12,8), dpi=80)
plt.bar(covid19_df_latest['State/UnionTerritory'][:5], covid19_df_latest['Deaths'][:5], align='center',color='lightgrey')
plt.ylabel('Number of Deaths')
plt.title('States with maximum deaths')
plt.show()

Let’s visualize the graph generated as the output of the code –

maximum deaths chart eda

We can see that Maharashtra again tops the count, with death cases crossing the border of  7000, followed by Delhi, which is about to touch the 3000 mark, whereas Gujarat, Tamil Nadu, and Uttar Pradesh are below the 2000 mark

Step 7:  Number of Deaths per Confirmed Cases in the Different Indian States

Next up, I wanted to look at the number of deaths per confirmed cases in different Indian states to gain a better idea about the healthcare facilities available.

We will create a new column ‘Deaths/Confirmed Cases’ for getting the values.

covid19_df_latest['Deaths/Confirmed Cases'] = (covid19_df_latest['Confirmed']/covid19_df_latest['Deaths']).round(2)
covid19_df_latest['Deaths/Confirmed Cases'] = [np.nan if x==float("inf") else x for x in covid19_df_latest['Deaths/Confirmed Cases']]
covid19_df_latest = covid19_df_latest.sort_values(by=['Deaths/Confirmed Cases'], ascending=True,na_position='last')
covid19_df_latest.iloc[:10]

So, after creating this new measure and sorting the states based on this figure, look at the ten worst states in this regard.

We see that there are some states like Meghalaya, Puducherry, Punjab, and Rajasthan where the number of cases and deaths are pretty low as of now, and it appears things are in control.

But other states like Gujarat, Maharashtra, Madhya Pradesh look well hit by the condition. We leave West Bengal out of the entire equation since there has been news emerging from the state regarding mis-publishing of numbers.

However, these statistics do not always lend a clear picture. India is a country of varying demographics and no two states are the same. Maybe equating the figures to the estimated population of a state may lend a better idea to the entire picture.

Step 8:  Cases Per 10 Million

We will rename the number of the Aadhaar card assigned column as Population and discard the Area feature since I decided against using it due to recent updates in the States and UT in India.

indian_states_df = indian_states_df[['State','Aadhaar assigned as of 2019']]
indian_states_df.columns = ['State/UnionTerritory','Population']
indian_states_df.head()

We will now merge the Population dataset with our main dataset and create a new feature called Cases/10 Million to gain some more idea on really which cases are more hit by the COVID-19 crisis.

I feel this new measure is now a more level-headed measure as it takes care of the population differences which exist between different states.

covid19_df_latest = pd.merge(covid19_df_latest, indian_states_df, on='State/UnionTerritory')
covid19_df_latest['Cases/10million']=(covid19_df_latest['Confirmed']/covid19_df_latest['Population'])*10000000

Let us fill the missing values with 0, by .fillna() and sort the data frame by cases per 10 million from the end by setting ascending=False.

covid19_df_latest.fillna(0, inplace=True)
covid19_df_latest.sort_values(by='Cases/10million',ascending=False)

Now let’s plot the visualization for Number of Confirmed Cases Vs. Number of Cases per 10 million people!

df = covid19_df_latest[(covid19_df_latest['Confirmed']>=30000)| (covid19_df_latest['Cases/10million']>=4000)]
plt.figure(figsize=(12,8),dpi=80)
plt.scatter(covid19_df_latest['Confirmed'],covid19_df_latest['Cases/10million'],alpha=0.5)
plt.xlabel('Number of confirmed cases',size=12)
plt.ylabel('Number of cases per 10 million people',size=12)
plt.scatter(df['Confirmed'],df['Cases/10million'],color='red')

for i in range(df.shape[0]):
    plt.annotate(df['State/UnionTerritory'].tolist()[i], xy=(df['Confirmed'].tolist()[i],df['Cases/10million'].tolist()[i]),
    xytext = (df['Confirmed'].tolist()[i]+1.0, df['Cases/10million'].tolist()[i]+12.0),size=11)

plt.tight_layout()
plt.title('Visualization to display the variation in COVID 19 figures in different Indian states', size=16)
plt.show()

Let us visualize our output –

exploratory data analysis python steps or  eda python tutorial

Step 9: Plot a Correlation Heatmap to Visualize Correlation Coefficients b/w different Columns

Plotting a correlation matrix using a heatmap, assigns some dark and light shade colors, where dark color codes are assigned to values that tend to be equal or near to 1, representing that there is a strong correlation between those two features, whereas a 0 gives a neutral color, representing that feature doesn’t affect our data and a -1 with light to no shade of color code, representing that feature is negatively correlated to our data.

Let us plot it and gain some good insight!

plt.figure(figsize=(10,8),dpi=80)
sns.heatmap(covid19_df_latest.corr(),annot=True

Our heatmap is generated after running this code! Let us visualize it!

heatmap is generated after running this code

We notice that some measures like Confirmed, Cured, Deaths, and Cases/10 million are very much co-related we need to take these data seriously!

Enjoying exploratory data analysis python? Drop a big YES below in the comments!

Step 10: Analyzing Individual Data

Next up, we have a look at the individual case data which we have. On initial inspection of this dataset, we see that there are a huge number of missing data in this dataset which we must take into consideration as we move forward with our analysis

individuals_df.isna().sum()

.Output –

id                        0
government_id         25185
diagnosed_date            0
age                   25836
gender                22869
detected_city         25832
detected_district      6984
detected_state            0
nationality           25473
current_status            0
status_change_date      402
notes                  1335
dtype: int64

So, we got to remove these missing values right?

Here’s how to do it.

individuals_df.dropna()

Let us see where was the 1st case detected!

individuals_df.iloc[0]

Let us see the output-

id                                       0
government_id                     KL-TS-P1
diagnosed_date                  30/01/2020
age                                     20
gender                                   F
detected_city                     Thrissur
detected_district                 Thrissur
detected_state                      Kerala
nationality                          India
current_status                   Recovered
status_change_date              14/02/2020
notes                 Travelled from Wuhan
Name: 0, dtype: object

The first case in India due to COVID-19 was noticed on 30th January 2020. It was detected in the city of Thrissur in Kerala. The individual had a travel history in Wuhan.

individuals_grouped_district = individuals_df.groupby('detected_district')
individuals_grouped_district = individuals_grouped_district['id']
individuals_grouped_district.columns=['count']
individuals_grouped_district.count().sort_values(ascending=False).head()

Now we will see in the output, how many individual cases were detected in different districts.

detected_district
Mumbai       3149
Ahmedabad    2181
Indore       1176
Jaipur        808
Pune          706
Name: id, dtype: int64

Let’s find male and female ratio who got infected by the disease!

individuals_grouped_gender = individuals_df.groupby('gender')
individuals_grouped_gender = pd.DataFrame(individuals_grouped_gender.size().reset_index(name='count'))
individuals_grouped_gender.head()

plt.figure(figsize=(10,6),dpi=80)
barlist= plt.bar(individuals_grouped_gender['gender'],individuals_grouped_gender['count'],align='center',color='grey',alpha=0.3)
barlist[1].set_color('r')
plt.ylabel('Count',size=12)
plt.title('Count on the basis of gender',size=16)
plt.show()

Now look at the graph output!

basis of gender graph for exploratory data analysis python projects

Uh oh!

This graph tells us the statistic that males are affected more than females, almost double!

Step 11: Progression of Case Count in India

In this step, we will have a look at how the number of cases increased in India. Afterward, we will inspect this curve and find similarities with the state-level curves.

For doing this analysis, I had to modify the dataset a bit. I grouped the data on the basis of the diagnosed data feature so that I had a count of the number of cases detected each day throughout India.

I followed this up by doing a cumulative sum of this feature and adding it to a new column.

individuals_grouped_date = individuals_df.groupby('diagnosed_date')
individuals_grouped_date = pd.DataFrame(individuals_grouped_date.size().reset_index(name="count"))
individuals_grouped_date[['Day','Month','Year']] = individuals_grouped_date.diagnosed_date.apply(
    lambda x : pd.Series(str(x).split("/")))
individuals_grouped_date.sort_values(by=['Year','Month','Day'],inplace=True,ascending=True)
individuals_grouped_date.reset_index(inplace=True)
individuals_grouped_date['Cumulative Count'] = individuals_grouped_date['count'].cumsum()
individuals_grouped_date = individuals_grouped_date.drop(['index','Day','Month','Year'],axis=1)
individuals_grouped_date.head()

Step 12: How the Case Counts Increased in India

Now we are at the most crucial and important section of this analysis report. This section will probably let us know how quickly the cases jumped exponentially leading to a domestic and also a global pandemic!

Let us first code it –

individuals_grouped_date = individuals_grouped_date.iloc[3:]
individuals_grouped_date.reset_index(inplace=True)
individuals_grouped_date.columns = ['Day Number','diagnosed_date','count','Cumulative Count']
individuals_grouped_date['Day Number']  = individuals_grouped_date['Day Number']-2
plt.figure(figsize=(12,8), dpi=80)
plt.plot(individuals_grouped_date['Day Number'],individuals_grouped_date['Cumulative Count'],color="grey",alpha=0.5)
plt.xlabel('Number of Days', size = 12)
plt.ylabel('Number of Cases', size = 12)
plt.title('How the case count increased in India', size=16)
plt.show()

Now, let’s visualize the graph.

case count increased in india chart

In the above curve, we see that the rise was more or less steady till the 20th day mark. In the interval between 20-30, the curve inclined a little. This inclination gradually incremented, and we see a steady and steep slope after 30-day mark with no signs of flattening. These are ominous indications.

In the next few code elements, I prepare and process the dataset to group the data in terms of different states. I used the following five states for this next analysis

Maharashtra
covid19_maharashtra = covid19_df[covid19_df['State/UnionTerritory']=="Maharashtra"
covid19_maharashtra.head()
covid19_maharashtra.reset_index(inplace=True)
covid19_maharashtra= covid19_maharashtra.drop(['index', 'Sno', 'Time', 'ConfirmedIndianNational', 'ConfirmedForeignNational','Cured'],  axis = 1)
covid19_maharashtra.reset_index(inplace = True)
covid19_maharashtra.columns = ['Day Count', 'Date', 'State/UnionTerritory', 'Deaths', 'Confirmed']
covid19_maharashtra['Day Count'] = covid19_maharashtra['Day Count'] +8
missing_values = pd.DataFrame({"Day Count":[x for x in range(1,8)],
                              "Date": ["0"+ str(x)+"/06/20" for x in range(2,9)],
                              "State/UnionTerritory": ["Maharashtra"]*7,
                                "Deaths": [0]*7,
                                  "Confirmed": [0]*7})
covid19_maharashtra = covid19_maharashtra.append(missing_values,ignore_index=True)
covid19_maharashtra = covid19_maharashtra.sort_values(by="Day Count", ascending = True)

covid19_maharashtra.reset_index(drop=True, inplace=True)
print(covid19_maharashtra.shape)
covid19_maharashtra.head()
Tamil Nadu
covid19_tamilnadu = covid19_df[covid19_df['State/UnionTerritory'] == "Tamil Nadu"]
covid19_tamilnadu.reset_index(inplace = True)
covid19_tamilnadu = covid19_tamilnadu.drop(['index','Sno', 'Time', 'ConfirmedIndianNational', 'ConfirmedForeignNational','Cured'], axis = 1)
covid19_tamilnadu.reset_index(inplace = True)
covid19_tamilnadu.columns = ['Day Count', 'Date', 'State/UnionTerritory', 'Deaths', 'Confirmed']
covid19_tamilnadu['Day Count'] = covid19_delhi['Day Count'] + 1
print(covid19_tamilnadu.shape)
covid19_tamilnadu.head()
Delhi
covid19_delhi = covid19_df[covid19_df['State/UnionTerritory'] == "Delhi"]
covid19_delhi.reset_index(inplace = True)
covid19_delhi = covid19_delhi.drop(['index','Sno', 'Time', 'ConfirmedIndianNational', 'ConfirmedForeignNational','Cured'], axis = 1)
covid19_delhi.reset_index(inplace = True)
covid19_delhi.columns = ['Day Count', 'Date', 'State/UnionTerritory', 'Deaths', 'Confirmed']
covid19_delhi['Day Count'] = covid19_delhi['Day Count'] + 1
print(covid19_delhi.shape)
covid19_delhi.head()
Gujarat
covid19_gujarat = covid19_df[covid19_df['State/UnionTerritory'] == "Gujarat"]
covid19_gujarat.reset_index(inplace = True)
covid19_gujarat = covid19_gujarat.drop(['index','Sno', 'Time', 'ConfirmedIndianNational', 'ConfirmedForeignNational','Cured'], axis = 1)
covid19_gujarat.reset_index(inplace = True)
covid19_gujarat.columns = ['Day Count', 'Date', 'State/UnionTerritory', 'Deaths', 'Confirmed']
covid19_gujarat['Day Count'] = covid19_gujarat['Day Count'] + 19
missing_values = pd.DataFrame({"Day Count": [x for x in range(1,19)],
                           "Date": [("0" + str(x) if x < 10 else str(x))+"/03/20" for x in range(2,20)],
                           "State/UnionTerritory": ["Gujarat"]*18,
                           "Deaths": [0]*18,
                           "Confirmed": [0]*18})
covid19_gujarat = covid19_gujarat.append(missing_values, ignore_index = True)
covid19_gujarat = covid19_gujarat.sort_values(by="Day Count", ascending = True)
covid19_gujarat.reset_index(drop=True, inplace=True)
print(covid19_gujarat.shape)
covid19_gujarat.head()
Kerala
covid19_kerala = covid19_df[covid19_df['State/UnionTerritory'] == "Kerala"]
covid19_kerala = covid19_kerala.iloc[32:]
covid19_kerala.reset_index(inplace = True)
covid19_kerala = covid19_kerala.drop(['index','Sno', 'Time', 'ConfirmedIndianNational', 'ConfirmedForeignNational','Cured'], axis = 1)
covid19_kerala.reset_index(inplace = True)
covid19_kerala.columns = ['Day Count', 'Date', 'State/UnionTerritory', 'Deaths', 'Confirmed']
covid19_kerala['Day Count'] = covid19_kerala['Day Count'] + 1
print(covid19_kerala.shape)
covid19_kerala.head()

Now, let us see which states were flattening the curve.

plt.figure(figsize=(12,8), dpi=80)
plt.plot(covid19_kerala['Day Count'], covid19_kerala['Confirmed'])
plt.plot(covid19_maharashtra['Day Count'], covid19_maharashtra['Confirmed'])
plt.plot(covid19_delhi['Day Count'], covid19_delhi['Confirmed'])
plt.plot(covid19_tamilnadu['Day Count'], covid19_tamilnadu['Confirmed'])
plt.plot(covid19_gujarat['Day Count'], covid19_gujarat['Confirmed'])
plt.legend(['Kerala', 'Maharashtra', 'Delhi', 'Tamil Nadu', 'Gujarat'], loc='upper left')
plt.xlabel('Day Count', size=12)
plt.ylabel('Confirmed Cases Count', size=12)
plt.title('Which states are flattening the curve ?', size = 16)
plt.show()

Let us visualize the graph generated in the output.

day count cases chart

We see almost all the curves follow the curve which is displayed by the nation as a whole. The only odd one out is that of Kerala. Kerala’s curve saw the gradual incline in the period between 20-30 days as seen in other curves.

But what Kerala managed to do was it did not let the curve incline further and manage to flatten the curve. As a result, the state has been able to contain the situation.

The situation in Maharashtra looks very grave indeed. The curve has had an immense steep incline and shows no signs of slowing down. Gujarat’s curve steeped at a later time interval compared to the rest. It remained in control till the 30-day mark and the steep worsened after 40 days.

The only way we can as a whole prevent this impending crisis is by flattening the curve.

You can get the code for exploratory data analysis python github in my GitHub Repo. Links provided in conclusion

I hope you must have got an idea of how to do exploratory data analysis in python.

Concluding exploratory data analysis python

This eda python tutorial with example and code must have given you a basic step by step idea about eda projects. There any many exploratory data analysis python projects on github. Refer them if you feel like practicing more eda using python

By Author,

Hope this article helped you gain a lot of insights into the spread of the virus and its impact on the country’s population. Also, free to download my code and data from my GitHub Repository and get your hands-on with EDA!

Feel free to connect with me on GitHub, Medium, and LinkedIn for further discussions!

This concludes the topic exploratory data analysis python. Thank you! Sukanya Bag

You can check out our other projects with source code below-

  1. Fake News Classifier with NLP 2021
  2. Spam Email Detection using Machine Learning Projects for Beginners in Python (2021)
  3. Hands-on Exploratory data analysis python Code & Steps -2021
  4. Interesting python project (2021) Mouse control with hand gestures.
  5. Best (2021) Python Project with Source Code
  6. Live Color Detector (#1 ) in Python 2021
My Blind bird Coding

Programming

Find programming projects along with the source code & complete detailed explanation. Feel free to contact us for any queries.