Analyzing COVID-19 Time Series Data with Pandas in Python
Written on
Chapter 1: Introduction to Time Series Data
In the realm of data science, time series datasets are among the most frequently encountered. This tutorial aims to provide a concise introduction to utilizing Pandas for manipulating and analyzing the confirmed COVID-19 case dataset sourced from Johns Hopkins University (JHU) CSSE.
Let's dive right in!
If you are new to Pandas or Python, begin by downloading the latest version of Python and install Pandas using the following command in your console:
$ pip install pandas
Section 1.1: Preparing the Dataframe
Start by organizing your project folder and downloading the time-series CSV file from Johns Hopkins CSSE.
Next, create a new Python file and load the CSV into a Pandas DataFrame:
import pandas as pd
df = pd.read_csv('time_series_covid19_confirmed_global.csv')
print(df)
Section 1.2: Cleaning the Data
Now, let's clean the data. The dataset includes cases at the Province/State level in certain areas, so we will aggregate this information to the Country/Region level using the groupby function to sum the total cases.
Before aggregation, we will remove unnecessary columns such as Province/State, Latitude, and Longitude:
df = df.drop(columns=['Province/State', 'Lat', 'Long'])
df = df.groupby('Country/Region').agg('sum')
Chapter 2: Preparing the Datetime Index
Next, we need to create a DateTime index for our DataFrame. To do this, we will transpose the DataFrame first:
df = df.T
In this transposed DataFrame, the index now represents date values. However, these are still in string format, so we will convert them into DateTime using pd.to_datetime and pd.DatetimeIndex:
df_time = pd.to_datetime(df.index)
datetime_index = pd.DatetimeIndex(df_time.values)
df = df.set_index(datetime_index)
Section 2.1: Exploring the Time-Series Data
Our time-series DataFrame is now set up with a DateTime index. You can extract data from specific dates. For instance:
- To select confirmed COVID-19 cases for the 15th of any month:
df[df.index.day == 15]
- To select data from April:
df[df.index.month == 4] # All years
# or
df['2020-04'] # April 2020
- To select data from April 1, 2020, to April 5, 2020:
df['2020-04-01':'2020-04-05']
- To find the six countries with the highest confirmed cases:
df = df.sort_values(by=df.index.values[-1], axis=1, ascending=False)
df = df.iloc[:, 0:6]
Video: Analyzing COVID Vaccine Data with Pandas in Python
This video showcases techniques for analyzing COVID-19 vaccine data using Pandas.
Chapter 3: Resampling the Time-Series Data
Let’s get to the exciting part—resampling the time-series data using the resample method. You can adjust the frequency of your time-series data with:
df.resample()
Commonly used aliases include:
- 'nD' for n days
- 'nM' for n months
- 'nW' for n weeks
Here’s how to summarize and analyze the data:
df.resample(timeinterval).sum() # Sum
df.resample(timeinterval).mean() # Mean
For example, to find the mean resampled confirmed COVID-19 cases on a weekly basis:
df.resample('W').mean()
df.resample('W').mean().plot()
Video: Dynamic Mapping of COVID-19 Progression with Python
This video illustrates how to create dynamic maps to visualize the progression of COVID-19 using Python.
Chapter 4: Analyzing Percentage Growth
To analyze the weekly percentage growth of COVID-19 cases, simply apply .pct_change():
df.resample('W').mean().pct_change()
You can visualize this by skipping the initial NaN values:
df.resample('W').mean().pct_change().iloc[2:].plot(marker="v", figsize=(15, 5))
Section 4.1: Identifying Trends
Trend analysis can be effectively performed using the rolling function. For instance, to calculate and plot the rolling average over a 10-day window:
df.rolling('10D').mean().plot(marker="v", figsize=(15, 5))
Conclusion
This article serves as a fundamental guide on manipulating and analyzing time series data, specifically focusing on the confirmed COVID-19 cases dataset. While it covers basic functionalities, there are many advanced techniques available for time series analysis. For more detailed information, refer to the official Pandas time-series documentation.
I hope this guide proves beneficial for your projects and daily tasks. Feel free to reach out with any questions or feedback.
Stay safe and healthy! 💪
Thank you for reading. 📚