General requirements for the assignment

DEADLINE FOR THIS ASSIGNMENT IS 29 OCTOBER 2021 BEFORE 23:59


Assignment

In the past 7 weeks, you have been working with Google mobility data. Now, let's combine that data with covid-19 data to see if we can derive some interesting insights. There are multiple sources of COVID-19 data. Maybe the country that you chose has its separate data source.

Feel free to use either of these data sources or something you found on your own!

Import libraries

Part I - Data import

  1. [5 points] Create a new dataframe

This dataframe should combine mobility data and covid-19 data of your chosen country. There are different types of covid data available such as the number of positively tested cases, hospital admission, fatality rates, government stringency index, etc. Provide a brief explanation or data dictionary of your new dataframe. Keep in mind that you need to associate these two datasets, then pick municipal, provincial, or nationwide data accordingly.

Data origin and location

The mobility data (NZ_nation.csv) is from New Zealand and retrieved from Google after which it was processed to only contain the national data.
The Covid-19 data (owid-covid-data.csv) from New Zealand is retrieved from OurWorldInData. This data contains all the countries, but will be filtered in the next step to only contain New Zealand

The files: 'Final Assignment.ipynb', 'NZ_nation.csv' and 'owid-covid-data.csv' are all placed in the same directory.

1. Create the new dataframe

Explanation of the new dataframe 'df_nz'

The dataframe 'df_nz' is made from the mobility data (df_mobi_nz) and the Covid-19 data (df_covid_nz) of New Zealand.
The dataframe of the mobility data 'df_mobi_nz' shows the changes in mobility (in percentages) during 2020/2021.
With this data the impact of Covid-19 on the change in mobility in different places can be analysed and visualised.
The dataframe of the Covid-19 data 'df_covid_nz' shows different parameters on Covid-19, for example: the total cases, weekly ICU admissions and the stringency index.
With the dataframe 'df_nz', correlation between the changes in and the peaks / valleys of the mobility and Covid-19 data can be analysed and visualised.

Part II - Data processing

As you already know, there are various peaks/valleys in the changes of mobility activity data. In this assignment, find peaks/valleys (if available) in the covid data.

After identifying peaks from two datasets, you need to check if there are common peaks. Most likely, the peaks do not intersect on the same day, so it should be possible to provide a certain offset to combine peaks/valleys that are close to each other. A visual representation of this problem is shown in the following image:

drawing

Below are the challenges that need to be solved for this part:

  1. [8 points] Provide pseudo-code or logic behind the offset algorithms that you will develop for the following questions (3. and 4.) Use bullet points/flow chart/pseudocode/other means to explain the logic.
  1. [10 points] Find all the common peaks/valleys of mobility activity patterns of a municipality/provinces/nation within a range of time offsets. eg: find common peaks between 1 activity of two municipalities OR find common peaks between 2 activities of the same municipality
  1. [2 points] Find all the common peaks/valleys of the selected covid data of municipality/provinces/nation within a range of time offsets. eg: find common peaks between 1 type of covid data (eg. vaccinations) of two municipalities OR find common peaks between 2 types of covid data (eg. vaccinations and deaths) of the same municipality
  1. [8 points] Relationship between common peaks/valleys (municipal/provincial/nationwide) in activities and covid data (municipal/provincial/nationwide) (time-offset) (either through observation or using programmable logic). If you only use visual observational methods, you won't get maximum points for this question. eg: compare peaks of 1 activity and 1 type of covid data of the same municipality OR compare common peaks of all activities and common peaks of all types of covid data of the same municipality

Motivate your selection for the data choice for finding the common peaks

2. Explanation and pseudo-code of the offset algorithm

Algorithm method to find the common peaks and valleys

This algorithm uses the indices of the data since they correspond to the dates of the data. The algorithm start after the indices of the peaks or valleys are found of both activities (question 3) or both types of Covid-19 data (question 4) with the scipy function 'find_peaks'.

Lists are made for the matched indices of the first activity, matched indices of the second activity and the dates.
An offset is set, where the numeric value corresponds to the number of days.
Only the date column from the 'df_nz' dataframe is selected into another dataframe.
Then three if-statements are executed:

  1. For each value (i) in the list of peaks or valleys from the first activity / type of Covid-19 data, and for each value (j) in the list of peaks or valleys in the second activity / type of Covid-19 data, if there is a match, then the values (i) and (j) are appended to the lists made for the matched indices of the activities / types of Covid-19 data. The dates are retrieved from the 'df_nz' dataframe by locating them with the (i) values and then appending them to the list of dates.

  2. For each value (i) - two days in the list of peaks or valleys from the first activity / type of Covid-19 data, and for each value (j) in the list of peaks or valleys in the second activity / type of Covid-19 data, if there is a match, then the values (i) and (j) are appended to the lists made for the matched indices of the activities / types of Covid 19 data. The dates are retrieved from the 'df_nz' dataframe by locating them with the (i) - 2 values and then appending them to the list of dates.

  3. For each value (i) + two days in the list of peaks or valleys from the first activity / type of Covid-19 data, and for each value (j) in the list of peaks or valleys in the second activity / type of Covid-19 data, if there is a match, then the values (i) and (j) are appended to the lists made for the matched indices of the activities / types of Covid 19 data. The dates are retrieved from the 'df_nz' dataframe by locating them with the (i) + 2 values and then appending them to the list of dates.

After the execution of the algorithm, with each list of indices, the corresponding values are located in the 'df' dataframe and stored in separate arrays.
Then the three arrays are combined into a dataframe with the dates and the corresponding peaks / valleys.

Pseudo-code of the offset algorithm:

Input:

Output:

Initialise:
match_ind ← [ ]
match_ind_two ← [ ]
date_list ← [ ]
df_date ← df_data [ date column ]
offset ← set value

for i in first_data:
for j in second_data:
    if i == j:
      match_ind.add(i)
      match_ind_two.add(j)
      date_list.add(df_date[i])
    elif i - offset == j:
      match_ind.add(i)
      match_ind_two.add(j)
      date_list.add(df_date[i-offset])
    elif i + offset == j:
      match_ind.add(i)
      match_ind_two.add(j)
      date_list.add(df_date.[i+offset])

3. Common peaks and valleys between two activities in New Zealand

The two activities that are chosen for comparing the common peaks and valleys are 'Retail and recreation' and 'Grocery and pharmacy'.
Data in the columns of these activities shows the change of these activities in percentages compared to baseline.

I chose these activities because they are closely related. To perform these activities outdoors you need to go to stores. Usually different types of stores are located in a shopping centre or area. Therefore, I expect that when one of the two activities shows an peak, the other one might also shows a peak around the same time.

In order not to copy or write the names of the activities everytime, a list was made with the names of the columns of the activities.

Find peaks and valleys of the two activities

Find the common peaks

Find the common valleys

4. Common peaks and valleys between two types of Covid-19 data in New Zealand

The two types of Covid-19 data that are chosen for comparing the common peaks and valleys are 'New tests' and 'New cases'.
Data in the columns of these types show the numbers of the new tests and the new cases of Covid-19, respectively.

I chose these types of Covid-19 data because I think they could be related. If there is an sharp increase (peak) in new tests being done, I would expect that there could also be an increase in new cases. This of course does depend on how fast the results of the tests are provided and registered.

Find peaks and valleys of two types of Covid-19 data

Find the common peaks

Find the common valleys

5. Compare peaks of one activity and one type of Covid-data in New Zealand

The activity chosen for comparing the common peaks / valleys is 'Residential' and the type of Covid-19 data is 'New cases'.
Data in the columns of the activity and type of Covid-19 data show the change in residential percentage compared to baseline and the new cases of Covid-19, respectively.

I chose the activity and the type of Covid-19 data because I thought there might be a correlation, since a peak in new cases could result in restrictions, which itself could result in an increase in change in residential percentages compared to baseline.

I used a different method (a built-in method) for the time offset in this question than that of question 3 and 4, since the assignment information said that the offset algorithm only applied to those two questions.

Built-in method to find the common peaks / valleys

The built-in method uses the function pandas.Grouper to group dates (and the values corresponding to it) bases on week starting at Monday with the use of an offset alias (W-MON).
After the indices of the peaks or valleys are found of both the activity and the type of Covid-19 data with the scipy function 'find_peaks', the values of the activity and type of Covid-19 data and the dates are retrieved from the 'df' dataframe with the use of the indices.
For each activity and type of Covid-19 data the values and the date column are stored in two separate dataframes.
Then, for each of the two newly made dataframes the date column is converted to datetime format.
The data from each of these dataframes are grouped each week starting at Monday.
The largest value of the activity / type of Covid-19 data is chosen when it concerns peaks, otherwise in the case of valleys the smallest value is chosen.
All the values are stored in a new dataframe.
Hereafter, the NaN values of the newly created dataframe are dropped.
The two dataframes with the grouped values are then merged into one dataframe on the date column resulting in a dataframe with the common peaks / valleys.

Find peaks and valleys of the activity and the type of Covid-19 data

Find the common peaks

Visual analysis of the common peaks

Analysis:
From the plot above it can be seen that most of the common peaks are concentrated at two periods: one period is at the end of March and throughout April 2020 and the other period is at the end of August 2021.
According to Wikipedia, on 21 March 2020 and on 17 August 2021 New Zealand moved to alert level 4, which corresponds to a lockdown. This explains the increase in residential percentages compared to baseline around the periods.
Although the first lockdown was initiated due to a increase in new cases, which could imply an indirect correlation between new cases and the change in residential percentage: increase in new cases → lockdown → increase in change of residential percentage.
However, this was not the case with the second lockdown, since that lockdown was issued due to a specific case of the Delta-variant (according to Wikipedia). A possible explanation for this could be that after the discovery of the case of the Delta-variant, more people were tested, which resulted in an increase in new cases. Then, there is no correlation -direct or indirect- between new cases and the changes in residential percentages compared to baseline.

Wikipedia source:
https://en.wikipedia.org/wiki/COVID-19_alert_levels_in_New_Zealand

Statistical analysis of common peaks

Analysis:
The covariance matrix and the Pearson's correlation coefficient are used to measure the linear correlation between two variables.
From the results of the statistical analysis above, it can be concluded that there is no positive direct correlation between the peaks of the new cases and the residential percentage compared to baseline. There is even a small negative direct correlation.
As said before, there might however be a indirect relationship.
That indirect relationship is only causal in the sense that an increase in new cases → lockdown → increase in change of residential percentage.
It does not influence the values itself. For example, if there is an increase in new cases with 50 new cases that does not mean that there is an increase
with 50 % in change of residential percentage compared to baseline.

Find the common valleys

Analysis:
Since no common valleys were found, no further analysis was performed.

Part III - Data visualisation

  1. [12 points] Use visualization to tell the mobility and covid data story of a specific municipality/province/nationwide. This is a more exploration question. Explain the logic behind your story and also your visualization choices

6. Compare New Zealand and Australia

I chose to visualise the mobility and (several parameters of) the Covid-19 data of New Zealand and Australia, in order to compare the two countries. With this comparison the difference between mobility trends and the Covid-19 effects and response can be seen and analysed.

Prepare the data from Australia

Although the data from New Zealand is ready for use, the data of Australia also needs to be prepared and made ready for use.
The data of the mobility reports of Australia ('2020_AU_Region_Mobility_Report.csv', '2021_AU_Region_Mobility_Report.csv') were retrieved from Google and stored in the folder with the assignment and the other data from New Zealand. The Covid-19 of Australia was retrieved from the same file of OurWorldInData as that of New Zealand (owid-covid-data.csv).

Comparison of mobility data between New Zealand and Australia

In order to compare both dataframes, visually but also statistically, both dataframes need to cover the same time period. Therefore a two if-statements are made to test which dataframe is shorter and then the other dataframe is transformed to a dataframe with the same length. Since the dates are in chronological order, the same length of dataframes also results in a same time period.

All types of mobility data of New Zealand

Pearson's correlation coefficient - New Zealand

Analysis:
From the heatmap above, several strong correlations can be seen.
Between the 'Residential' activity and and the three activities: 'Retail and recreation', 'Public transport', 'Workplaces' there is a strong negative correlation. When the three activities, that are outdoor, decrease, 'Residential' which is indoor, increases.
Between 'Retail and recreation' and the two activities: 'Grocery and pharmacy' and 'Public transport' there is a strong positive correlation. When people go shopping in retail stores, they often also shop for groceries and might use public transport to get to the stores.
Between 'Workplaces' and 'Public transport' there is a strong correlation. When people go to work, they often use the public transport.

All types of mobility data of Australia

Pearson's correlation coefficient - Australia

Analysis:
From the heatmap above, several strong correlations can be seen.
Between the 'Residential' activity and the three activities: 'Retail and recreation', 'Public transport', 'Workplaces' there is a strong negative correlation. When the three activities, that are outdoor, decrease, 'Residential' which is indoor, increases.
Between 'Workplaces' and 'Public transport' there is a strong correlation. When people go to work, they often use the public transport.

Same types of mobility data for New Zealand and Australia

Pearson's correlation coefficient - Same types of mobility data of New Zealand and Australia

Analysis:
The highest values of the Pearson's correlation coefficient can be seen with the activities: 'Retail and recreation', 'Grocery and pharmacy', 'Public transport', and 'Residential'.
This implies that when the percentage of change increased or decreased with these activities in New Zealand, this also largely happened in Australia.
Although an increase or decrease in percentage of change in activity in New Zealand does not cause a increase or decrease in percentage of change in activity in Australia, it does imply that in both countries people act the same way, resulting in the same mobility trends with regards to these activities.
With the other activities this correlation was a lot weaker.

Comparison of Covid-19 data between New Zealand and Australia

In order to compare both dataframes, visually but also statistically, both dataframes need to cover the same time period. Therefore, two if-statements are made to test which dataframe is shorter and then the other dataframe is transformed to a dataframe with the same length. Since the dates are in chronological order, the same length of dataframes also results in a same time period.

In order not to copy or write the names of the types of Covid-19 data everytime, a list was made with the names of the columns of the types of Covid-19 data that was to be visualised.

Analysis:
From the three graphs above the difference between New Zealand and Australia with regards to these Covid-19 parameters are visualised.
It can be seen that in the beginning New Zealand and Australia has a similar number of cases and deaths per million inhabitants.
However after July 2020, the number of cases and deaths per million rose sharply in Australia, prompting more restrictions which resulted in a higher Stringency Index.
From October 2020 till August 2021 the number of cases and deaths per million rose a little for New Zealand and was relatively steady for Australia.
After August 2021 the number of cases and deaths per million rose sharply in Australia, resulting in a higher Stringency Index. Although the number of cases per million in New Zealand also went up, the number of deaths per million did not increase as much. This might explain why the Stringency Index only went up a little.

Comparison of stringency indici of New Zealand and Australia

A comparison was visualised of the stringency indici of New Zealand and Australia.
The total cases per million inhabitants was set as the x values and the total deaths per million inhabitants was set as the y-value.
With this visualisation the correlation (if there is any) can be seen between stricter government policy with regards to Corona restrictions, total cases and total deaths in New Zealand and Australia.
The data of New Zealand and Australia will both be grouped on weeks starting on Monday in order to make the animation of the visualisation run better and make it more clear.

Analysis:
The above graph shows an animation of the change in total cases and deaths per millions and the Stringency Index (SI) as the circle as a function of the date.
While the governments usually look at (an increase in) new cases to increase the restrictions, I think that the total cases and deaths per million might also be interesting to visualise and analyse to see if there is a correlation.
In the beginning (week starting at 2020-03-23) both countries have about the same SI.
Then, till the week starting at 2020-05-18 the SI of New Zealand was higher than Australia even though both countries has comparable number of cases and deaths per million.
Hereafter, the SI of New Zealand decreased a lot, while that of Australia first decreased a bit and then increased a lot. The number of cases and deaths per million in Australia also increased strongly.
From the week starting 2020-08-24 till the week starting at 2020-09-21, the SI of New Zealand increased after which it decreased a bit.
From the week starting at 2020-08-10 till the week starting at 2020-09-21, the SI of Australia stayed the same. And even though there was not a sharp increase of number of cases per million, there was however a very sharp increase in number of deaths per million.
Hereafter, during the week starting at 2020-09-21 till the week starting at 2021-08-16, the SI's of New Zealand and Australia increased and decreased bit.
Then, in the week starting at 2021-08-23 the SI of New Zealand increased very sharp and became higher than that of Australia, even though Australia had a larger number of total cases and deaths per million. The increase in SI of New Zealand might have to do with a new case of the Delta variant that occurred the week before on 17 August 2020 (Wikipedia source: https://en.wikipedia.org/wiki/COVID-19_alert_levels_in_New_Zealand).
From the week starting at 2021-08-23, the SI of New Zealand stayed the same for two weeks, after which (from the week starting at 2021-09-13) it decreased a bit and stayed the same till the week starting at 2021-11-01 (the end of the data). The number of cases per million did increase during this time period, but the number of deaths per million increased only a little.
For Australia, after the week starting at 2021-08-23 till the week starting at 2021-11-01 (the end of the data), the SI stayed the same, although the number of cases per million did increase a lot, faster than that of New Zealand, and the number of deaths per million increased a bit.

Rubrics

Overall grading

drawing

Rubrics for each question in the assignment

Criteria:

  1. Consistent dataset throughout the assignment, combined dataframe
  2. Correctness, generalisability, clarity, simplicity
  3. Working code, visualization of the result, generalisability
  4. Generalisability
  5. Logic, visualization
  6. Logic, story, visuals, clarity, correctness, readability

You can obtain maximum points for the question if you have:

You can obtain bonus points if you use: