Day 74 - Advanced - Google Trends Data: Resampling and Visualising Time Series

本章主要練習:

提示:下方黑色向右箭頭 :arrow_forward: 可以點擊看更多

74-622

Day 74 Goals: what you will make by the end of the day

  • How to make time-series data comparable by resampling and converting to the same periodicity (e.g., from daily data to monthly data).

  • Fine-tuning the styling of Matplotlib charts by using limits, labels, linestyles, markers, colours, and the chart’s resolution.

  • Using grids to help visually identify seasonality in a time series.

  • Finding the number of missing and NaN values and how to locate NaN values in a DataFrame.

  • How to work with Locators to better style the time axis on a chart

  • Review the concepts learned in the previous three days and apply them to new datasets

74-623

Data Exploration - Making Sense of Google Search Data

本節目的在讓學生快速複習前面幾章的 pandas method。以下以 tesla 資料為例:

  • read_csv()
    df_tesla = pd.read_csv(‘TESLA Search Trend vs Price.csv’)

  • shape
    df_tesla.shape

  • head()
    df_tesla.head()

  • describe()
    df_tesla.describe()

SAMPLE_CODE
import pandas as pd
import matplotlib.pyplot as plt

df_tesla = pd.read_csv('TESLA Search Trend vs Price.csv')

df_btc_search = pd.read_csv('Bitcoin Search Trend.csv')
df_btc_price = pd.read_csv('Daily Bitcoin Price.csv')

df_unemployment = pd.read_csv('UE Benefits Search vs UE Rate 2004-19.csv')

print(df_tesla.shape)
df_tesla.head()

print(f'Largest value for Tesla in Web Search: {df_tesla.TSLA_WEB_SEARCH.max()}')
print(f'Smallest value for Tesla in Web Search: {df_tesla.TSLA_WEB_SEARCH.min()}')

df_tesla.describe()

74-624

Data Cleaning - Resampling Time Series Data

3 + 1 Challenges:

Challenge 1: 檢查資料是否齊全

df_tesla.isna().values.any()

print(f'Missing values for Tesla?: {df_tesla.isna().values.any()}')
print(f'Missing values for U/E?: {df_unemployment.isna().values.any()}')
print(f'Missing values for BTC Search?: {df_btc_search.isna().values.any()}')
print(f'Missing values for BTC price?: {df_btc_price.isna().values.any()}')
Challenge 2 & 3: 如果資料有缺,找出所在位置並刪除

.dropna()

df_btc_price[df_btc_price.CLOSE.isna()]
# To remove a missing value we can use .dropna()
# method 1
df_btc_price = df_btc_price.dropna()
# method 2: The inplace argument allows to overwrite our DataFrame
df_btc_price.dropna(inplace=True)
Challenge 4: 將資料型態由 str 改為 Datetime object

.to_datetime()

# type: str
type(df_tesla['MONTH'][1])

# Convert strings into objects: 
df_tesla.MONTH = pd.to_datetime(df_tesla.MONTH)
df_btc_search.MONTH = pd.to_datetime(df_btc_search.MONTH)
df_unemployment.MONTH = pd.to_datetime(df_unemployment.MONTH)

# type: pandas._libs.tslibs.timestamps.Timestamp
type(df_tesla['MONTH'][1])

type(df_btc_price['DATE'][1])
df_btc_price.DATE = pd.to_datetime(df_btc_price.DATE)
參考資料:resample frequency
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects

# This particular day contains a day light savings time transition
ts = pd.Timestamp("2016-10-30 00:00:00", tz="Europe/Helsinki")

# Respects absolute time
ts + pd.Timedelta(days=1)
Out[145]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')

# Respects calendar time
ts + pd.DateOffset(days=1)
Out[146]: Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki')

friday = pd.Timestamp("2018-01-05")

friday.day_name()
Out[148]: 'Friday'

# Add 2 business days (Friday --> Tuesday)
two_business_days = 2 * pd.offsets.BDay()

two_business_days.apply(friday)
Out[150]: Timestamp('2018-01-09 00:00:00')

friday + two_business_days
Out[151]: Timestamp('2018-01-09 00:00:00')

(friday + two_business_days).day_name()
Out[152]: 'Tuesday'

74-625

Data Visualisation - Tesla Line Charts in Matplotlib

.twinx() 兩條縱軸

SAMPLE_CODE
# Increase the figure size (e.g., to 14 by 8).
plt.figure(figsize=(14,8), dpi=120) 

# Rotate the text on the x-axis by 45 degrees.
plt.xticks(fontsize=14, rotation=45)

ax1 = plt.gca()
ax2 = ax1.twinx()

# Increase the font sizes for the labels and the ticks on the x-axis to 14.
ax1.set_ylabel('TSLA Stock Price', color='#E6232E', fontsize=14)
ax2.set_ylabel('Search Trend', color='skyblue', fontsize=14)

# Rotate the text on the x-axis by 45 degrees.
# plt.xticks(fontsize=14, rotation=45)

# Add a title that reads 'Tesla Web Search vs Price'
plt.title('Tesla Web Search vs Price', fontsize=18)

# Make the lines on the chart thicker.
ax1.plot(df_tesla.MONTH, df_tesla.TSLA_USD_CLOSE, color='#E6232E', linewidth=3)
ax2.plot(df_tesla.MONTH, df_tesla.TSLA_WEB_SEARCH, color='skyblue', linewidth=3)

# Keep the chart looking sharp by changing the dots-per-inch or DPI value.
plt.figure(figsize=(14,8), dpi=120) 

# Set minimum and maximum values for the y and x-axis. 
# Hint: check out methods like set_xlim().
ax1.set_ylim([0, 600])
ax1.set_xlim([df_tesla.MONTH.min(), df_tesla.MONTH.max()])

# Finally use plt.show() to display the chart below the cell instead of relying on the automatic notebook output.
plt.show()

74-626

Using Locators and DateFormatters to generate Tick Marks on a Time Line

三步驟

步驟一:import matplotlib.dates

import matplotlib.dates as mdates

步驟二:Notebook Formatting & Style Helpers
a YearLocator() and a MonthLocator() objects, which will help Matplotlib find the years and the months.
DateFormatter(), which will help us specify how we want to display the dates.

# Create locators for ticks on the time axis
years = mdates.YearLocator()
months = mdates.MonthLocator()
years_fmt = mdates.DateFormatter('%Y')

# Register date converters to avoid warning messages
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

步驟三:Adding Locator Tick Marks

# format the ticks
ax1.xaxis.set_major_locator(years)
ax1.xaxis.set_major_formatter(years_fmt)
ax1.xaxis.set_minor_locator(months)
SAMPLE_CODE
# ...
import matplotlib.dates as mdates

# Create locators for ticks on the time axis
years = mdates.YearLocator()
months = mdates.MonthLocator()
years_fmt = mdates.DateFormatter('%Y')

# Register date converters to avoid warning messages
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# ...

# Increase the figure size (e.g., to 14 by 8).
plt.figure(figsize=(14,8), dpi=120) 

# Rotate the text on the x-axis by 45 degrees.
plt.xticks(fontsize=14, rotation=45)

ax1 = plt.gca()
ax2 = ax1.twinx()

# ■ ■ format the ticks
ax1.xaxis.set_major_locator(years)
ax1.xaxis.set_major_formatter(years_fmt)
ax1.xaxis.set_minor_locator(months)

# Increase the font sizes for the labels and the ticks on the x-axis to 14.
ax1.set_ylabel('TSLA Stock Price', color='#E6232E', fontsize=14)
ax2.set_ylabel('Search Trend', color='skyblue', fontsize=14)

# Rotate the text on the x-axis by 45 degrees.
# plt.xticks(fontsize=14, rotation=45)

# Add a title that reads 'Tesla Web Search vs Price'
plt.title('Tesla Web Search vs Price', fontsize=18)

# Make the lines on the chart thicker.
ax1.plot(df_tesla.MONTH, df_tesla.TSLA_USD_CLOSE, color='#E6232E', linewidth=3)
ax2.plot(df_tesla.MONTH, df_tesla.TSLA_WEB_SEARCH, color='skyblue', linewidth=3)

# Keep the chart looking sharp by changing the dots-per-inch or DPI value.
plt.figure(figsize=(14,8), dpi=120) 

# Set minimum and maximum values for the y and x-axis. 
# Hint: check out methods like set_xlim().
ax1.set_ylim([0, 600])
ax1.set_xlim([df_tesla.MONTH.min(), df_tesla.MONTH.max()])

# Finally use plt.show() to display the chart below the cell instead of relying on the automatic notebook output.
plt.show()

74-627

Data Visualisation - Bitcoin: Line Style and Markers

6 Challenges
plt.figure(figsize=(14,8), dpi=120)

# Challenge 1: Modify the chart title to read 'Bitcoin News Search vs Resampled Price'

plt.title('Bitcoin News Search vs Resampled Price', fontsize=18)

# Rotate the text on the x-axis by 45 degrees

plt.xticks(fontsize=14, rotation=45)

ax1 = plt.gca()

ax2 = ax1.twinx()

# Challenge 2: Change the y-axis label to 'BTC Price'

ax1.set_ylabel('BTC Price', color='#F08F2E', fontsize=14)

ax2.set_ylabel('Search Trend', color='skyblue', fontsize=14)

ax1.xaxis.set_major_locator(years)

ax1.xaxis.set_major_formatter(years_fmt)

ax1.xaxis.set_minor_locator(months)

# Challenge 3: Change the y- and x-axis limits to improve the appearance

ax1.set_ylim(bottom=0, top=15000)

ax1.set_xlim([df_btc_monthly.index.min(), df_btc_monthly.index.max()])

# Challenge 4: Investigate the linestyles to make the BTC closing price a dashed line

ax1.plot(df_btc_monthly.index, df_btc_monthly.CLOSE,  color='#F08F2E', linewidth=3, linestyle='--')

# Challenge 5: Investigate the marker types to make the search datapoints little circles

ax2.plot(df_btc_monthly.index, df_btc_search.BTC_NEWS_SEARCH, color='skyblue', linewidth=3, marker='o')

plt.show()

# Challenge 6: Were big increases in searches for Bitcoin accompanied by big increases in the price?

參考資料

linestyles
matplotlib.pyplot.plot — Matplotlib 3.2.1 documentation

marker types
matplotlib.markers — Matplotlib 3.2.1 documentation

74-628

Data Visualisation - Unemployment: How to use Grids

本節重點:Grids 格線

5 Challenges
plt.figure(figsize=(14,8), dpi=120)

# Challenge 1: Change the title to: Monthly Search of "Unemployment Benefits" in the U.S. vs the U/E Rate

plt.title('Monthly Search of "Unemployment Benefits" in the U.S. vs the U/E Rate', fontsize=18)

plt.yticks(fontsize=14)

plt.xticks(fontsize=14, rotation=45)

ax1 = plt.gca()

ax2 = ax1.twinx()

# Challenge 2: Change the y-axis label to: FRED U/E Rate

ax1.set_ylabel('FRED U/E Rate', color='purple', fontsize=14)

ax2.set_ylabel('Search Trend', color='skyblue', fontsize=14)

ax1.xaxis.set_major_locator(years)

ax1.xaxis.set_major_formatter(years_fmt)

ax1.xaxis.set_minor_locator(months)

# Challenge 3: Change the axis limits

ax1.set_ylim(bottom=3, top=10.5)

ax1.set_xlim([df_unemployment.MONTH.min(), df_unemployment.MONTH.max()])

# Challenge 4: Add a grey grid to the chart to better see the years and the U/E rate values. Use dashed lines for the line style.

ax1.grid(color='grey', linestyle='--')

# Show the grid lines as dark grey lines

ax1.grid(color='grey', linestyle='--')

  

# Change the dataset used

ax1.plot(df_unemployment.MONTH, df_unemployment.UNRATE, color='purple', linewidth=3, linestyle='--')

ax2.plot(df_unemployment.MONTH, df_unemployment.UE_BENEFITS_WEB_SEARCH, color='skyblue', linewidth=3)

  

plt.show()

# Challenge 5: Can you discern any seasonality in the searches? Is there a pattern?

參考資料:grid
https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.grid.html

更改取樣頻率
plt.figure(figsize=(14,8), dpi=120)

# Challenge 1: Change the title to: Monthly Search of "Unemployment Benefits" in the U.S. vs the U/E Rate
plt.title('Monthly Search of "Unemployment Benefits" in the U.S. vs the U/E Rate', fontsize=18)

plt.yticks(fontsize=14)
plt.xticks(fontsize=14, rotation=45)

ax1 = plt.gca()
ax2 = ax1.twinx()

ax1.xaxis.set_major_locator(years)
ax1.xaxis.set_major_formatter(years_fmt)
ax1.xaxis.set_minor_locator(months)

# Challenge 2: Change the y-axis label to: FRED U/E Rate
ax1.set_ylabel('FRED U/E Rate', color='purple', fontsize=14)
ax2.set_ylabel('Search Trend', color='skyblue', fontsize=14)

# Challenge 3: Change the axis limits
ax1.set_ylim(bottom=3, top=10.5)
# ax1.set_xlim([df_unemployment.MONTH.min(), df_unemployment.MONTH.max()])
ax1.set_xlim([df_unemployment.MONTH[0], df_unemployment.MONTH.max()])

# Challenge 4: Add a grey grid to the chart to better see the years and the U/E rate values. Use dashed lines for the line style.
ax1.grid(color='grey', linestyle='--')

# Show the grid lines as dark grey lines
ax1.grid(color='grey', linestyle='--')
  
# ■ ■ Calculate the rolling average over a 6 month window
roll_df = df_unemployment[['UE_BENEFITS_WEB_SEARCH', 'UNRATE']].rolling(window=6).mean()

# ■ ■ Change the dataset used
ax1.plot(df_unemployment.MONTH, roll_df.UNRATE, color='purple', linewidth=3, linestyle='--')
ax2.plot(df_unemployment.MONTH, roll_df.UE_BENEFITS_WEB_SEARCH, color='skyblue', linewidth=3)
  
plt.show()

# Challenge 5: Can you discern any seasonality in the searches? Is there a pattern?

74-629

Data Visualisation - Unemployment: The Effect of New Data

SAMPLE_CODE
df_ue_2020 = pd.read_csv('UE Benefits Search vs UE Rate 2004-20.csv')
df_ue_2020.MONTH = pd.to_datetime(df_ue_2020.MONTH)

plt.figure(figsize=(14,8), dpi=120)
plt.yticks(fontsize=14)
plt.xticks(fontsize=14, rotation=45)
plt.title('Monthly US "Unemployment Benefits" Web Search vs UNRATE incl 2020', fontsize=18)
 
ax1 = plt.gca()
ax2 = ax1.twinx()
 
ax1.set_ylabel('FRED U/E Rate', color='purple', fontsize=16)
ax2.set_ylabel('Search Trend', color='skyblue', fontsize=16)
 
ax1.set_xlim([df_ue_2020.MONTH.min(), df_ue_2020.MONTH.max()])
 
ax1.plot(df_ue_2020.MONTH, df_ue_2020.UNRATE, 'purple', linewidth=3)
ax2.plot(df_ue_2020.MONTH, df_ue_2020.UE_BENEFITS_WEB_SEARCH, 'skyblue', linewidth=3)
 
plt.show()

74-630

Learning Points & Summary

  • How to use .describe() to quickly see some descriptive statistics at a glance.

  • How to use .resample() to make a time-series data comparable to another by changing the periodicity.

  • How to work with matplotlib.dates Locators to better style a timeline (e.g., an axis on a chart).

  • How to find the number of NaN values with .isna().values.sum()

  • How to change the resolution of a chart using the figure’s dpi

  • How to create dashed ‘–’ and dotted ‘-.’ lines using linestyles

  • How to use different kinds of markers (e.g., ‘o’ or ‘^’) on charts.

  • Fine-tuning the styling of Matplotlib charts by using limits, labels, linewidth and colours (both in the form of named colours and HEX codes).

  • Using .grid() to help visually identify seasonality in a time series.