Day 74 Goals: what you will make by the end of the day
How to make time-series data comparable by resampling and converting to the same periodicity (e.g., from daily data to monthly data).
Fine-tuning the styling of Matplotlib charts by using limits, labels, linestyles, markers, colours, and the chart’s resolution.
Using grids to help visually identify seasonality in a time series.
Finding the number of missing and NaN values and how to locate NaN values in a DataFrame.
How to work with Locators to better style the time axis on a chart
Review the concepts learned in the previous three days and apply them to new datasets
Data Exploration - Making Sense of Google Search Data
本節目的在讓學生快速複習前面幾章的 pandas method。以下以 tesla 資料為例:
df_tesla = pd.read_csv(‘TESLA Search Trend vs Price.csv’) -
df_tesla.shape -
df_tesla.head() -
import pandas as pd
import matplotlib.pyplot as plt
df_tesla = pd.read_csv('TESLA Search Trend vs Price.csv')
df_btc_search = pd.read_csv('Bitcoin Search Trend.csv')
df_btc_price = pd.read_csv('Daily Bitcoin Price.csv')
df_unemployment = pd.read_csv('UE Benefits Search vs UE Rate 2004-19.csv')
print(f'Largest value for Tesla in Web Search: {df_tesla.TSLA_WEB_SEARCH.max()}')
print(f'Smallest value for Tesla in Web Search: {df_tesla.TSLA_WEB_SEARCH.min()}')
Data Cleaning - Resampling Time Series Data
3 + 1 Challenges:
Challenge 1: 檢查資料是否齊全
print(f'Missing values for Tesla?: {df_tesla.isna().values.any()}')
print(f'Missing values for U/E?: {df_unemployment.isna().values.any()}')
print(f'Missing values for BTC Search?: {df_btc_search.isna().values.any()}')
print(f'Missing values for BTC price?: {df_btc_price.isna().values.any()}')
Challenge 2 & 3: 如果資料有缺,找出所在位置並刪除
# To remove a missing value we can use .dropna()
# method 1
df_btc_price = df_btc_price.dropna()
# method 2: The inplace argument allows to overwrite our DataFrame
Challenge 4: 將資料型態由 str 改為 Datetime object
# type: str
# Convert strings into objects:
df_tesla.MONTH = pd.to_datetime(df_tesla.MONTH)
df_btc_search.MONTH = pd.to_datetime(df_btc_search.MONTH)
df_unemployment.MONTH = pd.to_datetime(df_unemployment.MONTH)
# type: pandas._libs.tslibs.timestamps.Timestamp
df_btc_price.DATE = pd.to_datetime(df_btc_price.DATE)
參考資料:resample frequency
# This particular day contains a day light savings time transition
ts = pd.Timestamp("2016-10-30 00:00:00", tz="Europe/Helsinki")
# Respects absolute time
ts + pd.Timedelta(days=1)
Out[145]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')
# Respects calendar time
ts + pd.DateOffset(days=1)
Out[146]: Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki')
friday = pd.Timestamp("2018-01-05")
Out[148]: 'Friday'
# Add 2 business days (Friday --> Tuesday)
two_business_days = 2 * pd.offsets.BDay()
Out[150]: Timestamp('2018-01-09 00:00:00')
friday + two_business_days
Out[151]: Timestamp('2018-01-09 00:00:00')
(friday + two_business_days).day_name()
Out[152]: 'Tuesday'
Data Visualisation - Tesla Line Charts in Matplotlib
.twinx() 兩條縱軸
Using Locators and DateFormatters to generate Tick Marks on a Time Line
步驟一:import matplotlib.dates
import matplotlib.dates as mdates
步驟二:Notebook Formatting & Style Helpers
a YearLocator() and a MonthLocator() objects, which will help Matplotlib find the years and the months.
DateFormatter(), which will help us specify how we want to display the dates.
# Create locators for ticks on the time axis
years = mdates.YearLocator()
months = mdates.MonthLocator()
years_fmt = mdates.DateFormatter('%Y')
# Register date converters to avoid warning messages
from pandas.plotting import register_matplotlib_converters
步驟三:Adding Locator Tick Marks
# format the ticks
# ...
import matplotlib.dates as mdates
# Create locators for ticks on the time axis
years = mdates.YearLocator()
months = mdates.MonthLocator()
years_fmt = mdates.DateFormatter('%Y')
# Register date converters to avoid warning messages
from pandas.plotting import register_matplotlib_converters
# ...
# Increase the figure size (e.g., to 14 by 8).
plt.figure(figsize=(14,8), dpi=120)
# Rotate the text on the x-axis by 45 degrees.
plt.xticks(fontsize=14, rotation=45)
ax1 = plt.gca()
ax2 = ax1.twinx()
# ■ ■ format the ticks
# Increase the font sizes for the labels and the ticks on the x-axis to 14.
ax1.set_ylabel('TSLA Stock Price', color='#E6232E', fontsize=14)
ax2.set_ylabel('Search Trend', color='skyblue', fontsize=14)
# Rotate the text on the x-axis by 45 degrees.
# plt.xticks(fontsize=14, rotation=45)
# Add a title that reads 'Tesla Web Search vs Price'
plt.title('Tesla Web Search vs Price', fontsize=18)
# Make the lines on the chart thicker.
ax1.plot(df_tesla.MONTH, df_tesla.TSLA_USD_CLOSE, color='#E6232E', linewidth=3)
ax2.plot(df_tesla.MONTH, df_tesla.TSLA_WEB_SEARCH, color='skyblue', linewidth=3)
# Keep the chart looking sharp by changing the dots-per-inch or DPI value.
plt.figure(figsize=(14,8), dpi=120)
# Set minimum and maximum values for the y and x-axis.
# Hint: check out methods like set_xlim().
ax1.set_ylim([0, 600])
ax1.set_xlim([df_tesla.MONTH.min(), df_tesla.MONTH.max()])
# Finally use plt.show() to display the chart below the cell instead of relying on the automatic notebook output.
Data Visualisation - Bitcoin: Line Style and Markers
6 Challenges
plt.figure(figsize=(14,8), dpi=120)
# Challenge 1: Modify the chart title to read 'Bitcoin News Search vs Resampled Price'
plt.title('Bitcoin News Search vs Resampled Price', fontsize=18)
# Rotate the text on the x-axis by 45 degrees
plt.xticks(fontsize=14, rotation=45)
ax1 = plt.gca()
ax2 = ax1.twinx()
# Challenge 2: Change the y-axis label to 'BTC Price'
ax1.set_ylabel('BTC Price', color='#F08F2E', fontsize=14)
ax2.set_ylabel('Search Trend', color='skyblue', fontsize=14)
# Challenge 3: Change the y- and x-axis limits to improve the appearance
ax1.set_ylim(bottom=0, top=15000)
ax1.set_xlim([df_btc_monthly.index.min(), df_btc_monthly.index.max()])
# Challenge 4: Investigate the linestyles to make the BTC closing price a dashed line
ax1.plot(df_btc_monthly.index, df_btc_monthly.CLOSE, color='#F08F2E', linewidth=3, linestyle='--')
# Challenge 5: Investigate the marker types to make the search datapoints little circles
ax2.plot(df_btc_monthly.index, df_btc_search.BTC_NEWS_SEARCH, color='skyblue', linewidth=3, marker='o')
# Challenge 6: Were big increases in searches for Bitcoin accompanied by big increases in the price?
matplotlib.pyplot.plot — Matplotlib 3.2.1 documentation
marker types
matplotlib.markers — Matplotlib 3.2.1 documentation
Data Visualisation - Unemployment: How to use Grids
本節重點:Grids 格線
5 Challenges
plt.figure(figsize=(14,8), dpi=120)
# Challenge 1: Change the title to: Monthly Search of "Unemployment Benefits" in the U.S. vs the U/E Rate
plt.title('Monthly Search of "Unemployment Benefits" in the U.S. vs the U/E Rate', fontsize=18)
plt.xticks(fontsize=14, rotation=45)
ax1 = plt.gca()
ax2 = ax1.twinx()
# Challenge 2: Change the y-axis label to: FRED U/E Rate
ax1.set_ylabel('FRED U/E Rate', color='purple', fontsize=14)
ax2.set_ylabel('Search Trend', color='skyblue', fontsize=14)
# Challenge 3: Change the axis limits
ax1.set_ylim(bottom=3, top=10.5)
# ax1.set_xlim([df_unemployment.MONTH.min(), df_unemployment.MONTH.max()])
ax1.set_xlim([df_unemployment.MONTH[0], df_unemployment.MONTH.max()])
# Challenge 4: Add a grey grid to the chart to better see the years and the U/E rate values. Use dashed lines for the line style.
ax1.grid(color='grey', linestyle='--')
# Show the grid lines as dark grey lines
ax1.grid(color='grey', linestyle='--')
# ■ ■ Calculate the rolling average over a 6 month window
roll_df = df_unemployment[['UE_BENEFITS_WEB_SEARCH', 'UNRATE']].rolling(window=6).mean()
# ■ ■ Change the dataset used
ax1.plot(df_unemployment.MONTH, roll_df.UNRATE, color='purple', linewidth=3, linestyle='--')
ax2.plot(df_unemployment.MONTH, roll_df.UE_BENEFITS_WEB_SEARCH, color='skyblue', linewidth=3)
# Challenge 5: Can you discern any seasonality in the searches? Is there a pattern?
Data Visualisation - Unemployment: The Effect of New Data
df_ue_2020 = pd.read_csv('UE Benefits Search vs UE Rate 2004-20.csv')
df_ue_2020.MONTH = pd.to_datetime(df_ue_2020.MONTH)
plt.figure(figsize=(14,8), dpi=120)
plt.xticks(fontsize=14, rotation=45)
plt.title('Monthly US "Unemployment Benefits" Web Search vs UNRATE incl 2020', fontsize=18)
ax1 = plt.gca()
ax2 = ax1.twinx()
ax1.set_ylabel('FRED U/E Rate', color='purple', fontsize=16)
ax2.set_ylabel('Search Trend', color='skyblue', fontsize=16)
ax1.set_xlim([df_ue_2020.MONTH.min(), df_ue_2020.MONTH.max()])
ax1.plot(df_ue_2020.MONTH, df_ue_2020.UNRATE, 'purple', linewidth=3)
ax2.plot(df_ue_2020.MONTH, df_ue_2020.UE_BENEFITS_WEB_SEARCH, 'skyblue', linewidth=3)
Learning Points & Summary
How to use .describe() to quickly see some descriptive statistics at a glance.
How to use .resample() to make a time-series data comparable to another by changing the periodicity.
How to work with matplotlib.dates Locators to better style a timeline (e.g., an axis on a chart).
How to find the number of NaN values with .isna().values.sum()
How to change the resolution of a chart using the figure’s dpi
How to create dashed ‘–’ and dotted ‘-.’ lines using linestyles
How to use different kinds of markers (e.g., ‘o’ or ‘^’) on charts.
Fine-tuning the styling of Matplotlib charts by using limits, labels, linewidth and colours (both in the form of named colours and HEX codes).
Using .grid() to help visually identify seasonality in a time series.