Day 80 - Advanced - Capstone Project - Predict House Prices

80-699

Day 80 Goals: what you will make by the end of the day

1970年代 波士頓 房地產開發

建立模型以提供價格估計,如:

• 客房數量

• 到就業中心的距離

• 這個地區有多富或多窮

• 當地學校每個老師有多少學生等

import train_test_split()

機器學習中,常將原始數據按照比例分割為「測試集」和「訓練集」,從 sklearn.model_selection 中調用 train_test_split 函數

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split
   (train_data,      # 所要劃分的樣本特徵集
	train_target,    # 所要劃分的樣本結果
	test_size=0.4,   # 樣本佔比,如果是整數的話就是樣本的數量
	random_state=0,  # 隨機數的種子
	stratify=y_train)

參考資料:深度學習 | sklearn的train_test_split()各函數參數含義解釋

Understand the Boston House Price Dataset

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. The Median Value (attribute 14) is the target.

:Attribute Information (in order):

1. CRIM     per capita crime rate by town
			每個城鎮的人均犯罪率
2. ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
			佔地25,000平方英尺以上的住宅區域比例(702坪)
3. INDUS    proportion of non-retail business acres per town
			每個城鎮非零售業的營業面積比例
4. CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
			是否靠近 Charles 河
5. NOX      nitric oxides concentration (parts per 10 million)
			一氧化氮濃度
6. RM       average number of rooms per dwelling
			每個住宅的平均房間數
7. AGE      proportion of owner-occupied units built prior to 1940
			1940年前私有住宅的住房比率
8. DIS      weighted distances to five Boston employment centres
			與五個波士頓工作地區的加權距離
9. RAD      index of accessibility to radial highways
			輻射狀公路的到達接近指數
10. TAX      full-value property-tax rate per $10,000
			每一萬美元所需繳納財產稅
11. PTRATIO  pupil-teacher ratio by town
			每個城鎮的師生比例
12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
			黑人比例
13. LSTAT    % lower status of the population
			中下階級的比率
14. PRICE    Median value of owner-occupied homes in $1000's
			自用住宅平均房價(單位:千元美金)

參考資料:波士頓房價評估

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.


PART I: Preliminary Data Exploration

初步數據探索

Challenge

• What is the shape of data?
• How many rows and columns does it have
    data.shape
    (506, 14)
    data.describe()
    data.head()
    data.tail()
    
• What are the column names?
    data.columns
    
    Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'PRICE'],
      dtype='object')

• Are there any NaN values or duplicates?
    print(f'Any NaN values? {data.isna().values.any()}')
    Any NaN values? False
    print(f'Any duplicated data? {data.duplicated().values.any()}')
    Any duplicated data? False

PART II: Data Cleaning

Check for Missing Values and Duplicates
就是上方 code 中的最後一項


PART III: Descriptive Statistics

描述性統計

Challenge

• How many students are there per teacher on average?
    data.describe()
   11. PTRATIO  pupil-teacher ratio by town
                每個城鎮的師生比例
                18.46
• What is the average price of a home in the dataset?
   14. PRICE    Median value of owner-occupied homes in $1000's
                自用住宅平均房價(單位:千元美金)
                22.53 => 22530 美金 => 67.6萬元新台幣左右
• What is the CHAS feature?
    4. CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
                是否靠近 Charles 河? => 靠近:1;不靠近:0
• What are the minimum and the maximum value of the CHAS and why?
    4. CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
                是否靠近 Charles 河?
                min: 0.00
                max: 1.00
• What is the maximum and the minimum number of rooms per dwelling in the dataset?
    6. RM       average number of rooms per dwelling
                每個住宅的平均房間數
                max: 8.78
                min: 3.56

PART IV: Visualise the Features

Challenge
Having looked at some descriptive statistics, visualise the data for your model. Use Seaborn’s .displot() to create a bar chart and superimpose the Kernel Density Estimate (KDE) for the following variables:

Seaborn's .displot() to create a bar chart
    sns.displot(data['PRICE'], 
                bins=50, 
                aspect=2,
                kde=True)
KDE
• PRICE: The home price in thousands. 條狀圖
   14. PRICE    Median value of owner-occupied homes in $1000's
                自用住宅平均房價(單位:千元美金)
    sns.displot(data['PRICE'], 
                bins=50, 
                aspect=2,
                kde=True)

• RM: the average number of rooms per owner unit. 條狀圖
    6. RM       average number of rooms per dwelling
                每個住宅的平均房間數
    sns.displot(data.RM, 
                aspect=2,
                kde=True)

• DIS: the weighted distance to the 5 Boston employment centres i.e., the estimated length of the commute.
    8. DIS      weighted distances to five Boston employment centres
                與五個波士頓工作地區的加權距離
    sns.displot(data.DIS, 
                bins=50, 
                aspect=2,
                kde=True)

• RAD: the index of accessibility to highways.
    9. RAD      index of accessibility to radial highways
                輻射狀公路的到達接近指數
    sns.displot(data.RAD, 
                bins=50, 
                aspect=2,
                kde=True)

Next to the River?

Challenge
Create a bar chart with plotly for CHAS to show many more homes are away from the river versus next to it.
You can make your life easier by providing a list of values for the x-axis (e.g., x=[‘No’, ‘Yes’])

river_access = data['CHAS'].value_counts()

bar = px.bar(x=['No', 'Yes'],
			 y=river_access.values,
			 color=river_access.values,
			 color_continuous_scale=px.colors.sequential.haline,
			 title='Next to Charles River?')

bar.update_layout(xaxis_title='Property Located Next to the River?', 
				  yaxis_title='Number of Homes',
				  coloraxis_showscale=False)
bar.show()

PART V: Understand the Relationships in the Data

Run a Pair Plot

幾種不同的圖表表示方式

seaborn.pairplot 官方文件

seaborn.pairplot (data, *, hue=None, hue_order=None, palette=None, vars=None, x_vars=None, y_vars=None, kind=‘scatter’, diag_kind=‘auto’, markers=None, height=2.5, aspect=1, corner=False, dropna=False, plot_kws=None, diag_kws=None, grid_kws=None, size=None)

There might be some relationships in the data that we should know about. Before you run the code, make some predictions:

• What would you expect the relationship to be between pollution (NOX) and the distance to employment (DIS)?
    5. NOX      nitric oxides concentration (parts per 10 million)
                一氧化氮濃度
    8. DIS      weighted distances to five Boston employment centres
                與五個波士頓工作地區的加權距離

• What kind of relationship do you expect between the number of rooms (RM) and the home value (PRICE)?
    6. RM       average number of rooms per dwelling
                每個住宅的平均房間數
   14. PRICE    Median value of owner-occupied homes in $1000's
                自用住宅平均房價(單位:千元美金)

• What about the amount of poverty in an area (LSTAT) and home prices?
   13. LSTAT    % lower status of the population
                中下階級的比率
   14. PRICE    Median value of owner-occupied homes in $1000's
                自用住宅平均房價(單位:千元美金)

Run a Seaborn .pairplot() to visualise all the relationships at the same time. Note, this is a big task and can take 1-2 minutes! After it’s finished check your intuition regarding the questions above on the pairplot.

Challenge

seaborn.jointplot 官方文件

seaborn.jointplot(*, x=None, y=None, data=None, kind=‘scatter’, color=None, height=6, ratio=5, space=0.2, dropna=False, xlim=None, ylim=None, marginal_ticks=False, joint_kws=None, marginal_kws=None, hue=None, palette=None, hue_order=None, hue_norm=None, **kwargs)

Use Seaborn’s .jointplot() to look at some of the relationships in more detail. Create a jointplot for:

• DIS and NOX
    8. DIS      weighted distances to five Boston employment centres
                與五個波士頓工作地區的加權距離
    5. NOX      nitric oxides concentration (parts per 10 million)
                一氧化氮濃度
    with sns.axes_style('darkgrid'):
      sns.jointplot(x=data['DIS'], 
                    y=data['NOX'], 
                    height=8, 
                    kind='scatter',
                    color='deeppink', 
                    joint_kws={'alpha':0.5})

    plt.show()

• INDUS vs NOX
    3. INDUS    proportion of non-retail business acres per town
                每個城鎮非零售業的營業面積比例
    5. NOX      nitric oxides concentration (parts per 10 million)
                一氧化氮濃度
    with sns.axes_style('darkgrid'):
      sns.jointplot(x=data.NOX, 
                    y=data.INDUS, 
                    # kind='hex', 
                    height=7, 
                    color='darkgreen',
                    joint_kws={'alpha':0.5})
    plt.show()

• LSTAT vs RM
   13. LSTAT    % lower status of the population
                中下階級的比率
    6. RM       average number of rooms per dwelling
                每個住宅的平均房間數
    with sns.axes_style('darkgrid'):
      sns.jointplot(x=data['LSTAT'], 
                    y=data['RM'], 
                    # kind='hex', 
                    height=7, 
                    color='orange',
                    joint_kws={'alpha':0.5})
    plt.show()

• LSTAT vs PRICE
   13. LSTAT    % lower status of the population
                中下階級的比率
   14. PRICE    Median value of owner-occupied homes in $1000's
                自用住宅平均房價(單位:千元美金)
    with sns.axes_style('darkgrid'):
      sns.jointplot(x=data.LSTAT, 
                    y=data.PRICE, 
                    # kind='hex', 
                    height=7, 
                    color='crimson',
                    joint_kws={'alpha':0.5})
    plt.show()


• RM vs PRICE
    6. RM       average number of rooms per dwelling
                每個住宅的平均房間數
   14. PRICE    Median value of owner-occupied homes in $1000's
                自用住宅平均房價(單位:千元美金)
    with sns.axes_style('whitegrid'):
      sns.jointplot(x=data.RM, 
                    y=data.PRICE, 
                    height=7, 
                    color='darkblue',
                    joint_kws={'alpha':0.5})
    plt.show()

Try adding some opacity or alpha to the scatter plots using keyword arguments under joint_kws.


PART VI: Split Training & Test Dataset

We can’t use all 506 entries in our dataset to train our model. The reason is that we want to evaluate our model on data that it hasn’t seen yet (i.e., out-of-sample data). That way we can get a better idea of its performance in the real world.

sklearn.model_selection.train_test_split 說明文件

機器學習中,常將原始數據按照比例分割為「測試集」和「訓練集」,從 sklearn.model_selection 中調用 train_test_split 函數

隨機數種子:其實就是該組隨機數的編號,在需要重複試驗的時候,保證得到一組一樣的隨機數。
比如你每次都填1,其他引數一樣的情況下你得到的隨機陣列是一樣的。
但填0或不填,每次都會不一樣。

Challenge

複習一下上面提過的各參考意思:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split
   (train_data,      # 所要劃分的樣本特徵集
	train_target,    # 所要劃分的樣本結果
	test_size=0.4,   # 樣本佔比,如果是整數的話就是樣本的數量
	random_state=0,  # 隨機數的種子
	stratify=y_train)

• Import the train_test_split() function from sklearn

• Create 4 subsets: X_train, X_test, y_train, y_test

• Split the training and testing data roughly 80/20.

• To get the same random split every time you run your notebook use random_state=10. This helps us get the same results every time and avoid confusion while we’re learning.

Hint: Remember, your target is your home PRICE, and your features are all the other columns you’ll use to predict the price.

target = data['PRICE']
features = data.drop('PRICE', axis=1)

X_train, X_test, y_train, y_test = train_test_split
   (features,        # 所要劃分的樣本特徵集
	target,          # 所要劃分的樣本結果
	test_size=0.2,   # 樣本佔比,如果是整數的話就是樣本的數量
	random_state=10) # 隨機數的種子

SAMPLE_CODE

# % of training set
train_pct = 100*len(X_train)/len(features)
print(f'Training data is {train_pct:.3}% of the total data.')

# % of test data set
test_pct = 100*X_test.shape[0]/features.shape[0]
print(f'Test data makes up the remaining {test_pct:0.3}%.')

OUTPUT

Training data is 79.8% of the total data.
Test data makes up the remaining 20.2%.

PART VII: Multivariable Regression

In a previous lesson, we had a linear model with only a single feature (our movie budgets).
This time we have a total of 13 features. Therefore, our Linear Regression model will have the following form:

PRI^CE = θ0 + θ1RM + θ2NOX + θ3DIS + θ4CHAS… + θ13LSTAT

Run Your First Regression

Challenge
Use sklearn to run the regression on the training dataset.
How high is the r-squared for the regression on the training data?

SAMPLE_CODE

regr = LinearRegression()
regr.fit(X_train, y_train)
rsquared = regr.score(X_train, y_train)
print(f'Training data r-squared: {rsquared:.2}')

OUTPUT

Training data r-squared: 0.75

R-Squared 值通常愈高愈好,請參考本篇說明:

R-Squared - Definition, Interpretation, and How to Calculate

Evaluate the Coefficients of the Model
Here we do a sense check on our regression coefficients. The first thing to look for is if the coefficients have the expected sign (positive or negative).

Challenge
Print out the coefficients (the thetas in the equation above) for the features.
Hint: You’ll see a nice table if you stick the coefficients in a DataFrame.

• We already saw that RM on its own had a positive relation to PRICE based on the scatter plot. 
  Is RM's coefficient also positive?
    正相關:房間數愈多,房價愈高。RM 係數為 3.11
    
• What is the sign on the LSAT coefficient? Does it match your intuition and the scatter plot above?
    負相關:中下階級的比率。中下階級的比率愈高,房價愈低。LSTAT 係數為 -0.58

• Check the other coefficients. Do they have the expected sign?

• Based on the coefficients, how much more expensive is a room with 6 rooms compared to a room with 5 rooms? According to the model, what is the premium you would have to pay for an extra room?
    RM 係數為 3.11,所以價格增加大約 3.11 * 1000美元。(補充:3.11是四捨五入的呈現數字,實際乘以1000會有些微差異)
    premium = regr_coef.loc['RM'].values[0] * 1000

SAMPLE_CODE

regr_coef = pd.DataFrame(data=regr.coef_, index=X_train.columns, columns=['Coefficient'])
regr_coef

OUTPUT

		Coefficient
CRIM	-0.13
ZN	    0.06
INDUS	-0.01
CHAS	1.97
NOX	    -16.27
RM	    3.11
AGE	    0.02
DIS	    -1.48
RAD	    0.30
TAX	    -0.01
PTRATIO	-0.82
B	    0.01
LSTAT	-0.58

SAMPLE_CODE

# Premium for having an extra room
premium = regr_coef.loc['RM'].values[0] * 1000  # i.e., ~3.11 * 1000
print(f'The price premium for having an extra room is ${premium:.5}')

OUTPUT

The price premium for having an extra room is $3108.5

Analyse the Estimated Values & Regression Residuals
The next step is to evaluate our regression.
How good our regression is depends not only on the r-squared.
It also depends on the residuals -
the difference between the model’s predictions ( y^i ) and the true values ( yi ) inside y_train.
predicted_values = regr.predict(X_train)
residuals = (y_train - predicted_values)

Challenge

• The first plot should be actual values (y_train) against the predicted value values:
The cyan line in the middle shows y_train against y_train. If the predictions had been 100% accurate then all the dots would be on this line. The further away the dots are from the line, the worse the prediction was. That makes the distance to the cyan line, you guessed it, our residuals

SAMPLE_CODE

predicted_vals = regr.predict(X_train)
residuals = (y_train - predicted_vals)

# Original Regression of Actual vs. Predicted Prices
plt.figure(dpi=100)
			  _______    ______________
plt.scatter(x=y_train, y=predicted_vals, c='indigo', alpha=0.6)
plt.plot(y_train, y_train, color='cyan')
plt.title(f'Actual vs Predicted Prices: $y _i$ vs $\hat y_i$', fontsize=17)
plt.xlabel('Actual prices 000s $y _i$', fontsize=14)
plt.ylabel('Prediced prices 000s $\hat y _i$', fontsize=14)
plt.show()

• The second plot should be the residuals against the predicted prices. Here’s what we’re looking for:

SAMPLE_CODE

# Residuals vs Predicted values
plt.figure(dpi=100)
plt.scatter(x=predicted_vals, y=residuals, c='indigo', alpha=0.6)
plt.title('Residuals vs Predicted Values', fontsize=17)
plt.xlabel('Predicted Prices $\hat y _i$', fontsize=14)
plt.ylabel('Residuals', fontsize=14)
plt.show()

Challenge
• Calculate the mean and the skewness of the residuals.

• Again, use Seaborn’s .displot() to create a histogram and superimpose the Kernel Density Estimate (KDE)

• Is the skewness different from zero? If so, by how much?

• Is the mean different from zero?

SAMPLE_CODE

# Residual Distribution Chart
resid_mean = round(residuals.mean(), 2)
resid_skew = round(residuals.skew(), 2)

sns.displot(residuals, kde=True, color='indigo')
plt.title(f'Residuals Skew ({resid_skew}) Mean ({resid_mean})')
plt.show()

We see that the residuals have a skewness of 1.46. There could be some room for improvement here.

Data Transformations for a Better Fit
We have two options at this point:

  1. Change our model entirely. Perhaps a linear model is not appropriate.
  2. Transform our data to make it fit better with our linear model.

Let’s try a data transformation approach.

Challenge
Investigate if the target data[‘PRICE’] could be a suitable candidate for a log transformation.

• Use Seaborn’s .displot() to show a histogram and KDE of the price data.

• Calculate the skew of that distribution.

tgt_skew = data['PRICE'].skew()
sns.displot(data['PRICE'], kde='kde', color='green')
plt.title(f'Normal Prices. Skew is {tgt_skew:.3}')
plt.show()

結果 skew: 1.11

• Use NumPy’s log() function to create a Series that has the log prices
• Plot the log prices using Seaborn’s .displot() and calculate the skew.

y_log = np.log(data['PRICE'])
sns.displot(y_log, kde=True)
plt.title(f'Log Prices. Skew is {y_log.skew():.3}')
plt.show()

• Which distribution has a skew that’s closer to zero?

skew: -0.33

The log prices have a skew that’s closer to zero. This makes them a good candidate for use in our linear model. Perhaps using log prices will improve our regression’s r-squared and our model’s residuals.

How does the log transformation work?
Using a log transformation does not affect every price equally. Large prices are affected more than smaller prices in the dataset.

We can see this when we plot the actual prices against the (transformed) log prices.

SAMPLE_CODE

plt.figure(dpi=150)
plt.scatter(data.PRICE, np.log(data.PRICE)) <= log prices 

plt.title('Mapping the Original Price to a Log Price')
plt.ylabel('Log Price')
plt.xlabel('Actual $ Price in 000s')
plt.show()

PART VIII: Regression using Log Prices

Using log prices instead, our model has changed to:
log(PRI^CE)=θ0 + θ1RM + θ2NOX + θ3DIS + θ4CHAS + … + θ13LSTAT

Challenge

• Use train_test_split() with the same random state as before to make the results comparable.

• Run a second regression, but this time use the transformed target data.

• What is the r-squared of the regression on the training data?

• Have we improved the fit of our model compared to before based on this measure?

SAMPLE_CODE

new_target = np.log(data['PRICE']) # Use log prices
features = data.drop('PRICE', axis=1)

X_train, X_test, log_y_train, log_y_test = train_test_split
	(features,       # 所要劃分的樣本特徵集
	 new_target,     # 所要劃分的樣本結果
	 test_size=0.2,  # 樣本佔比,如果是整數的話就是樣本的數量
	 random_state=10)# 隨機數的種子

log_regr = LinearRegression()
log_regr.fit(X_train, log_y_train)
log_rsquared = log_regr.score(X_train, log_y_train)

log_predictions = log_regr.predict(X_train)
log_residuals = (log_y_train - log_predictions)

print(f'Training data r-squared: {log_rsquared:.2}')

OUTPUT

Training data r-squared: 0.79

This time we got an r-squared of 0.79 compared to 0.75. This looks like a promising improvement.


PART IX: Evaluating Coefficients with Log Prices

Challenge

Print out the coefficients of the new regression model.

• Do the coefficients still have the expected sign?

• Is being next to the river a positive based on the data?

• How does the quality of the schools affect property prices? What happens to prices as there are more students per teacher?

Hint: Use a DataFrame to make the output look pretty.

SAMPLE_CODE

df_coef = pd.DataFrame(data=log_regr.coef_, index=X_train.columns, columns=['coef'])
df_coef

OUTPUT

		Coefficient
CRIM	-0.01
ZN	    0.00
INDUS	0.00
CHAS	0.08
NOX	    -0.70
RM	    0.07
AGE	    0.00
DIS	    -0.05
RAD	    0.01
TAX	    -0.00
PTRATIO	-0.03
B	    0.00
LSTAT	-0.03

So how can we interpret the coefficients? The key thing we look for is still the sign - being close to the river results in higher property prices because CHAS has a coefficient greater than zero. Therefore property prices are higher next to the river.

More students per teacher - a higher PTRATIO - is a clear negative. Smaller classroom sizes are indicative of higher quality education, so have a negative coefficient for PTRATIO.


PART X: Regression with Log Prices & Residual Plots

Challenge

• Copy-paste the cell where you’ve created scatter plots of the actual versus the predicted home prices as well as the residuals versus the predicted values.

• Add 2 more plots to the cell so that you can compare the regression outcomes with the log prices side by side.

• Use indigo as the colour for the original regression and navy for the color using log prices.

SAMPLE_CODE

# Graph of Actual vs. Predicted Log Prices
plt.scatter(x=log_y_train, y=log_predictions, c='navy', alpha=0.6)
plt.plot(log_y_train, log_y_train, color='cyan')
plt.title(f'Actual vs Predicted Log Prices: $y _i$ vs $\hat y_i$ (R-Squared {log_rsquared:.2})', fontsize=17)
plt.xlabel('Actual Log Prices $y _i$', fontsize=14)
plt.ylabel('Prediced Log Prices $\hat y _i$', fontsize=14)
plt.show()

# Original Regression of Actual vs. Predicted Prices
plt.scatter(x=y_train, y=predicted_vals, c='indigo', alpha=0.6)
plt.plot(y_train, y_train, color='cyan')
plt.title(f'Original Actual vs Predicted Prices: $y _i$ vs $\hat y_i$ (R-Squared {rsquared:.3})', fontsize=17)
plt.xlabel('Actual prices 000s $y _i$', fontsize=14)
plt.ylabel('Prediced prices 000s $\hat y _i$', fontsize=14)
plt.show()

# Residuals vs Predicted values (Log prices)
plt.scatter(x=log_predictions, y=log_residuals, c='navy', alpha=0.6)
plt.title('Residuals vs Fitted Values for Log Prices', fontsize=17)
plt.xlabel('Predicted Log Prices $\hat y _i$', fontsize=14)
plt.ylabel('Residuals', fontsize=14)
plt.show()

# Residuals vs Predicted values
plt.scatter(x=predicted_vals, y=residuals, c='indigo', alpha=0.6)
plt.title('Original Residuals vs Fitted Values', fontsize=17)
plt.xlabel('Predicted Prices $\hat y _i$', fontsize=14)
plt.ylabel('Residuals', fontsize=14)
plt.show()

It’s hard to see a difference here just by eye. The predicted values seems slightly closer to the cyan line, but eyeballing the charts is not terribly helpful in this case.

Challenge

Calculate the mean and the skew for the residuals using log prices. Are the mean and skew closer to 0 for the regression using log prices?

SAMPLE_CODE

# Distribution of Residuals (log prices) - checking for normality
log_resid_mean = round(log_residuals.mean(), 2)
log_resid_skew = round(log_residuals.skew(), 2)

sns.displot(log_residuals, kde=True, color='navy')
plt.title(f'Log price model: Residuals Skew ({log_resid_skew}) Mean ({log_resid_mean})')
plt.show()

sns.displot(residuals, kde=True, color='indigo')
plt.title(f'Original model: Residuals Skew ({resid_skew}) Mean ({resid_mean})')
plt.show()

Our new regression residuals have a skew of 0.09 compared to a skew of 1.46. The mean is still around 0. From both a residuals perspective and an r-squared perspective we have improved our model with the data transformation.


PART XI: Compare Out of Sample Performance

The real test is how our model performs on data that it has not “seen” yet. This is where our X_test comes in.

Challenge
Compare the r-squared of the two models on the test dataset. Which model does better? Is the r-squared higher or lower than for the training dataset? Why?

SAMPLE_CODE

print(f'Original Model Test Data r-squared: {regr.score(X_test, y_test):.2}')
print(f'Log Model Test Data r-squared: {log_regr.score(X_test, log_y_test):.2}')

OUTPUT

Original Model Test Data r-squared: 0.67
Log Model Test Data r-squared: 0.74

By definition, the model has not been optimised for the testing data. Therefore performance will be worse than on the training data. However, our r-squared still remains high, so we have built a useful model


PART XII: Predict a Property’s Value using the Regression Coefficients

Our preferred model now has an equation that looks like this:
log(PRI^CE) = θ0 + θ1RM + θ2NOX + θ3DIS + θ4CHAS + … + θ13LSTAT

The average property has the mean value for all its charactistics:

SAMPLE_CODE

# Starting Point: Average Values in the Dataset
features = data.drop(['PRICE'], axis=1)
average_vals = features.mean().values
property_stats = pd.DataFrame(data=average_vals.reshape(1, len(features.columns)), 
							  columns=features.columns)
property_stats

OUTPUT

	CRIM	ZN	    INDUS	CHAS	NOX	    RM	    AGE	    DIS	    RAD	    TAX	    PTRATIO	B	    LSTAT
0	3.61	11.36	11.14	0.07	0.55	6.28	68.57	3.80	9.55	408.24	18.46	356.67	12.65

Challenge

Predict how much the average property is worth using the stats above. What is the log price estimate and what is the dollar estimate? You’ll have to reverse the log transformation with .exp() to find the dollar value.

SAMPLE_CODE

# Make prediction
log_estimate = log_regr.predict(property_stats)[0]
print(f'The log price estimate is ${log_estimate:.3}')

# Convert Log Prices to Acutal Dollar Values
dollar_est = np.e**log_estimate * 1000
# or use
dollar_est = np.exp(log_estimate) * 1000
print(f'The property is estimated to be worth ${dollar_est:.6}')

OUTPUT

The log price estimate is $3.03
The property is estimated to be worth $20703.2

A property with an average value for all the features has a value of $20,700.

Challenge

Keeping the average values for CRIM, RAD, INDUS and others, value a property with the following characteristics:

SAMPLE_CODE

# Define Property Characteristics
next_to_river = True
nr_rooms = 8
students_per_classroom = 20 
distance_to_town = 5
pollution = data.NOX.quantile(q=0.75) # high
amount_of_poverty =  data.LSTAT.quantile(q=0.25) # low

# Solution
# Set Property Characteristics
property_stats['RM'] = nr_rooms
property_stats['PTRATIO'] = students_per_classroom
property_stats['DIS'] = distance_to_town

if next_to_river:
	property_stats['CHAS'] = 1
else:
	property_stats['CHAS'] = 0

property_stats['NOX'] = pollution
property_stats['LSTAT'] = amount_of_poverty

# Make prediction
log_estimate = log_regr.predict(property_stats)[0]
print(f'The log price estimate is ${log_estimate:.3}')

# Convert Log Prices to Acutal Dollar Values
dollar_est = np.e**log_estimate * 1000
print(f'The property is estimated to be worth ${dollar_est:.6}')

OUTPUT

The log price estimate is $3.25
The property is estimated to be worth $25792.0

  1. Solution & Learning Points

今天我們學會了

• 如何使用 Seaborn 的數據集快速發現關係。.pairplot()

• 如何將數據拆分為培訓和測試數據集,以更好地評估模型的性能。

• 如何運行多變數回歸。

• 如何根據係數的符號來評估回歸。

• 如何分析和尋找模型殘餘中的模式。

• 如何使用(日誌)數據轉換改進回歸模型。

• 如何指定各種功能的自身值,並使用模型進行預測。


參考資料:

找資料的過程中,發現這個題目被很多學習 python 或 AI 的朋友,拿來做練習題。這邊是幾個範例: