Hands-On Machine Learning — Simple Linear Regression

Mark Subra
Analytics Vidhya
Published in
2 min readNov 7, 2020

--

Perhaps one of the best books on python and data science is Géron’s Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow. This book is especially good for beginners in my opinion. For self-taught data scientists, this book is an absolute must. There are several guided projects with code included to guide aspiring data scientists along the way.

Knowing Scikit-Learn, Keras, and TensorFlow are also absolute musts for data scientists as employers are keen on these libraries and their modules. A brief summary :

Scikit-Learn is a machine learning library for Python featuring various algorithms for statistical analysis, supervised, and unsupervised learning.

Keras is a library for artificial neural networks.

TensorFlow is another library for machine learning but focused on training deep neural networks.

This book is guides the reader with code along with a great explanation of what the code is doing and how it’s working. It also explains how the various algorithms and neural networks work with diagrams and visuals. In my opinion it’s a great balance between theory and practice.

Chapter 1 Simple Linear Regression

The first chapter covers scikit-learn linear model and a very simple linear regression problem. The dataset is the OECD 2015 Better Life Index. Below is the code for a linear model between GDP per capita and life satisfaction.

In [1]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model

In [2]:

oecd_bli = pd.read_csv("oecd_bli_2015.csv",thousands = ',')
gdp_per_capita = pd.read_csv('gdp_per_capita.csv',thousands=',',delimiter='\t',encoding='latin1',na_values = 'n/a')

In [7]:

def prepare_country_stats(oecd_bli,gdp_per_capita):
oecd_bli = oecd_bli[oecd_bli['INEQUALITY']=='TOT']
oecd_bli = oecd_bli.pivot(index='Country',columns='Indicator',values='Value')
gdp_per_capita.rename(columns={'2015':'GDP per capita'},inplace=True)
gdp_per_capita.set_index('Country',inplace=True)
full_country_stats = pd.merge(left=oecd_bli,right=gdp_per_capita,left_index=True,right_index=True)
full_country_stats.sort_values(by='GDP per capita',inplace=True)
remove_indices = [0,1,6,8,33,34,35]
keep_indices = list(set(range(36))-set(remove_indices))
return full_country_stats[['GDP per capita', 'Life satisfaction']].iloc[keep_indices]

In [8]:

country_stats = prepare_country_stats(oecd_bli,gdp_per_capita)

In [10]:

X = np.c_[country_stats['GDP per capita']]
y = np.c_[country_stats['Life satisfaction']]

In [11]:

country_stats.plot(kind = 'scatter',x='GDP per capita', y='Life satisfaction')
plt.show()

In [12]:

model = sklearn.linear_model.LinearRegression()

In [13]:

model.fit(X,y)

Out[13]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [14]:

X_new = [[22587]] # GDP per capita for Cyprus

In [15]:

print(model.predict(X_new))[[5.96242338]]

The data for Cyprus is not available, but based on the linear model it would have had a life satisfaction of 5.96. Slovenia is available which is 5.7 with a GDP per capita of $20,732. This is a very simple (x,y) model with only two variables. This dataset has many other variables and features which other scikit-learn models could better fit.

--

--

Mark Subra
Analytics Vidhya

I am a data scientist having recently graduated from the Flatiron School Immersive Data Science Bootcamp