Machine Learning with Python 101 (Lesson 3): General Linear Regression Models with Scikit-learn.
This week we’ll cover the ‘Generalized Linear models’ section of the scikit-learn documentation, and we’ll complement what we learn through the content of other book materials. I have made the point to write this tutorials in advance so that you get one post every day this week (the originally intended speed of publications).
In the following code we’ll get started with simple linear regression, and we’ll expand to other linear models. We’ll be using the iris dataset. Just to make matters more comfortable for me (and hopefully for you too), I’ll start by transforming it from dictionary form to a pandas dataframe. Here is the code I used:
from sklearn import linear_model
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
iris = load_iris()
IrisData = pd.DataFrame(iris.data, columns=['sepal_length','sepal_width', 'petal_length', 'petal_width'])
IrisData['Target']=iris.target#Next, we'll use the following code to describe our data
print(IrisData.describe())
print(IrisData.info())
We’ll proceed by creating a a correlation matrix:
import seaborn as sns
sns.heatmap(IrisData.corr(), cmap='YlGnBu', annot=True,linewidths=.3,linecolor='white',
xticklabels=corr.columns.values,
yticklabels=corr.columns.values)
According to the correlations in this simple linear model we can expect for petal_width and petal_length to be the best predictors. However, we’ll encounter collinearity because they are highly correlated amongst themselves.
Before we continue, lets remember the 4 assumptions of the linear regression, and ensure our data complies with them. This way we can trust our model fits the data appropriately. If they aren't, it would be a good idea to use a different model for which assumptions are met.
Lets get started with our simple regression model by showing the code we have used to create our estimator and fit it to our data.
reg = linear_model.LinearRegression(fit_intercept=True,normalize=False, copy_X=True,n_jobs=None).fit(IrisData.iloc[:,0:3],IrisData['Target'])
Before we proceed, it is critically important that you feel comfortable with the statistical concepts we’ll be covering ahead. As a result, I’ll post here small section to help guide you if you are in need of a recap.
Simple Linear Regression Formulas:
For this I’ll be using the following book:
In the next lessons coming up (starting with Lesson 4), we’ll be (1) using the following github blog: https://zhiyzuo.github.io/Linear-Regression-Diagnostic-in-Python/ in order to test how well the iris dataset complies with the linear regression assumptions, (2) continuing the general regression model model in python tutorial (and move towards advanced topics in fitting model data), (3) leverage a couple more examples in order to solidify what we have learned.
Coming up after ordinary least squares: Ridge Regression and Lasso.