Machine Learning with Python 101 (Lesson 1)
Welcome to the blog where I read the python documentation (and other resources), so you don’t have to. In this case, credit goes 100% to the python documentation. Here I am just summarizing things that I think will be important for you and me as we embark on this machine learning journey. In this blog post I am only including theory. Practice (with actual code) will come later. We will start by summarizing what we think its important form the ‘tutorials’ section — which you can find here: https://scikit-learn.org/stable/tutorial/index.html. Today we are tackling the “An introduction to machine learning with scikit-learn” section.
The learning process of an model can be categorized into (1) supervised learning, and (2) unsupervised learning.
Supervised learning encompasses the models where the algorithm has a defined target variable it is trying to predict or explain. This problem can be either classification or regression. On the other hand, we have unsupervised learning — where the model is just trying to learn new relationships between the variables within the data. Its widely used for exploratory analysis. In these cases there is no target variable that the model is trying to predict or explain.
Through these series of exercises and blog posts, we will be using the data that comes readily available with the sklearn datasets package for us to use: iris dataset, digits dataset and boston house prices. The three come in dictionary form (and include meta data). As per the scikit-learn documentation indicates — please find more information about the datasets in the following link: https://scikit-learn.org/stable/datasets/index.html#datasets.
Supervised learning:
New concept: estimator
As the documentation puts it: “In scikit-learn, an estimator for classifciatoin is a python object that implements the methods fit(X,y) and predict(T). The estimator creates a trained model based on the paramters that were specified. We allow the estimator to learn from the data by using fit(X, y) and our model is created. We use predict(T) when we want to test the accuracy of the model in predicting the outcome.
As per the documentation, “It is possible to save a model in scikit-learn by using Python’s built-in persistence model, pickle.” However, “In the specific case of scikit-learn, it may be more interesting to use joblib’s replacement for pickle (jobling.dump & joblib.load), which is more efficient on big data but it can only pickle to the disk and not a string.” To learn more: https://joblib.readthedocs.io/en/latest/persistence.html.
Conventions:
Please visit https://scikit-learn.org/stable/tutorial/basic/tutorial.html and https://scikit-learn.org/stable/glossary.html#glossary for more information about the different conventions that estimators follow in order to make the life of programmers easier.
Up Next: A tutorial on statistical-learning for scientific data processing (summary)
Documentation: https://scikit-learn.org/stable/tutorial/statistical_inference/index.html
While I kind of deviated from the topic in this video (it gets a little started with linear models, and is not the direct approach I had in mind for the series of written blog posts), it overlaps somewhat. I’ll start getting the video content and the written content to be more synchronized as I gain experience in the content creation area.