Machine Learning with Python 101 (Lesson 2)
Welcome again to my set of blog tutorials. The blog and YouTube channel where I read the python documentation (as well as other resources) so you don’t have to. Today I will give you another theoretical summary of the ‘tutorial’ section of the Sci-kit learn documentation available here: https://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-an-example-dataset. The section is called “a tutorial on statistical-learning for scientific data processing”.
As the tutorial says: “Statistical inference: drawing conclusions on the data at hand” and “Scikit-learn is a python module integrating classic machine learning algorithms in the tightly-knit world of scitific Python packages (Numpy, SciPy, matplotlib).”
The section starts talking about datasets and estimator objects that are available through Sci-kit learn for you to use. I will not be covering this in detail, I did a small reference to it in the previous section. You can access more information here: https://scikit-learn.org/stable/tutorial/statistical_inference/settings.html.
Supervised Learning models: classification vs regression.
Classification vs regression really depend on the function of the model. If you want your output to be discrete variable (a set of categories you are classifying the observation to be), you will be using a classification model. On the other hand, if you want your output variable to be a continuous variable prediction, then you are going to be using regression.
Here the documentation expands to give short explanations of what each model is an how they work. I am not going to do that. I will end up confusing you and not giving each model the justice they deserve. If you want to read go here: https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html.
I will however go in-depth on each of these models in the future (and also give you he statistical background you need — using other resources) for you to understand it (as well as link some Kaggle practice examples you can play with). I’ll continue now to model selection.
As the Scikit-learn documentation explains: “every estimator exposes a score method that can judge the quality of the fit (or the prediction) on new data. Bigger is better.” For this however, you will first need to split your data into training and test sets. You will build a model using the train portion, and the test the accuralcy of your model in the test set. This is called cross-validation, and its pretty easy to understand. We’ll get into the actual code of this in later lessons. You can come here if you would like to have actual examples right now instead of waiting for my practice exercises using Kaggle.com datasets: https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html. I believe cross-validation is such an extensive topic that it deserves its own post and video tutorial.
As a closing point, I will leave you with the image that the Scikit-learn documentation provides in order to aid you in finding the right estimator for your data. With this section the very vague part of the lessons end, and we will be tackling things both (1) in detail and, (2) with actual practice examples to show for it. You rock. Thank you for following my blog, and please write me with any questions.