We discussed TPOT, a tool for optimizing machine learning pipelines which uses genetic programming and generates accurate algorithms.
Genetic programming for automated machine learning
Automated machine Learning (or simply AutoML) refers to automating the generation of a data analysis pipeline. AutoML can include data pre-processing, feature selection, and feature engineering methods along with machine learning methods and parameter settings that are optimized for your data. The biggest benefit of AutoML is that it automates the algorithm selection, that will now take hours instead of months in the case of manual selection.
This case study focused on TPOT, an open source AutoML tool that intelligently explores thousands of possible pipelines to find the best one for a given dataset. It uses genetic algorithms that are optimized to obtain the best outcome. Once the search process has finished, TPOT provides the user with a short list of the best algorithms found.
TPOT doesn’t automate the entire machine learning process, but a large part of it, namely the processes for selecting and preprocessing features, together with the model selection and parameter optimization pipelines.
How it works
Scikit-learn lets you define pipelines, which are collections of sequenced operations. Next, TPOT uses the outcome of one pipeline as the input for the next operation. The scikit-learn documentation has an overview of all classes that can be put into a pipeline. There are many and operators from other libraries can be added as well.
The engineers of Bi4 Group demonstrated the accuracy of TPOT by participating in a contest on Kaggle, a community of machine learning challenges. A dataset of 200,000 rows and 200 columns was processed twice using TPOT, yielding a TPOT score of .751 compared to a Kaggle score of .800, showing that TPOT can automatically explore solutions and give one that is usually good enough. TPOT does require a scalable (cloud) infrastructure to process the data.
TPOT helps substantially in the ml process
TPOT is a great tool to generate accurate algorithms and automate a large part of the machine learning pipeline.
It saves the user a lot of time and effort as is it much faster than data scientists selecting algorithms and improves efficiency in machine learning processes.