GENETIC PROGRAMMING FOR AUTOMATED MACHINE LEARNING
Automated machine Learning (or simply AutoML) refers to automating the generation of a data analysis pipeline. AutoML can include data pre-processing, feature selection, and feature engineering methods along with machine learning methods and parameter settings that are optimized for your data. The biggest benefit of AutoML is that it automates the algorithm selection, that will now take hours instead of months in the case of manual selection.
This case study focused on TPOT, an open source AutoML tool that intelligently explores thousands of possible pipelines to find the best one for a given dataset. It uses genetic algorithms that are optimized to obtain the best outcome. Once the search process has finished, TPOT provides the user with a short list of the best algorithms found.
TPOT doesn’t automate the entire machine learning process, but a large part of it, namely the processes for selecting and preprocessing features, together with the model selection and parameter optimization pipelines.
HOW IT WORKS
Scikit-learn lets you define pipelines, which are collections of sequenced operations. Next, TPOT uses the outcome of one pipeline as the input for the next operation. The scikit-learn documentation has an overview of all classes that can be put into a pipeline. There are many and operators from other libraries can be added as well.
The engineers of Bi4 Group demonstrated the accuracy of TPOT by participating in a contest on Kaggle, a community of machine learning challenges. A dataset of 200,000 rows and 200 columns was processed twice using TPOT, yielding a TPOT score of .751 compared to a Kaggle score of .800, showing that TPOT can automatically explore solutions and give one that is usually good enough. TPOT does require a scalable (cloud) infrastructure to process the data.
Next to the open source alternative, tech giants also offer boxed solutions. Google recently launched Cloud AutoML, which is part of a suite of machine learning products geared to developers with limited ML experience. However, most solutions are in reality just “standard” machine learning tools (and for instance don’t automate the machine learning pipelines). We recommend the usage of an open source project implemented by a specialist company, which can then be customized. We don´t believe in a one-size fits all solution as each company has different requirements and wishes.
TPOT HELPS SUBSTANTIALLY IN THE ML PROCESS
TPOT is a great tool to generate accurate algorithms and automate a large part of the machine learning pipeline.
It saves the user a lot of time and effort as is it much faster than data scientists selecting algorithms and improves efficiency in machine learning processes.