What is Machine Learning Pipeline?
Machine Learning Pipeline is a mechanism that is used to automate the machine learning workflows that chains multiple steps together so that the output of each step is used as an input to the next step.
In real world, the data is messy and it requires series of preprocessing steps in order to prepare our dataset for training our machine learning model.
When we want to build an end-to-end machine learning project so that our aim is for the production, the data needs to be first preprocessed before passing the new data to the model and it is more of a work doing all of the preprocessing steps.
In order to make the preprocessing steps easier and smooth, there is a concept of Pipelining in which we can perform all the preprocessing steps in a single chain. Python scikit-learn provides a Pipeline utility to help automate machine learning workflows.
Transformers in Machine learning pipeline
Transformers are used in Machine Learning to transform data into a desired format. Transformers are data processing and transformation models. Because our data is not always in a format that can be immediately fed to a machine learning model for both training and prediction, these transformers are quite beneficial. They are the classes that enable data transformations while preprocessing the data for machine learning.
ColumnTransformer in scikit-learn:
It is a scikit-learn class which is used to create and apply separate transformation to a numerical and categorical data
To create transformers, we need to specify the transformer object and then pass the list of transformations inside a tuple along with the column on which you want to apply the transformation
Transformers are usually combined with classifiers, regressors or other estimators to build one composite estimator.
We can use the Pipeline class to create transformers.
We can also use a Pipeline class to to chain multiple estimators into one pipeline and use it for training and prediction as a main model.
In this post, we will create a simple machine learning model using the pipeline from sklearn and then build a simple website using Flask. We will also try to host the website on heroku at the end.
For now we do not plan on creating a more advanced webapp instead we will only create the input forms in which we can enter the input for the prediction using our model.
I've used downloaded the Sloth specie dataset from kaggle and you can download it from here if you plan to follow with the same dataset.
About the dataset:
A dataset contains different species of sloths and such as endangered classification as well as the sub species of sloth.
This dataset could be used to distinguish between two or three toed sloths or the sub species.
This file contains the sizes of sloths, with each row being a different observed sloth. It contains the dimensions of the sloth, their weights and if the species is endangered.
Feature names:
'claw_length_cm', 'endangered', 'size_cm', 'specie', 'sub_specie', 'tail_length_cm', 'weight_kg'
Task:
- Create a model to predict if a sloth is two toed or three toed using machine learning pipeline
The steps included for this entire project are:
- Import Libraries
- Loading Data
- Data Engineering
- Creating Pipeline for the model
- Train the model
- Validation and Evaluation of model
- Saving Model
- Create a Template file
- Create routes using Flask and implement our model
- Deploy on heroku
1. Import Libraries
Here, we will import only pandas and numpy. We will import the other libraries as and when needed.
import pandas as pd
import numpy as np
2. Loading Data
Load the dataset using the pandas library :
df = pd.read_csv('sloth_data.csv')
df.head()
Check if there are null instances:
df.isnull().sum()
3. Data Engineering
i. Drop the first column which we don't need.
df = df.drop('Unnamed: 0', axis=1)
df.head()
ii. Understand the data.
Check the datatypes of different columns:
df.dtypes
Check the value counts in different columns:
df.endangered.value_counts()
df.specie.value_counts()
df.sub_specie.value_counts()
Shape of our dataset:
df.shape
iii. Seperate the features from the label and store it as X.
X = df.drop('specie', axis=1)
y = df['specie']
print(y.head())
iv. Split the dataset into training and the testing set.
from sklearn.model_selection import train_test_split
X_train, x_test, Y_train, y_test = train_test_split(X, y , test_size = 0.20)
4. Creating Pipeline for the model
Import the libraries needed for creating pipelines:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
Numeric columns and categorical columns from the dataset :
numeric_cols = X.select_dtypes(include=['float64']).columns
print(numeric_cols)
# categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns
print(categorical_cols)
Store the numeric column indices in numeric_index:
numeric_index = [X.columns.get_loc(col) for col in numeric_cols]
numeric_index
And store the categorical column indices in categorical_index:
categorical_index = [X.columns.get_loc(col) for col in categorical_cols]
categorical_index
Building the Numeric Transformation Pipeline:
Although we don't have any missing values but we will not skip the imputation step because most data will contain null values and we need to deal with it. Here we are using simple imputer to fill the null values and then use StandardScaler to normalize the numeric data
n_transformer = Pipeline(steps=
[
('imputeN',SimpleImputer(strategy='mean', fill_value='missing')), # Handle numeric missing value with mean
('scale',StandardScaler()) # Normalize the data
])
Building the Categorical Transformation Pipeline:
For the categorical columns, we will impute using most_frequent value in that column and encode the values using OneHotEncoder.
c_transformer = Pipeline(steps=
[
('imputeC', SimpleImputer(strategy = 'most_frequent', fill_value = 'missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
Now combine the numeric and categorical transformers:
from sklearn.compose import ColumnTransformer
pre = ColumnTransformer(transformers =
[
('numeric', n_transformer, numeric_index),
('categoric', c_transformer, categorical_index)
])
Create the estimator for training:
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression()
Finally combine the individual blocks to form the main pipeline:
main_pipeline = Pipeline(
steps = [
('preprocessor', pre), # Preprocessing
('classifier' , estimator) # Model
]
)
5. Train the model
i. Training the model using pipeline
ii. Displaying Pipeline
from sklearn import set_config
set_config(display='diagram')
# fit data
main_pipeline.fit(X_train, Y_train)
6. Validation and Evaluation of model
Import all the metrics for validation and evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
import seaborn as sns
Predict on x_test data:
y_pred = main_pipeline.predict(x_test)
print(y_pred)
Summarise the fitted model:
report = classification_report(y_test, y_pred, target_names=['two_toed','three_toed'])
print("Report : \n{}".format(report))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
# print(cm)
sns.heatmap(cm,annot=True,cmap='Blues')
7. Saving Model
Save Model using pickle
import pickle
pickle.dump(main_pipeline, open("model.pkl","wb"))
To use the saved model, load the pickle model
model = pickle.load(open("model.pkl","rb"))
model
And predict using the model:
model.predict(x_test)
Check the shape of x_test so that when we give a new input, we will give the same shape and order:
x_test.head(1)
The column order is as below and we will use this same order later when we get the data from our html form:
['claw_length_cm', 'endangered', 'size_cm','sub_specie','tail_length_cm','weight_kg']
8. Create a Template file
Create a new project folder, I called it as 'slothspecie' and inside this folder, create another folder called templates.
We will try to create a simple html file and save it inside the templates directory. We will use this form to enter the details to predict the specie of a sloth. Create an input field for the columns that we listed in the above:
home.html
<!DOCTYPE html>
<html>
<head>
<meta charset='utf-8'>
<meta http-equiv='X-UA-Compatible' content='IE=edge'>
<title>Sloth's species</title>
<meta name='viewport' content='width=device-width, initial-scale=1'>
<!-- CSS only -->
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.0.1/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
<style>
.container {
margin: 8rem;
}
.login--wrapper {
display: flex;
align-items: center;
height: 80vh;
justify-content: center;
}
.form {
width: 500px;
}
</style>
</head>
<body>
<div class="container">
<div class="login--wrapper">
<form method='POST' class="form" action="/predict">
<h3 style="text-align: center;">Welcome, </h3> <br>
<P>Enter the details in the following input fields to predict the species of Sloth </P>
<p></p>
<div class="mb-3">
<label for="input_claw_length" class="form-label">Claw_length_cm: </label>
<input type="int" class="form-control" name="claw_length_cm" id="input_claw_length" value="9.266">
</div>
<div class="mb-3">
<label for="input_endangered" class="form-label">Endangered: </label>
<select name="endangered" id="input_endangered" class="form-control">
<option value="least_concern">least_concern</option>
<option value="vulnerable">vulnerable</option>
<option value="critically_endangered">critically_endangered</option>
</select>
</div>
<div class="mb-3">
<label for="input_size_cm" class="form-label">Size_cm: </label>
<input type="int" class="form-control" name="size_cm" id="input_size_cm" value="65.49">
</div>
<!-- <div class="mb-3">
<label for="input_specie">Specie: </label>
<select name="specie" id="input_specie" class="form-control">
<option value="three_toed">three_toed</option>
<option value="two_toed">two_toed</option>
</select>
</div> -->
<div class="mb-3">
<label for="input_sub_specie" class="form-label">sub_specie</label>
<select name="sub_specie" id="input_sub_specie" class="form-control">
<option value="Hoffman’s two-toed sloth">Hoffman’s two-toed sloth</option>
<option value="Linnaeus’s two-toed sloth">Linnaeus’s two-toed sloth</option>
<option value="Pale-throated sloth">Pale-throated sloth</option>
<option value="Brown-throated sloth ">Brown-throated sloth </option>
<option value="Maned three-toed sloth">Maned three-toed sloth</option>
<option value="Pygmy three-toed sloth">Pygmy three-toed sloth</option>
</select>
</div>
<div class="mb-3">
<label for="input_tail_length_cm" class="form-label">Tail_length_cm: </label>
<input type="int" class="form-control" name="tail_length_cm" id="input_tail_length_cm" value="0.691">
</div>
<div class="mb-3">
<label for="input_weight_kg" class="form-label">Weight_kg: </label>
<input type="int" class="form-control" name="weight_kg" id="input_weight_kg" value="4.724
">
</div>
<h5 style="text-align: center;">Result: </h5><b><p style="text-align: center;">{{predicted}}</p></b> <br>
<button type="submit" class="btn btn-dark btn-sm">Predict</button>
<p></p>
<p>Predict <a href="{{ url_for('home') }}">again?</a> </p>
</form>
</div>
</div>
</body>
</html>
9. Create routes using Flask and implement our model
We need to install Flask so that we can define our routes inside the Flask app. We will create a virtual environment to install the Flask package:
Open the new terminal and cd into the project directory as shown below:
$ python -m venv env
Activate the environment:
.\env\Scripts\activate
Install Flask, sklearn, numpy and pandas :
$ pip install Flask
$ pip install sklearn
$ pip install numpy
$ pip install pandas
Create the app.py file inside the project directory:
from flask import Flask, render_template, request
import pickle
import pandas as pd
import numpy as np
model = pickle.load(open('model.pkl', 'rb'))
app = Flask(__name__)
@app.route('/')
def home():
return render_template('home.html')
@app.route('/predict', methods=['POST'])
def predict():
claw_length_cm = request.form['claw_length_cm']
endangered = request.form['endangered']
size_cm = request.form['size_cm']
sub_specie = request.form['sub_specie']
tail_length_cm = request.form['tail_length_cm']
weight_kg = request.form['weight_kg']
# sample_columns = ['claw_length_cm', 'endangered', 'size_cm','sub_specie','tail_length_cm','weight_kg']
# sample = [[9.266, 'least_concern', 65.49, 'Linnaeus’s two-toed sloth', 0.691, 4.724]]
arr = [[claw_length_cm, endangered, size_cm, sub_specie, tail_length_cm, weight_kg]]
df = pd.DataFrame(arr, columns=['claw_length_cm', 'endangered', 'size_cm','sub_specie','tail_length_cm','weight_kg'])
print(arr)
pred = model.predict(df)
return render_template('home.html', predicted=pred[0])
if __name__ == "__main__":
app.run()
Don't forget to put the saved pickle model inside the project directory. The whole directory structure is as follows:
Now try to run the app in the terminal :
$ python app.py
10. Deploy on heroku
Flask does not have a WSGI HTTP Server, so for this tutorial, we will use the most common library Gunicorn. For this, we will create a new wsgi.py file.
i. Install Gunicorn
$ pip install gunicorn
from app import app
if __name__ == "__main__":
app.run()
ii. Signup to Heroku link
Creating a Heroku Application:
Login to your Heroku account, then click on the New → Create new app
Give your application a name and choose where to host your app, then click on Create app
Creating a Procfile
A Procfile is a file that specifies the commands that are executed by the app on startup. We need to create a Procfile next to app.py and wsgi.py as shown below. Procfile
web: gunicorn wsgi:app
Create a requirements.txt file. Inside the terminal, type:
$ pip freeze > requirements.txt
iii. Git — Windows/Linux/macOS link
Login to Heroku via CLI
To use any Heroku CLI command we will first need to log in by using the following command:
$ heroku login
You’ll be prompted to enter any key to go to your web browser to complete the login.
iv. Download Heroku CLI from Here
We must make our project a Git repository in order to push our project to Heroku servers.
Open a terminal and change the current directory to the project’s root directory, then run:
$ git init
Then we need to add a new remote (Heroku remote) to our repository.
$ heroku git:remote -a sloth-specie
Add the files to heroku remote repository and create a commit and push the project to Heroku.
$ git add .
$ git commit -am "first commit "
$ git push heroku master
Now that our app is deployed check it by opening the url: https://sloth-specie.herokuapp.com/