1. Introduction

In this tutorial, we’ll explain the Scikit-learn (Sklearn) Pipeline class and how to use it.

2. What is Scikit-Learn?

Scikit-learn or Sklearn is a popular machine learning library for Python programming language. It provides various algorithms for classification, regression, clustering, model selection, data preprocessing, and many more. Sklearn is well-documented and user-friendly, making it a popular choice for both beginners and experienced developers.

One of its useful but perhaps less commonly utilized classes is Pipeline, which we’ll explain further below.

3. What is Scikit-Learn Pipeline Class?

The Pipeline class in Sklearn is a utility that helps automate the process of transforming data and applying models. Often in machine learning modeling, we need to sequentially combine several steps on both the training and test data. For example, we want to standardize the input features, apply PCA, and predict with logistic regression.

With the Pipeline class, these steps can be easily combined into one object and then applied to training and test data.

Key features of the Sklearn Pipeline:

  • Sequential steps – each step in the Pipeline is performed in sequential order as we define them.
  • Consistency – the Pipeline ensures that we apply the same transformations to the training data and any new data during prediction.
  • Simplified code – makes the code cleaner and easier to manage.

To help better understand the Pipeline class, we will present a few examples below.

4. Examples of Scikit-Learn Pipeline Class

As an example, we will use a simple Iris data set from Sklearn for multi-class classification. We’ll load the data, split it into training and test sets, select the best 2 features based on the ANOVA F-value, standardize the features, and use logistic regression to predict classes.

4.1. Example Without the Pipeline Class

One approach without the Pipeline class would look like this:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load and split dataset
iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature selection
selector = SelectKBest(score_func=f_classif, k=2)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_selected)
X_test_scaled = scaler.transform(X_test_selected)

# Train the model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

4.2. Example With the Pipeline Class

**The same example with Pipeline class looks like this:
**

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load and split dataset
iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline with feature selection, scaling, and model training
pipeline = Pipeline([
    ('feature_selection', SelectKBest(score_func=f_classif, k=2)),
    ('scaler', StandardScaler()),
    ('logistic_regression', LogisticRegression())
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Notice that in the first example, we need to apply feature selection and scaling to the training and test sets separately. Also, we need to have a variable to store output data after every preprocessing step (X_train_selected and X_train_scaled).

In the second example with the Pipeline class, if we need to change some data preprocessing steps, we would only need to modify Pipeline initialization.

5. Conclusion

In this article, we explained and provided an example of the Sklearn Pipeline class. The pipeline class reduces code complexity, ensures consistency, and minimizes the risk of errors, making it useful for both beginners and experienced developers. The examples clearly show how Pipeline can turn a complicated task into a more manageable and elegant solution.