1. Introduction

Feature engineering boosts the accuracy of machine-learning models by producing new variables in data.

In this tutorial, we’ll look into a particular type of feature engineering we call one-hot encoding.

2. Why Do We Encode Data?

We use one-hot encoding when our data contain nominal (categorical) features. By definition, that means that possible values of those features cannot be ordered. For example, in a gender column, we can’t order male and female as female > male or male > female, as that would bias algorithms. For that reason, we can’t assign numerical values to categories, e.g., male \mapsto 0 and female \mapsto 1, as that would also imply an order.

Additionally, some machine-learning algorithms expect the input data to contain only numbers, not strings (categories). However, many datasets have at least one categorical variable.

One-hot encoding addresses those issues.

3. What Is One-Hot Encoding?

In one-hot encoding, we convert categorical data to multidimensional binary vectors.

The number of dimensions corresponds to the number of categories, and each category gets its dimension. Then, we encode each category by mapping it to a vector in which the element corresponding to the category’s dimension is 1, and the rest are 0, hence the name.

For example, let’s suppose the categorical variable denotes weather prediction and has three categories: sunny, rain, and wind. Then, the encoding can be sunny = [1,0,0], rain = [0,1,0], wind = [0,0,1]:

One-hot encoding example

3.1. Example

Let’s say we have a Pokemon dataset:

Name

Total

HP

Attack

Defence

Type

Beedrill

395

65

90

40

Poison

Gastly

310

30

35

30

Poison

Pidgey

251

40

45

40

Flying

Wigglytuff

435

140

70

45

Fairy

Two features are categorical. The Name column acts as the id, so we’ll disregard it when training machine-learning models. However, the Type column contains information that is relevant to learning tasks.

To use it, we apply one-hot encoding. As there are three categories, the vectors will have three dimensions. In each dataset row, we replace the Type category with an encoded vector that has 1 in the position corresponding to the category and contains zeroes in the other two dimensions:

Name

Total

HP

Attack

Defence

Type-Poison

Type-Flying

Type-Fairy

Beedrill

395

65

90

40

1

0

0

Gastly

310

30

35

30

1

0

0

Pidgey

251

40

45

40

0

1

0

Wigglytuff

435

140

70

45

0

0

1

3.2. Dimensionality

If we have a categorical column with a lot of categories, one-hot encoding will add an excessive number of new features. That will take a lot of space and make learning algorithms slow.

To address that issue, we can find the top n most frequent categories and encode only them. For the rest, we create a special column “other” or ignore them. The exact choice of n depends on our processing power. In practice, we usually go with 10 or 20.

This issue can also occur if we have multiple categorical variables that, in total, produce too many new columns.

5. Conclusion

In this article, we explored one-hot encoding and the motivations to apply it. It enables us to use standard machine-learning algorithms on categorical data but may increase dimensionality and slow down the training.


« 上一篇: Inception网络简介
» 下一篇: 网络杀伤链