Java实现的协同过滤推荐系统

1. Introduction

In this tutorial, we’ll learn all about the Slope One algorithm in Java.

We’ll also show the example implementation for the problem of Collaborative Filtering (CF) – a machine learning technique used by recommendation systems.

This can be used, for example, to predict user interests for specific items.

2. Collaborative Filtering

The Slope One algorithm is an item-based collaborative filtering system. It means that it is completely based on the user-item ranking. When we compute the similarity between objects, we only know the history of rankings, not the content itself. This similarity is then used to predict potential user rankings for user-item pairs not present in the dataset.

The image below show the complete process of obtaining and calculating the rating for the specific user:

At first, users rate different items in the system. Next, the algorithm calculates the similarities. After that, the system makes predictions for user-item ratings, which the user hasn’t rated yet.

For more details on the topic of the collaborative filtering, we can refer to the Wikipedia article.

3. The Slope One Algorithm

Slope One was named as the simplest form of non-trivial item-based collaborative filtering based on ratings. It takes into account both information from all users who rated the same item and from the other items rated by the same user to calculate the similarity matrix.

In our simple example, we are going to predict user rankings on the items in the store.

Let’s start with a simple Java model for our problem and domain.

3.1. Java Model

In our model we have two main objects – items and users. The Item class contains the name of the item:

private String itemName;

On the other hand, the User class contains the username:

private String username;

Finally, we have a InputData class that will be used to initialize the data. Let us assume that we’ll create five different products in the store:

List<Item> items = Arrays.asList(
  new Item("Candy"), 
  new Item("Drink"), 
  new Item("Soda"), 
  new Item("Popcorn"), 
  new Item("Snacks")
);

Moreover, we’ll create three users that randomly rated some of the above-mentioned using the scale from 0.0-1.0, where 0 means no interest, 0.5 somehow interested and 1.0 means totally interested. As a result of data initialization we’ll get a Map with user item-ranking data:

Map<User, HashMap<Item, Double>> data;

3.2. Differences and Frequencies Matrices

Based on the available data, we’ll calculate the relationships between the items, as well as the number of items’ occurrences. For each user, we check his/her rating of the items:

for (HashMap<Item, Double> user : data.values()) {
    for (Entry<Item, Double> e : user.entrySet()) {
        // ...
    }
}

In the next step, we check if the item is existing in our matrices. If this is a first occurrence, we create the new entry in the maps:

if (!diff.containsKey(e.getKey())) {
    diff.put(e.getKey(), new HashMap<Item, Double>());
    freq.put(e.getKey(), new HashMap<Item, Integer>());
}

The first matrix is used to calculate the differences between the user ratings. The values of it might be positive or negative (since the difference between ratings might be negative), and are stored as Double. On the other hand, the frequencies are stored as Integer values.

In the next step we are going to compare the ratings of all items:

for (Entry<Item, Double> e2 : user.entrySet()) {
    int oldCount = 0;
    if (freq.get(e.getKey()).containsKey(e2.getKey())){
        oldCount = freq.get(e.getKey()).get(e2.getKey()).intValue();
    }

    double oldDiff = 0.0;
    if (diff.get(e.getKey()).containsKey(e2.getKey())){
        oldDiff = diff.get(e.getKey()).get(e2.getKey()).doubleValue();
    }
    
    double observedDiff = e.getValue() - e2.getValue();
    freq.get(e.getKey()).put(e2.getKey(), oldCount + 1);
    diff.get(e.getKey()).put(e2.getKey(), oldDiff + observedDiff);
}

If somebody rated the item before, we increase the frequency count by one. Moreover, we check the average difference between the item’s ratings and calculate the new observedDiff.

Please note, that we put the sum of oldDiff and observedDiff as a new value of an item.

Finally, we calculate the similarity scores inside the matrices:

for (Item j : diff.keySet()) {
    for (Item i : diff.get(j).keySet()) {
        double oldValue = diff.get(j).get(i).doubleValue();
        int count = freq.get(j).get(i).intValue();
        diff.get(j).put(i, oldValue / count);
    }
}

The main logic is to divide the calculated item rating’s difference by the number of its occurrences. After that step, we can print out our final differences matrix.

3.3. Predictions

As the main part of the Slope One, we are going to predict all missing ratings based on the existing data. In order to do that, we need to compare the user-item ratings with differences matrix calculated in the previous step:

for (Entry<User, HashMap<Item, Double>> e : data.entrySet()) {
    for (Item j : e.getValue().keySet()) {
        for (Item k : diff.keySet()) {
            double predictedValue =
              diff.get(k).get(j).doubleValue() + e.getValue().get(j).doubleValue();
            double finalValue = predictedValue * freq.get(k).get(j).intValue();
            uPred.put(k, uPred.get(k) + finalValue);
            uFreq.put(k, uFreq.get(k) + freq.get(k).get(j).intValue());
        }
    }
    // ...
}

After that, we need to prepare the “clean” predictions using the code below:

HashMap<Item, Double> clean = new HashMap<Item, Double>();
for (Item j : uPred.keySet()) {
    if (uFreq.get(j) > 0) {
        clean.put(j, uPred.get(j).doubleValue() / uFreq.get(j).intValue());
    }
}
for (Item j : InputData.items) {
    if (e.getValue().containsKey(j)) {
        clean.put(j, e.getValue().get(j));
    } else if (!clean.containsKey(j)) {
        clean.put(j, -1.0);
    }
}

The trick to consider with larger data set is to use only the item entries that have a large frequency value (for example > 1). Please note, that if the prediction is not possible, the value of it will be equal to -1.

Finally, very important note. If our algorithm worked correctly, we should receive the predictions for items that user didn’t rate, but also the repeated ratings for the items that he rated. Those repeated rating’s should not change, otherwise it means that there is a bug in your algorithm implementation.

3.4. Tips

There are few major factors that affect Slope One algorithm. Here are the few tips how to increase the accuracy and processing time:

consider obtaining the user-item ratings on the DB side for large data sets
set the time frame for ratings fetching, as people interests might change over the time – it will also reduce the time required to process input data
split large data sets into smaller ones – you don’t need to calculate predictions for all users every day; you can check if the user interacted with the predicted item and then add/remove him/her from processing queue for the next day

4. Conclusion

In this tutorial we were able to learn about the Slope One algorithm. Moreover, we introduced the collaborative filtering problem for item recommendation systems.

The full implementation of this tutorial can be found in the GitHub project.

Persistence

REST

Security