# Recommender Systems Tutorial

In this tutorial, we’ll provide a simple walkthrough of how to use Snorkel to build a recommender system. We consider a setting similar to the Netflix challenge, but with books instead of movies. We have a set of users and books, and for each user we know the set of books they have interacted with (read or marked as to-read). We don’t have the user’s numerical ratings for the books they read, except in a small number of cases. We also have some text reviews written by users.

Our goal is to build a recommender system by training a classifier to predict whether a user will read and like any given book.
We’ll train our model over a user-book pair to predict a `rating`

(a `rating`

of 1 means the user will read and like the book).
To simplify inference, we’ll represent a user by the set of books they interacted with (rather than learning a specific representation for each user).
Once we have this model trained, we can use it to recommend books to a user when they visit the site.
For example, we can just predict the rating for the user paired with a book for a few thousand likely books, then pick the books with the top ten predicted ratings.

Of course, there are many other ways to approach this problem. The field of recommender systems is a very well studied area with a wide variety of settings and approaches, and we just focus on one of them.

We will use the Goodreads dataset, from
“Item Recommendation on Monotonic Behavior Chains”, RecSys’18 (Mengting Wan, Julian McAuley), and “Fine-Grained Spoiler Detection from Large-Scale Review Corpora”, ACL’19 (Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley).
In this dataset, we have user interactions and reviews for Young Adult novels from the Goodreads website, along with metadata (like `title`

and `authors`

) for the novels.

## Loading Data

We start by running the `download_and_process_data`

function.
The function returns the `df_train`

, `df_test`

, `df_dev`

, `df_valid`

dataframes, which correspond to our training, test, development, and validation sets.
Each of those dataframes has the following fields:

`user_idx`

: A unique identifier for a user.`book_idx`

: A unique identifier for a book that is being rated by the user.`book_idxs`

: The set of books that the user has interacted with (read or planned to read).`review_text`

: Optional text review written by the user for the book.`rating`

: Either`0`

(which means the user did not read or did not like the book) or`1`

(which means the user read and liked the book). The`rating`

field is missing for`df_train`

. Our objective is to predict whether a given user (represented by the set of book_idxs the user has interacted with) will read and like any given book. That is, we want to train a model that takes a set of`book_idxs`

(the user) and a single`book_idx`

(the book to rate) and predicts the`rating`

.

In addition, `download_and_process_data`

also returns the `df_books`

dataframe, which contains one row per book, along with metadata for that book (such as `title`

and `first_author`

).

```
from utils import download_and_process_data
(df_train, df_test, df_dev, df_valid), df_books = download_and_process_data()
df_books.head()
```

We look at a sample of the labeled development set.
As an example, we want our final recommendations model to be able to predict that a user who has interacted with `book_idxs`

(25743, 22318, 7662, 6857, 83, 14495, 30664, …) would either not read or not like the book with `book_idx`

22764 (first row), while a user who has interacted with `book_idxs`

(3880, 18078, 9092, 29933, 1511, 8560, …) would read and like the book with `book_idx`

3181 (second row).

```
df_dev.sample(frac=1, random_state=12).head()
```

user_idx | book_idxs | book_idx | rating | review_text | |
---|---|---|---|---|---|

277346 | 10733 | (18515, 590, 4221, 9716, 4965, 2711, 30370, 21... | 31043 | 1 | NaN |

797926 | 31002 | (22472, 29064, 19059, 4739, 18534, 20559, 1277... | 12231 | 1 | NaN |

624489 | 24170 | (16739, 10142, 31677, 22976, 4903, 831, 22079,... | 22589 | 1 | NaN |

565393 | 21830 | (1532, 19248, 1021, 22612, 18773, 13376, 23564... | 31238 | 1 | NaN |

433259 | 16679 | (22472, 19544, 28628, 3137, 24753, 24680, 836,... | 7754 | 1 | NaN |

## Writing Labeling Functions

```
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1
```

If a user has interacted with several books written by an author, there is a good chance that the user will read and like other books by the same author.
We express this as a labeling function, using the `first_author`

field in the `df_books`

dataframe.
We picked the threshold 15 by plotting histograms and running error analysis using the dev set.

```
from snorkel.labeling.lf import labeling_function
book_to_first_author = dict(zip(df_books.book_idx, df_books.first_author))
first_author_to_books_df = df_books.groupby("first_author")[["book_idx"]].agg(set)
first_author_to_books = dict(
zip(first_author_to_books_df.index, first_author_to_books_df.book_idx)
)
@labeling_function(
resources=dict(
book_to_first_author=book_to_first_author,
first_author_to_books=first_author_to_books,
)
)
def shared_first_author(x, book_to_first_author, first_author_to_books):
author = book_to_first_author[x.book_idx]
same_author_books = first_author_to_books[author]
num_read = len(set(x.book_idxs).intersection(same_author_books))
return POSITIVE if num_read > 15 else ABSTAIN
```

We can also leverage the long text reviews written by users to guess whether they liked or disliked a book.
For example, the third `df_dev`

entry above has a review with the text `'4.5 STARS'`

, which indicates that the user liked the book.
We write a simple LF that looks for similar phrases to guess the user’s rating of a book.
We interpret >= 4 stars to indicate a positive rating, while < 4 stars is negative.

```
low_rating_strs = [
"one star",
"1 star",
"two star",
"2 star",
"3 star",
"three star",
"3.5 star",
"2.5 star",
"1 out of 5",
"2 out of 5",
"3 out of 5",
]
high_rating_strs = ["5 stars", "five stars", "four stars", "4 stars", "4.5 stars"]
@labeling_function(
resources=dict(low_rating_strs=low_rating_strs, high_rating_strs=high_rating_strs)
)
def stars_in_review(x, low_rating_strs, high_rating_strs):
if not isinstance(x.review_text, str):
return ABSTAIN
for low_rating_str in low_rating_strs:
if low_rating_str in x.review_text.lower():
return NEGATIVE
for high_rating_str in high_rating_strs:
if high_rating_str in x.review_text.lower():
return POSITIVE
return ABSTAIN
```

We can also run TextBlob, a tool that provides a pretrained sentiment analyzer, on the reviews, and use its polarity and subjectivity scores to estimate the user’s rating for the book. As usual, these thresholds were picked by analyzing the score distributions and running error analysis.

```
from snorkel.preprocess import preprocessor
from textblob import TextBlob
@preprocessor(memoize=True)
def textblob_polarity(x):
if isinstance(x.review_text, str):
x.blob = TextBlob(x.review_text)
else:
x.blob = None
return x
# Label high polarity reviews as positive.
@labeling_function(pre=[textblob_polarity])
def polarity_positive(x):
if x.blob:
if x.blob.polarity > 0.3:
return POSITIVE
return ABSTAIN
# Label high subjectivity reviews as positive.
@labeling_function(pre=[textblob_polarity])
def subjectivity_positive(x):
if x.blob:
if x.blob.subjectivity > 0.75:
return POSITIVE
return ABSTAIN
# Label low polarity reviews as negative.
@labeling_function(pre=[textblob_polarity])
def polarity_negative(x):
if x.blob:
if x.blob.polarity < 0.0:
return NEGATIVE
return ABSTAIN
```

```
from snorkel.labeling import PandasLFApplier, LFAnalysis
lfs = [
stars_in_review,
shared_first_author,
polarity_positive,
subjectivity_positive,
polarity_negative,
]
applier = PandasLFApplier(lfs)
L_dev = applier.apply(df_dev)
```

```
LFAnalysis(L_dev, lfs).lf_summary(df_dev.rating)
```

j | Polarity | Coverage | Overlaps | Conflicts | Correct | Incorrect | Emp. Acc. | |
---|---|---|---|---|---|---|---|---|

stars_in_review | 0 | [0, 1] | 0.014573 | 0.003582 | 0.001235 | 93 | 25 | 0.788136 |

shared_first_author | 1 | [1] | 0.028653 | 0.000124 | 0.000000 | 151 | 81 | 0.650862 |

polarity_positive | 2 | [1] | 0.043349 | 0.011609 | 0.001112 | 265 | 86 | 0.754986 |

subjectivity_positive | 3 | [1] | 0.016179 | 0.012103 | 0.001976 | 101 | 30 | 0.770992 |

polarity_negative | 4 | [0] | 0.014820 | 0.002964 | 0.001976 | 77 | 43 | 0.641667 |

### Applying labeling functions to the training set

We apply the labeling functions to the training set, and then filter out data points unlabeled by any LF to form our final training set.

```
from snorkel.labeling.model.label_model import LabelModel
L_train = applier.apply(df_train)
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=5000, seed=123, log_freq=20, lr=0.01)
preds_train = label_model.predict(L_train)
```

```
from snorkel.labeling import filter_unlabeled_dataframe
df_train_filtered, preds_train_filtered = filter_unlabeled_dataframe(
df_train, preds_train, L_train
)
df_train_filtered["rating"] = preds_train_filtered
```

### Rating Prediction Model

We write a Keras model for predicting ratings given a user’s book list and a book (which is being rated).
The model represents the list of books the user interacted with, `books_idxs`

, by learning an embedding for each idx, and averaging the embeddings in `book_idxs`

.
It learns another embedding for the `book_idx`

, the book to be rated.
Then it concatenates the two embeddings and uses an MLP to compute the probability of the `rating`

being 1.
This type of model is common in large-scale recommender systems, for example, the YouTube recommender system.

```
import numpy as np
import tensorflow as tf
from utils import precision_batch, recall_batch, f1_batch
n_books = max([max(df.book_idx) for df in [df_train, df_test, df_dev, df_valid]])
# Keras model to predict rating given book_idxs and book_idx.
def get_model(embed_dim=64, hidden_layer_sizes=[32]):
# Compute embedding for book_idxs.
len_book_idxs = tf.keras.layers.Input([])
book_idxs = tf.keras.layers.Input([None])
# book_idxs % n_books is to prevent crashing if a book_idx in book_idxs is > n_books.
book_idxs_emb = tf.keras.layers.Embedding(n_books, embed_dim)(book_idxs % n_books)
book_idxs_emb = tf.math.divide(
tf.keras.backend.sum(book_idxs_emb, axis=1), tf.expand_dims(len_book_idxs, 1)
)
# Compute embedding for book_idx.
book_idx = tf.keras.layers.Input([])
book_idx_emb = tf.keras.layers.Embedding(n_books, embed_dim)(book_idx)
input_layer = tf.keras.layers.concatenate([book_idxs_emb, book_idx_emb], 1)
# Build Multi Layer Perceptron on input layer.
cur_layer = input_layer
for size in hidden_layer_sizes:
tf.keras.layers.Dense(size, activation=tf.nn.relu)(cur_layer)
output_layer = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(cur_layer)
# Create and compile keras model.
model = tf.keras.Model(
inputs=[len_book_idxs, book_idxs, book_idx], outputs=[output_layer]
)
model.compile(
"Adagrad",
"binary_crossentropy",
metrics=["accuracy", f1_batch, precision_batch, recall_batch],
)
return model
```

We use triples of (`book_idxs`

, `book_idx`

, `rating`

) from our dataframes as training data points. In addition, we want to train the model to recognize when a user will not read a book. To create data points for that, we randomly sample a `book_id`

not in `book_idxs`

and use that with a `rating`

of 0 as a *random negative* data point. We create one such *random negative* data point for every positive (`rating`

1) data point in our dataframe so that positive and negative data points are roughly balanced.

```
# Generator to turn dataframe into data points.
def get_data_points_generator(df):
def generator():
for book_idxs, book_idx, rating in zip(df.book_idxs, df.book_idx, df.rating):
# Remove book_idx from book_idxs so the model can't just look it up.
book_idxs = tuple(filter(lambda x: x != book_idx, book_idxs))
yield {
"len_book_idxs": len(book_idxs),
"book_idxs": book_idxs,
"book_idx": book_idx,
"label": rating,
}
if rating == 1:
# Generate a random negative book_id not in book_idxs.
random_negative = np.random.randint(0, n_books)
while random_negative in book_idxs:
random_negative = np.random.randint(0, n_books)
yield {
"len_book_idxs": len(book_idxs),
"book_idxs": book_idxs,
"book_idx": random_negative,
"label": 0,
}
return generator
def get_data_tensors(df):
# Use generator to get data points each epoch, along with shuffling and batching.
padded_shapes = {
"len_book_idxs": [],
"book_idxs": [None],
"book_idx": [],
"label": [],
}
dataset = (
tf.data.Dataset.from_generator(
get_data_points_generator(df), {k: tf.int64 for k in padded_shapes}
)
.shuffle(123)
.repeat(None)
.padded_batch(batch_size=256, padded_shapes=padded_shapes)
)
tensor_dict = tf.compat.v1.data.make_one_shot_iterator(dataset).get_next()
return (
(
tensor_dict["len_book_idxs"],
tensor_dict["book_idxs"],
tensor_dict["book_idx"],
),
tensor_dict["label"],
)
```

We now train the model on our combined training data (data labeled by LFs plus dev data).

```
from utils import get_n_epochs
model = get_model()
X_train, Y_train = get_data_tensors(df_train_filtered)
X_valid, Y_valid = get_data_tensors(df_valid)
model.fit(
X_train,
Y_train,
steps_per_epoch=300,
validation_data=(X_valid, Y_valid),
validation_steps=40,
epochs=get_n_epochs(),
verbose=1,
)
```

Finally, we evaluate the model’s predicted ratings on our test data.

```
X_test, Y_test = get_data_tensors(df_test)
_ = model.evaluate(X_test, Y_test, steps=30)
```

```
30/30 [==============================] - 1s 32ms/step - loss: 0.6717 - acc: 0.6482 - f1_batch: 0.4793 - precision_batch: 0.5636 - recall_batch: 0.4296
```

Our model has generalized quite well to our test set! Note that we should additionally measure ranking metrics, like precision@10, before deploying to production.

## Summary

In this tutorial, we showed one way to use Snorkel for recommendations.
We used book metadata and review text to create LFs that estimate user ratings.
We used Snorkel’s `LabelModel`

to combine the outputs of those LFs.
Finally, we trained a model to predict whether a user will read and like a given book (and therefore what books should be recommended to the user) based only on what books the user has interacted with in the past.

Here we demonstrated one way to use Snorkel for training a recommender system. Note, however, that this approach could easily be adapted to take advantage of additional information as it is available (e.g., user profile data, denser user ratings, and so on.)