# 🚀 Snorkel Intro Tutorial: Data Labeling

In this tutorial, we will walk through the process of using Snorkel to build a training set for classifying YouTube comments as spam or not spam. The goal of this tutorial is to illustrate the basic components and concepts of Snorkel in a simple way, but also to dive into the actual process of iteratively developing real applications in Snorkel.

Note that this is a toy dataset that helps highlight the different features of Snorkel. For examples of high-performance, real-world uses of Snorkel, see our publications list.

Additionally:

For an overview of Snorkel, visit snorkel.org
You can also check out the Snorkel API documentation

Our goal is to train a classifier over the comment data that can predict whether a comment is spam or not spam. We have access to a large amount of unlabeled data in the form of YouTube comments with some metadata. In order to train a classifier, we need to label our data, but doing so by hand for real world applications can often be prohibitively slow and expensive.

In these cases, we can turn to a weak supervision approach, using labeling functions (LFs) in Snorkel: noisy, programmatic rules and heuristics that assign labels to unlabeled training data. We’ll dive into the Snorkel API and how we write labeling functions later in this tutorial, but as an example, we can write an LF that labels data points with "http" in the comment text as spam since many spam comments contain links:

from snorkel.labeling import labeling_function

@labeling_function()
def lf_contains_link(x):
    # Return a label of SPAM if "http" in comment text, otherwise ABSTAIN
    return SPAM if "http" in x.text.lower() else ABSTAIN

The tutorial is divided into four parts:

Loading Data: We load a YouTube comments dataset, originally introduced in “TubeSpam: Comment Spam Filtering on YouTube”, ICMLA’15 (T.C. Alberto, J.V. Lochter, J.V. Almeida).
Writing Labeling Functions: We write Python programs that take as input a data point and assign labels (or abstain) using heuristics, pattern matching, and third-party models.
Combining Labeling Function Outputs with the Label Model: We model the outputs of the labeling functions over the training set using a novel, theoretically-grounded modeling approach, which estimates the accuracies and correlations of the labeling functions using only their agreements and disagreements, and then uses this to reweight and combine their outputs, which we then use as probabilistic training labels.
Training a Classifier: We train a classifier that can predict labels for any YouTube comment (not just the ones labeled by the labeling functions) using the probabilistic training labels from step 3.

# Task: Spam Detection

We use a YouTube comments dataset that consists of YouTube comments from 5 videos. The task is to classify each comment as being

HAM: comments relevant to the video (even very simple ones), or
SPAM: irrelevant (often trying to advertise something) or inappropriate messages

For example, the following comments are SPAM:

    "Subscribe to me for free Android games, apps.."

    "Please check out my vidios"

    "Subscribe to me and I'll subscribe back!!!"

and these are HAM:

    "3:46 so cute!"

    "This looks so fun and it's a good song"

    "This is a weird video."

# Data Splits in Snorkel

We split our data into two sets:

Training Set: The largest split of the dataset, and the one without any ground truth (“gold”) labels. We will generate labels for these data points with weak supervision.
Test Set: A small, standard held-out blind hand-labeled set for final evaluation of our classifier. This set should only be used for final evaluation, not error analysis.

Note that in more advanced production settings, we will often further split up the available hand-labeled data into a development split, for getting ideas to write labeling functions, and a validation split for e.g. checking our performance without looking at test set scores, hyperparameter tuning, etc. These splits are used in some of the other advanced tutorials, but omitted for simplicity here.

# 1. Loading Data

We load the YouTube comments dataset and create Pandas DataFrame objects for the train and test sets. DataFrames are extremely popular in Python data analysis workloads, and Snorkel provides native support for several DataFrame-like data structures, including Pandas, Dask, and PySpark. For more information on working with Pandas DataFrames, see the Pandas DataFrame guide.

Each DataFrame consists of the following fields:

author: Username of the comment author
data: Date and time the comment was posted
text: Raw text content of the comment
label: Whether the comment is SPAM (1), HAM (0), or UNKNOWN/ABSTAIN (-1)
video: Video the comment is associated with

We start by loading our data. The load_spam_dataset() method downloads the raw CSV files from the internet, divides them into splits, converts them into DataFrames, and shuffles them. As mentioned above, the dataset contains comments from 5 of the most popular YouTube videos during a period between 2014 and 2015.

The first four videos’ comments are combined to form the train set. This set has no gold labels.
The fifth video is part of the test set.

from utils import load_spam_dataset

df_train, df_test = load_spam_dataset()

# We pull out the label vectors for ease of use later
Y_test = df_test.label.values

The class distribution varies slightly between SPAM and HAM, but they’re approximately class-balanced.

# For clarity, we define constants to represent the class labels for spam, ham, and abstaining.
ABSTAIN = -1
HAM = 0
SPAM = 1

# 2. Writing Labeling Functions (LFs)

# A gentle introduction to LFs

Labeling functions (LFs) help users encode domain knowledge and other supervision sources programmatically.

LFs are heuristics that take as input a data point and either assign a label to it (in this case, HAM or SPAM) or abstain (don’t assign any label). Labeling functions can be noisy: they don’t have perfect accuracy and don’t have to label every data point. Moreover, different labeling functions can overlap (label the same data point) and even conflict (assign different labels to the same data point). This is expected, and we demonstrate how we deal with this later.

Because their only requirement is that they map a data point a label (or abstain), they can wrap a wide variety of forms of supervision. Examples include, but are not limited to:

Keyword searches: looking for specific words in a sentence
Pattern matching: looking for specific syntactical patterns
Third-party models: using an pre-trained model (usually a model for a different task than the one at hand)
Distant supervision: using external knowledge base
Crowdworker labels: treating each crowdworker as a black-box function that assigns labels to subsets of the data

# Recommended practice for LF development

Typical LF development cycles include multiple iterations of ideation, refining, evaluation, and debugging. A typical cycle consists of the following steps:

Look at examples to generate ideas for LFs
Write an initial version of an LF
Spot check its performance by looking at its output on data points in the training set (or development set if available)
Refine and debug to improve coverage or accuracy as necessary

Our goal for LF development is to create a high quality set of training labels for our unlabeled dataset, not to label everything or directly create a model for inference using the LFs. The training labels are used to train a separate discriminative model (in this case, one which just uses the comment text) in order to generalize to new, unseen data points. Using this model, we can make predictions for data points that our LFs don’t cover.

We’ll walk through the development of two LFs using basic analysis tools in Snorkel, then provide a full set of LFs that we developed for this tutorial.

# a) Exploring the training set for initial ideas

We’ll start by looking at 20 random data points from the train set to generate some ideas for LFs.

df_train[["author", "text", "video"]].sample(20, random_state=2)

	author	text	video
4	ambareesh nimkar	"eye of the tiger" "i am the champion" seems l...	2
87	pratik patel	mindblowing dance.,.,.superbbb song	3
14	RaMpAgE420	Check out Berzerk video on my channel ! :D	4
80	Jason Haddad	Hey, check out my new website!! This site is a...	1
104	austin green	Eminem is my insperasen and fav	4
305	M.E.S	hey guys look im aware im spamming and it piss...	4
22	John Monster	Οh my god ... Roar is the most liked video at ...	2
338	Alanoud Alsaleh	I started hating Katy Perry after finding out ...	2
336	Leonardo Baptista	http://www.avaaz.org/po/petition/Youtube_Corpo...	1
143	UKz DoleSnacher	Remove This video its wank	1
163	Monica Parker	Check out this video on YouTube:	3
129	b0b1t.48058475	i rekt ur mum last nite. cuz da haterz were 2 ...	2
277	MeSoHornyMeLuvULongTime	This video is so racist!!! There are only anim...	2
265	HarveyIsTheBoss	You gotta say its funny. well not 2 billion wo...	1
214	janez novak	share and like this page to win a hand signed ...	4
76	Bizzle Sperq	https://www.facebook.com/nicushorbboy add mee ...	1
123	Gaming and Stuff PRO	Hello! Do you like gaming, art videos, scienti...	1
268	Young IncoVEVO	Check out my Music Videos! and PLEASE SUBSCRIB...	1
433	Chris Edgar	Love the way you lie - Driveshaft	4
40	rap classics	check out my channel for rap and hip hop music	4

One dominant pattern in the comments that look like spam (which we might know from prior domain experience, or from inspection of a few training data points) is the use of the phrase “check out” (e.g. “check out my channel”). Let’s start with that.

# b) Writing an LF to identify spammy comments that use the phrase “check out”

Labeling functions in Snorkel are created with the @labeling_function decorator. The decorator can be applied to any Python function that returns a label for a single data point.

Let’s start developing an LF to catch instances of commenters trying to get people to “check out” their channel, video, or website. We’ll start by just looking for the exact string "check out" in the text, and see how that compares to looking for just "check" in the text. For the two versions of our rule, we’ll write a Python function over a single data point that express it, then add the decorator.

from snorkel.labeling import labeling_function


@labeling_function()
def check(x):
    return SPAM if "check" in x.text.lower() else ABSTAIN


@labeling_function()
def check_out(x):
    return SPAM if "check out" in x.text.lower() else ABSTAIN

To apply one or more LFs that we’ve written to a collection of data points, we use an LFApplier. Because our data points are represented with a Pandas DataFrame in this tutorial, we use the PandasLFApplier. Correspondingly, a single data point x that’s passed into our LFs will be a Pandas Series object.

It’s important to note that these LFs will work for any object with an attribute named text, not just Pandas objects. Snorkel has several other appliers for different data point collection types which you can browse in the API documentation.

The output of the apply(...) method is a label matrix, a fundamental concept in Snorkel. It’s a NumPy array L with one column for each LF and one row for each data point, where L[i, j] is the label that the jth labeling function output for the ith data point. We’ll create a label matrix for the train set.

from snorkel.labeling import PandasLFApplier

lfs = [check_out, check]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

L_train

array([[-1, -1],
       [-1, -1],
       [-1,  1],
       ...,
       [ 1,  1],
       [-1,  1],
       [ 1,  1]])

# c) Evaluate performance on training set

We can easily calculate the coverage of these LFs (i.e., the percentage of the dataset that they label) as follows:

coverage_check_out, coverage_check = (L_train != ABSTAIN).mean(axis=0)
print(f"check_out coverage: {coverage_check_out * 100:.1f}%")
print(f"check coverage: {coverage_check * 100:.1f}%")

check_out coverage: 21.4%
check coverage: 25.8%

Lots of statistics about labeling functions — like coverage — are useful when building any Snorkel application. So Snorkel provides tooling for common LF analyses using the LFAnalysis utility. We report the following summary statistics for multiple LFs at once:

Polarity: The set of unique labels this LF outputs (excluding abstains)
Coverage: The fraction of the dataset the LF labels
Overlaps: The fraction of the dataset where this LF and at least one other LF label
Conflicts: The fraction of the dataset where this LF and at least one other LF label and disagree
Correct: The number of data points this LF labels correctly (if gold labels are provided)
Incorrect: The number of data points this LF labels incorrectly (if gold labels are provided)
Empirical Accuracy: The empirical accuracy of this LF (if gold labels are provided)

For Correct, Incorrect, and Empirical Accuracy, we don’t want to penalize the LF for data points where it abstained. We calculate these statistics only over those data points where the LF output a label. Note that in our current setup, we can’t compute these statistics because we don’t have any ground-truth labels (other than in the test set, which we cannot look at). Not to worry—Snorkel’s LabelModel will estimate them without needing any ground-truth labels in the next step!

from snorkel.labeling import LFAnalysis

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

	j	Polarity	Coverage	Overlaps	Conflicts
check_out	0	[1]	0.214376	0.214376	0.0
check	1	[1]	0.257881	0.214376	0.0

We might want to pick the check rule, since check has higher coverage. Let’s take a look at 10 random train set data points where check labeled SPAM to see if it matches our intuition or if we can identify some false positives.

df_train.iloc[L_train[:, 1] == SPAM].sample(10, random_state=1)

	author	date	text	label	video
305	M.E.S	NaN	hey guys look im aware im spamming and it piss...	-1.0	4
265	Kawiana Lewis	2015-02-27T02:20:40.987000	Check out this video on YouTube:opponents mm <...	-1.0	3
89	Stricker Stric	NaN	eminem new song check out my videos	-1.0	4
147	TheGenieBoy	NaN	check out fantasy music right here -------&...	-1.0	4
240	Made2Falter	2014-09-09T23:55:30	Check out our vids, our songs are awesome! And...	-1.0	2
273	Artady	2014-08-11T16:27:55	https://soundcloud.com/artady please check my ...	-1.0	2
94	Nick McGoldrick	2014-10-27T13:19:06	Check out my drum cover of E.T. here! thanks -...	-1.0	2
139	MFkin PRXPHETZ	2014-01-20T09:08:39	if you like raw talent, raw lyrics, straight r...	-1.0	1
303	이 정훈	NaN	This great Warning will happen soon. ,0\nLneaD...	-1.0	4
246	media.uploader	NaN	Check out my channel to see Rihanna short mix ...	-1.0	4

No clear false positives here, but many look like they could be labeled by check_out as well.

Let’s see 10 data points where check_out abstained, but check labeled. We can use theget_label_buckets(...) to group data points by their predicted label and/or true labels.

from snorkel.analysis import get_label_buckets

buckets = get_label_buckets(L_train[:, 0], L_train[:, 1])
df_train.iloc[buckets[(ABSTAIN, SPAM)]].sample(10, random_state=1)

	author	date	text	label	video
403	ownpear902	2014-07-22T18:44:36.299000	check it out free stuff for watching videos an...	-1.0	3
256	PacKmaN	2014-11-05T21:56:39	check men out i put allot of effort into my mu...	-1.0	1
196	Angek95	2014-11-03T22:28:56	Check my channel, please!	-1.0	1
282	CronicleFPS	2014-11-06T03:10:26	Check me out I'm all about gaming	-1.0	1
352	MrJtill0317	NaN	┏━━━┓┏┓╋┏┓┏━━━┓┏━━━┓┏┓╋╋┏┓ ┃┏━┓┃┃┃╋┃┃┃┏━┓┃┗┓┏...	-1.0	4
161	MarianMusicChannel	2014-08-24T03:57:52	Hello! I'm Marian, I'm a singer from Venezuela...	-1.0	2
270	Kyle Jaber	2014-01-19T00:21:29	Check me out! I'm kyle. I rap so yeah	-1.0	1
292	Soundhase	2014-08-19T18:59:38	Hi Guys! check this awesome EDM & House mi...	-1.0	2
179	Nerdy Peach	2014-10-29T22:44:41	Hey! I'm NERDY PEACH and I'm a new youtuber an...	-1.0	2
16	zhichao wang	2013-11-29T02:13:56	i think about 100 millions of the views come f...	-1.0	1

Most of these seem like small modifications of “check out”, like “check me out” or “check it out”. Can we get the best of both worlds?

# d) Balance accuracy and coverage

Let’s see if we can use regular expressions to account for modifications of “check out” and get the coverage of check plus the accuracy of check_out.

import re


@labeling_function()
def regex_check_out(x):
    return SPAM if re.search(r"check.*out", x.text, flags=re.I) else ABSTAIN

Again, let’s generate our label matrices and see how we do.

lfs = [check_out, check, regex_check_out]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

	j	Polarity	Coverage	Overlaps
check_out	0	[1]	0.214376	0.214376
check	1	[1]	0.257881	0.233922
regex_check_out	2	[1]	0.233922	0.233922

We’ve split the difference in train set coverage—this looks promising! Let’s verify that we corrected our false positive from before.

To understand the coverage difference between check and regex_check_out, let’s take a look at 10 data points from the train set. Remember: coverage isn’t always good. Adding false positives will increase coverage.

buckets = get_label_buckets(L_train[:, 1], L_train[:, 2])
df_train.iloc[buckets[(SPAM, ABSTAIN)]].sample(10, random_state=1)

	author	date	text	label	video
16	zhichao wang	2013-11-29T02:13:56	i think about 100 millions of the views come f...	-1.0	1
99	Santeri Saariokari	2014-09-03T16:32:59	Hey guys go to check my video name "growtopia ...	-1.0	2
21	BeBe Burkey	2013-11-28T16:30:13	and u should.d check my channel and tell me wh...	-1.0	1
239	Cony	2013-11-28T16:01:47	You should check my channel for Funny VIDEOS!!	-1.0	1
288	Kochos	2014-01-20T17:08:37	i check back often to help reach 2x10^9 views ...	-1.0	1
65	by.Ovskiy	2014-10-13T17:09:46	Rap from Belarus, check my channel:)	-1.0	2
196	Angek95	2014-11-03T22:28:56	Check my channel, please!	-1.0	1
333	FreexGaming	2014-10-18T08:12:26	want to win borderlands the pre-sequel? check ...	-1.0	2
167	Brandon Pryor	2014-01-19T00:36:25	I dont even watch it anymore i just come here ...	-1.0	1
266	Zielimeek21	2013-11-28T21:49:00	I'm only checking the views	-1.0	1

Most of these are SPAM, but a good number are false positives. To keep precision high (while not sacrificing much in terms of coverage), we’d choose our regex-based rule.

# e) Writing an LF that uses a third-party model

The LF interface is extremely flexible, and can wrap existing models. A common technique is to use a commodity model trained for other tasks that are related to, but not the same as, the one we care about.

For example, the TextBlob tool provides a pretrained sentiment analyzer. Our spam classification task is not the same as sentiment classification, but we may believe that SPAM and HAM comments have different distributions of sentiment scores. We’ll focus on writing LFs for HAM, since we identified SPAM comments above.

A brief intro to Preprocessors

A Snorkel Preprocessor is constructed from a black-box Python function that maps a data point to a new data point. LabelingFunctions can use Preprocessors, which lets us write LFs over transformed or enhanced data points. We add the @preprocessor(...) decorator to preprocessing functions to create Preprocessors. Preprocessors also have extra functionality, such as memoization (i.e. input/output caching, so it doesn’t re-execute for each LF that uses it).

We’ll start by creating a Preprocessor that runs TextBlob on our comments, then extracts the polarity and subjectivity scores.

from snorkel.preprocess import preprocessor
from textblob import TextBlob


@preprocessor(memoize=True)
def textblob_sentiment(x):
    scores = TextBlob(x.text)
    x.polarity = scores.sentiment.polarity
    x.subjectivity = scores.sentiment.subjectivity
    return x

We can now pick a reasonable threshold and write a corresponding labeling function (note that it doesn’t have to be perfect as the LabelModel will soon help us estimate each labeling function’s accuracy and reweight their outputs accordingly):

@labeling_function(pre=[textblob_sentiment])
def textblob_polarity(x):
    return HAM if x.polarity > 0.9 else ABSTAIN

Let’s do the same for the subjectivity scores. This will run faster than the last cell, since we memoized the Preprocessor outputs.

@labeling_function(pre=[textblob_sentiment])
def textblob_subjectivity(x):
    return HAM if x.subjectivity >= 0.5 else ABSTAIN

Let’s apply our LFs so we can analyze their performance.

lfs = [textblob_polarity, textblob_subjectivity]

applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)

LFAnalysis(L_train, lfs).lf_summary()

	j	Polarity	Coverage	Overlaps	Conflicts
textblob_polarity	0	[0]	0.035309	0.013871	0.0
textblob_subjectivity	1	[0]	0.357503	0.013871	0.0

Again, these LFs aren’t perfect—note that the textblob_subjectivity LF has fairly high coverage and could have a high rate of false positives. We’ll rely on Snorkel’s LabelModel to estimate the labeling function accuracies and reweight and combine their outputs accordingly.

# 3. Writing More Labeling Functions

If a single LF had high enough coverage to label our entire test dataset accurately, then we wouldn’t need a classifier at all. We could just use that single simple heuristic to complete the task. But most problems are not that simple. Instead, we usually need to combine multiple LFs to label our dataset, both to increase the size of the generated training set (since we can’t generate training labels for data points that no LF voted on) and to improve the overall accuracy of the training labels we generate by factoring in multiple different signals.

In the following sections, we’ll show just a few of the many types of LFs that you could write to generate a training dataset for this problem.

# a) Keyword LFs

For text applications, some of the simplest LFs to write are often just keyword lookups. These will often follow the same execution pattern, so we can create a template and use the resources parameter to pass in LF-specific keywords. Similar to the labeling_function decorator, the LabelingFunction class wraps a Python function (the f parameter), and we can use the resources parameter to pass in keyword arguments (here, our keywords to lookup) to said function.

from snorkel.labeling import LabelingFunction


def keyword_lookup(x, keywords, label):
    if any(word in x.text.lower() for word in keywords):
        return label
    return ABSTAIN


def make_keyword_lf(keywords, label=SPAM):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )


"""Spam comments talk about 'my channel', 'my video', etc."""
keyword_my = make_keyword_lf(keywords=["my"])

"""Spam comments ask users to subscribe to their channels."""
keyword_subscribe = make_keyword_lf(keywords=["subscribe"])

"""Spam comments post links to other channels."""
keyword_link = make_keyword_lf(keywords=["http"])

"""Spam comments make requests rather than commenting."""
keyword_please = make_keyword_lf(keywords=["please", "plz"])

"""Ham comments actually talk about the video's content."""
keyword_song = make_keyword_lf(keywords=["song"], label=HAM)

# b) Pattern-matching LFs (regular expressions)

If we want a little more control over a keyword search, we can look for regular expressions instead. The LF we developed above (regex_check_out) is an example of this.

# c) Heuristic LFs

There may other heuristics or “rules of thumb” that you come up with as you look at the data. So long as you can express it in a function, it’s a viable LF!

@labeling_function()
def short_comment(x):
    """Ham comments are often short, such as 'cool video!'"""
    return HAM if len(x.text.split()) < 5 else ABSTAIN

# d) LFs with Complex Preprocessors

Some LFs rely on fields that aren’t present in the raw data, but can be derived from it. We can enrich our data (providing more fields for the LFs to refer to) using Preprocessors.

For example, we can use the fantastic NLP (natural language processing) tool spaCy to add lemmas, part-of-speech (pos) tags, etc. to each token. Snorkel provides a prebuilt preprocessor for spaCy called SpacyPreprocessor which adds a new field to the data point containing a spaCy Doc object. For more info, see the SpacyPreprocessor documentation.

If you prefer to use a different NLP tool, you can also wrap that as a Preprocessor and use it in the same way. For more info, see the preprocessor documentation.

from snorkel.preprocess.nlp import SpacyPreprocessor

# The SpacyPreprocessor parses the text in text_field and
# stores the new enriched representation in doc_field
spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)

@labeling_function(pre=[spacy])
def has_person(x):
    """Ham comments mention specific people and are short."""
    if len(x.doc) < 20 and any([ent.label_ == "PERSON" for ent in x.doc.ents]):
        return HAM
    else:
        return ABSTAIN

Because spaCy is such a common preprocessor for NLP applications, we also provide a prebuilt labeling_function-like decorator that uses spaCy. This resulting LF is identical to the one defined manually above.

from snorkel.labeling.lf.nlp import nlp_labeling_function


@nlp_labeling_function()
def has_person_nlp(x):
    """Ham comments mention specific people and are short."""
    if len(x.doc) < 20 and any([ent.label_ == "PERSON" for ent in x.doc.ents]):
        return HAM
    else:
        return ABSTAIN

Adding new domain-specific preprocessors and LF types is a great way to contribute to Snorkel! If you have an idea, feel free to reach out to the maintainers or submit a PR!

# e) Third-party Model LFs

We can also utilize other models, including ones trained for other tasks that are related to, but not the same as, the one we care about. The TextBlob-based LFs we created above are great examples of this!

# 4. Combining Labeling Function Outputs with the Label Model

This tutorial demonstrates just a handful of the types of LFs that one might write for this task. One of the key goals of Snorkel is not to replace the effort, creativity, and subject matter expertise required to come up with these labeling functions, but rather to make it faster to write them, since in Snorkel the labeling functions are assumed to be noisy, i.e. innaccurate, overlapping, etc. Said another way: the LF abstraction provides a flexible interface for conveying a huge variety of supervision signals, and the LabelModel is able to denoise these signals, reducing the need for painstaking manual fine-tuning.

lfs = [
    keyword_my,
    keyword_subscribe,
    keyword_link,
    keyword_please,
    keyword_song,
    regex_check_out,
    short_comment,
    has_person_nlp,
    textblob_polarity,
    textblob_subjectivity,
]

With our full set of LFs, we can now apply these once again with LFApplier to get the label matrices. The Pandas format provides an easy interface that many practitioners are familiar with, but it is also less optimized for scale. For larger datasets, more compute-intensive LFs, or larger LF sets, you may decide to use one of the other data formats that Snorkel supports natively, such as Dask DataFrames or PySpark DataFrames, and their corresponding applier objects. For more info, check out the Snorkel API documentation.

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_test = applier.apply(df=df_test)

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

	j	Polarity	Coverage	Overlaps	Conflicts
keyword_my	0	[1]	0.198613	0.185372	0.109710
keyword_subscribe	1	[1]	0.127364	0.108449	0.068726
keyword_http	2	[1]	0.119168	0.100252	0.080706
keyword_please	3	[1]	0.112232	0.109710	0.056747
keyword_song	4	[0]	0.141866	0.109710	0.043506
regex_check_out	5	[1]	0.233922	0.133039	0.087011
short_comment	6	[0]	0.225725	0.145019	0.074401
has_person_nlp	7	[0]	0.071879	0.056747	0.030895
textblob_polarity	8	[0]	0.035309	0.032156	0.005044
textblob_subjectivity	9	[0]	0.357503	0.252837	0.160151

Our goal is now to convert the labels from our LFs into a single noise-aware probabilistic (or confidence-weighted) label per data point. A simple baseline for doing this is to take the majority vote on a per-data point basis: if more LFs voted SPAM than HAM, label it SPAM (and vice versa). We can test this with the MajorityLabelVoter baseline model.

from snorkel.labeling.model import MajorityLabelVoter

majority_model = MajorityLabelVoter()
preds_train = majority_model.predict(L=L_train)

preds_train

array([ 1,  1, -1, ...,  1,  1,  1])

However, as we can see from the summary statistics of our LFs in the previous section, they have varying properties and should not be treated identically. In addition to having varied accuracies and coverages, LFs may be correlated, resulting in certain signals being overrepresented in a majority-vote-based model. To handle these issues appropriately, we will instead use a more sophisticated Snorkel LabelModel to combine the outputs of the LFs.

This model will ultimately produce a single set of noise-aware training labels, which are probabilistic or confidence-weighted labels. We will then use these labels to train a classifier for our task. For more technical details of this overall approach, see our NeurIPS 2016 and AAAI 2019 papers. For more info on the API, see the LabelModel documentation.

Note that no gold labels are used during the training process. The only information we need is the label matrix, which contains the output of the LFs on our training set. The LabelModel is able to learn weights for the labeling functions using only the label matrix as input. We also specify the cardinality, or number of classes.

from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)

majority_acc = majority_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")

label_model_acc = label_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

Majority Vote Accuracy:   84.0%
Label Model Accuracy:     86.0%

The majority vote model or more sophisticated LabelModel could in principle be used directly as a classifier if the outputs of our labeling functions were made available at test time. However, these models (i.e. these re-weighted combinations of our labeling function’s votes) will abstain on the data points that our labeling functions don’t cover (and additionally, may require slow or unavailable features to execute at test time). In the next section, we will instead use the outputs of the LabelModel as training labels to train a discriminative classifier which can generalize beyond the labeling function outputs to see if we can improve performance further. This classifier will also only need the text of the comment to make predictions, making it much more suitable for inference over unseen comments. For more information on the properties of the label model, see the Snorkel documentation.

# Filtering out unlabeled data points

As we saw earlier, some of the data points in our train set received no labels from any of our LFs. These data points convey no supervision signal and tend to hurt performance, so we filter them out before training using a built-in utility.

from snorkel.labeling import filter_unlabeled_dataframe

df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
    X=df_train, y=probs_train, L=L_train
)

# 5. Training a Classifier

In this final section of the tutorial, we’ll use the probabilistic training labels we generated in the last section to train a classifier for our task. The output of the Snorkel LabelModel is just a set of labels which can be used with most popular libraries for performing supervised learning, such as TensorFlow, Keras, PyTorch, Scikit-Learn, Ludwig, and XGBoost. In this tutorial, we use the well-known library Scikit-Learn. Note that typically, Snorkel is used (and really shines!) with much more complex, training data-hungry models, but we will use Logistic Regression here for simplicity of exposition.

# Featurization

For simplicity and speed, we use a simple “bag of n-grams” feature representation: each data point is represented by a one-hot vector marking which words or 2-word combinations are present in the comment text.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 5))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())

# Scikit-Learn Classifier

As we saw in Section 4, the LabelModel outputs probabilistic (float) labels. If the classifier we are training accepts target labels as floats, we can train on these labels directly (see describe the properties of this type of “noise-aware” loss in our NeurIPS 2016 paper).

If we want to use a library or model that doesn’t accept probabilistic labels (such as Scikit-Learn), we can instead replace each label distribution with the label of the class that has the maximum probability. This can easily be done using the probs_to_preds helper method. We do note, however, that this transformation is lossy, as we no longer have values for our confidence in each label.

from snorkel.utils import probs_to_preds

preds_train_filtered = probs_to_preds(probs=probs_train_filtered)

We then use these labels to train a classifier as usual.

from sklearn.linear_model import LogisticRegression

sklearn_model = LogisticRegression(C=1e3, solver="liblinear")
sklearn_model.fit(X=X_train, y=preds_train_filtered)

print(f"Test Accuracy: {sklearn_model.score(X=X_test, y=Y_test) * 100:.1f}%")

Test Accuracy: 94.4%

We observe an additional boost in accuracy over the LabelModel by multiple points! This is in part because the discriminative model generalizes beyond the labeling function’s labels and makes good predictions on all data points, not just the ones covered by labeling functions. By using the label model to transfer the domain knowledge encoded in our LFs to the discriminative model, we were able to generalize beyond the noisy labeling heuristics.

# Summary

In this tutorial, we accomplished the following:

We introduced the concept of Labeling Functions (LFs) and demonstrated some of the forms they can take.
We used the Snorkel LabelModel to automatically learn how to combine the outputs of our LFs into strong probabilistic labels.
We showed that a classifier trained on a weakly supervised dataset can outperform an approach based on the LFs alone as it learns to generalize beyond the noisy heuristics we provide.

# Next Steps

If you enjoyed this tutorial and you’ve already checked out the Getting Started tutorial, check out the Tutorials page for other tutorials that you may find interesting, including demonstrations of how to use Snorkel

As part of a hybrid crowdsourcing pipeline
For visual relationship detection over images
For information extraction over text
For data augmentation

and more! You can also visit the Snorkel website or Snorkel API documentation for more info!