Vowpal Wabbit, the magic recommender system !

Samuel Guedj
3 min readMay 29, 2020

Tutorial to predict clicks

This post is introduction to recommender system.
Next step will be a Kaggle submission via Jupyter

Tech stack on this post:

  1. Jupyter
  2. Vowpal Wabbit
  3. Spark
  4. Parquet

Vowpal Wabbit is an open-source fast online interactive machine learning system library and program developed originally at Yahoo! Research, and currently at Microsoft Research

We will use data of the following Kaggle competition outbrain-click-prediction. It’s an old competition but a good match for our need.

Let’s import classical ML library:

import tqdm.notebook as tqdm
import numpy as np
import scipy
import sklearn
import matplotlib.pyplot as plt

Let’s also import Spark

import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc = SparkContext("yarn", "My App")
se = SparkSession(sc)

Then register all tables for sql queries

from IPython.display import display
tables = ["clicks_test", "clicks_train",
"documents_categories", "documents_entities", "documents_meta", "documents_topics",
"events", "page_views", "page_views_sample", "promoted_content"]
for name in tqdm.tqdm(tables):
df = se.read.parquet("s3://ydatazian/{}.parquet".format(name))
df.registerTempTable(name)
print(name)
display(df.limit(3).toPandas())

We are loading data from a public s3, I will update it later to my own s3.
Below a sample of the display:

You should get something like that

Prepare dataset for VW

We will predict a click based on:

  • ad_id
  • document_id
  • campaign_id
  • advertiser_id

This is basic feature, we will optimize it in a different post.

%%time
se.sql("""
select
clicks_train.clicked,
clicks_train.display_id,
clicks_train.ad_id,
promoted_content.document_id,
promoted_content.campaign_id,
promoted_content.advertiser_id
from clicks_train join promoted_content on clicks_train.ad_id = promoted_content.ad_id
""").write.parquet("/train_features.parquet", mode='overwrite')

We just wrote the result of the query in a parquet file. Let’s now display the first rows

se.read.parquet(“/train_features.parquet”).show(5)

This function will format the data for VW

# Format: [Label] [Importance] [Base] [Tag]|Namespace Features |Namespace Features ... |Namespace Features
# https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format
def vw_row_mapper(row):
clicked = None
features = []
for k, v in row.asDict().items():
if k == 'clicked':
clicked = '1' if v == '1' else '-1'
else:
features.append(k + "_" + v)
tag = row.display_id + "_" + row.ad_id
return "{} {}| {}".format(clicked, tag, " ".join(features))
r = se.read.parquet("/train_features.parquet").take(1)[0]
print(r)
print(vw_row_mapper(r))

Below we are converting from parquet format to simple txt file:

%%time
! hdfs dfs -rm -r /train_features.txt
(
se.read.parquet("/train_features.parquet")
.rdd
.map(vw_row_mapper)
.saveAsTextFile("/train_features.txt")
)

We are saving file from hdfs to local file system. “getmerge” is hadoop function.

! rm /mnt/train.txt
! hdfs dfs -getmerge /train_features.txt /mnt/train.txt
# preview local file
! head -n 5 /mnt/train.txt

Train VW

More info on getting started and command line.
Data are ready locally, we are training the model:

! ./vw -d /mnt/train.txt -b 24 -c -k --ftrl --passes 1 -f model --holdout_off --loss_function logistic --random_seed 42 --progress 8000000

And finally, we create the input file and get the predictions

! echo "? tag1| ad_id_144739 document_id_1337362 campaign_id_18488 advertiser_id_2909" > /mnt/test.txt
! echo "? tag2| ad_id_156824 document_id_992370 campaign_id_7283 advertiser_id_1919" >> /mnt/test.txt
! ./vw -d /mnt/test.txt -i model -t -k -p /mnt/predictions.txt --progress 1000000 --link=logistic
# predicted probabilities of "1" class
! cat /mnt/predictions.txt

Above are the result of the 2 predictions

Thanks for reading

--

--