Vowpal Wabbit, the magic recommender system !
Tutorial to predict clicks
This post is introduction to recommender system.
Next step will be a Kaggle submission via Jupyter
Tech stack on this post:
Vowpal Wabbit is an open-source fast online interactive machine learning system library and program developed originally at Yahoo! Research, and currently at Microsoft Research
We will use data of the following Kaggle competition outbrain-click-prediction. It’s an old competition but a good match for our need.
Let’s import classical ML library:
import tqdm.notebook as tqdm
import numpy as np
import scipy
import sklearn
import matplotlib.pyplot as plt
Let’s also import Spark
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc = SparkContext("yarn", "My App")
se = SparkSession(sc)
Then register all tables for sql queries
from IPython.display import display
tables = ["clicks_test", "clicks_train",
"documents_categories", "documents_entities", "documents_meta", "documents_topics",
"events", "page_views", "page_views_sample", "promoted_content"]
for name in tqdm.tqdm(tables):
df = se.read.parquet("s3://ydatazian/{}.parquet".format(name))
df.registerTempTable(name)
print(name)
display(df.limit(3).toPandas())
We are loading data from a public s3, I will update it later to my own s3.
Below a sample of the display:
Prepare dataset for VW
We will predict a click based on:
- ad_id
- document_id
- campaign_id
- advertiser_id
This is basic feature, we will optimize it in a different post.
%%time
se.sql("""
select
clicks_train.clicked,
clicks_train.display_id,
clicks_train.ad_id,
promoted_content.document_id,
promoted_content.campaign_id,
promoted_content.advertiser_id
from clicks_train join promoted_content on clicks_train.ad_id = promoted_content.ad_id
""").write.parquet("/train_features.parquet", mode='overwrite')
We just wrote the result of the query in a parquet file. Let’s now display the first rows
se.read.parquet(“/train_features.parquet”).show(5)
This function will format the data for VW
# Format: [Label] [Importance] [Base] [Tag]|Namespace Features |Namespace Features ... |Namespace Features
# https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format
def vw_row_mapper(row):
clicked = None
features = []
for k, v in row.asDict().items():
if k == 'clicked':
clicked = '1' if v == '1' else '-1'
else:
features.append(k + "_" + v)
tag = row.display_id + "_" + row.ad_id
return "{} {}| {}".format(clicked, tag, " ".join(features))r = se.read.parquet("/train_features.parquet").take(1)[0]
print(r)
print(vw_row_mapper(r))
Below we are converting from parquet format to simple txt file:
%%time
! hdfs dfs -rm -r /train_features.txt
(
se.read.parquet("/train_features.parquet")
.rdd
.map(vw_row_mapper)
.saveAsTextFile("/train_features.txt")
)
We are saving file from hdfs to local file system. “getmerge” is hadoop function.
! rm /mnt/train.txt
! hdfs dfs -getmerge /train_features.txt /mnt/train.txt
# preview local file
! head -n 5 /mnt/train.txt
Train VW
More info on getting started and command line.
Data are ready locally, we are training the model:
! ./vw -d /mnt/train.txt -b 24 -c -k --ftrl --passes 1 -f model --holdout_off --loss_function logistic --random_seed 42 --progress 8000000
And finally, we create the input file and get the predictions
! echo "? tag1| ad_id_144739 document_id_1337362 campaign_id_18488 advertiser_id_2909" > /mnt/test.txt
! echo "? tag2| ad_id_156824 document_id_992370 campaign_id_7283 advertiser_id_1919" >> /mnt/test.txt
! ./vw -d /mnt/test.txt -i model -t -k -p /mnt/predictions.txt --progress 1000000 --link=logistic
# predicted probabilities of "1" class
! cat /mnt/predictions.txt
Above are the result of the 2 predictions