Easy recommendations on Hadoop

The parallelized computing powers of Hadoop drives the adoption of Data Science and Machine Learning techniques in many organizations.

People often claim that Machine Learning is a hard thing to do, and while it is true that getting Machine Learning just right is an intensive task, which requires a talented Data Scientist, the core techniques of Machine Learning algorithms are quite easy to understand.

In this article, I’m going to explain how you can write your own super easy recommendation algorithm. Bare in mind that is by no means something you should use in production (even tough you could), but just an example to show you how these recommendations can work under the hood.


What are we going to recommend?

Suppose we’re starting a brand new and unique music streaming service called Napstify. We already have a logo and everything:


The unique selling point of our music service will be that we will help our users to discover new music by offering them state-of-the-art recommendations.

Unfortunately for us, we don’t have any data yet to build our recommendation engine.

Luckily we’re not the first ones to get in the music business. One of our competitors-to-be is letting their users tweet what they are listening to right now:


@alexa_afpb:#NowPlaying 50/50 – The Strokes http://t.co/BvnZNGS0gd

@_Yoshimii:#NowPlaying Ask - 2011 Remastered Version – The Smiths http://t.co/5rdd1l4QMy

@CassieHFox:#NowPlaying (Everything I Do) I Do It For You – Bryan Adams http://t.co/mZndfkXh6O

@alexa_afpb:#NowPlaying Heads Will Roll – Yeah Yeah Yeahs http://t.co/lcOWG9E9e6

@ChoiceofWeapons:#NowPlaying Lions, Tigers & Bears – Jazmine Sullivan http://t.co/gaurAripL4

@JoseGalRe:#NowPlaying Romani Holiday (Antonius Remix) – Hans Zimmer http://t.co/aw0N2q22sz

@CeciSCV:#NowPlaying Ich Will – Rammstein http://t.co/sFNA49NATC


We can just harvest those tweets and use them to feed our algorithm.

Now that we have our data in place, let’s see what kind of algorithm we’re going to use.


The algorithm

Our algorithm is based on a very simple principle: you probably like what other people that listen to the same music like.

Let’s say that we know that a person A listened to The Bee Gees and The Village People, and a Person B also listened to The Bee Gees and The Village People. When we get a Person C that has listened to The Bee Gees, we might want to recommend him listening to The Village People.



So, how do we do this?

First, you make a list of artists a user has listened to.  We strip out the usernames and the artists from our tweets:

@ilizc      Bon Jovi

@ilizc      Gorillaz

@ilizc      Aerosmith

@ilizc      Metallica

@savsa      Gorillaz

@savsa      Placebo

@savsa      Röyksopp

@scuj1      Aerosmith

@scuj1      Pink Floyd

@scuj1      Metallica

@scuj1      Led Zeppelin


Next, we generate the co occurrences of artists. So from the example above we get:


Bon Jovi – Gorillaz

Bon Jovi – Aerosmith

Bon Jovi – Metallica

Gorillaz – Aerosmith

Gorillaz – Metallica

Aerosmith – Metallica

Gorillaz – Placebo

Gorillaz – Röyksopp

Placebo – Röyksopp

Aerosmith – Pink Floyd

Aerosmith – Metallica

Aerosmith – Led Zeppelin

Pink Floyd – Metallica

Pink Floyd – Led Zeppelin

Metallica – Led Zeppelin


And when we count and sort them:


Aerosmith – Metallica – 2

Bon Jovi – Gorillaz – 1

Bon Jovi – Aerosmith – 1

Bon Jovi – Metallica – 1

Gorillaz – Aerosmith – 1

Gorillaz – Metallica – 1

Gorillaz – Placebo – 1

Gorillaz – Röyksopp – 1

Placebo – Röyksopp – 1

Aerosmith – Pink Floyd – 1

Aerosmith – Led Zeppelin – 1

Pink Floyd – Metallica – 1

Pink Floyd – Led Zeppelin – 1

Metallica – Led Zeppelin – 1


And there you have it: the artists that are played together the most appear at the top (in this case Aerosmith and Metallica)

Now let’s apply this on Hadoop.  In the sources at the bottom of the article you can find the full dataset with 45407 user-artist pairs. It’s fairly small in size, so we don’t really have the need to do this on a cluster; but if you want to test it you can generate new records from the original file and create a file with any size you want. 

We’ll use MapReduce for processing our file, but since writing the MapReduce code for this simple task is quite cumbersome I’ll just write a Pig script instead.

The script goes as follows:

tweets = LOAD 'path/to/tweets' USING PigStorage('\t') AS (user:chararray, artist:chararray);

duplicateTweets = foreach tweets generate user, artist;

joined_tweets = JOIN tweets by user, duplicateTweets by user;

tweetPairs = FILTER joined_tweets by tweets::artist <  duplicateTweets::artist; 

tweetPairGroup = GROUP tweetPairs by (tweets::artist, duplicateTweets::artist);

tweetPairCount = FOREACH tweetPairGroup GENERATE group.tweets::artist as artistA, group.duplicateTweets::artist as artistB, COUNT(tweetPairs) as aantal;

ordered = order tweetPairCount by aantal desc;

store ordered into 'path/to/output/folder' using PigStorage('\t');

What we do in the script is join the full dataset against itself based on user. Then group them by artist pair, and count them. We then order them (so we can find the artists that co occur the most) and store them again.

This gives us the following output:

Fifth Harmony     - Fifth Harmony, Tyga  35

Fifth Harmony     - Fifth Harmony, Kid Ink     27

Ed Sheeran - Maroon 5   26

Fifth Harmony - Fifth Harmony, Meghan Trainor   25

Ellie Goulding - Sia    21

Beyoncé - Ellie Goulding      21

Hozier - Sia      19


The first thing we learn from this is that Fifth Harmony is quite popular right now.

And what about the other results? 

Iron Maiden - Judas Priest    12

Coldplay - Maroon 5     12

Nirvana - Pearl Jam     12

Black Sabbath - Metallica     10

David Bowie - The Rolling Stones    9

This looks about right… So we can now build custom recommendations for every user based on these similar artists and the listening history of every user.

Of course this algorithm isn’t perfect, but it shows the potential of generating recommendations on the Apache Hadoop platform. You can easily build a condensed recommendation model to use in production every day, using your entire dataset.

The hard thing to do is perfect this model: popular artists that everyone listens to are over recommended, for example. You’ll need an experienced Data Scientist to do that. If you want to learn more about that, give us a call. 


Source files:

Pig Script

Parsed Tweets