Description

For businesses it is important to have knowledge about what users enjoy buying in order to provide them with better shopping experience. For example, a person buying a new television may frequently buy a surge protector to go along with it. What a recommender system does calculate the probability that a person consuming an item would likely need. Providing an opportunity for the business to sell another item.

For this project we will be applying the same logic to movie recommendations. Given a history of what a user has watched, we will try to recommend items that they would also be interested in seeing. For this project we will see if we can get the most viewed movies to show up in the top 10 recommendations for viewers.

Methods

Data Set

The data set used for this project comes from the GroupLens Movie data set of 1 million ratings (GroupLens). The dataset contains demographic information and rating data for over 6000 users for 4000 movies.

Methodology

Here we explore the use of Collaborative filtering (Wikipedia) in order to make our recommendations. Collaborative Filtering can be done with various methods. Our chosen methods was to do an item-item collaborative filtering which in essence says "users who bought x also bought y". This method first applies a similarity between items. There is then an option to do an additional user-based filtering to find individual recommendations, which we will also do.

For this use of item-based collaborative filtering, we'll be finding the similarity of the movies to each other by finding the Cosine Similarity between matrices. Denoted: Similarity Function

Based on users reviews, the rated movies are turned into a matrix where each movie has a list of movies that are similar to them. These matrix rows are then compared to with each other to get a similarity rating.

From there the user layer is applied which takes the user's seen movies and finds the movies most similar to the ones already watched. This list is then sorted by similarity rating and output to a file.

Procedure

Because of data sparcity issues that developed in testing. The procedure was modified to be done for users who had the most movies rated and having the most rated movie be the one removed from the set. The reason for this is that with collaborative filtering, each item-item comparison affects the grand total, and removing a single movie from a set with few reviews would cause a major jump in similarity. However, when you use a large enough set each rating is worth less and may be used to test more efficiently. Thus, for the user with the most movie ratings. The two most watched movies were withheld from the rating set.

Results

The the user with the most ratings, the two highest rated movies were removed from the set, American Beauty and The Matrix. After the recommender had run American Beauty was listed as the Number 6 movie to recommend, while the Matrix was outside of the top 20.

Disussion

Based on the result, it's concluded that while the technique has promise, to be accurate we need an incredibly large data-set to be remotely accurate. Thus while collaborative filtering is useful it may be better to use other applications such as KNN and Neural Networks. Image of RatingsCount

Future Applications

The beauty of Collaborative Filtering lies in its ability to generalize to other areas. For example, this was an example of movies. It can be done with movies, restaurants, or most other ideas. It can also take other methodologies, I used cosine similarity, we very much could have used Euclidian Distance with ratings and used that instead. As always, it's important to know your dataset.