What are the differences between the online and offline testing of recommender algorithms?

Recommendation systems are essential features for any platform aiming to satisfy increasingly demanding users. To test the relevance of these recommendation systems, two methods co-exist: offline evaluation and online evaluation. How do they differentiate themselves? What are the advantages and biases that they generate? Let’s talk about it! 

What is the purpose of a recommender system?

Recommender systems provide users with new content, from news articles (Google News, Yahoo News) to series and films (Netflix, Disney+, Prime Videos), or even products (Amazon, eBay). There are so many products and so much information available on the web today, that it is impossible for any one viewer to see everything that is available. That’s where recommendations come into play, allowing for products and information to be sorted by their expected relevance to the user’s interest.

All of the platforms we’ve cited use proprietary recommendation algorithms with the goal of proposing a product likely to interest the user, improving customer loyalty and generating additional income for the company. Each of these platforms survive essentially on sponsored content, ads, purchases or subscriptions. 

How to test and evaluate the performance of recommendation algorithms 

Recommendation algorithms are built off years of research into recommender systems. In order to carry out research and refine algorithms, the community of researchers relies on various data sets to evaluate their algorithms, such as: 

  • MovieLens (Harper and Konstan, 2015): a dataset allowing researchers to experiment with movie recommendations; 
  • LastFM (Bertin-Mahieux et al., 2011): dedicated to music recommendations; 
  • CiteULike (Wang et al., 2013): a dataset for recommending bibliographic resources;
  • Delicious (Rossi and Ahmed, 2015): dedicated to recommending websites. 

Systems that are evaluated using these datasets are what we refer to as offline systems. These datasets allow researchers to estimate the performance of an algorithm as if it were used in a live setting. Thus, offline recommendations, which either implicitly or explicitly define user preferences through previous data, allow researchers in a laboratory environment to predict future preferences.

This, of course, must be contrasted with online evaluation platforms that allow researchers to test their systems in live and real-time scenarios with actual users. For example, NewsREEL (Hopfgartner et al., 2016) allowed for the testing of systems on news articles found across multiple online news outlets. This will also be the case with the Renewal platform developed by Octopeek and the Laboratoire Interdisciplinaire des Science du Numerique (LISN – Interdisciplinary Digital Sciences Laboratory). 

What performance biases exist in offline evaluations?

The biggest advantage to offline modeling and evaluating is the speed and ease of testing algorithms using predefined datasets. However, as these results are not correlated to actual users, there are a number of biases that need to be taken into account.

Partial view of a user’s interests

A look at a user’s history will only reveal a portion of their actual and potential interests. Given that any one user only has so much time to consecrate to viewing content, what they have viewed in the past won’t necessarily cover the entirety of their interests. In other words, datasets used may only give you a partial image of the user. 

This, in turn, means that algorithms based entirely on their history will limit the exposure of the user to new subjects and interests and will be limited in the diversity that it is able to offer.

Echo chambers and filter bubbles

We use “filter bubbles” to describe the type of unintended isolation that results from an over simplification and specialization of content delivered by search engine results, news feeds, recommendation algorithms, or even social networks. Unfortunately, these often result in echo chambers where the user isn’t being provided with new and divergent information, but simply having their own ideas echoed back to them by the information being served. The more a particular subject captures the interest of a user, the more the recommender system will serve the user similar content, and the less diversity they will experience. 

These filter bubbles result from our tendency as a species to fall into what we call confirmation biases. The more a user believes something to be true, the more they’ll be interested in articles and opinions that confirm their beliefs, and the less they’ll be willing to entertain an opposing opinion or point of view. And this is a danger in and of itself as it can negatively impact the free will and free thinking of the individual or society at large. Disinformation, or “fake news”, passed via personalized media recommendations have been shown to correlate with the ideological segregation of the public (Flaxman et al., 2016).

Diversity, novelty, serendipity 

Recommender systems aim to compensate for these limitations by introducing these three, key ingredients. 

Diversity: when we are referring to diversity, we are talking about the intended goal of avoiding these filter bubbles by offering a wide range of content, while still maximizing the likelihood that this content is relevant to the user’s interests. It is also a way for a search engine, for example, to hedge its bets when an ambiguous search is entered. Take “Malay” for example, depending on the user’s interest, their first result may be information or offers related to travel, anthropology, or the language. 

Novelty: proposing similar content that has never before been seen by the user (an “absolute novelty”) or that is sufficiently different than content the user has already seen (a “relative novelty”). Relative novelties can be achieved by promoting similar though less popular content, referred to as a long tail recommendation (Park and Tuzhilin, 2008).

Serendipity: offering interesting or relevant content that is new, unexpected, or surprising, which the user would not have found on their own. This differs from the notion of novelty by taking into account the question of relevance to the interest of the user. 

What are the advantages of online evaluations?

Online evaluations differ from offline evaluations in that they measure the observed satisfaction of the user in real time, rather than the supposed interaction with a recommended list. This is a real-world test, and, therefore, there may be significant differences between what a user is assumed to interact with and what actual interactions occur within a different context (with a limited number of items or when experimenting with discovering new interests).

Online evaluations present their own difficulties, beyond longer evaluation times. A major issue is with defining the actual satisfaction of the user. Limiting results to the mere act of clicking on an item (the Click Through Rate, or CTR), cannot take into account whether an article was actually read or if the user found the recommendation to be relevant to their interests. 

Therefore, online evaluations must rely on more difficult to collect indicators, such as dwell time (the time a user spends on an article or result after initially clicking it). This is one of the major metrics we took into account with the Renewal platform, developed at Octopeek. We consider a result to be relevant only after the user has spent a sufficient amount of time, as defined upstream of the testing, on the article. What’s more, in order to avoid false positives, we don’t only take into consideration the time the user leaves the page open, as it isn’t rare a user could open a tab, switch to another and not come back. We measure only active interaction, such as scrolling on the page. We refer to this feedback as the “Click-and-Read” rate.

In order to truly develop and test a recommender system, both solutions are crucial in the process. Using offline models and datasets allows researchers to run numerous iterations of their algorithm, fine tuning and testing with a limited scope of conditions in a very short time frame. However, it is only after, when running online evaluations, that the rubber really meets the road and a recommender system is put through its paces.

To learn more about on-line evaluation methods, and in particular on the Renewal platform developed by Octopeek, read our article “Promoting participation in machine learning algorithm challenges”.