Content recommendation by means of EEXCESS

Authors: Timo Borst, Nils Witt

Since their beginnings, libraries and related cultural institutions were confident in the fact that users had to visit them in order to search, find and access their content. With the emergence and massive use of the World Wide Web and associated tools and technologies, this situation has drastically changed: if those institutions still want their content to be found and used, they must adapt themselves to those environments in which users expect digital content to be available. Against this background, the general approach of the EEXCESS project is to ‘inject’ digital content (both metadata and object files) into users' daily environments like browsers, authoring environments like content management systems or Google Docs, or e-learning environments. Content is not just provided, but recommended by means of an organizational and technical framework of distributed partner recommenders and user profiles. Once a content partner has connected to this framework by establishing an Application Program Interface (API) for constantly responding to the EEXCESS queries, the results will be listed and merged with the results of the other partners. Depending on the software component installed either on a user’s local machine or on an application server, the list of recommendations is displayed in different ways: from a classical, text-oriented list, to a visualization of metadata records.

The Recommender

The EEXCESS architecture comprises  three major components: a privacy-preserving proxy, multiple client-side tools for the Chrome Browser, Wordpress, Google Docs and more, and the central server-side component, responsible for generating recommendations, called recommender. Covering all of these components in detail is beyond the scope of this blog post. Instead, we want to focus on one component: the federated recommender, as it is the heart of the EEXCESS infrastructure.

The recommender’s task is to generate a list of objects like text documents, images and videos (hereafter called documents, for brevity’s sake) in response to a given query. The list is supposed to contain only documents relevant to the user. Moreover, the list should be ordered (by descending relevance). To generate such a list, the recommender can pick documents from the content providers that participate in the EEXCESS infrastructure. Technically speaking but somewhat oversimplified: the recommender receives a query and forwards it to all content provider systems (like Econbiz, Europeana, Mendeley and others). After receiving results from each content provider, the recommender decides in which order documents will be recommended to the user  and return it to the user who submitted the query.

This raises some questions. How can we find relevant documents? The result lists from the content providers are already sorted by relevance; how can we merge them? Can we deal with ambiguity and duplicates? Can we respond within reasonable time? Can we handle the technical disparities of the different content provider systems? How can we integrate the different document types? In the following, we will describe how we tackled some of these questions, by giving a more detailed explanation on how the recommender compiles the recommendation lists.

Recommendation process

If the user wishes to obtain personalized recommendations, she can create a local profile (i.e. stored only on the user’s device). They can specify their education, age, field of interest and location. But to be clear here: this is optional. If the profile is used, the Privacy Proxy[4] takes care of anonymizing the personal information. The overall process of creating personalized recommendations is depicted in figure and will be described in the following.

After the user has sent a query as well as her user profile, a process called Source Selection is triggered. Based on the user’s preferences, the Source Selection decides which partner systems will be queried. The reason for this is that most content providers cover only a specific discipline (see figure). For instance, queries from a user that is only interested in biology and chemistry will never receive Econbiz recommendations, whereas a query from a user merely interested in politics and money will get Econbiz recommendations (up to the present, this may change when other content provider participate). Thereby, Source Selection lowers the network traffic and the latency of the overall process and increases the precision of the results at the expense of missing results and reduced diversity. Optionally, the user can also select the sources manually.

The subsequent Query Processing step alters the query:

  • Short queries are expanded using Wikipedia knowledge
  • Long queries are split into smaller queries, which are then handled separately (See [1] for more details).

 The queries from the Query Processing step are then used to query the content providers selected during the Source Selection step. With the results from the content providers, two post processing steps are carried out to generate the personalized recommendations:

  • Result Processing: The purpose of the Result Processing is to detect duplicates. A technique called fuzzy hashing is used for this purpose. The words that make up a result list’s entry are sorted, counted and truncated by the MD5 hash algorithm [2], which allows convenient comparison.
  • Result Ranking: After the duplicates have been removed, the results are re-ranked. To do so, a slightly modified version of the round robin method is used. Where vanilla round robin would just concatenate slices of the result lists (i.e. first two documents from list A + first two document from list B + …), Weighted Round Robinmodifies this behavior by taking the overlap of the query and the result’s meta-data into account. This is, before merging the lists, each individual list is modified. Documents, whose meta data exhibit a high accordance to the query, are being promoted.

Partner Wizard

As the quality of the recommended documents increases with the number and diversity of the content providers that participate, a component called Partner Wizard was implemented. Its goal is to simplify the integration of new content providers to a level that non-experts can manage this process without any support from the EEXCESS consortium. This is achieved by a semi-automatic process triggered from a web frontend that is provided by the EEXCESS consortium. Given a search API, it is relatively easy to obtain search results, but the main point is to obtain results that are meaningful and relevant to the user. Since every search service behaves differently, there is no point in treating all services equally. Some sort of customization is needed. That’s where the Partner Wizard comes into play. It allows an employee from the new content provider to specify the search API. Afterwards, the wizard submits pre-assembled pairs of search queries to the new service. Each pair is similar but not identical, like for examp

  • Query 1: <TERM1> OR <TERM2>
  • Query 2: <TERM1> AND <TERM2>.

The thereby generated result lists are presented to the user, which has to decide which list contains the more relevant results and suits the query better (see figure). Finally, based on the previous steps, a configuration file is generated that configures the federated recommender. Whereupon the recommender mimics the behavior, that was previously exhibited. The wizard can be completed within a few minutes and it only requires a publically available search API.

The project started with five initial content providers. Now, due to the contribution of the partner wizard, there are more than ten content providers and negotiations with further candidates are ongoing. Since there are almost no technical issues anymore, legal issues dominate the consultations. As all programs developed within the EEXCESS project are published under open source conditions, the Partner Wizard can be found at [3].

Conclusions

The EEXCESS project is about injecting distributed content from different cultural and scientific domains into everyday user environments, so this content becomes more visible and better accessible. To achieve this goal and to establish a network of distributed content providers, apart from the various organizational, conceptual and legal aspects some specification and engineering of software is to be done – not only one-time, but also with respect to maintaining the technical components. One of the main goals of the project is to establish a community of networked information systems, with a lightweight approach towards joining this network by easily setting up a local partner recommender. Future work will focus on this growing network and the increasing requirements of integrating heterogeneous content via central processing of recommendations.