by Franz Osorio, Timo Borst
Bibliometrics, scientometrics, informetrics and webometrics have been both research topics and practical guidelines for publishing, reading, citing, measuring and acquiring published research for a while (Hood 2001)
. Citation databases and measures had been introduced in the 1960s, becoming benchmarks both for the publishing industry and academic libraries managing their holdings and journal acquisitions that tend to be more selective with a growing number of journals on the one side, budget cuts on the other. Due to the Open Access movement triggering a transformation of traditional publishing models (Schimmer 2015)
, and in the light of both global and distributed information infrastructures for publishing and communicating on the web that have yielded more diverse practices and communities, this situation has dramatically changed: While bibliometrics of research output in its core understanding still is highly relevant to stakeholders and the scientific community, visibility, influence and impact of scientific results has shifted to locations in the World Wide Web that are commonly shared and quickly accessible not only by peers, but by the general public (Thelwall 2013)
. This has several implications for different stakeholders who are referring to metrics in dealing with scientific results:
- With the rise of social networks, platforms and their use also by academics and research communities, the term 'metrics' itself has gained a broader meaning: while traditional citation indexes only track citations of literature published in (other) journals, 'mentions', 'reads' and 'tweets', albeit less formal, have become indicators and measures for (scientific) impact.
- Altmetrics has influenced research performance, evaluation and measurement, which formerly had been exclusively associated with traditional bibliometrics. Scientists are becoming aware of alternative publishing channels and both the option and need of 'self-advertising' their output.
- In particular academic libraries are forced to manage their journal subscriptions and holdings in the light of increasing scientific output on the one hand, and stagnating budgets on the other. While editorial products from the publishing industry are exposed to a global competing market requiring a 'brand' strategy, altmetrics may serve as additional scattered indicators for scientific awareness and value.
Against this background, we took the opportunity to collect, process and display some impact or signal data with respect to literature in economics from different sources, such as 'traditional' citation databases, journal rankings and community platforms resp. altmetrics indicators:
- CitEc. The long-standing citation service maintainted by the RePEc community provided a dump of both working papers (as part of series) and journal articles, the latter with significant information on classic impact factors such as impact factor (2 and 5 years) and h-index.
- Rankings of journals in economics including Scimago Journal Rank (SJR) and two German journal rankings, that are regularly released and updated (VHB Jourqual, Handelsblatt Ranking).
- Usage data from Altmetric.com that we collected for those articles that could be identified via their Digital Object Identifier.
- Usage data from the scientific community platform and reference manager Mendeley.com, in particular the number of saves or bookmarks on an individual paper.
A major consideration for this project was finding an open environment in which to implement it. Finding an open platform to use served a few purposes. As a member of the "Leibniz Research Association," ZBW has a commitment to Open Science and in part that means making use of open technologies to as great extent as possible (The ZBW - Open Scienc...)
. This open system should allow direct access to the underlying data so that users are able to use it for their own investigations and purposes. Additionally, if possible the user should be able to manipulate the data within the system.
The first instance of the project was created in Tableau
, which offers a variety of means to express data and create interfaces for the user to filter and manipulate data. It also can provide a way to work with the data and create visualizations without programming skills or knowledge. Tableau is one of the most popular tools to create and deliver data visualization in particular within academic libraries (Murphy 2013)
. However, the software is proprietary and has a monthly fee to use and maintain, as well as closing off the data and making only the final visualization available to users. It was able to provide a starting point for how we wanted to the data to appear to the user, but it is in no way open.
The first technical challenge was to consolidate the data from the different sources which had varying formats and organizations. Broadly speaking, the bibliometric data (CitEc and journal rankings) existed as a spread sheet with multiple pages, while the altmetrics and Mendeley data came from a database dumps with multiple tables that were presented as several CSV files. In addition to these different formats, the data needed to be cleaned and gaps filled in. The sources also had very different scopes. The altmetrics and Mendeley data covered only 30 journals, the bibliometric data, on the other hand, had more than 1,000 journals.
Transitioning from Tableau to an open platform was big challenge. While there are many ways
to create data visualizations and present them to users, the decision was made to use R
to work with the data and Shiny
to present it. R is used widely to work with data and to present it (Kläre 2017)
. The language has lots of support for these kinds of task over many libraries. The primary libraries used were R Plotly
and R Shiny. Plotly is a popular library for creating interactive visualizations. Without too much work Plotly can provide features including information popups while hovering over a chart and on the fly filtering. Shiny provides a framework to create a web application to present the data without requiring a lot of work to create HTML and CSS. The transition required time spent getting to know R and its libraries, to learn how to create the kinds of charts and filters that would be useful for users. While Shiny alleviates the need to create HTML and CSS, it does have a specific set of requirements and structures in order to function.
The final challenge was in making this project accessible to users such that they would be able to see what we had done, have access to the data, and have an environment in which they could explore the data without needing anything other than what we were providing. In order to achieve this we used Binder
as the platform. At it's most basic Binder makes it possible to share a Jupyter Notebook
stored in a Github repository with a URL by running the Jupyter Notebook remotely and providing access through a browser with no requirements placed on the user. Additionally, Binder is able to run a web application using R and Shiny. To move from a locally running instance of R Shiny to one that can run in Binder, instructions for the runtime environment need to be created and added to the repository. These include information on what version of the language to use, which packages and libraries to install for the language, and any additional requirements there might be to run everything.
Given the disparate sources and formats for the data, there was work that needed to be done to prepare it for visualization. The largest dataset, the bibliographic data, had several identifiers for each journal but without journal names. Having the journals names is important because in general the names are how users will know the journals. Adding the names to the data would allow users to filter on specific journals or pull up two journals for a comparison. Providing the names of the journals is also a benefit for anyone who may repurpose the data and saves them from having to look them up. In order to fill this gap, we used metadata available through Research Papers in Economics (RePEc
). RePEc is an organization that seeks to "enhance the dissemination of research in Economics and related sciences". It contains metadata for more than 3 million papers available in different formats. The bibliographic data contained RePEc Handles which we used to look up the journal information as XML and then parse the XML to find the title of the journal. After writing a small Python script to go through the RePEc data and find the missing names there were only 6 journals whose names were still missing.
For the data that originated in an MySQL database, the major work that needed to be done was to correct the formatting. The data was provided as CSV files but it was not formatted such that it could be used right away. Some of the fields had double quotation marks and when the CSV file was created those quotes were put into other quotation marks resulting doubled quotation marks which made machine parsing difficult without intervention directly on the files. The work was to go through the files and quickly remove the doubled quotation marks.
In addition to that, it was useful for some visualizations to provide a condensed version of the data. The data from the database was at the article level which is useful for some things, but could be time consuming for other actions. For example, the altmetrics data covered only 30 journals but had almost 14,000 rows. We could use the Python library pandas
to go through the all those rows and condense the data down so that there are only 30 rows with the data for each column being the sum of all rows. In this way, there is a dataset that can be used to easily and quickly generate summaries on the journal level.
Shiny applications require a specific structure and files in order to do the work of creating HTML without needing to write the full HTML and CSS. At it's most basic there are two main parts to the Shiny application. The first defines the user interface (UI) of the page. It says what goes where, what kind of elements to include, and how things are labeled. This section defines what the user interacts with by creating inputs and also defining the layout of the output. The second part acts as a server that handles the computations and processing of the data that will be passed on to the UI for display. The two pieces work in tandem, passing information back and forth to create a visualization based on user input. Using Shiny allowed almost all of the time spent on creating the project to be concentrated on processing the data and creating the visualizations. The only difficulty in creating the frontend was making sure all the pieces of the UI and Server were connected correctly.
Binder provided a solution for hosting the application, making the data available to users, and making it shareable all in an open environment. Notebooks and applications hosted with Binder are shareable in part because the source is often a repository like Github
. By passing a Github repository to Binder, say one that has a Jupyter Notebook in it, Binder will build a Docker image to run the notebook and then serve the result to the user without them needing to do anything. Out of the box the Docker image will contain only the most basic functions. The result is that if a notebook requires a library that isn't standard, it won't be possible to run all of the code in the notebook. In order to address this, Binder allows for the inclusion in a repository of certain files that can define what extra elements should be included when building the Docker image. This can be very specific such as what version of the language to use and listing various libraries that should be included to ensure that the notebook can be run smoothly. Binder also has support for more advanced functionality in the Docker images such as creating a Postgres database and loading it with data. These kinds of activities require using different hooks that Binder looks for during the creation of the Docker image to run scripts.
Results and evaluation
The final product has three main sections that divide the data categorically into altmetrics, bibliometrics, and data from Mendeley. There are additionally some sections that exist as areas where something new could be tried out and refined without potentially causing issues with the three previously mentioned areas. Each section has visualizations that are based on the data available.
Considering the requirements for the project, the result goes a long way to meeting the requirements. The most apparent area that the Journal Map succeeds in is its goals is of presenting data that we have collected. The application serves as a dashboard for the data that can be explored by changing filters and journal selections. By presenting the data as a dashboard, the barrier to entry for users to explore the data is low. However, there exists a way to access the data directly and perform new calculations, or create new visualizations. This can be done through the application's access to an R-Studio
environment. Access to R-Studio provides two major features. First, it gives direct access to the all the underlying code that creates the dashboard and the data used by it. Second, it provides an R terminal so that users can work with the data directly. In R-Studio, the user can also modify the existing files and then run them from R-Studio to see the results. Using Binder and R as the backend of the applications allows us to provide users with different ways to access and work with data without any extra requirements on the part of the user. However, anything changed in R-Studio won't affect the dashboard view and won't persist between sessions. Changes exist only in the current session.
All the major pieces of this project were able to be done using open technologies: Binder to serve the application, R to write the code, and Github to host all the code. Using these technologies and leveraging their capabilities allows the project to support the Open Science paradigm that was part of the impetus for the project.
The biggest drawback to the current implementation is that Binder is a third party host and so there are certain things that are out of our control. For example, Binder can be slow to load. It takes on average 1+ minutes for the Docker image to load. There's not much, if anything, we can do to speed that up. The other issue is that if there is an update to the Binder source code that breaks something, then the application will be inaccessible until the issue is resolved.
Outlook and future work
The application, in its current state, has parts that are not finalized. As we receive feedback, we will make changes to the application to add or change visualizations. As mentioned previously, there a few sections that were created to test different visualizations independently of the more complete sections, those can be finalized.
In the future it may be possible to move from BinderHub to a locally created and administered version of Binder. There is support and documentation for creating local, self hosted instances of Binder. Going that direction would give more control, and may make it possible to get the Docker image to load more quickly.
While the application runs stand-alone, the data that is visualized may also be integrated in other contexts. One option we are already prototyping is integrating the data into our subject portal EconBiz, so users would be able to judge the scientific impact of an article in terms of both bibliometric and altmetric indicators.
William W. Hood, Concepcion S. Wilson. The Literature of Bibliometrics, Scientometrics, and Informetrics. Scientometrics 52, 291–314 Springer Science and Business Media LLC, 2001. Link
R. Schimmer. Disrupting the subscription journals’ business model for the necessary large-scale transformation to open access. (2015). Link
Mike Thelwall, Stefanie Haustein, Vincent Larivière, Cassidy R. Sugimoto. Do Altmetrics Work? Twitter and Ten Other Social Web Services. PLoS ONE 8, e64841 Public Library of Science (PLoS), 2013. Link
The ZBW - Open Science Future. Link
Sarah Anne Murphy. Data Visualization and Rapid Analytics: Applying Tableau Desktop to Support Library Decision-Making. Journal of Web Librarianship 7, 465–476 Informa UK Limited, 2013. Link
Christina Kläre, Timo Borst. Statistic packages and their use in research in Economics | EDaWaX - Blog of the project ’European Data Watch Extended’. EDaWaX - European Data Watch Extended (2017). Link