Integrating a Research Data Repository with established research practices

Authors: Timo Borst, Konstantin Ott

In recent years, repositories for managing research data have emerged, which are supposed to help researchers to upload, describe, distribute and share their data. To promote and foster the distribution of research data in the light of paradigms like Open Science and Open Access, these repositories are normally implemented and hosted as stand-alone applications, meaning that they offer a web interface for manually uploading the data, and a presentation interface for browsing, searching and accessing the data. Sometimes, the first component (interface for uploading the data) is substituted or complemented by a submission interface from another application. E.g., in Dataverse or in CKAN data is submitted from remote third-party applications by means of data deposit APIs [1]. However the upload of data is organized and eventually embedded into a publishing framework (data either as a supplement of a journal article, or as a stand-alone research output subject to review and release as part of a ‘data journal’), it definitely means that this data is supposed to be made publicly available, which is often reflected by policies and guidelines for data deposit.

In clear contrast to this publishing model, the vast majority of current research data however is not supposed to be published, at least in terms of scientific publications. Several studies and surveys on research data management indicate that at least in the social sciences there is a strong tendency and practice to process and share data amongst peers in a local and protected environment (often with several local copies on different personal devices), before eventually uploading and disseminating derivatives from this data to a publicly accessible repository. E.g., according to a survey among Austrian researchers, the portion of researchers agreeing to share their data either on request or among colleagues is 57% resp. 53%, while the agreement to share on a disciplinary repository is only 28% [2]. And in another survey among researchers from a local university and cooperation partner, almost 70% preferred an institutional local archive, while only 10% agreed on a national or international archive. Even if there is data planned to be published via a publicly accessible repository, it will first be stored and processed in a protected environment, carefully shared with peers (project members, institutional colleagues, sponsors) and often subject to access restrictions – in other words, it is used before being published.

With this situation in mind, we designed and developed a central research data repository as part of a funded project called ‘SowiDataNet’ (SDN - Network of data from Social Sciences and Economics) [3]. The overall goal of the project is to develop and establish a national web infrastructure for archiving and managing research data in the social sciences, particularly quantitative (statistical) data from surveys. It aims at smaller institutional research groups or teams, which often do lack an institutional support or infrastructure for managing their research data. As a front-end application, the repository based on DSpace software provides a typical web interface for browsing, searching and accessing the content. As a back-end application, it provides typical forms for capturing metadata and bitstreams, with some enhancements regarding the integration of authority control by means of external webservices. From the point of view of the participating research institutions, a central requirement is the development of a local view (‘showcase’) on the repository’s data, so that this view can be smoothly integrated into the website of the institution. The web interface of the view is generated by means of the Play Framework in combination with the Bootstrap framework for generating the layout, while all of the data is retrieved and requested from the DSpace backend via its Discover interface and REST-API.

Image of SDN Architecture

SDN ArchitectureDiagram: SowiDataNet software components

The purpose of the showcase application is to provide an institutional subset and view of the central repository’s data, which can easily be integrated into any institutional website, either as an iFrame to be embedded by the institution (which might be considered as an easy rather than a satisfactory technical solution), or as a stand-alone subpage being linked from the institution’s homepage, optionally using a proxy server for preserving the institutional domain namespace. While these solutions imply the standard way of hosting the showcase software, a third approach suggests the deployment of the showcase software on an institution’s server for customizing the application. In this case, every institution can modify the layout of their institutional view by customizing their institutional CSS file. Because using Bootstrap and LESS Compiling the CSS file, a lightweight possibility might be to modify only some LESS Variables compiling to an institutional CSS file.

As a result from the requirement analysis conducted with the project partners (two research institutes from the social sciences), and in accordance with the survey results cited, there is a strong demand for managing not only data which is to be published in the central repository, but also data which is protected and circulating only among the members of the institution. Moreover, this data is described by additional specific metadata containing internal hints on the availability restrictions and access conditions. Hence, we had to distinguish between the following two basic use cases to be covered by the showcase:

  • To provide a view on the public SDN data (‘data published’)
  • To provide a view on the public SDN data plus the internal institutional data resp. their corresponding metadata records, the latter only visible and accessible for institutional members (‘data in use’)


From the perspective of a research institution and data provider, the second use case turned out to be the primary one, since it covers more the institutional practices and workflows than the publishing model does. As a matter of fact, research data is primarily generated, processed and shared in a protected environment, before it may eventually be published and distributed to a wider, potentially abstract and unknown community – and this fact must be acknowledged and reflected by a central research data repository aiming at the contributions from researchers which are bound to an institution.

If ‘data in use’ is to be integrated into the showcase as an internal view on protected data to be shared only within an institution, it means to restrict the access to this data on different levels. First, for every community (in the sense of an institution), we introduce a DSpace collection for just those internal data, and protect it by assigning it to a DSpace user role ‘internal[COMMUNITY_NAME]’. This role is associated with an IP range, so that only requests from that range will be assigned to the role ‘internal’ and granted access to the internal collection. In the context of our project, we enter only the IP of the showcase application, so that every user of this application will see the protected items. Depending on the locality of the showcase application resp. server, we have to take further steps: If the application resp. server is located in the institution’s intranet, the protected items are only visible and accessible from the institution’s network. If the application is externally hosted and accessible via the World Wide Web – which is expected to be the default solution for most of the research institutes –, then the showcase application needs an authentication procedure, which is preferably realized by means of the central DSpace SowiDataNet repository, so that every user of the showcase application is granted access by becoming a DSpace user.

In the context of an r&d project where we are partnering with research institutes, it turned out that the management of research data is twofold: while repository providers are focused on the publishing and unrestricted access to research data, researchers are mainly interested in local archiving and sharing of their data. In order to manage this data, the researchers’ institutional practices need to be reflected and supported. For this purpose, we developed an additional viewing and access component. When it comes to their integration with existing institutional research practices and workflows, the implementation of research data repositories requires concepts and actions which go far beyond the original idea of a central publishing platform. Further research and development is planned in order to understand and support better the sharing of data in both institutional and cross-institutional subgroups, so the integration with a public central repository will be fostered.

Link to prototype

References

[1] Dataverse Deposit-API. Retrieved 24 May 2016, from http://guides.dataverse.org/en/3.6.2/dataverse-api-main.html#data-deposit-api
[2] Forschende und ihre Daten. Ergebnisse einer österreichweiten Befragung – Report 2015. Version 1.2 - Zenodo. (2015). Retrieved 24 May 2016, from https://zenodo.org/record/32043#.VrhmKEa5pmM
[3] Project homepage: https://sowidatanet.de/. Retrieved 24 May 2016.
[4] Research data management survey: report - Nottingham ePrints. (2013). Retrieved 24 May 2016, from http://eprints.nottingham.ac.uk/1893/
[5] University of Oxford Research Data Management Survey 2012 : The Results | DaMaRO. (2012). Retrieved 24 May 2016, from https://blogs.it.ox.ac.uk/damaro/2013/01/03/university-of-oxford-research-data-management-survey-2012-the-results/