30
Project Acronym / RDA EuropeProject Title / Research Data Alliance Europe
Project Number / 312424
Deliverable Title / First year report on RDA Europe analysis programme
Deliverable No. / D2.5
Delivery Date
Author / Herman Stehouwer, Peter Wittenburg
ABSTRACT
All detailed analysis results of the data architectures and organizations of the communities studied and a description of possible generalizations towards solutions for integration and interoperability to foster the RDA Europe Forum and RDA discussions.
DOCUMENT INFORMATION
PROJECTProject Acronym / RDA Europe
Project Title / Research Data Alliance Europe
Project Start / 1st September 2012
Project Duration / 24 months
Funding / FP7-INFRASTRUCTURES-2012-1
Grant Agreement No. / 312424
DOCUMENT
Deliverable No. / D2.5
Deliverable Title / Second year report on RDA Europe analysis programme
Contractual Delivery Date / 10 2014
Actual Delivery Date / 10 2014
Author(s) / Herman Stehouwer, MPG; Peter Wittenburg, MPG
Editor(s) / Herman Stehouwer, MPG; Peter Wittenburg, MPG
Reviewer(s)
Contributor(s) / Rob Baxter, Diana Hendrix, Eleni Petra, Fotis Karagiannis, Daan Broeder, Marina Boulanov, Leif Laaksonen, Francoise Genova, Gavin Pringle, Rob Baxter, Franco Zoppi, Constantino Thanos, Herman Stehouwer, Peter Wittenburg
Work Package No. & Title / WP2 Access and Interoperability Platform
Work Package Leader / Peter Wittenburg – MPI-PL
Work Package Participants / CSC, Cineca, MPG, EPCC, CNRS, STFC, UM, ACU, ATHENA, CNR
Estimated Person Months / 16
Distribution / public
Nature / Report
Version / Revision / 1.0
Draft / Final / Draft
Total No. Pages (including cover) / 47
Keywords
DISCLAIMER
RDA Europe (312424) is a Research Infrastructures Coordination and Support Action (CSA) co-funded by the European Commission under the Capacities Programme, Framework Programme Seven (FP7).
This document contains information on RDA Europe (Research Data Alliance Europe) core activities, findings and outcomes and it may also contain contributions from distinguished experts who contribute as RDA Europe Forum members. Any reference to content in this document should clearly indicate the authors, source, organization and publication date.
The document has been produced with the funding of the European Commission. The content of this publication is the sole responsibility of the RDA Europe Consortium and its experts, and it cannot be considered to reflect the views of the European Commission. The authors of this document have taken any available measure in order for its content to be accurate, consistent and lawful. However, neither the project consortium as a whole nor the individual partners that implicitly or explicitly participated the creation and publication of this document hold any sort of responsibility that might occur as a result of using its content.
The European Union (EU) was established in accordance with the Treaty on the European Union (Maastricht). There are currently 27 member states of the European Union. It is based on the European Communities and the member states’ cooperation in the fields of Common Foreign and Security Policy and Justice and Home Affairs. The five main institutions of the European Union are the European Parliament, the Council of Ministers, the European Commission, the Court of Justice, and the Court of Auditors (http://europa.eu.int/).
Copyright © The RDAEurope Consortium 2012. See https://europe.rd-alliance.org/Content/About.aspx?Cat=0!0!1 for details on the copyright holders.
For more information on the project, its partners and contributors please see https://europe.rd-alliance.org/. You are permitted to copy and distribute verbatim copies of this document containing this copyright notice, but modifying this document is not allowed. You are permitted to copy this document in whole or in part into other documents if you attach the following reference to the copied elements: “Copyright © The RDA Europe Consortium 2012.”
The information contained in this document represents the views of the RDA Europe Consortium as of the date they are published. The RDA Europe Consortium does not guarantee that any information contained herein is error-free, or up to date. THE RDA Europe CONSORTIUM MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, BY PUBLISHING THIS DOCUMENT.
GLOSSARY
ABBREVIATION / DEFINITIONRDA Europe / Research Data Alliance Europe
OAI-PMH / Open Archives Initiative Protocol for Metadata Harvesting
CSC / Finnish IT Centre for Science
UM / Maastricht University
MPI-PL / Max Planck Institute for Psycholinguistics
CLST / Centre for Language and Speech Technology
RU / Radboud University
CNRS / Centre National de la Recherche Scientifique
ENVRI / Common Operations of Environmental Research infrastructures
TNO / Dutch Organisation for Applied Research
E-IRG / E-Infrastructure Reflection Group
EEF / European E-Infrastructure Forum
ESFRI / European Strategy Forum on Research Infrastructures
ACU / Association of Commonwealth Universities
CERN / European Organization for Nuclear Research
MPG / Max Planck Gesellschaft
TABLE OF CONTENTS
1 Executive Summary (to be done) 8
2 Introduction 9
3 Science Workshop Recommendations 11
3.1 General Observations 11
3.2 Sharing and Reuse of Data 12
3.3 Publishing and Citing Data 12
3.4 Infrastructure and Repositories 12
3.5 Conclusions 12
4 Analysis Programme Recommendations 14
5 New Observations 15
5.1 Process Model 15
5.2 Observations made 17
5.3 Overall Conclusions 30
5.4 Concurrence of RDA Activities 32
6 Recommendations 36
Appendix A. RDA/Europe and Max Planck Society Science Workshop on Data 38
A.1 Background and Aims of the Workshop 38
A.2 General Observations 39
A.3 Sharing and Re-use of Data 40
A.4 Publishing and Citing Data 40
A.5 Infrastructures and Repositories 41
A.6 Spectra of Data 41
A.7 Conclusions and Recommendations for RDA 42
A.8 Participants 43
Appendix B. List of Attended Community events 45
Appendix C. List of Interviews 47
Appendix D. Big Data Analytics 48
1 Executive Summary (to be done)
2 Introduction
One of the major action lines within the European iCORDI project (now called RDA/EU) was the analysis of the current data landscape in the various research communities and disciplines. This was seen as one of the core sources of both motivation and opportunity to kick off concrete activities within the RDA context. We feel that this process was indeed clearly helpful and some urgent issues that stemmed from data analysis are consequently being addressed within RDA groups. The first deliverable from this activity was written at an early stage and was therefore based on a limited number of interviews within iCORDI[1]. This follow-up deliverable is built on:
· 24 Interviews done in iCORDI;
· 17 Interviews obtained from the EUDAT project on understanding communities’ data organization;
· 9 Interviews obtained from the Radieschen[2] project (a German-funded project);
· Interactions at more than 50 community meetings, many attended by the editors[3];
· The results of the first Science Workshop Organized by RDA/EU and the Max Planck Society (see Appendix A).
The combination of these five sources of information gives us access to a large amount of information on data practices in many different scientific disciplines, in different organizational contexts, in different initiatives, and at different maturity levels of the data process. We can assume that the results of this analysis will have a substantial impact on the work of RDA, and also on the design and funding of research infrastructures. It should be noted that this report is meant to give insight to data practices as they are used within the research communities and that it is not meant to indicate new concepts and ideas from advanced technology research[4].
However, even though we have achieved broad coverage we cannot claim to be comprehensive in our description of data landscapes and organizations. There are two major limiting factors: 1) there was only a limited amount of time available for each interview and interaction, i.e. not all aspects could be covered in great detail; and 2) the conversion from interview to interview transcript and from interview transcript to extracted observations had to be done manually, i.e. it is influenced by the interviewers’ and editors' biases.
Before continuing, let us briefly outline the method chosen to come to what we call “observations” in Chapter 5:
· A group of people (contributors, editors) interviewed community experts guided by an underlying questionnaire.
o The reports of these interviews are widely available and can be looked at again.
· Additional interview reports where collected from the EUDAT and Radieschen projects.
· The authors and contributors extracted key points from across the interviews and from notes from additional interactions with community experts at various community meetings.
· The key points were aggregated, classified and combined, resulting in the observations.
One interesting point to note is that interviews and interactions with companies did not prove very useful. We conjecture that the main reason is that companies tend to argue that they can do everything, have the know-how about all knowledge and possess ready-made platforms. What is often ignored is the fact that software technology and expertise can solve many problems of technological nature but there is a price that has to be paid. It is evident from the interviews that the dependence on commercial solutions has significant consequences, naming the solutions cannot easily be changed whereas business interests, on the other hand, supersede the willingness for innovation.
The structure of this document is as follows. In Chapter 3 we present the conclusions that came out of the Science Workshop that took place in February 2014 in Munich. In Chapter 4 we summarize the results from our first deliverable which was in many ways preliminary. In Chapter 5 we present all observations that we could extract from the available material at this stage. We summarize them to a number of key observations in Section 5.3 and compare the observations with the current activities in RDA working groups and interest groups in Section 5.4. In Chapter 6 we conclude with recommendations.
3 Science Workshop Recommendations
RDA Europe, together with the Max Planck Society, organized a workshop involving leading European scientists[5] with a broad interest in data. The goal of this workshop was to understand which opportunities, challenges and concerns researchers have in relation to research data while conducting their research, both currently and in the future.
The two-day workshop fostered exchange and interaction on a wide range of topics that included Sharing and Re-use of Data, Publishing and Citing Data, Infrastructures and Repositories, and Data Continua. These discussions enabled the identification of a number of issues viewed as essential in helping to achieve the RDA vision of researchers and innovators openly sharing data across technologies, disciplines, and countries to address the grand challenges of society.
As mentioned above several topics where covered during the workshop on the following four areas:
1. General Observations.
2. Sharing and Reuse of Data.
3. Publishing and Citing Data.
4. Infrastructure and Repositories.
In what follows, we summarize the outcomes for each on the above mentioned areas and we conclude this section with major recommendations. Below we go through each of these areas in order and briefly summarize the outcomes for each. Major recommendations follow at the end of this section.
3.1 General Observations
It is very clear that the many new possibilities in data generation are at the source of a number of major challenges. We need smarter algorithms, processes and automated workflows in order to keep on top of the generated data. At this point our ability to generate data far outstrips our ability to process data.
When dealing with larger volumes of data, we need more systematic solutions to process the data in order to have reproducible science. By that we mean, systems that need to cater for the multidisciplinary approaches inherent in modern scientific practice. We note that when taking in multiple types of data with many differing properties, processing leads to a complex adaptive system where sociological hurdles play an important role. Currently a considerable effort is spent on reusing and combining different data sources.
It is clear that in order to deal with the increasing complexity and cost of combining data we need automated workflows that can cope with increasing demands. We need to stop relying on on-shot solutions for data exchange and interoperability for this to work.
3.2 Sharing and Reuse of Data
Reuse and sharing of data are problematic for a number of reasons. One reason has to do with our inability to explore/find/collect the data, i.e. lack of visibility due to insufficient descriptive metadata, or lack of inclusion in catalogues that are used by search engines. Other reasons arise from a lack of cross-discipline methods that scale, and data mapping difficulties – in other words, the lack of a common semantic base. A further complicating issue is that of trust: can you trust the identity, integrity, authenticity and seriousness of all actors involved in the production chain?
3.3 Publishing and Citing Data
From the discussions it is clear that publishing results and citation of those results is at the core of the scientific process. Due to the increasing relevance of data, data needs to become a first class citizen, i.e. data publications need to be impactful.
In order for data citation to work the mechanisms used to cite this data need to be stable and in place, i.e. the worldwide accessible PID systems must work. We note that in order to be able to retrieve data later a stable infrastructure must be in place that makes not only the identifiers, but also the data and attributes of the data, available. This requires a considerable cost.
3.4 Infrastructure and Repositories
We need infrastructures that connect more seamlessly and more efficiently; however it is not clear how to get there. The components of such infrastructures need to be trusted, including persistent repositories, PID systems, etc.
The participants support open access as a general principle.
A large advantage is to offer services on the data, and not the data as such. However, these services need to provide alternative views on the data and not restrict usage of the data.
Existing repositories and infrastructures will need to be included in future infrastructures. It is hard to find infrastructures that are available; we will need registries and catalogues in order to find the services that we need.
3.5 Conclusions
Overall, the Science Workshop drew the following conclusions.
· Infrastructures must work and be persistent and sustainable, i.e. the infrastructure must still function in the same manner after a number of years.