Notes from the Dynamic Data Citation Break Out session at the 4th RDA Plenary in Amsterdam

Tuesday, 23rd September 2014; 14.00 – 15.30

Attendance: 30-40 people [forgot to send around attendance list]

Session chair(s):

Andreas Rauber (Vienna University of Technology)

Ari Asmi (University of Helsinki)

Dieter van Uytvanck (Max Planck Institute)

Recorder:

Patricia Herterich (CERN)

Agenda:

  • Intro: Goals, approach and progress so far
  • Reports from pilots
  • Results from the NERC workshop
  • CLARIN presentation
  • MSD presentation
  • XML data
  • National Supercomputing Facility Australia
  • Results from the VAMDC workshop
  • <other?>
  • Open challenges
  • Time stamping and versioning
  • Hash Key calculation / Result set verification
  • Data Citation Zoo and request for examples
  • Discussion & future plans

Notes:

Agenda point “Introduction”:

The session started with a welcome by the co-chairs. Andreas started with a presentation introducing the session’s agenda. The main focus of the session was to give reports from the first pilots that were developed. He also gave a short overview of the WG’s “history” and goals. The WG was endorsed in March 2014 and will work on machine-actionable citations of dynamic datasets and their arbitrary subsets, NOT metadata for these citations, landing pages etc. (these topics are handled by other WGs/IGs). The goal is to develop concepts and recommendations, some of which are currently evaluated, conceptionally as well as through implementations.

The approach the WG chose is to use time-stamped and versioned database data and to assign PIDs to a time-stamped query/selection expression which also allows a comparison to the current version of the results in the database.

Question from the audience: Why do you need versioned data if you use a PID that is time-stamped?

Andreas clarifiedthat it is the database that needs to be versioned which is needed for the approach. He also defined versioned database as the WG understands it which means that it is a database where nothing gets deleted and every insertion, update or deletion gets annotated with a timestamp.

This was followed by a remark from the audience that the approach is based on a strong assumption that there is a time-stamped database running in the background. A different person in the audience mentioned that for a repository to allow versioning there is always some time-stamping needed (at least in the case of SQL databases).

The audience raised a non-technical question, but a social issue: If you have the query and the data resulting from it, how do you know that this query is the data they actually use later?

Ari answered that this is the researcher’s problem and the WG just tries to help them. Andreas added that the same problem occurs for articles as no one can prove if the article was actually read when it gets cited. The co-chairs statedthat the WG provides support for reproducibility but cannot prevent fraud. Andreas tries to illustrate this by giving the example of a music database where you could de-select results from the result list and the query included the manually deleted records. The WG results try to state the semantics as clear as possible; the actual analysis is another issue though.

Another person in the audience asked what happens in the case of institutions that have a backup policy saying data needs to be archived and thus put away after 3 years. Dieter agreed that it happens that data needs to be removed and the WG has to address real-life issues. Removals have to be made clear in the database though so others who try to replicate the result know about the changes. Andreas added that having a backup just means that you might not be able to re-execute the query immediately and it might come with some costs.

Another question was asked if the goal is to define a high-level object that encapsulates this query. Do researchers really use complicated SQL queries? Andreas clarifies that the researcher can use whatever workbench is available to them in the approach taken by the WG, only the process in the background gets frozen.

That brought up the question if the WG will try to define a standard way to describe the query. Andreas thought this was an interesting idea, but data centres are too specific and it won’t be doable. The WG takes one step at a time.

One person in the audience wanted some clarification if the query would get a PID or be the PID. Andreas made clear there is a PID assigned that resolves to a landing page that has the query.

Geoffrey Bilder from CrossRef asked about the approach of taking snapshots of databases and citing the snapshots. It wouldn’t be dynamic then. But taking a snapshot of the database and the query might be easier for data centres if they just have to make these available. He wanted to know which models the decision is based on to take the WG’s approach. Andreas justified that storing the query is always smaller than storing the entire database. Taking snapshots doesn’t scale. If a generic solution is wanted that works for small and big databases and datasets, storing a snapshot is not the solution as it only works for a small database/dataset.

Adam Farquhar (BL) highlighted that the approach comes with 2 aspects, one being the assignment of a PID for a query and a dataset which go into researchers’ reference list and the other being the actual implementations that come with design decisions data centres have to take for serving their users.

At this point, Ari stopped the discussion to return to the agenda and not run out of time for the presentations. Andreas quickly summarized that the deliverables coming from the WG will be requirements for functionality on a fundamental level as well as pilots which are reference implementations for different data centres and based on that recommendations will be written. At the end of the session, open issues such as timestamps or hashtags should be discussed.

Agenda point “results from pilots”:

John Watkins from NERC reported from a workshop with NERC members held at the BL in London where they worked through the WG’s approach for data citation [for detailed information see the slides]. As a result of this workshop, 3 use cases will work on implementations seeing what they can do with the resources they have. Stefan Pröll will publish a paper and the ARGO buoy network will propose an implementation.

There was a question from the audience regarding the snapshots SeaDataNet currently takes which was mentioned in the talk. John stated that they currently just take a snapshot of the data as it is taken and assign a PID as that is as good as they can get at the moment. He was not sure if that approach can be scaled though.

The second pilot presented by Dieter was the CLARIN example. [see slides for more] Field linguistic data comprise newspaper corpora, video recordings but mainly transcriptions i.e. manually crafted data. These are fairly labour intensive to create and old and new versions need to be interlinked. Several sociological issues connected to the data are that researchers are afraid to publish them in an early stage and default setting is set to old versions not being open. Some examples include possibilities to create your own collections and reference them as queries with plain text strings, PIDs or URLs.

The audience raised the question of instability of the queries and versioning of them. Dieter answered that the query should still be stored even if one doesn’t know where to access it. There are trajectories to send a programme in cases when the query is a sophisticated piece of software that gets send to the data. Andreas added that they are working on a separate project where the goal is to preserve the whole processing maybe even including the hardware.

Adrian Burton from ANDS presented the challenges ANDS faces with the data that comes out of the national computing infrastructure in Australia. They discuss issues such as granularity and versioning and aware that the answers to these questions are not only technical but also social. He admitted that they currently have more questions than answers and would like to be part of the conversation and get help from other WG members e.g. through a workshop in Australia. Ari added that based on the co-chairs’ experiences, a face-to-face workshop is definitely the most effective way to come to solutions.

Next Stefan presented some prototypes to handle CSV data files and XML files (for details see slides). CSV files were the most requested file format to test in previous meetings as they are commonly used. They can be dynamic in a database since researchers can upload new versions. The CSV pilot works with a query store where metadata of a query are stored and queries are normalized. So the system detects if results stay the same and there is no need to assign a new PID. The pilot is built modular to prevent interfering with other systems. Currently it is a rough prototype which still needs an interface and more applications. Being asked if CSV allows creating a schema, Stefan clarified that only a generic relational table schema is created and there is no normalization functionality available, so a query needs some expert knowledge.

The XML pilot works very similar to the CSV one. They currently try different approaches and similar to the CSV prototype, it still lacks a user-friendly interface and more use cases. The XML pilot does not use a relational database but native XML databases such as BaseX.Versioning is currently achieved by either copying branches or introducing parent-child relationships.The pilot uses XPath only to see if basic principles can be applied in their native environment.

The last pilot of the session was presented by Andreas on behalf of Carlo Maria Zwölf from the Virtual Atomic and Molecular Data Centre (for more see slides) who could not attend the meeting in person. The pilot uses a federated system with different technologies in the background. Thus, there is no time stamping used, nor do they use query stores. However, updates get time stamps and a link structure between versions, a “family tree of versions”, exists.

The co-chairs were pleased with the progress as at the last plenary only 2-3 implementations were shown and now the examples filled a whole session. Andreas asked the audience to share their experiences, too. One person reported a similar problem they encountered in their very specialized database. They run a commercial text-mining tool that uses a corpus that is periodically updated. As the corpus needs to be referred to, they independently chose a query based approach for referencing as well. The retrieval function can be a query as well as a computation of the data. The size of their dataset and the user expectations did not allow snapshotting.

One session attended wanted to know more about the WG’s strategy as all the examples that were presented were databases were the approach worked. He inquired about strategies on how to adapt to different use cases. Andreas replied that they like to test the approach as broadly as possible with any kind of data. Ari added that it will be a matter of time and effort to see how many examples they can collect.

Another example that was shared by the audience came from the seismology community. This community has two main use cases. The first is the production of streaming data at a very high frequency with updates coming every few seconds so there are no intermediate steps possible to handle or process the data. The second is the long-term archiving. They would like to see a homogenous way to handle both use cases. A time stamp might be a good indicator for different versions, but you run into trouble when you use federated systems. Andreas suggested as a solution to have each federated data centre run on its own time and to use local timestamps.

To end the session, the co-chairs highlighted the call for more examples, especially also negative ones where the approach will not work.