An automated workflow to publish accessible scientific papers: integrating Daisy Pipeline within DSpace

1

Mireia Ribera

UniversitatdeBarcelona

Department of Library and Information Science

Mehrad Golkhosravi

Universitatde Barcelona

Department of Library and Information Science

1

Abstract

This paper presents a prototype plug-in, Daisy4Dspace, of using Daisy Pipeline tools to integrate automated format conversions within DSpace (anopen source digital repository for scientific articles). Daisy4DSpace consists in newDSpacemediafilter java classes which call different Daisy Pipelinetransformation scripts depending on the source format of the submitted article. This simple solution generates benefits for all stakeholders: the automatic generation of alternate versions is useful for disabled end users; repository holders have a low cost conversion tool; and accessibility advocates have a demonstrating platform for DAISY publication.

Keywords

Digital accessibility; adaptive content processing; DSpace; Daisy Pipeline; XML transformations; open source repositories; scholar communication; stakeholders model of accessibility;

Introduction

It is clear that digital accessibility is a prerequisite for an inclusive information society. Any developed society needs responsible citizens engaged in a long-life learning experience, who are able to make correct decisions, and it is only possible with well-informed people. Everybody deserves to have access to scientific and technical knowledge, mainly disseminated through scientific articles.Open access movement (Budapest Open Access Initiative 2002) have long fighted for this right, unluckily forgetting disabled users. As governments are more aware of disabled community needs and active regulations are taken, there is an increasing demand on established accessibility standards conformance for all public services.

Not being an exception, open access digital repositories will be forced in many countries to follow Web Content Accessibility Guidelines 2.0[1] in a near future. Currently, their contents do not conform to these guidelines and this may result to denial of service by governmental agencies, resulting in a lack of accessibility for all people (Kelly, 2006). This tool offers a useful solution to this problem.

Related research

There are similar studies concerning organizational and technical aspectsof accessible content processing:

  • On the organizational side, Brian Kelly, and David Sloan from UKOLN and University of Dundee, respectively, introduced new views in the accessibility field emphasizing the importance of workflows, process and integration of solutions (Kelly, 2007). David Crombie, from EUAIN, led the development of CEN workshop agreement on content processing accessibility (CWA 15778:2008) showing many scenarios and contexts where a new direction to promote accessibility was needed.
  • On the technical side Braillenet, the blind organization in FRANCE, developed various XSLT transformations in Helene Server for the automatic creation of accessible content (Benoît 2004). Duarteand Carriço (2006) and other members from LaSIGE, at University of Lisbon, developed an interesting framework for managing DAISY through XML processing. Lemon8 (Lemon8), a development of the Public Knowledge Project, works with the USA National Library of Medicine, to create a specialized repository designed to make it easier for non-technical editors and authors to convert scholarly papers from typical word-processors into XML.

Methodology

The plug-in (Daisy4DSpace) has been developed according to a software engineering method, the spiral model of software development. Thetool presented here is simplya preliminary research or the first prototype of the system, which will be followed by evaluation and will start again the typical life-cycle development in the next phase of the spiral model.

We choose DSpace as a base for our development for several reasons: firstly, DSpace is the most popular digital repository, it is used in many prestigious universities (DOAR 2008), with a very active users community; secondly, it is complete at the functional level, it follows the OAIS standard (OAIS 2002) and it offers a high level of customization; finally, it is an open source project written in Java, a very common language that works very well with XML -- a key format for digital documents.

Architecture of the system

The prototype is integrated in DSpace as a batch processing tool. On one hand DSpace offers to the author a platform to submit originals in different word processing formats, to describe originals metadata and to manage digital rights. On the other hand, Daisy Pipelineoffers the conversion tools from text editors formats to other formats such as accessible HTML format, DAISY with audio, etc. At the end, Daisy4DSpaceplug-in builds a bridge between Daisy Pipeline conversion tools and DSpace digital repository, showing the benefits of the integration of both tools to improve the accessibility of scientific articles.

Figure1. Prototype of Daisy4Dspaceplug-in integrated in DSpace

Daisy4DSpace consists in new mediafilter java classes which inherit from org.dspace.app.mediafilter abstract class. The new mediafilters call different Daisy Pipeline transformation scripts depending on the source format of the submitted article.

Before explaining the operation of the plug-in I should mention some problems that arose during the development of the tool:

  • Robustness of Daisy Pipeline: Transformations in Daisy Pipeline are not still enough robust to work in a real world context. The new versions of the software are expected to solve this problem. Also validation and feedback tools for authors would be welcomed to improve the quality of originals.
  • Invisibility of alternate media: In standard mediafilters new generated documents go to specialized bundles in DSpace which are not visible for the end user; I had to change this standard flow to force publication to the “original” bundle and make possible to retrieve the many versions of a work in an end-user research.

Concerning the first mentioned problem I should remind that in current Daisy Pipelinedevelopment stage authors of articles will have some extra requirements before using the system. The most important thing is that they should submit well-structured articles in the repository: in the article the headings have to be indicated, there should be alternative text for each image, well-created tables and correct links for reference footnotes or other material, etc. It is clear that these requirements are not far from what the conferences and scientific journals ask authors, but still it is not general practice.In order to minimize the problems due to bad originals and to guarantee a minimum level of quality in submitted documents, authors are equipped by a template (currently only done in MS Word), inspired on RSC templates [Microsoft…] for journals of the Royal Society of Chemistry, and with a simple guide of good accessibility practices in authoring articles (inspired on TechDis guides [Accessibility…]).

As the intended audience of the tool are STM authors we should also mention, that in current prototype no MathML support is given. Many technical articles including mathematical formulas impose several accessibility barriers for blind people. The MathML is a current solution in HTML documents, open office formats and it is included in recent versions of DAISY standard as well. However, this prototype does not cover this specific solution because Daisy Pipelinetools do not robustly support it either.

The operation of the tool is simple as it is illustrated in the followingscenario:

  1. Accessibility experts develop templates for article submission
  2. Authors submit articles in Microsoft Word, Open Office Writer or directly in DTBook XML format.
  3. Repository administrators call batch processing for creating alternate presentations of the original work, and validate the results.
  4. Users retrieve articles of their interest and are able to select which is the most suitable format according to their needs or preferences.

Results

There are different benefits of this development, along the document life cycle and benefiting many stakeholders.

  • Alternate versions for disabled end users: The results of the integration are alternate versions of submitted articles that would be more adapted to the needs of numerous users in higher education studies [Petrie, Weber, Fisher 2005] such as people with dyslexia, low vision people and blind people.
  • Low cost conversions for repository holders:with this tool organizations holding the repositories have an almost no cost accessible collection: they should only pay for TTS synthesizer for each collection language and for disc space (as audio and PDF versions are sometimes heavy files); furthermore they do not need to ask for copyright permissions (a hard and long process) neither from the authors nor from the publishers, because Creative Commons Licenses in open accessrepositories grant the conversion.
  • Demonstrating platform in DAISY: accessibility advocates have difficulties to sell their ideas because there is no real data of accessibility market size. With this tool, which easily can cover 10% of published articles, they benefit from a demonstrating platform which permits to disseminate DAISY format between authors on one hand and provide readily available statistics on use of alternate versions on the other hand.

Further research

Further research is necessary for this prototype to become a working tool in real repositories and to foster its benefits to its maximum expression.

On one hand the academic community should persist in promoting the use of institutional repositories through incentives, compulsory measures, and author awareness;and on the other hand the accessibility community should devise which is the best method to include accessibility requirements in this context.

With the advent of this tool an opportunity appear to do research on metrics for real use of alternate documents, a long perceived need for advocating digital accessibility usefulness. Because of alternate documents integrated inDSpace, repository administrators could benefit from statistical information provided by DSpace. However, some improvements should be done to keep track ofuser’sformat selections.

In a more technical point of viewit is important to develop new pipeline transformations focused on scholars:

  • NLM DTD to DTBook and vice versa
  • DocBook to DTBook and vice versa
  • DTBook to large print PDF; normal print PDF; and tagged PDF

It is also desirable to improve the management of mediafilters in DSpace,and include them in the backend. Furthermore, a new user interface for DSpace would be needed to read Daisy books directly from the platform.

As previously stated future versions should improve MathML support. It will also be desirable to integrate Intelligent Structure Recognition software to transform unstructured originals existing in repositories, and start a backwards conversion process.

Finally, there are some new questions, concerning preservation and metadata, which arose withthe integration of multiple manifestations of a unique work in a repository:

  • Should we preserve all different output formats?Orshould we just keep the original one?
  • Which format is the original one?
  • How can we define the relations between manifestations in metadata?

Bibliography

“Accessibility Essentials 2: Writing Accessible Electronic Documents with Microsoft Word” [ retrieved October, 24th 2008

Budapest Open Access Initiative (BOAI) 14-febrer-2002 [ retrieved October, 24th 2008

CEN/ISSS CWA 15778:2008 – Document processing for accessibility(February 2008)

[

RetrievedOctober, 24th 2008

Daisy Pipeline Project [ retrievedOctober, 24th 2008

DSpace [ retrievedOctober, 24th 2008

Duarte, Carlos, and Luís Carriço. "A Conceptual Framework for Developing Adaptive Multimodal Applications." 2006 International Conference on Intelligent User Interfaces. Sydney, 29 January 2005-1 February 2005.

Guillon, Benoît, et al. "Towards an Integrated Publishing Chain for Accessible Multimodal Documents in Blind and Visually Impaired People: Access to Documents and Information." 9th International Conference, ICCHP 2004. Paris, July 7-9, 2004.

Kelly, Brian [et al.] “Accessibility 2.0: people, policies and processes” in W4A2007 Banff, Canada, May 07-08 2007

Kelly, Brian. "Accessibility and Institutional Repositories" [ retrievedOctober, 24th 2008

Lemon8-XML [ retrievedOctober, 24th 2008

Microsoft Word Templates [ retrievedOctober, 24th 2008

National Library of Medicine Journal Archiving and Interchange Tag Suite [ retrievedOctober, 24th 2008

Petrie, Helen, G. Weber, and W. Fisher. "Personalization, Interaction, and Navigation in Rich Multimedia Documents for Print-Disabled Users" IBM Systems Journal 2005: 620.

Reference Model for an Open Archival Information System (OAIS) CCSDS 650.0-B-1

BLUE BOOK January 2002 [ retrievedOctober, 24th 2008

“Usage of open access repository software (worldwide)” [ retrieved October, 24th 2008

“Web Content Accessibility Guidelines 2.0”, W3C Candidate Recommendation 30 April 2008 [ 24th 2008

Paper presented at Adaptive Content Processing Conference 2008, 6-7 November 2008, Amsterdam [ also retrievable at

1

[1]WCAG 2.0 is currently a candidate recommendation