Hackathon generates new tools for digital text collections  

The first Distributed Text Services (DTS) Hackathon, co-sponsored by the dhCenter, yielded four award-winning and ready-to-implement ideas for improving uptake of DTS, which defines an API for working with collections of text as machine-actionable data.

The DTS specification defines a type of software interface called an API (application programming interface), which allows digital text corpora to be published in a standardized, uniform way that is more easily accessible and navigable across platforms.

According to the DTS community, which organized the online hackathon held from September 27th-October 8th, the specification enables machine-consumption of digital text collections, and can help publishers of such collections make their data Findable, Accessible, Interoperable and Reusable (FAIR).

“The DTS specification allows you to put a corpus online in such a way that users can access metadata about your documents, browse its sub-collections, retrieve pieces of text, etc.” says dhCenter member Matteo Romanello, a lecturer at the University of Lausanne and co-coordinator of the hackathon. “You can use it for browsing virtual collections, and at the document level, you can discover the table of contents, elements of text, and navigation endpoints.”

Making digital humanities research easier

Romanello explains that while another API specification, known as the Image Interoperability Framework (IIIF), already exists for images, DTS is the first such specification aimed at text corpora, which makes it a key tool for digital humanities researchers. He notes that a key feature of DTS is that it is generic, meaning that it can be used regardless of text language or format.

“The previous standards that inspired DTS were not generalizable. If publishers use DTS, any collection can be accessed in the same way, whether or not the text is canonical, or homogeneous. This means that it can be used to retrieve data about inscriptions or texts on papyrus, for example.”

In addition to improving the ecosystem of DTS data and tools, a main goal of the hackathon was to showcase the utility and versatility of the standard, which is currently implemented by five corpora, and promote its uptake by research institutions, digital heritage collections, and other publishers of digital texts.

Tools, data, and documentation

The hackathon had 26 registered participants who teamed up to develop a total of nine hack ideas. Four were selected for awards following evaluation by an international jury of experts (Thibault Clérice, École nationale des chartes; Berenike Herrmann, University of Bielefeld; Leif Isaksen, University of Exeter; Davide Picca, University of Lausanne; Elena Pierazzo, University of Tours; and Valeria Vitale, Turing Institute).

All contributions were divided into the categories of tool hacks (aimed at developing software tools that enrich or annotate DTS-ready corpora, and facilitating the publication of DTS-compliant corpora) and data hacks (aimed at exposing existing textual resources via a DTS-compliant API). A new hack category, documentation hacks, emerged during the hackathon.

Judging criteria included the extent to which the hack contributes to the DTS ecosystem of tools and resources, its potential to increase the adoption of DTS, the ease of use of the resulting tool, and its usefulness for the broader community.

🏆  Best tool hack (to consume DTS data): DTS2CSV by Laurent Millet Lacombe (MetaindeX), and Audric Wannaz (University of Basel)

This “extremely useful” hack was selected by the judges for its ability to make DTS more user-friendly. Lacombe and Wannaz developed a Python tool, which can run both as a command-line tool and as a graphical user interface (GUI), to convert content available via the DTS API into the tabular CSV format. The tool’s behaviour is fully configurable by means of a JSON configuration file. Having DTS data in CSV format can be very useful to further analyze and explore such content, and to compute statistics about DTS collections and texts.

🏆  Best tool hack (to produce DTS data): Reusable DTS web server for EpiDoc collection, by James Chartrand and Simona Stoyanova (Crossreads Project, University of Oxford) 

This proof-of-concept was described by the judges as “useful in its own right, as well as helpful for those looking to implement it for other datasets.” The team demonstrated that a DTS API can be created on top of a TEI-XML EpiDoc corpus, and stored on GitHub using a GitHub API. This is implemented by using the Express framework for Node.js, and a demo instance can already be tested.

🏆  Best data hack: DraCor-API to DTS, by Ingo Börner (University of Potsdam)

This hack has the potential to broaden the DTS audience by focusing on theater corpora. It implements a DTS endpoint into the DraCor-API, which consists of drama corpora in 11 languages. Although it is a custom API, Börner extended the DraCor-API to include support for the DTS specification, thus making all DraCor corpus content accessible via a DTS API.

🏆  Best documentation hack: DTS & IIIF integration by Robert Casties, (Max Planck Institute for the History of Science)

This concept was lauded by the judges for its novelty and creativity as well as its utility. The goal is to allow the DTS and IIIF standards to talk to each other, and to make the DTS API into an IIIF analog for text use on the internet. Casties explored some use cases for the integration of textual data served by the DTS API with image data served by the IIIF API. These initial use cases include the synchronized display of text and images for entire pages, as well as for individual sections of a text document.