For “Digging Into Data Challenge” homepage, click HERE.
Research on integrating digital library content with computational tools and services has been concerned with examining, analyzing, and finding patterns within a data set. Scholars, on the other hand, associate the people, places and events mentioned in texts with other descriptions elsewhere. Thus, while most computational analysis looks inward to the contexts of a particular set of data, scholars tend to look outward, seeking the context for the texts they are studying.
In this 17-month project we go beyond this basic analysis by providing a prototype system developed to provide expert system support to scholars in their work. This system will integrate large-scale collections including JSTOR and the books collections of the Internet Archive stored and managed in a distributed preservation environment. It will also incorporate text mining and Natural Language Processing software capable of generating dynamic links to related resources discussing the same persons, places, and events.
Next generation digital libraries should go beyond merely providing access by providing tools and services to explore content.
Our project aims to:
- Create tools that aid researchers by synthesizing an understanding of unexplained names and terms within e-resources.
- Promote sharing and reuse, by providing publication tools for annotating and exposing relevant data and new knowledge.
- Develop advanced discovery tools by inclusion of both machine and user generated metadata.
Our approach is to apply text and data mining algorithms and tools to large data collections, federated across data grid/cloud storage and other digital asset management systems. Such tools will be used to identify and discriminate people, places and events, within the free-text they contain. This will involve such techniques as:
- Natural Language Processing
- Named Entity Recognition
- Geo-coding / Geo-retrieval
The output from these analyses will be used to automatically enhance data collections through tagging and/or annotation. The enhanced data will then be loaded and indexed by a digital library system to provide advanced resource discovery tools. A prototype interface will be developed to make use of the advanced discovery tools to automatically provide explanatory resources about people, places and events in an intuitive and unobtrusive manner, enabling their use by scholars during their normal research processes.
Although it is not part of this project, our vision is for such tools to be embedded within production quality distributed digital preservation environments, capable of managing multi terabyte or even petabyte scale collections. When analysing such massive data sets, the ability of the software tools for text and data-mining analyses to be run in parallel by compute clusters will be of paramount importance.
Technologies are based on the implementation of a data management infrastructure based on the integrated Rule Oriented Data System (iRODS) data grid and workflows applied to cloud storage and compute technologies (Univ. of North Carolina and Liverpool) and the Cheshire digital library system (UC Berkeley and Liverpool), to federate data management and discovery services across clouds and other asset management systems. Our applied outputs will provide new and enriched discovery services for researchers and learners, through the development of JSTOR and JISC service exemplars for scaled (terabyte to petabyte) digital repositories The data management technologies will integrate these with distributed institutional repositories, and still provide a scholar-friendly research environment for humanists. Targeted, focused scholarly communities will be engaged to understand how researchers may react to the enriched services and what kind of new discoveries may become possible as a result.
Banner graphics from Agricola’s De Re Metallica (available in the Internet Archive books collection).