Tomoro's Headlines: Mark Logic Project Report

This following describes the core goals, defined tasks, and progress to-date of the Scholars Portal project to replace ScienceServer with a new archival repository and search system based on the Mark Logic XML database system for OCUL’s e-journal collections. The project has been underway since Spring 2007 but development began in earnest in January 2008. The target for completion of the project and the replacement of Science Server is September 2008.
Project Goals
Detailed project goals include:
a. migrate all the current or legacy data now present in ScienceServer to a Mark Logic XML database, transforming the content from the ScienceServer DTD to the NLM Journal Publish-ing and Archiving DTD
b. create a framework for loading new data from publishers into Mark Logic, including the creation of full-text XML versions of articles, transforming the content from proprietary publisher DTDs to the NLM Journal Publishing and Archiving DTD
c. build a set of management tools to support the functions and services of a “trusted digital repository”
d. build a user search interface that replicates the current functionality of ScienceServer and extends it where possible to include the best features of current publisher websites, including RSS feeds, faceted browsing, subject based browsing, and personalization features
Project Streams
The project is moving forward in four independent streams. Progress has been made in all four streams and dedicated staffing is in place for 3 of these streams. Staffing for the fourth stream – the development of a user search interface – has not yet been allocated.
1. Hardware and Software Infrastructure
MarkLogic supports sub-second query response times across very large databases by means of a clustered architecture which allows for the distribution of indexes and query evaluation functions across many small servers working in concert. Analysis of current usage and data sizes suggested a cluster configuration shown in the following diagram.

The cluster includes four data nodes connected to four 1.5TB disks mounted on the new Pillar storage array and served to Mark Logic via the GFS clustered filesystem. Two servers handle XQuery execution in a load-balanced configuration with the user interface to be implemented as a J2EE Web application connecting to the cluster through the Mark Logic Java API.
Tasks
install Mark Logic hardware cluster consisting of 4 data nodes sand 2 evaluator nodes
DONE
connect the MarkLogic cluster to the new Pillar storage system and use GFS to mount four 1.5 TB disks to all 4 data nodes
DONE
implement and test failover features
IN PROCESS
run performance tests and tune indexes and cache settings to support sub-second query response times
IN PROCESS
copy all PDF filesystems now on ScienceServer to Pillar for access from the Mark Logic cluster
SUMMER 08
install front-end application servers with load balancing
SUMMER 08
2. Migrate Legacy Data
Over the last 5 years, data loaded into ScienceServer has been converted to a proprietary DTD schema which supports basic article metadata but lacks support for features now common on most publisher web sites (e.g. articles in press, electronic only journals). Ideally, all of this data (13 million articles) should be reloaded from the original publisher datasets to pull more data into Mark Logic than we could store in ScienceServer. But there are too many legacy publisher DTDs (a mish-mash of of SGML and XML schemas) to manage this in a short period of time. Many publishers have plans to reissue their complete corpus of datasets in XML format over the coming years and so we anticipate that one day we will be able to reload the content now in ScienceSevrver from these reissued datasets. In the meantime, however, we will be loading this legacy data from ScienceServer by transforming it from the internal data format used in ScienceServer.
The ScienceServer content will be converted to the NIH Journal Publishing and Archiving DTD schema as it is loaded into Mark Logic. All content in Mark Logic must be well-formed XML using UTF-8 encoding. Unfortunately, ScienceServer hides a lot of legacy SGML coding in its XML records by enclosing it in CDATA sections. These sections will be removed as part of the loading and so a lot of invalid XML will be exposed as we load these legacy records. An initial load from ScienceServer shows that at least 1 million re-cords will need some form of correction before they can be loaded into Mark Logic.
Tasks
develop data loader to convert records in ScienceServer XML format to NLM format
DONE
migrate all XML metadata re-cords from Science Server to Mark Logic
DONE (first run brought over 13.5 million records with 1 million rejected for various encoding errors)
clean up all issues related to badly formed XML records in ScienceServer
MARCH 2008
reload data to correct errors
APRIL 2008
run all content through CrossRef for resolution of cited references
APRIL 2008
generate TOC files for volumes and issues
MAY 2008
3. Develop New Loaders for Current Data
Once all the legacy data has been brought over from Science-Server into Mark Logic, all newly received data will be converted directly from the publisher datasets to the NLM DTD format for loading into Mark Logic. This will require the development of data loaders (essentially XML transformations) for all the publishers we currently load into ScienceServer. Some publisher data formats are more sophisticated than others, and these will require significant development work (approximately 4 weeks per data loader). Other publisher formats are simpler, essentially metadata-only fomats, and data loaders for these will be easier to develop (approximately 1 week of work per data loader).
Tasks
develop native data loaders for all publishers based on current DTDs
IN PROCESS(Elsevier and Emerald currently under development)
test and validate each one and revise as necessary
IN PROCESS
begin loading current data in parallel with ScienceServer and verify results
IN PROCESS
timing tests for loaders
IN PROCESS
develop management tools for loading, logging, verifying and correcting data
SUMMER 2008

4. User Interface
The intention is to develop a user interface that is geared toward fast searching and browsing of journal content. The interface should incorporate all the existing functionality of ScienceServer, including basic and advanced search, export to RefWorks, entitlement management, saved searches and saved articles. A selection of the best features of current publisher web sites should also be added, based on analysis of user needs and preferences.
Tasks
develop prototype interface in XQuery for reviewing results of loaders, performance testing, and usability review
IN PROCESS
determine feasibility of developing application in XQuery rather than Java
IN PROCESS
settle on functional requirements for production interface
IN PROCESS
design and code production interface
TBD
beta testing
TBD
final release
TBD

Tomoro's Headlines

Thursday, March 20, 2008

Mark Logic Project Report

No comments:

About Me