Tải bản đầy đủ
6 Case study: using NoSQL at the Office of the Historian at the Department of State

6 Case study: using NoSQL at the Office of the Historian at the Department of State

Tải bản đầy đủ

116

CHAPTER 5

Native XML databases

After reading this case study, you’ll understand how annotations are used to solve
business problems and how native XML databases are unique in their ability to query
text with rich annotations. You’ll also become familiar with how open source native
XML databases use XQuery and Lucene full-text search library functions to create
high-quality search tools.
The Office of the Historian at the Department of State is charged by statute with
publishing the official records associated with US foreign relations. A declassified
analysis of specific periods of US diplomatic history is published in a series of volumes
titled Foreign Relations of the United States (FRUS). Through a detailed editing and peer
review process, the Office of the Historian has become the “gold standard” for accuracy in the history of international diplomacy. FRUS documents are used in political
science and diplomacy classes as well as for other training throughout the world.
In 2008, the Office of the Historian embarked on an initiative to convert the
printed FRUS textbooks into an online format that could be easily searched and
viewed using multiple formats. The Office of the Historian chose a standard XML format widely used for encoding historical documents called Text Encoding Initiative (TEI).
TEI was chosen because it has precise XML elements to encode a digital representation of historical documents and includes elements for indicating the people, organizations, locations, dates, and terms used in the documents.
To convert the FRUS volumes (each over 1,000 pages long) to TEI format, the documents are first sent to an outside service that enters the information into two separate XML documents using an XML editor. The two XML files are compared against
each other to ensure accuracy. The TEI-encoded XML documents are then returned
to the Office of the Historian ready to be indexed and transformed into HTML, PDF,
or other formats. Figure 5.9 outlines this encoding process.
HTML search forms

XQuery search service

eXist DB
Printed
documents

Encoded in
TEI XML
format with
annotations

Validation
with XML
Schema and
Schematron

B+tree

Lucene
fulltext

Subversion

Figure 5.9 The overall document workflow for converting printed historical
documents into an online system using TEI encoding. TEI-encoded documents are
validated using XML schemas and Schematron rules files and saved into a
Subversion revision control system. XML documents are then loaded into the
eXist native XML database. Search forms are used to send keyword queries to a
REST XQuery search service. This service uses the eXist document tree indexes
and Lucene indexes to create search results.

Case study: using NoSQL at the Office of the Historian at the Department of State

117

The TEI-encoded FRUS documents are validated using XML validation tools (XML
Schema and Schematron) and uploaded into the eXist DB native XML database, where
each data element is automatically indexed. When XML elements contain text, they’re
also automatically indexed using Apache Lucene libraries, resulting in full-text
indexes of each document. When pages are viewed on the website, XQuery performs
a transformation and converts the TEI XML format into HTML. XQuery programs are
also used to transform the TEI XML into other formats, including RSS/Atom feeds,
PDF, and EPUB. No preconversion of TEI to other formats is required until a page or
document is requested on the website.
A critical success factor for the Office of the Historian at the Department of State
project was the need for high-quality search. A sample search result for the query
“nixon in china” is shown in figure 5.10.

Figure 5.10 A sample web search result from the Office of the Historian at the Department of State.
The result page uses Apache Lucene full-text indexes to quickly search and rank many documents. The
bold words in the search result use the key-word-in-context (KWIC) function to show the search
keywords found in the documents. The search interface allows users to utilize advanced search options
to limit scope, and includes features such as Boolean, wildcard, and nearness, or proximity search.

118

CHAPTER 5

Native XML databases

The TEI documents contain many entities (people, dates, terms) that are annotated
with TEI tags. For example, each person has a tag wrapping the name of the
individual mentioned. A sample of these tags is shown in table 5.2.
Table 5.2 Sample of TEI entity annotations for people, dates, glossary terms, and geolocations. Note
that an XML attribute such as corresp for persons is used to reference a global dictionary of entities.
Annotations are wrappers around text to describe the text. Attributes such as corresp="" are keyvalue pairs within the annotation elements that add specificity to the annotations.
Entity type

Example

Person

the president

Date

June 9th

Glossary term

Phantom F–4 aircraft

Geolocations

China

XQuery makes it easy to query any XML document for all entities within the document. For example, in figure 5.11 the XPath expression //person will return all person elements found in a document including those found at the beginning, in the
middle, and at the end.
An important note to this project: it was done on a modest budget, by nontechnical internal staff and limited outside contractors. The internal staff had no prior

Figure 5.11 Each page of the FRUS document lists the entities found on that page. For
example, the people and terms referenced in this page are also shown in the right margin
of the page. Users can click on each entity for a full definition of that person or term.

Case study: managing financial derivatives with MarkLogic

119

experience with native XML systems nor the XQuery language. One member of the
staff, a historian by training, learned XQuery over the course of several months and
created a prototype website using online examples and assistance from other members of the eXist and TEI community.
There are currently hundreds of completed FRUS volumes in the system, with
more being added each month. Search performance has met all the requirements for
the site with page rendering and web searches all averaging well under 500ms.

5.7

Case study: managing financial derivatives
with MarkLogic
In this case study, we’ll look at how a financial institution implemented a commercial,
native XML database (MarkLogic) to manage a high-stakes financial derivatives system.
This study is an excellent example of how organizations with highly-variable data
are moving away from relational databases even if they’re managing high-stakes financial transactions. High-variability data is difficult to store in relational databases, since
each variation may need new columns and tables created in a RDBMS as well as new
reports.
After reading this study, you’ll understand how organizations with high-variability
data can use document stores for transactional data. You’ll also see how these organizations manage ACID transactions and use database triggers to process event streams.

5.7.1

Why financial derivatives are difficult to store in RDBMSs
This section presents an overview of financial derivatives and provides insight as to
why they’re not well suited for storage in tables within a RDBMS.
Let’s start with a quick comparison. If you purchase items from any web retailer,
the information you enter for each item you want to purchase is limited. When you
purchase a dress or shirt, you choose the item name or number, size, color, and perhaps a few other details such as the material type or item length. This information fits
neatly into the rows of an RDBMS.
Now consider purchasing a complex financial instrument like a derivative, where
each item has thousands of parameters and the parameters for every item are different. Most derivatives contain a product ID, but they also contain conditional logic,
mathematical equations, lookup tables, decision trees, and even the full text of legal
contracts. In short, the information doesn’t lend itself to an RDBMS table. Note that
it’s possible to store the item as a binary large object (BLOB) in a traditional RDBMS,
but you wouldn’t be able to access any property inside the BLOB for reporting.

5.7.2

An investment bank switches from 20 RDBMSs
to one native XML system
A large investment bank was using 20 different RDBMSs to store complex financial
instruments called over-the-counter derivative contracts, as shown in figure 5.12.

120

CHAPTER 5

Native XML databases

Figure 5.12 A sample data flow of an operational data store (ODS) for a complex
financial derivatives system using multiple RDBMSs to store the data. The trading
systems each stored data into RDBMSs using complex SQL INSERT statements. SQL
SELECT statements were used to extract data. Each new derivative type required custom
software to be written.

Highlights of the banks conversion process included these:
 Each system had its own method for ingesting the transactions, converting them






to row structures, storing the rows in tables, and reporting on the transactions.
Custom software was required for each new derivative type so key parameters
could be stored and queried.
In many instances, a single column stored different types of information based
on other parameters in the transaction.
After the data was stored, SQL queries were written to extract information for
downstream processing when key events occurred.
Because different data was shoehorned into the same column based on the
derivative type, reporting was complex and error prone.
Errors resulted in data quality issues and required extensive auditing of output
results before the data could be used by downstream systems.

This complex conversion process made it difficult for the bank to get consistent and
timely reports and to efficiently manage document workflow. What they needed was a
flexible way to store the derivative documents in a standard format such as XML, and
to be able to report on the details of the data. If all derivatives were stored as full XML
documents, each derivative could contain its unique parameters, without changes to
the database.
As a result of this analysis, the bank converted their operational data store ( ODS)
to a native XML database (MarkLogic) to store their derivative contracts. Figure 5.13
shows how the MarkLogic database was integrated into the financial organization’s
workflow.
MarkLogic is a commercial document-oriented NoSQL system that has been
around since before the term NoSQL was popular. Like other document stores, MarkLogic excels at storing data with high variability and is compliant with W3C standards
such as XML, XPath, and XQuery.