Tải bản đầy đủ
7 Case study: managing financial derivatives with MarkLogic

7 Case study: managing financial derivatives with MarkLogic

Tải bản đầy đủ

120

CHAPTER 5

Native XML databases

Figure 5.12 A sample data flow of an operational data store (ODS) for a complex
financial derivatives system using multiple RDBMSs to store the data. The trading
systems each stored data into RDBMSs using complex SQL INSERT statements. SQL
SELECT statements were used to extract data. Each new derivative type required custom
software to be written.

Highlights of the banks conversion process included these:
 Each system had its own method for ingesting the transactions, converting them






to row structures, storing the rows in tables, and reporting on the transactions.
Custom software was required for each new derivative type so key parameters
could be stored and queried.
In many instances, a single column stored different types of information based
on other parameters in the transaction.
After the data was stored, SQL queries were written to extract information for
downstream processing when key events occurred.
Because different data was shoehorned into the same column based on the
derivative type, reporting was complex and error prone.
Errors resulted in data quality issues and required extensive auditing of output
results before the data could be used by downstream systems.

This complex conversion process made it difficult for the bank to get consistent and
timely reports and to efficiently manage document workflow. What they needed was a
flexible way to store the derivative documents in a standard format such as XML, and
to be able to report on the details of the data. If all derivatives were stored as full XML
documents, each derivative could contain its unique parameters, without changes to
the database.
As a result of this analysis, the bank converted their operational data store ( ODS)
to a native XML database (MarkLogic) to store their derivative contracts. Figure 5.13
shows how the MarkLogic database was integrated into the financial organization’s
workflow.
MarkLogic is a commercial document-oriented NoSQL system that has been
around since before the term NoSQL was popular. Like other document stores, MarkLogic excels at storing data with high variability and is compliant with W3C standards
such as XML, XPath, and XQuery.

121

Case study: managing financial derivatives with MarkLogic

Consumers of the data can fetch complete
docs or specific data elements with XPath.
Operational data store
Queries
Trading system
Trading system

Ad hoc querying
Native XML
database
(MarkLogic)

Events
Workflow

Trading system

Reconciliation
Trade capture and
risk management systems
write trade data to the ODS,
where they are stored
as XML documents.

All XML data
elements get
indexed upon
ingestion.

Clearing and
settlement

Books and
records

Post-trade-processing workflows
are triggered by events as soon as
new data is inserted.

Systems can attach
additional data, e.g., a
confirmation PDF can
be attached to a
trade document.

Figure 5.13 Financial derivatives are stored in a native XML database being used as an
ODS. Trading systems send XML documents for each trade or contract directly to the
database where each element is directly and immediately indexed. Update triggers
automatically send event data to a workflow system and system users use simple XPath
expressions to perform ad hoc queries.

The bank’s new system was ideal as a centralized store for the highly variable derivatives contracts. Since MarkLogic supports ACID transactions and replication, the bank
maintained the reliability and availability guarantees it had with its RDBMS. MarkLogic
also supports event triggers on document collections. These are scripts that are executed each time an XML file is inserted, updated, or deleted.
Whereas RDBMSs require every record in a database to have the same structure
and data types, document stores are more flexible and allow organizations to capture
the variations in their data in a single database. As we move to our next section, we’ll
take a look at the major benefits of using a native XML document store.

5.7.3

Business benefits of moving to a native XML document store
The move to a document-centric architecture resulted in the following tangible benefits to the organization:
 Faster development—New instrument types added to the system by the front office

traders didn’t require additional software development, and therefore could be
supported in a matter of hours rather than days, weeks, or even months.
 Higher data quality—As new derivatives were loaded into the system as XML documents, the system was able to append the document with additional XML

122

CHAPTER 5

Native XML databases

elements needed to more precisely describe the derivative. Downstream analysis and reporting became easier to manage and less error prone.
 Better risk management—New reporting capabilities aggregated the bank’s position in real time, and provided an instant, accurate view of exposure to certain
risk aspects such as counter-parties, currencies, or geographies.
 Lower operational costs—The elimination of processing errors associated with
multiple operational stores containing conflicting data reduced cost per trade;
the reduction of database administrators needed from 10 to 1 lowered human
resource expense; mechanisms that trigger all post-trade processing workflows
from a single source instead of 20 databases increased operational efficiencies;
and the ability to query the content of each individual derivative lowered
reporting costs. With its new infrastructure, the bank didn’t need to add
resources to meet regulators’ escalating demands for more transparency and
increased stress-testing frequency.
In addition to the more tangible benefits of the new system, the bank was able to
bring new products to market faster and perform more detailed quality checks on
diverse data. As a result of the new-found confidence in the data quality and accuracy,
the solution was adopted by other parts of the bank.

5.7.4

Project results
The new MarkLogic system allowed the bank to cut the costs of building and maintaining an operational data store for complex derivatives. In addition, the bank
became more responsive to the needs of the organization when new derivatives
needed to be added. Derivative contracts are now kept in a semantically precise and
flexible XML format while maintaining high data integrity, even as the format moves
into remote reporting and workflow systems. These changes had a positive impact on
the entire lifecycle of derivative contract management.

5.8

Summary
If you talk with people who’ve been using native XML databases for several years, they
tell you they’re happy with these systems, and express their reluctance to return to
RDBMSs. Their primary reason for liking native XML systems isn’t centered around
performance issues, although there are commercial native XML databases such as
MarkLogic that store petabtyes of information. Their primary reason is related to
increased developer productivity and the ability for nonprogrammers to be able to
participate in the development process.
Seasoned software developers have exposure to good training and they frequently
use tools customized to the XML development process. They have large libraries of
XQuery code that can quickly be customized to create new applications in a short
period of time. The ability to quickly create new applications shortens development
cycles and helps new products make tight time-to-market deadlines.

Further reading

123

Although XML is frequently associated with slow processing speeds, this often has
more to do with a specific implementation of an XML parser or a slow virtual machine.
It has little to do with how native XML systems work. The generation of XML directly
from a compressed tree storage structure is usually on par with any other format such
as CSV or JSON.
All native XML databases start with the cache-friendly document store pattern and
gain from the elimination of middle-tier, object-translation layers. They then leverage
the power of standards to gain both portability and reuse of XQuery function libraries. The use of standard metaphors like data folders to manage document collections
and simple path expressions in queries make native XML databases easy to set up and
administer for nontechnical users. This combination of features has yet to appear in
other NoSQL systems, since standardization is only critical as third-party software
developers look for application portability.
Despite the W3C’s work on extending XQuery for updates and full-text search,
there are still areas that lack standardization. Although native XML databases allow
you to create custom indexes for things like geolocation, RDF data, and graphs, there
are still few standards in these areas, making porting applications between native XML
databases more difficult than it needs to be. New work by the W3C, EXPath developers, and other researchers may mitigate these problems in the future. If these standards continue to be developed, XQuery-based document stores may become a more
robust platform for NoSQL developers.
The cache-friendliness of documents and the parallel nature of the FLWOR statement make native XML databases inherently more scalable than SQL systems. In the
next chapter, we’ll focus on the some of the techniques NoSQL systems use when
managing large datasets.

5.9

Further reading
 eXist-db. http://exist-db.org/.
 EXPath. http://expath.org.
 JSONiq. “The JSON Query Language.” http://www.jsoniq.org.
 MarkLogic. http://www.marklogic.com.
 “Metcalfe’s Law.” Wikipedia. http://mng.bz/XqMT.
 “Network effect.” Wikipedia. http://mng.bz/7dIQ.
 Oracle. “Oracle Berkeley DB XML & XQuery.” http://mng.bz/6w3z.
 TEI. “TEI: Text Encoding Initiative.” http://www.tei-c.org.
 W3C. “EXPath Community Group.” http://mng.bz/O3j8.
 W3C. “XML Query Use Cases.” http://mng.bz/h25P.
 W3C. “XML Schema Part 2: Datatypes Second Edition.” http://mng.bz/F8Gx.
 W3C. “XQuery and XPath Full Text 1.0.” http://mng.bz/Bd9E.
 W3C. “XQuery implementations.” http://mng.bz/49rG.
 W3C. “XQuery Update Facility 1.0.” http://mng.bz/SN6T.
 XSPARQL. http://xsparql.deri.org.

Part 3
NoSQL solutions

P

art 3 is a tour of how NoSQL solutions solve the real-world business problems of big data, search, high availability, and agility. As you go through each
chapter, you’ll be presented with a business problem, and then see how one or
more NoSQL technologies can be cost-effectively implemented to result in a
positive return on investment for an organization.
Chapter 6 tackles the issues of big data and linear scalability. You’ll see how
NoSQL systems leverage large numbers of commodity CPUs to solve large dataset and big data problems. You’ll also get an in-depth review of MapReduce and
the need for parallel processing.
In chapter 7 we identify the key features associated with a strong search
system and show you how NoSQL systems can be used to create better search
applications.
Chapter 8 covers how NoSQL systems are used to address the issues of high
availability and minimal downtime.
Chapter 9 looks at agility and how NoSQL systems can help organizations
quickly respond to changing organizational needs. Many people who are new to
the NoSQL movement underestimate how constraining RDBMSs can be when
market demand or business conditions change. This chapter shows how NoSQL
systems can be more adaptable to changing system and market requirements
and provide a competitive edge to an organization.

Using NoSQL
to manage big data

This chapter covers
 What is a big data NoSQL solution?
 Classifying big data problems
 The challenges of distributed computing for big data
 How NoSQL handles big data

By improving our ability to extract knowledge and insights from large and complex
collections of digital data, the initiative promises to help solve some the Nation’s most
pressing challenges.
—US Federal Government,
“Big Data Research and Development Initiative”

Have you ever wanted to analyze a large amount of data gathered from log files or
files you’ve found on the web? The need to quickly analyze large volumes of data is
the number-one reason organizations leave the world of single-processor RDBMSs
and move toward NoSQL solutions. You may recall our discussion in chapter 1 on
the key business drivers: volume, velocity, variability, and agility. The first two, volume and velocity, are the most relevant to big data problems.

127

128

CHAPTER 6

Using NoSQL to manage big data

Twenty years ago, companies managed datasets that contained approximately a
million internal sales transactions, stored on a single processor in a relational database. As organizations generated more data from internal and external sources, datasets expanded to billions and trillions of items. The amount of data made it difficult
for organizations to continue to use a single system to process this data. They had to
learn how to distribute the tasks among many processors. This is what is known as a
big data problem.
Today, using a NoSQL solution to solve your big data problems gives you some
unique ways to handle and manage your big data. By moving data to queries, using
hash rings to distribute the load, using replication to scale your reads, and allowing
the database to distribute queries evenly to your data nodes, you can manage your
data and keep your systems running fast.
What’s driving the focus on solving big data problems? First, the amount of publicly available information on the web has grown exponentially since the late 1990s
and is expected to continue to increase. In addition, the availability of low-cost sensors
lets organizations collect data from everything; for instance, from farms, wind turbines, manufacturing plants, vehicles, and meters monitoring home energy consumption. These trends make it strategically important for organizations to efficiently and
rapidly process and analyze large datasets.
Now let’s look at how NoSQL systems, with their inherently horizontal scale-out
architectures, are ideal for tackling big data problems. We’ll look at several strategies
that NoSQL systems use to scale horizontally on commodity hardware. We’ll see how
NoSQL systems move queries to the data, not data to the queries. We’ll see how they
use the hash rings to evenly distribute the data on a cluster and use replication to scale
reads. All these strategies allow NoSQL systems to distribute the workload evenly and
eliminate performance bottlenecks.

6.1

What is a big data NoSQL solution?
So what exactly is a big data problem? A big data class problem is any business problem
that’s so large that it can’t be easily managed using a single processor. Big data problems force you to move away from a single-processor environment toward the more
complex world of distributed computing. Though great for solving big data problems,
distributed computing environments come with their own set of challenges (see figure 6.1).
We want to stress that big data isn’t the same as NoSQL. As we’ve defined NoSQL
in this book, it’s more than dealing with large datasets. NoSQL includes concepts and
use cases that can be managed by a single processor and have a positive impact on
agility and data quality. But we consider big data problems a primary use case for
NoSQL.
Before you assume you have a big data problem, you should consider whether you
need all of your data or a subset of your data to solve your problem. Using a statistical
sample allows you to use a subset of your data and look for patterns in the subset. The

What is a big data NoSQL solution?
One database

• Easy to understand
• Easy to set up and configure
• Easy to administer
• Single source of truth
• Limited scalability

OR

Many databases

129

• Data partitioning
• Replication
• Clustering
• Query distribution
• Load balancing
• Consistency/Syncing
• Latency/Concurrency
• Clock synchronization
• Network bottlenecks/failures
• Multiple data centers
• Distributed backup
• Node failure
• Voting algorithms for error detection
• Administration of many systems
• Monitoring
• Scalable if designed correctly

Figure 6.1 One or many databases? Here are some of the challenges you face when you
move from a single processor to a distributed computing system. Moving to a distributed
environment is a nontrivial endeavor and should be done only if the business problem
really warrants the need to handle large data volumes in a short period of time. This is
why platforms like Hadoop are complex and require a complex framework to make things
easier for the application developer.

trick is to come up with a process to ensure the sample you choose is a fair representation of the full dataset.
You should also consider how quickly you need your data processed. Many data
analysis problems can be handled by a batch-type solution running on a single processor; you may not need an immediate answer. The key is to understand the true timecritical nature of your situation.
Now that you know that distributed databases are more complex than a single processor system and there are alternatives to using a full dataset, let’s look at why organizations are moving toward these complex systems. Why is the ability to handle big data
strategically important to many organizations? Answering this question involves
understanding the external factors that are driving the big data marketplace.
Here are some typical big data use cases:
 Bulk image processing—Organizations like NASA regularly receive terabytes of

incoming data from satellites or even rovers on Mars. NASA uses a large number
of servers to process these images and perform functions like image enhancement and photo stitching. Medical imaging systems like CAT scans and MRIs
need to convert raw image data into formats that are useful to doctors and
patients. Custom imaging hardware has been found to be more expensive than
renting a large number of processors on the cloud when they’re needed. For
example, the New York Times converted 3.3 million scans of old newspaper articles into web formats using tools like Amazon EC2 and Hadoop for a few hundred dollars.

130

CHAPTER 6

Using NoSQL to manage big data

 Public web page data —Publicly accessible pages are full of information that orga-













nizations can use to be more competitive. They contain news stories, RSS feeds,
new product information, product reviews, and blog postings. Not all of the
information is authentic. There are millions of pages of fake product reviews
created by competitors or third parties paid to disparage other sites. Finding
out which product reviews are valid is a topic for careful analysis.
Remote sensor data—Small, low-power sensors can now track almost any aspect of
our world. Devices installed on vehicles track location, speed, acceleration, and
fuel consumption, and tell your insurance company about your driving habits.
Road sensors can warn about traffic jams in real time and suggest alternate
routes. You can even track the moisture in your garden, lawn, and indoor plants
to suggest a watering plan for your home.
Event log data—Computer systems create logs of read-only events from web page
hits (also called clickstreams), email messages sent, or login attempts. Each of
these events can help organizations understand who’s using what resources and
when systems may not be performing according to specification. Event log data
can be fed into operational intelligence tools to send alerts to users when key
indicators fall out of acceptable ranges.
Mobile phone data—Every time users move to new locations, applications can
track these events. You can see when your friends are around you or when customers walk through your retail store. Although there are privacy issues
involved in accessing this data, it’s forming a new type of event stream that can
be used in innovative ways to give companies a competitive advantage.
Social media data—Social networks such as Twitter, Facebook, and LinkedIn provide a continuous real-time data feed that can be used to see relationships and
trends. Each site creates data feeds that you can use to look at trends in customer mood or get feedback on your own as well as competitor products.
Game data—Games that run on PCs, video game consoles, and mobile devices
have back-end datasets that need to scale quickly. These games store and share
high scores for all users as well as game data for each player. Game site back
ends must be able to scale by orders of magnitude if viral marketing campaigns
catch on with their users.
Open linked data—In chapter 4 we looked at how organizations can publish public datasets that can be ingested by your systems. Not only is this data large, but
it may require complex tools to reconcile, remove duplication, and find invalid
items.

When looking at these use cases, you see that some problems can be described as
independent parallel transforms since the output from one transform isn’t used as an
input to another. This includes problems like image and signal processing. Their
focus is on the efficient and reliable data transformation at scale. These use cases
don’t need the query or transactions support provided by many NoSQL systems. They