Tải bản đầy đủ
5 Denormalizing: using redundant data connections

5 Denormalizing: using redundant data connections

Tải bản đầy đủ

248

CHAPTER 8

Relations among documents

group: Denver technology
event: Introduction to Elasticsearch
group: Denver technology
event: Introduction to Hadoop
group: Denver technology
event: Logging and Elasticsearch

Figure 8.16 Hierarchical relationship
denormalized by copying group
information to each event

This relationship can be denormalized by adding the group info to all the events, as
shown in figure 8.16.
Next we’ll look at how and when denormalizing helps and how you’d concretely
index and query denormalized data.

8.5.1

Use cases for denormalizing
Let’s start with the disadvantages: denormalized data takes more space and is more
difficult to manage than normalized data. In the example from figure 8.16, if you
change the group’s details, you have to update three documents because those details
appear three times.
On the positive side, you don’t have to join different documents when you query.
This is particularly important in distributed systems because having to join documents
across the network introduces big latencies, as you can see in figure 8.17.

group:
Denver technology

event:
Logging and
Elasticsearch

event:
Introduction to Hadoop

event:
Introduction to
Elasticsearch

Node 1

Node 2

Node 3

Figure 8.17

Joining documents across nodes is difficult because of network latency.

Licensed to Thomas Snead

249

Denormalizing: using redundant data connections

group:
San Francisco
technology

group:
Denver
technology

event:
Elasticsearch
and Logstash

event:
Introduction
to Hadoop

event:
Document Relations
in Elasticsearch

event:
Queries and Filters

event:
Logging and
Elasticsearch

Node 1

Figure 8.18

event:
Introduction to
Elasticsearch

Node 2

Nested/parent-child relations make sure all joins are local.

Nested and parent-child documents get around this by making sure a parent and all
its children are stored in the same node, as shown in figure 8.18:




Nested documents are indexed in Lucene blocks, which are always together in
the same segment of the same shard.
Child documents are indexed with the same routing value as their parents,
making them belong to the same shard.

DENORMALIZING ONE-TO-MANY RELATIONS

Local joins done with nested and parent-child structures are much, much faster than
remote joins could be. Still, they’re more expensive than having no joins at all. This is
where denormalizing can help, but it implies that there’s more data. Your indexing
operations will cause more load because you’ll index more data and queries will run
on larger indices, making them slower.
You can see that there’s a tradeoff when it comes to choosing among nested, parentchild, and denormalizing. Typically, you’ll denormalize for one-to-many relations if
your data is fairly small and static and you have lots of queries. This way, disadvantages
hurt less—index size is acceptable and there aren’t too many indexing operations—
and avoiding joins should make queries faster.
TIP If performance is important to you, take a look at chapter 10, which is all
about indexing and searching fast.

Licensed to Thomas Snead

250

CHAPTER 8

Relations among documents

group:
San Francisco
technology

group:
Denver
Clojure

group:
Denver
Elasticsearch

group:
Bucharest
Elasticsearch

member:
Igor

member:
Radu

Node 1

Figure 8.19

member:
Joe

member:
Lee

Node 2

Many-to-many relationships can contain a huge amount of data, making local joins impossible.

DENORMALIZING MANY-TO-MANY RELATIONSHIPS

Many-to-many relationships are dealt with differently than one-to-many relationships
in Elasticsearch. For example, a group can contain multiple members, and a person
could be a member of multiple groups.
Here denormalizing is a much better proposition because unlike one-to-many
implementations of nested and parent-child, Elasticsearch can’t promise to contain
many-to-many relationships in a single node. As shown in figure 8.19, a single relationship may expand to your whole dataset. This would make expensive, cross-network
joins inevitable.
Because of how slow cross-network joins would be, as of version 1.5, denormalizing
is the only way to represent many-to-many relationships in Elasticsearch. Figure 8.20
shows how the structure of figure 8.19 looks when members are denormalized as children of each group they belong to. We denormalize one side of the many-to-many
relationship into more one-to-many relationships.
Next we’ll look at how you can index, update, and query a structure like the one in
figure 8.20.

8.5.2

Indexing, updating, and deleting denormalized data
Before you start indexing, you have to decide how you want to denormalize your
many-to-many into one-to-many, and there are two big decision points: which side of
the relationship you should denormalize and how you want to represent the resulting
one-to-many relationship.

Licensed to Thomas Snead

251

Denormalizing: using redundant data connections

group:
San Francisco
technology

group:
Bucharest
Elasticsearch

member:
Radu

group:
Denver
Clojure

group:
Denver
Elasticsearch

member:
Igor

member:
Igor

member:
Joe

member:
Radu

member:
Joe

Node 1

Figure 8.20

member:
Lee

Node 2

Many-to-many relation denormalized into multiple one-to-many relations, allowing local joins

WHICH SIDE WILL BE DENORMALIZED?

Will members be multiplied as children of groups or the other way around? To pick
one you have to understand how data is indexed, updated, deleted, and queried. The
part that’s denormalized—the child—will be more difficult to manage in all aspects:





You index those documents multiple times, once for each of its parents.
When you update, you have to update all instances of that document.
When you delete, you have to delete all instances.
When you query for children separately, you’ll get more hits with the same content, so you have to remove duplicates on the application side.

Based on these assumptions, it looks like it makes more sense to make members children of groups. Member documents are smaller in size, change less often, and are
queried less often than groups are with their events. As a result, managing cloned
member documents should be easier.
HOW DO YOU WANT TO REPRESENT THE ONE-TO-MANY RELATIONSHIP?

Will you have parent-child or nested documents? You’d choose here based on how
often groups and members are searched and retrieved together. Nested queries perform better than has_parent or has_child queries.
Another important aspect is how often membership changes. Parent-child structures perform better here because they can be updated separately.
For this example, let’s assume that searching and retrieving groups and members together is rare and that members often join and leave groups, so we’ll go with
parent-child.

Licensed to Thomas Snead

252

CHAPTER 8

Relations among documents

INDEXING

Groups and their events would be indexed as before, but members have to be indexed
once for every group they belong to. The following listing will first define a mapping
for the new member type and then index Mr. Hinman as a member of both the Denver
Clojure and the Denver Elasticsearch groups from the code samples.
Listing 8.10 Indexing denormalized members
curl -XPUT 'localhost:9200/get-together/_mapping/member' -d '{
"member": {
"_parent": { "type": "group"},
First define the mapping,
"properties": {
specifying that the parent
"first_name": { "type": "string"},
type for members is group.
"last_name": { "type": "string"}
}
}}'
curl -XPUT 'localhost:9200/get-together/member/10001?parent=1' -d '{
"first_name": "Matthew",
parent=1 points to the
"last_name": "Hinman"
Denver Clojure group.
}'
curl -XPUT 'localhost:9200/get-together/member/10001?parent=2' -d '{
"first_name": "Matthew",
"last_name": "Hinman"
parent=2 points to the
}'
Denver Elasticsearch group.

Multiple indexing operations can be done in a single HTTP request by
using the bulk API. We’ll discuss the bulk API in chapter 10, which is all about
performance.

NOTE

UPDATING

Once again, groups get lucky and you update them just as you saw in chapter 3, section 3.5. But if a member changes its details because it’s denormalized, you’ll first
have to search for all its duplicates and then update each one. In listing 8.11, you’ll
search for all the documents that have an _id of “10001” and update his first name to
Lee because that’s what he likes to be called.
You’re searching for IDs instead of names because IDs tend to be more reliable
than other fields, such as names. You may recall from the parent-child section that
when you’re using the _parent field, multiple documents within the same type within
the same index can have the same _id value. Only the _id and _parent combination
is guaranteed to be unique. When denormalizing, you can use this feature and intentionally use the same _id for the same person, once for each group they belong to.
This allows you to quickly and reliably retrieve all the instances of the same person by
searching for their ID.

Licensed to Thomas Snead

Denormalizing: using redundant data connections

253

Listing 8.11 Updating denormalized members

For each of
the returned
documents,
update the
name to
“Lee.”

curl 'localhost:9200/get-together/member/_search?pretty' -d '{
"query": {
"filtered": {
"filter": {
Searching for all the members with
"term": {
the same ID, which will return all
"_id": "10001"
the duplicates of this person
}
}
}
You need only the _parent field from each
},
document, so you know how to update.
"fields": ["_parent"]
}'
curl -XPOST 'localhost:9200/get-together/member/10001/_update?parent=1' -d '{
"doc": {
"first_name": "Lee"
}
}'
curl -XPOST 'localhost:9200/get-together/member/10001/_update?parent=2' -d '{
"doc": {
"first_name": "Lee"
}
}'

Multiple updates can also be done in a single HTTP request over the
bulk API. As with bulk indexing, we’ll discuss bulk updates in chapter 10.

NOTE

DELETING

Deleting a denormalized member requires you to identify all the copies again. Recall
from the parent-child section that in order to delete a specific document, you have to
specify both the _id and the _parent; that’s because the combination of the two is
unique in the same index and type. You’d have to identify members first through a
term filter like the one in listing 8.11. Then you’d delete each member instance:
% curl -XDELETE 'localhost:9200/get-together/member/10001?parent=1'
% curl -XDELETE 'localhost:9200/get-together/member/10001?parent=2'

Now that you know how to index, update, and delete in denormalized members, let’s
look at how you can run queries on them.

8.5.3

Querying denormalized data
If you need to query groups, there’s nothing denormalizing-specific because groups
aren’t denormalized. If you need search criteria from their members, use the has_child
query as you did in section 8.4.2.
Members got the shortest straw with queries, too, because they’re denormalized.
You can search for them, even including criteria from the groups they belong to, with
the has_parent query. But there’s a problem: you’ll get back identical members. In

Licensed to Thomas Snead

254

CHAPTER 8

Relations among documents

the following listing, you’ll index another two members, and when you search, you’ll
get them both back.
Listing 8.12 Querying for denormalized data returns duplicate results

Indexing
a person
twice, once
for each
group

The same
person is
returned
twice, once
for each
group.

curl -XPUT 'localhost:9200/get-together/member/10002?parent=1' -d '{
"first_name": "Radu",
"last_name": "Gheorghe"
}'
curl -XPUT 'localhost:9200/get-together/member/10002?parent=2' -d '{
"first_name": "Radu",
"last_name": "Gheorghe"
}'
curl -XPOST 'localhost:9200/get-together/_refresh'
curl 'localhost:9200/get-together/member/_search?pretty' -d '{
"query": {
"term": {
Searching for the
"first_name": "radu"
person by name
}
}}'
# reply
"hits" : [ {
"_index" : "get-together",
"_type" :
"member",
"_id" : "10002",
"_score" : 2.871802, "_source" : {
"first_name": "Radu","last_name": "Gheorghe"}
}, {
"_index" :
"get-together",
"_type" : "member",
"_id" : "10002",
"_score" : 2.5040774, "_source" : {
"first_name": "Radu","last_name": "Gheorghe"}
} ]

As of version 1.5, you can only remove those duplicate members from your application. Once again, if the same person always has the same ID, you can use that ID to
make this task easier: two results with the same ID are identical.
The same problem occurs with aggregations: if you want to count some properties
of the members, those counts will be inaccurate because the same member appears in
multiple places.
The workaround for most searches and aggregations is to maintain a copy of all
members in a separate index. Let’s call it “members.” Querying that index will return
just that one copy of each member. The problem with this workaround is that it only
helps when you query members alone, unless you’re doing application-side joins, which
we’ll discuss next.

Using denormalization to define relationships: pros and cons
As we did with the other methods, we provide a quick overview of the strengths and
weaknesses of denormalizing. The plus points:



It allows you to work with many-to-many relationships.
No joins are involved, making querying faster if your cluster can handle the extra
data caused by duplication.

Licensed to Thomas Snead

255

Application-side joins

The downsides:




8.6

Your application has to take care of duplicates when indexing, updating, and
deleting.
Some searches and aggregations won’t work as expected because data is
duplicated.

Application-side joins
Instead of denormalizing, another option for the groups and members relationship is
to keep them in separate indices and do the joins from your application. Much like
Elasticsearch does with parent-child, it requires you to store IDs to indicate which
member belongs to which group, and you have to query both.
For example, if you have a query for groups with “Denver” in the name, where
“Lee” or “Radu” is a member, you can run a bool query on members first to find out
which ones are Lee and Radu. Once you get the IDs, you can run a second query on
groups, where you add the member IDs in a terms filter next to the Denver query. The
whole process is illustrated in figure 8.21.
User query:
find groups with “Denver” in the name
where Lee or Radu are members
Application
query:
bool:
should
name:lee
name:radu

Response:
id=1; id=4

id: 1
name: Lee

query:
bool:
must:
name:denver
members:1,4

Response:
name=Denver technology group;
name=Denver search and big data

id: 2
name: Roy
id: 3
name: Susan

name: Denver technology group
members: 1,2,3

id: 4
name: Radu

name: Denver search and big data
members: 1,4

Members index
Figure 8.21

Group index

Application-side joins require you to run two queries.

Licensed to Thomas Snead

256

CHAPTER 8

Relations among documents

This works well when there aren’t many matching members. But if you want to include
all members from a city, for example, the second query will have to run a terms filter
with possibly thousands of members, making it expensive. Still, there are some things
you can do:


When you run the first query, if you need only member IDs, you can disable
retrieving the _source field to reduce traffic:
"query": {
"filtered": {
[...]
}
},
"_source": false



In the second query, if you have lots of IDs, it might be faster to execute the
terms filter on field data:
"query": {
"filtered": {
"filter": {
"terms": {
"members": [1, 4],
"execution": "fielddata"
}
}
}
}

We’ll cover more about performance in chapter 10, but when you model document
relations, it ultimately comes down to picking your battles.

8.7

Summary
Lots of use cases have to deal with relational data, and in this chapter you saw how you
can deal with these:





Object mapping, mostly useful for one-to-one relationships
Nested documents and parent-child structures, which deal with one-to-many
relationships
Denormalizing and application-side joins, which are mostly helpful with manyto-many relationships

Joining hurts performance, even when it’s local, so it’s typically a good idea to put as
many properties as you can in a single document. Object mapping helps with this
because it allows hierarchies in your documents. Searches and aggregations work here
as they do with a flat-structured document; you have to refer to fields using their full
path, like location.name.

Licensed to Thomas Snead

Summary

257

When you need to avoid cross-object matches, nested and parent/child documents
are available to help:




Nested documents are basically index-time joins, putting multiple Lucene documents in a single block. To the application, the block looks like a single Elasticsearch document.
The _parent field allows you to point a document to another document of
another type in the same index to be its parent. Elasticsearch will use routing to
make sure a parent and all its children land in the same shard so that it can perform a local join at query time.

You can search nested and parent-child documents with the following queries and filters:




nested query and filter
has_child query and filter
has_parent query and filter

Aggregations work across the relationship only with nested documents through the
nested and reverse_nested aggregation types.

Objects, nested and parent-child documents, and the generic technique of denormalizing can be combined in any way so you can get a good mix of performance and
functionality.

Licensed to Thomas Snead

Licensed to Thomas Snead