Tải bản đầy đủ
D.6 Sematext SPM: the Swiss Army knife

D.6 Sematext SPM: the Swiss Army knife

Tải bản đầy đủ

417

Sematext SPM: the Swiss Army knife

Figure D.8 Autocompletion
of REST calls

shown in figure D.9, offers performance monitoring, querying capabilities, alerting,
and anomaly detection in a cloud or on-premise offering.
SPM goes a step beyond the solutions mentioned previously by offering a rich set
of alerts and notification settings for Elasticsearch and across other infrastructure
you may have deployed, such as Apache Kafka, NGINX, Hadoop, MySQL, and others.
Alerts can be email-based, and they can post the alert data to another web service or
even integrate with other monitoring or collaboration applications, such as Atlassian
HipChat or Nagios.
Still, what appeals to us most about SPM is the all-in-one performance monitoring
dashboard idea, allowing users to see the big picture across every piece of their
deployed architecture or simply drill down into the real-time metrics being gathered

Figure D.9 Website: www.sematext.com License: Commercial

Licensed to Thomas Snead

418

APPENDIX

Figure D.10

D Elasticsearch monitoring plugins

Alerts and notifications configuration

on their Elasticsearch cluster (see figure D.10). That being said, SPM isn’t free like
some of the other options we discussed, but the pricing is variable depending on
usage (cpu/hour) and can be found here: http://sematext.com/spm/index.html.
Sematext SPM is available in the following ways:



On-premise installation
As-a-service online at www.sematext.com

This appendix covered just a small sample of the existing Elasticsearch monitoring
and management solutions available today. The current batch of available and
community-supported monitoring plugins can be found at www.elastic.co/guide/en/
elasticsearch/reference/current/modules-plugins.html#known-plugins.
Although Elasticsearch offers a complete and thorough REST API, the ability to
visualize live and historic data is well worth the few minutes needed to install any of
the plugins discussed here.

Licensed to Thomas Snead

appendix E
Turning search upside
down with the percolator
The Elasticsearch percolator is typically defined as “search upside down” for the
following reasons:






You index queries instead of documents. This registers the query in memory,
so it can be quickly run later.
You send a document to Elasticsearch instead of a query. This is called percolating a document, basically indexing it into a small, in-memory index. Registered queries are run against the small index, so Elasticsearch finds out
which queries match.
You get back a list of queries matching the document, instead of the other
way around like a regular search.

The typical use case for percolation is alerting. As shown in figure E.1, you can
notify users when new documents (matching their interests) appear.
As the figure shows, using the get-together site example we’ve used throughout
the book, you could let members define their interests, and you’d save them as percolator queries. When a new event is added, you can percolate it against those queries. Whenever there are matches, you can send emails to the respective users to
notify them of new events relevant to their interests.
Next, we’ll describe how to implement those alerts using the percolator. After
that, we’ll explain how it works under the hood, and then we’ll move on to performance and functionality tricks.

419

Licensed to Thomas Snead

420

APPENDIX

E

Turning search upside down with the percolator

Registered queries:

1

query:
match:
title: elasticsearch

query:
2 match:
title: python

Percolate:
title: Introduction
to Elasticsearch
Your
application
Matches: 1

Elasticsearch

New event on
Elasticsearch!

Users
Figure E.1 Typical use case: percolating a document enables the application to
send alerts to users if their stored queries match the document.

E.1

Percolator basics
There are three steps needed for percolation:
1

2
3

Make sure there’s a mapping in place for all the fields referenced by the registered queries.
Register the queries themselves.
Percolate documents.

Figure E.2 shows these steps.
We’ll take a closer look at these three steps next, and then we’ll move on to how
the percolator works and what its limitations are.

Licensed to Thomas Snead

421

Percolator basics

Step 1. Add mapping:

Step 2. Register queries:

PUT events/_mapping
1
properties:
title:
type: string

query:
match:
title: elasticsearch

query:
2 match:
title: python

Step 3. Percolate:
title: Introduction
to Elasticsearch
Your
application
Matches: 1

Elasticsearch

Figure E.2

E.1.1

You need a mapping and some registered queries in order to percolate documents.

Define a mapping, register queries, then percolate documents
Assume you want to send alerts for any new events about the Elasticsearch percolator.
Before registering queries, you need a mapping for all the fields you run queries on.
In the case of our get-together example, you might already have mappings for groups
and events if you ran populate.sh from the code samples. If you didn’t do that already,
you can download the code samples from https://github.com/dakrone/elasticsearchin-action so you can run populate.sh.
With the data from the code samples in place, you can register a query looking for
Elasticsearch Percolator in the title field. You already have the mapping for title
in place because you ran populate.sh:
% curl -XPUT 'localhost:9200/get-together/.percolator/1' -d '{
"query": {
"match": {
"title": "elasticsearch percolator"
}
}
}'

Note that the body of your request is the match query, but to register it, you have to
send it through a PUT request as you would while adding a document. To let Elasticsearch know this isn’t your average document but a percolator query, you have to
specify the .percolator type.
As you might expect, you can add as many queries as you want, at any
point in time. The percolator is real time, so a new query will account for percolation right after it’s added.

NOTE

Licensed to Thomas Snead

422

APPENDIX

E

Turning search upside down with the percolator

With your mapping and queries in place, you can start percolating documents. To do
that, you’ll hit the _percolate endpoint of the type where the document would go
and put the contents of the document under the doc field:
% curl 'localhost:9200/get-together/event/_percolate?pretty' -d '{
"doc": {
"title": "Discussion on Elasticsearch Percolator"
}
}'

You’ll get back a list of matching queries, identified by the index name and ID:
"total" : 1,
"matches" : [ {
"_index" : "get-together",
"_id" : "1"
} ]

TIP If you have lots of queries registered in the same index, you might want
only the IDs to shorten the reply. To do that, add the percolate_format=ids
parameter to the request URI.

Next, let’s look at how the percolator works and what kind of limitations you can expect.

E.1.2

Percolator under the hood
In the percolation you just did, Elasticsearch loaded the registered query and ran it
against a tiny, in-memory index containing the document you percolated. If you had
registered more queries, all of them would have been run on that tiny index.
REGISTERING QUERIES

It’s convenient that in Elasticsearch, queries are normally expressed in JSON, just as
documents are; when you register a query, it’s stored in the .percolator type of the
index you point it to. This is good for durability because those queries would be
stored like any other documents. In addition to storing the query, Elasticsearch loads
it in memory so it can be executed quickly.
Because registered queries are parsed and kept in memory, you
need to make sure you have enough heap on each node to hold those queries. As we’ll see in section E.2.2 of this appendix, one way to deal with large
amounts of queries would be to use a separate index (or more indices) for
percolation. This way you can scale out with percolation independent of the
actual data.

WARNING

UNREGISTERING QUERIES

To unregister a query, you have to delete it from the index using the .percolator type
and the ID of the query:
% curl -XDELETE 'localhost:9200/get-together/.percolator/1'

Licensed to Thomas Snead

Percolator basics

423

Because queries are also loaded in memory, deleting a query doesn’t always unregister
the query. A delete-by-ID does remove the percolation query from the memory, but as
of version 1.4, a delete-by-query request doesn’t unregister matching queries from
memory. For that to happen, you’d need to reopen the index; for example:
%
#
%
%
#

curl -XDELETE 'localhost:9200/get-together/.percolator/_query?q=*:*'
right now, any deleted queries are still in memory
curl -XPOST 'localhost:9200/get-together/_close'
curl -XPOST 'localhost:9200/get-together/_open'
now they're unregistered from memory, too

PERCOLATING DOCUMENTS

When you percolate a document, that document is first indexed in an in-memory
index; then all registered queries are run against that index to see which ones match.
Because you can only percolate one Elasticsearch document at a time, as of version 1.4 the parent-child queries you saw in chapter 8 don’t work with percolator
because they imply multiple documents. Plus, you can always add new children to the
same parent, so it’s difficult to keep all relevant data in the in-memory index.
By contrast, nested queries work because nested documents are always indexed
together in the same Elasticsearch document. You can see such an example in the following listing, where you’ll percolate events with attendee names as nested documents.
Listing E.1 Using percolator with nested attendee names
curl -XPUT 'localhost:9200/get-together/_mapping/nested-events' -d '{
"properties": {
"title": { "type": "string" },
"attendee-name": {
Defining attendee-name
"type": "nested",
as nested
"properties": {
"first": { "type": "string" },
"last": { "type": "string" }
}
}
}
}'
curl -XPUT 'localhost:9200/get-together/.percolator/1' -d '{
"query": {
"nested": { "path": "attendee-name",
Registering a
"query": {
nested query
"bool": {
"must": [
{ "match": {
"attendee-name.first": "Lee"
}},
{ "match": {
"attendee-name.last": "Hinman"
}}
]
}
}

Licensed to Thomas Snead

424

APPENDIX

E

Turning search upside down with the percolator

}
}
}'
curl 'localhost:9200/get-together/nested-events/_percolate?pretty' -d '{
"doc": {
"title": "Percolator works with nested documents",
"attendee-name": [
{ "first": "Lee", "last": "Hinman" },
This nested document
{ "first": "Radu", "last": "Gheorghe" },
will match the
{ "first": "Roy", "last": "Russo" }
registered query.
]
}
}'

As the number of queries grows, percolating a single document requires more CPU.
That’s why it’s important to register cheap queries wherever possible; for example, by
using ngrams instead of wildcards or regular expressions. You can look back at chapter 10 for performance tips, and section 10.4.1 describes the tradeoff between ngrams
and wildcards.
Percolation performance may be a concern for you, and in the next section we’ll
show you percolator-specific tips depending on your use case.

E.2

Performance tips
For different percolator use cases, there are different things you can do to improve
performance. In this section, we’ll look at the most important techniques and divide
them into two categories:




E.2.1

Optimizations to the format of the request or the reply—You can percolate existing
documents, percolate multiple documents in one request, and ask for only the
number of matching queries, instead of the whole list of IDs.
Optimizations to the way you organize queries—As we mentioned earlier, you can
use one or more separate indices to store registered queries. Here, you’ll apply
this advice, and we’ll also look at how you can use routing and filtering to
reduce the number of queries being run for each percolation.

Options for requests and replies
In some use cases, you can get away with fewer requests or less data going through the
network. Here, we’ll look at three ways to achieve this:




Percolating existing documents
Using multi percolate, which is the bulk API of percolation
Counting the number of matching queries instead of getting the full list

PERCOLATING EXISTING DOCUMENTS

This works well if what you percolate is what you index, especially if documents are
big. For example, if you index blogs, it might be slow to send every post twice over
HTTP: once for indexing and once for alerting subscribers of posts matching their

Licensed to Thomas Snead

425

Performance tips

interests. In such cases, indexing a document and then percolating it by ID, instead of
submitting it again, makes sense.
Percolating existing documents doesn’t work well for all use cases.
For example, if social media posts have a geo point field, you can register
geo queries matching each country’s area. This way, you can percolate each
post to determine its country of origin and add this information to the post
before indexing it. In such use cases, you need to percolate and then index;
it doesn’t make sense to do it the other way around. The use case to determine the country of origin is described in the following blog post by Elastic:
www.elastic.co/blog/using-percolator-geo-tagging/.

NOTE

In the next listing, you’ll register a query for groups matching elasticsearch. Then
you’ll percolate the group with ID 2 (Elasticsearch Denver), which is already indexed,
instead of sending its content all over again.
Listing E.2 Percolating an existing group document
curl -XPUT 'localhost:9200/get-together/.percolator/2' -d '{
"query": {
Query matching groups about
"match": {
Elasticsearch; the .percolator ID 2
"name": "elasticsearch"
is not related to group ID 2.
}
}
Percolating the
}'
existing Elasticsearch
curl 'localhost:9200/get-together/group/2/_percolate?pretty'

Denver group (ID 2)

MULTI PERCOLATE API

Whether you percolate existing documents or not, you can do multiple percolations
at once. This works well if you also index in bulks. For example, you can use the percolator for some automated tagging of blog posts by having one query for each tag.
When a batch of posts arrives, you can do as shown in figure E.3:
1

2

Percolate them all at once through the multi percolate API. Then, in your application, append matching tags. Be aware that the percolate API returns only the
IDs of the matching queries. Your application has to map the IDs of the percolation queries to the tags, so it has to map 1 to elasticsearch, 2 to release, and
3 to book. Another approach would be to give the percolation queries the ID
equal to the tag.
Finally, index all posts at once through the bulk API we introduced in chapter 10.

Be aware that sending the document twice, once for percolation and once for indexing, does imply more network traffic. The advantage would be that you wouldn’t have
to re-index the document if you added the tag using an update. That would be the alternative if you indexed the document first, did the percolation by ID, and used the multi
update API to update the indexed documents.

Licensed to Thomas Snead

426

APPENDIX

E

Step 1. Multi percolate

Turning search upside down with the percolator

Your
application

Step 2. Bulk index

post1
post1

title: New Elasticsearch Release
tags: [elasticsearch, release]

title: New Elasticsearch Release

post2

post2
title: New Elasticsearch Book
post1 matches: 1,2
post2 matches: 1,3

Type: posts

Type: percolator

1

2

3

query:
match:
title: elasticsearch
query:
match:
title: release

title: New Elasticsearch Book
tags: [elasticsearch, book]

post1
title: New Elasticsearch Release
tags: [elasticsearch, release]
post2
title: New Elasticsearch Book
tags: [elasticsearch, book]

query:
match:
title: book

Index: blog

Figure E.3 Percolator for automated tagging. The multi percolate and bulk APIs reduce the number of requests.
Before step 1, the percolation queries have been indexed. In step 1 you use the multi percolate API to find
matching percolation queries. The application maps the IDs to the tags and adds them to the documents to index.
In step 2 you use the bulk index API to index the documents.

In the following listing you’ll apply what’s described in figure E.3.
Listing E.3 Using the multi percolate and bulk APIs for automated tagging
curl -XPUT localhost:9200/blog -d '{
"mappings": {
"posts": {
"properties": {
"title": {
Create the index first, with the
"type": "string"
mapping for the title field.
}

Licensed to Thomas Snead

427

Performance tips
}
}

You can use the
bulk API to
register queries,
just as you’ve
used the index
API so far.
Multi percolate
will return
matches
for each
percolated
document.

}
}'
echo '{"index" : {"_index" : "blog", "_type" : ".percolator", "_id": "1"}}
{"query": {"match": {"title": "elasticsearch"}}}
{"index" : {"_index" : "blog", "_type" : ".percolator", "_id": "2"}}
{"query": {"match": {"title": "release"}}}
{"index" : {"_index" : "blog", "_type" : ".percolator", "_id": "3"}}
{"query": {"match": {"title": "book"}}}
' > bulk_requests_queries
curl 'localhost:9200/_bulk?pretty' --data-binary @bulk_requests_queries
echo '{"percolate" : {"index" : "blog", "type" : "posts"}}
{"doc": {"title": "New Elasticsearch Release"}}
{"percolate" : {"index" : "blog", "type" : "posts"}}
{"doc": {"title": "New Elasticsearch Book"}}
' > perc_requests
curl 'localhost:9200/_mpercolate?pretty' --data-binary @perc_requests
echo '{"index" : {"_index" : "blog", "_type" : "posts"}}
{"title": "New Elasticsearch Release", "tags": ["elasticsearch", "release"]}
{"index" : {"_index" : "blog", "_type" : "posts"}}
{"title": "New Elasticsearch Book", "tags": ["elasticsearch", "book"]}
' > bulk_requests
curl 'localhost:9200/_bulk?pretty' --data-binary @bulk_requests

Knowing which tag corresponds to which
post, you can index posts with tags, too.

Note how similar the multi percolate API is to the bulk API:







Every request takes two lines in the request body.
The first line shows the operation (percolate) and identification information
(index, type, and for existing documents, the ID). Note that the bulk API uses
underscores like _index and _type, but multi percolate doesn’t (index and type).
The second line contains metadata. You’d put the document in there under the
doc field. When you’re percolating existing documents, the metadata JSON
would be empty.
Finally, the body of the request is sent to the _mpercolate endpoint. As with the
bulk API, this endpoint can contain the index and the type name, which can
later be omitted from the body.

GETTING ONLY THE NUMBER OF MATCHING QUERIES
Besides the percolate action, the multi percolate API supports a count action, which

will return the same reply as before with the total number of matching queries for
each document, but you won’t get the matches array:
echo '{"count" : {"index" : "blog", "type" : "posts"}}
{"doc": {"title": "New Elasticsearch Release"}}
{"count" : {"index" : "blog", "type" : "posts"}}
{"doc": {"title": "New Elasticsearch Book"}}
' > percolate_requests
curl 'localhost:9200/_mpercolate?pretty' --data-binary @percolate_requests

Licensed to Thomas Snead