Tải bản đầy đủ - 0 (trang)
Chapter 15. Using Intelligent Caching to Avoid the Bot Performance Tax

Chapter 15. Using Intelligent Caching to Avoid the Bot Performance Tax

Tải bản đầy đủ - 0trang

Figure 15-1. Attack of the holidays

In Project Honey Pot’s case, the traffic from these bots had a significant performance

impact. Because they did not follow the typical human visitation pattern, they were

often triggering pages that weren’t hot in our cache. Moreover, since the bots typically

didn’t fire Javascript beacons like those used in systems like Google Analytics, their

traffic and its impact weren’t immediately obvious.

To solve the problem, we implemented two different systems to deal with two different

types of bots. Because we had great data on web threats, we were able to leverage that

to restrict known malicious crawlers from requesting dynamic pages on the site. Just

taking off the threat traffic had an immediate impact and freed up database resources

for legitimate visitors.

The same approach didn’t make sense for the other type of automated bots: search

engine crawlers. We wanted Project Honey Pot’s pages to be found through online

searches, so we didn’t want to block search engine crawlers entirely. However, in spite

of removing the threat traffic, Google, Yahoo, and Microsoft’s crawlers all accessing

the site at the same time would sometimes cause the web server and database to slow

to a crawl.

The solution was a modification of our caching strategy. While we wanted to deliver

the latest results to human visitors, we began serving search crawlers from a cache with

a longer time to live (TTL). We experimented with the right TTLs for pages, but eventually settled on 1 day as being optimal for the Project Honey Pot site. If a page is crawled

by Google today and then Baidu requests the same page less in the next 24 hours, we

return the cached version without regenerating the page from the database.

96 | Chapter 15: Using Intelligent Caching to Avoid the Bot Performance Tax


Search engines, by their nature, see a snapshot of the Internet. While it is important to

not serve deceptively different content to their crawlers, modifying your caching strategy to minimize their performance impact on your web application is well within the

bounds of good web practices.

Since starting CloudFlare (https://www.cloudflare.com/), we’ve taken the caching strategy we developed at Project Honey Pot and made it more intelligent and dynamic to

optimize performance. We automatically tune the search crawler TTL to the characteristics of the site, and are very good at keeping malicious crawlers from ever hitting

your web application. On average, we’re able to offload 70% of the requests from a

web application — which is stunning given the entire CloudFlare configuration process

takes about 5 minutes. While some of this performance benefit comes from traditional

CDN-like caching, some of the biggest cache wins actually come from handling bots’

deep page views that aren’t alleviated by traditional caching strategies.

The results can be dramatic. For example, SXSW’s website employs extensive traditional web application and database caching systems but was able to reduce the load

on their web servers and database machines by more than 50% (http://blog.cloudflare

.com/cloudflare-powers-the-sxsw-panel-picker) in large part because of CloudFlare’s

bot-aware caching (Figure 15-2).

Figure 15-2. Bot-aware caching results

When you’re tuning your web application for maximum performance, if you’re only

looking at a beacon-based analytics tool like Google Analytics you may be missing one

of the biggest sources of web application load. This is why CloudFlare’s analytics reports the visits from all visitors to your site. Even without CloudFlare, digging through

your raw server logs, being bot-aware, and building caching strategies that differentiate

between the behaviors of different classes of visitors can be an important aspect of any

site’s web performance strategy.

To comment on this chapter, please visit http://calendar.perfplanet.com/

2011/using-intelligent-caching-to-avoid-the-bot-performance-tax/. Originally published on Dec 15, 2011.

Using Intelligent Caching to Avoid the Bot Performance Tax | 97




A Practical Guide to the Navigation

Timing API

Buddy Brewer

Navigation Timing (http://dvcs.w3.org/hg/webperf/raw-file/tip/specs/NavigationTim

ing/Overview.html) is an API from the W3C’s Web Performance Working Group (http:

//www.w3.org/2010/webperf/) that exposes data about the performance of your web

pages. Navigation Timing is a major new development because it enables you to collect

fine-grained performance metrics from real users, including events that happen before

Javascript-based trackers have a chance to load. This gives us the ability to directly

measure things like DNS resolution, connection latency, and time to first byte from

inside the browsers of real users.

Why You Should Care

I spent the first eight years of my career building synthetic monitoring products but I

now believe real user monitoring should be your preferred source of “The Truth” when

it comes to understanding the performance of your site. That doesn’t mean you should

throw away your synthetic monitoring, but today I view it as a useful complement to

real user monitoring rather than a complete performance solution in itself.

Real user monitoring is critical because it provides the most accurate portrayal of the

true experience across the browsers, locations, and networks your users are on. It is

the only way to realistically measure how your caching decisions impact the user experience. Measuring real people (with real personalities and real credit cards) also gives

you an opportunity to collect performance and business metrics in the same context,

so you can see what impact load times are having on key business metrics like conversion and bounce rates.

The biggest problem we face with Navigation Timing is that there isn’t a good system

for collecting and analyzing the raw data. In this chapter, I’ll describe a solution to this

problem that can be quickly deployed using free tools.



Collecting Navigation Timing Timestamps and Turning Them

into Useful Measurements

The window.performance.timing object gives all of its metrics in the form of timestamps

relative to the epoch. In order to turn these into useful measurements, we need to settle

on a common vocabulary and do some arithmetic. I suggest starting with the following:

function getPerfStats() {

var timing = window.performance.timing;

return {

dns: timing.domainLookupEnd - timing.domainLookupStart,

connect: timing.connectEnd - timing.connectStart,

ttfb: timing.responseStart - timing.connectEnd,

basePage: timing.responseEnd - timing.responseStart,

frontEnd: timing.loadEventStart - timing.responseEnd



This gives you a starting point that is similar to the waterfall components you commonly

see in synthetic monitoring tools. It would be interesting to collect this data for a while

and compare it to your synthetic data to see how close they are.

Using Google Analytics as a Performance Data Warehouse

Next we need a place to store the data we’re collecting. You could write your own

beacon service or simply encode the values on a query string, log them in your web

server’s access logs, and write a program to parse and analyze the results. However

these are time-consuming approaches. We’re looking for something we can get up and

running quickly and at minimal cost. Enter Google Analytics (http://www.google.com/


Google Analytics is the most popular free web site analytics system on the Internet.

While GA automatically provides basic performance metrics in its Site Speed Analytics

Report (http://analytics.blogspot.com/2011/05/measure-page-load-time-with-site-speed

.html), it is based on a sample of data and only reports on the total page load time. We

can improve this by using GA’s event tracking capability to store and analyze our finegrained Navigation Timing metrics:

window.onload = function() {

if (window.performance && window.performance.timing) {

var ntStats = getPerfStats();

_gaq.push(["_trackEvent", "Navigation Timing", "DNS", undefined, ntStats.dns, true]);

_gaq.push(["_trackEvent", "Navigation Timing", "Connect", undefined, ntStats.connect, true]);

_gaq.push(["_trackEvent", "Navigation Timing", "TTFB", undefined, ntStats.ttfb, true]);

_gaq.push(["_trackEvent", "Navigation Timing", "BasePage", undefined, ntStats.basePage, true]);

_gaq.push(["_trackEvent", "Navigation Timing", "FrontEnd", undefined, ntStats.frontEnd, true]);



100 | Chapter 16: A Practical Guide to the Navigation Timing API


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Chapter 15. Using Intelligent Caching to Avoid the Bot Performance Tax

Tải bản đầy đủ ngay(0 tr)