Tải bản đầy đủ
Appendix A. Amazon Web Services Resources and Tools

Appendix A. Amazon Web Services Resources and Tools

Tải bản đầy đủ

Amazon Simple Storage Service
This is the service home page for Amazon Simple Storage Service (S3). The site
provides a detailed description of Amazon S3 and pricing information. Amazon S3
is used to store input and output data for Amazon EMR data analysis. Many of the
scripts and applications used for data analysis are stored in S3, and their S3 location
is specified in configuring Amazon EMR Job Flows.
Amazon Glacier
This is the service home page for Amazon Glacier. Amazon Glacier is a low-cost,
long-term storage solution for data in the book that may be needed in the future,
but is not currently being processed by EMR or reviewed by system users. Amazon
Glacier can be used for cost savings compared to online S3 storage.
AWS Data Pipeline
This is the service home page for AWS Data Pipeline. Data Pipeline is used to
automate EMR processing and reduce the administrative burden of maintaining
an EMR application in AWS.

Amazon AWS Cost Estimation Tools
When one transitions from internal systems to cloud-based solutions like AWS, the
discussion almost always comes down to considerations around cost. In Chapter 6, we
covered numerous real-world scenarios and estimation techniques to review project
costs. In running through the scenarios, we used the following online cost estimation
tools to review and compare costs in the scenarios.
Amazon Web Services Simple Monthly Calculator
This online calculator allows you to input the resources you expect to use in AWS
and determine the monthly cost of those services. The tool also allows you to “Save
and Share” your calculations, and produces a URL that can be given to others on
the project team or stakeholders for review.
Amazon Web Services Economics Center
The Economics Center helps you compare the costs of running an application in a
traditional data center and running the same application in AWS. This tool can be
useful in determining cost savings and comparing available resources inside an

AWS Best Practices and Architecture
Amazon provides a number of critical documents that help organizations start building
their applications using best practices. Also, for organizations that use third-party com‐
ponents like Microsoft Windows, Oracle, Red Hat Linux, and others, Amazon provides
a number of already configured EC2 instances and information on how to build your

| Appendix A: Amazon Web Services Resources and Tools

own Amazon Machine Images (AMI). The following links at AWS are useful for projects
that need this information:
Amazon Architecture Center
This AWS site helps developers review software reference architectures that were
designed to make best use of AWS services. The site can be useful in building a new
application or transitioning an existing application over to AWS. The information
will help the development team build applications in AWS that minimize downtime
and optimize scalability and performance.
Amazon Security Center
Security is one of the top reasons many organizations have been hesitant to move
their critical systems to cloud service providers like AWS. Amazon provides a great
deal of information on the security of AWS and its AWS data centers on this site.
Information on how AWS meets the compliance regulations for a number of in‐
dustry compliance regimes like PCI, HIPAA, and others is also published on this
Amazon EC2 Instances
This site demystifies the Amazon EC2 instance sizes of small, medium, large, extra
large, and so on, and maps these sizes to their physical equivalents of CPU, memory,
and disk space allocations.
Create Your Own AMI
Amazon AWS has many of the common software configurations that many organ‐
izations use for applications. However, you may want to build an Amazon Machine
Image of special or in-house software so you can instantly start a preconfigured
image with your software. This guide provides details on how to build a custom
image to run inside EC2 or EMR.

Amazon EMR Distributions
As a developer in Amazon EMR, you must understand what features and APIs are
available. Fortunately, Amazon has extensive documentation of all of its AWS services
including developer documentation of EMR.
Amazon regularly updates the version of Hadoop and applies patches to integrate Ha‐
doop with AWS infrastructure and services. Table A-1 lists the versions of Hadoop that
are supported in Amazon EMR as of the writing of this book.
Table A-1. Amazon-supported Hadoop versions
Hadoop version

Configuration parameters


--hadoop-version 1.0.3 --ami-version 2.3


--hadoop-version 0.20.205 --ami-version 2.0

Amazon EMR Distributions



Hadoop version

Configuration parameters


--hadoop-version 0.20 --ami-version 1.0


--hadoop-version 0.18 --ami-version 1.0

To find out the latest supported versions of Hadoop for EMR, visit the Supported Ha‐
doop Versions section of the EMR Developer Guide.


| Appendix A: Amazon Web Services Resources and Tools


Cloud Computing, Amazon Web Services,
and Their Impacts

Though cloud computing was originally conceived in the 1960s by pioneering think‐
ers like J.C.R. Licklider—who thought computing resources would become a public
utility like electricity-—it has only been recently with the start of AWS in 2006 and
Windows Azure in 2008 that we have seen businesses seriously moving many of their
core services outside of private data centers. There have been many discussions and
descriptions about what cloud computing is and its value to businesses. However, in
general we characterize it as a set of computing resources, CPU, memory, disk, and the
like that is available to an end user and the interactions that user has with these resources.

AWS Service Delivery Models
There are a number of delivery models for cloud services and how the end user accesses
these resources in the cloud. We will focus on the delivery methods specific to AWS and
the resources used in this book for Elastic MapReduce.

Platform as a Service
Platform as a Service (PaaS) allows the deployment of custom-built applications within
the cloud provider’s infrastructure. Elastic MapReduce is an example of an Amazon
cloud service that is delivered as a PaaS. As a user, you can deploy a number of precon‐
figured Amazon EC2 instances with the EMR software preinstalled. You can specify the
compute capacity and memory for these instances, and have access to make configu‐
ration changes to the EMR software. Amazon takes care of much of the customization
needed for the EMR software to work in its data center and with other Amazon services.
As a user of EMR, you can tune the configuration to your application’s needs and install
much of application through Amazon’s APIs and tools.


Infrastructure as a Service
Infrastructure as a Service (IaaS) is probably the simplest cloud delivery method, and
one that seems most familiar to many professionals that have developed solutions to
run in private data centers. As a consumer of IaaS services, you have access to computing
resources of CPU, memory, disk, network, and other resources. Amazon’s EC2 is an
example of a cloud service delivered in the IaaS model. You can specify the size of an
EC2 instance and the operating system used, but it is up to you as a consumer of an EC2
instance to install OS patches, configure OS settings, and install third-party applications
and software components.

Storage as a Service
Storage as a Service (SaaS) allows you to store files or data in the provider’s data center.
Amazon S3 and Amazon Glacier are the storage services we use throughout this book.
Amazon charges on a per-gigabyte basis for these services and has replication and du‐
rability options.
We have discussed some of the benefits of AWS throughout the book, but we would be
remiss if we did not cover many of the key issues businesses must consider when moving
critical business data and infrastructure into the cloud.

Performance in cloud computing can vary widely between cloud providers. This vari‐
ability can be due to the time of day, applications running, and how many customers
have signed up for service from the cloud provider. It is a result of how the physical
hardware of memory and CPU in a cloud provider is shared among all the customers.
Most cloud providers operate in a multitenancy model where a single physical server
may run many instances of virtual computers. Each virtual instance uses some amount
of memory and CPU from the physical server on which it resides. The sharing and
allocation of the physical resources of a server to each virtual instance is the job of a
piece of software installed by the cloud provider called the hypervisor. Amazon uses a
highly customized version of the Xen hypervisor for AWS. As a user of EC2 and other
AWS services, you may have your EC2 instance running on the same physical hardware
as many other Amazon EC2 customers.
Let’s look at a number of scenarios at a cloud provider to understand why variability in
performance can occur. Let’s assume we have three physical servers, each with four
virtual instances running. Figure B-1 shows a number of virtual instances running in a
cloud provider.



Appendix B: Cloud Computing, Amazon Web Services, and Their Impacts