Tải bản đầy đủ
A.1 Onto your laptop and Hadoopless: Linux or OS X

A.1 Onto your laptop and Hadoopless: Linux or OS X

Tải bản đầy đủ

In the cloud: Amazon Web Services

241

 Your laptop or desktop has OS X as its base OS.
 You’ve created a custom VM with Linux as the VM’s OS (see next section).
 You’re using a VM in the cloud—for example, Amazon AWS, Azure, or even a

web-hosting company VM.
You can download “pre-built” versions of Spark from the Apache website, pre-built for
various versions of Hadoop (Hadoop 1.x, Hadoop 2.x, MapR, and so forth). Because
in this option we won’t use Hadoop at all, it doesn’t matter which one you pick. As
long as you don’t try to read or write HDFS files, you’ll be fine.
Download the Spark tgz file, uncompress it, and you’re ready to go. To use only the
Spark Shell, you don’t even need Scala installed, just Java. To build Spark programs in
Scala, though, you’ll need to install Scala.

A.1.1

On a custom local virtual machine
Combining the preceding two ideas—to be both Hadoopless and on a virtual
machine—is another convenient option, with the following advantages:
 Without HDFS and other Hadoop services running, you gain an extra gigabyte

or two compared to using the pre-built Cloudera QuickStart VM.
 Compared to the option of installing Spark as your computer’s (or VM in the
cloud’s) base OS, a VM has the benefits described at the beginning of this
appendix.
Creating a VM from scratch is non-trivial. It’s a lot of steps—selecting the right
options, tweaking a lot of things—that are out of the scope of this book but that you
can Google. Or you can try to find a pre-built VM for the virtual machine host software
of your choice and for the Linux flavor of your preference and download that.

A.1

In the cloud: Amazon Web Services
Amazon Web Services provides dozens of different cloud services, the most wellknown of which are S3 for storage and EC2 for elastic compute. For the purposes of
Hadoop and Spark, Amazon offers Elastic MapReduce (EMR). EMR allows you to manage S3 and EC2 resources to bring up an entire Hadoop cluster (with or without
Spark).
The advantage of AWS EMR over the options described previously is that you can
actually run on a cluster, realize the benefit of parallelization, handle large datasets,
and become familiar with developing Spark applications for YARN and submitting
Spark applications to a YARN-powered cluster.
The obvious downside is that AWS isn’t free. The other downside is that if you use
the AWS automatic Spark cluster, there’s no way to pause it. It has to be completely
destroyed every time you would otherwise want to walk away and pause it. There’s no
way as of the time of writing to pause an AWS Spark cluster to prevent billing. That
means you have to be conscientious and save your work on S3. But there’s no way, for
example, to leave data stored in the REPL and come back to it later.

appendix B
Gephi visualization software
Chapter 4 contains code to generate .gexf files, the native file format of Gephi.
Downloading and installing Gephi from http://gephi.github.io is straightforward
(it’s available for OS X, Windows, and Linux), but its user interface can be intimidating at first. This appendix points you to the most important UI elements—
enough to get you started—and you can then explore the remaining rich set of features on your own.

B.1

Laying out your environment
Gephi has dockable windows, much like an IDE. Figure B.1 is how we used Gephi
when generating some of the diagrams in this book. The three dockable windows
to choose from the Window drop-down menu are shown in figure B.2. Once
they’re displayed, drag and drop them to the arrangement shown in figure B.1.

Figure B.1

Gephi’s dockable window layout

242

243

Key settings

You’ll notice the Graph tab, which, given its
position and name, seems like an important tab,
but it’s for doing processing on the graph.
Because we’re typically doing graph processing
in GraphX and are interested in Gephi for its
visualization capability, you should ignore the
Graph tab at first.

B.2

Basic recipe
Here’s the basic loop of steps you’ll typically do
to visualize:
1

2

3

B.3

Adjust something in the Layout or Preview
Settings window.
Click the Refresh button in the Preview
Settings window.
Pan the Preview window via right-clickdrag and adjust zoom via the buttons at
the bottom of the Preview window.

Figure B.2 The three windows to choose
from the Window drop-down menu

Key settings
Gephi has a lot of options. This section covers some of the more useful ones.

B.3.1

Layout window
Here you can choose a layout algorithm and its parameters. Some layout algorithms
are incremental (they tweak what has already been laid out) and some start from
scratch. You’ll want to choose a “start from scratch” algorithm first and then tweak it
with an “incremental” algorithm only if necessary. The available algorithms are shown
in figure B.3. Usually, Force Atlas is a good starting point because it reliably produces
reasonable results.

Incremental
Typical good
starting point
Also incremental

Figure B.3 Available layout algorithms from the drop-down list inside the Layout window. The
ones we haven’t labeled as Incremental are all first-class layout algorithms that perform a
complete layout from scratch. The incremental ones nudge around an already-laid-out graph.

244

APPENDIX B

Gephi visualization software

For small graphs, you may need to first adjust “Repulsion strength” (or “Optimal distance” in other algorithms) to a much larger number, as highlighted in figure B.4.
Gephi is designed to handle very large graphs with hundreds or thousands of vertices,
and its default settings provide for very short edges. For graphs with a dozen or a few
dozen vertices, you’ll want to make the edges longer by increasing “Repulsion
strength” or “Optimal distance.”

May need to vastly increase
Repulsion Strength (or Optimal
Distance in other algorithms)
for smaller graphs

Figure B.4

Key adjustment so that small graphs don’t end up as a tiny, scrunched-up bunch

After making any setting adjustment in the Layout window, click the Run button (seen
in figure B.4) and then click the Refresh button in the Preview Settings window.

B.3.2

Preview Settings window
Important settings in the Preview Settings window are highlighted in figure B.5.
Gephi uses the term nodes to mean vertices. In this book, we’ve used
nodes to mean computers participating in a cluster for cluster computing.

NOTE

Key settings

Tick this if your
vertices have properties

Untick this to set font size
in absolute terms (if your
graph is small)

Untick this to set edge
thickness in absolute
terms (if your graph
is small)
Set to custom to nail
down the edge color
(if your graph is small)
May need to vastly
increase for small
graphs
Tick this if your edges
have properties

Click the refresh button
after every change in either
the layout window or this
Preview Settings window
Figure B.5

Key Preview Settings window settings

245

appendix C
Resources:
where to go for more
C.1

Spark
The number of books on Spark finally started growing in 2015—six years after
Spark development first began. But Spark development is still moving fast, and the
best resources are online.

Apache mailing lists
As with any open source project, especially one from Apache, the mailing lists are
the best sources of information, and subscribing to them—and asking questions
when you can’t find answers on the web—should be considered the minimum you
have to do. The mailing lists are known as user@spark.apache.org and
dev@spark.apache.org. You can subscribe to them from https://spark.apache.org
/community.html.

Databricks forums
Databricks is the commercialization of Spark that offers a commercial product of a
Spark notebook in the cloud. But the forums on www.databricks.com aren’t limited
to only the commercial product. As a large percentage of the commits to Apache
Spark come from Databricks, the Databricks forums also contain a lot of generalpurpose information about Spark, including future plans that pertain to the open
source Apache Spark as well as the commercial Databricks product.

Conference and meetup videos
There are four major sources of Spark videos. None should be overlooked; they are
all outstanding. Spark is moving fast, and watching these videos on your smartphone while on the treadmill or as a bedtime story is sometimes the only way to
keep up:
1
2
3
4

Spark Summit (West, East, and Europe)
AMPLab AMPCamp
Bay Area Spark Meetup
O’Reilly Strata Conference (West and East)
246

Spark

247

Jira
If staying current with Spark is important to you, there’s no substitute to following the
Spark Jira. Create an Apache Jira account if you don’t already have one, list all the
issues every day in reverse chronological order, and click Watch for the issues that are
important to you. That way you can know what new features, bug fixes, performance
improvements, architectural changes, and support for third-party systems (file systems, cluster managers, database connectors, compression formats, serialization
schemes, and so on) are coming down the way—and, more importantly, which versions they’re being targeted for.
There are some long-standing gems of planned features buried within Jira from
the early days that are still being worked on or planned for, so, as painful and timeconsuming as it may sound, the first time you list Spark Jira tickets, it’s probably worth
your while to go through all of those that are still open.

Twitter
If you think Twitter is just about celebrities and that nothing useful could possibly be
expressed in 140 characters, you’re in for a surprise.
There’s a lot on Twitter in terms of Big Data, data science, and machine learning.
You can regard Twitter as a link aggregator to hot or important blog posts, news stories, or Git repositories.

spark-packages.org
Because the developers of Apache Spark are reluctant to overload the official distribution with too many features and sub-packages, they set up the website spark-packages
.org. Available add-on packages are broken up into categories such as machine learning, graphs, Python, and so on.

AMPLab
Spark came out of AMPLab, and AMPLab continues to develop new modules that work
with Spark, as well as some other brand-new technologies unrelated to Spark. Modules
that come out of AMPLab have a tendency to either be incorporated directly into the
Apache Spark distribution (such as GraphX, Catalyst, which became Spark SQL, and
SparkR) or at least semi-officially supported, such as Tachyon.

Google Scholar Alerts
You’re likely familiar with Google Alerts, which sends you an email whenever a page is
updated. But there’s something completely different called Google Scholar Alerts,
part of scholar.google.com, which sends an email whenever a new paper is published
that cites a paper you’re tracking.
If you set Google Scholar Alerts on some of the seminal Spark papers, such as Matei
Zaharia’s “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory
Cluster Computing” or Gonzalez et al’s “GraphX: Graph Processing in a Distributed

248

APPENDIX C

Resources: where to go for more

Dataflow Framework,” you can keep track of the latest advances in academia before
they become commercialized.

Author blogs
If you do all that we’ve suggested so far, you won’t need to read these blogs. But if you
want to save time and read only a distilled version of what’s coming in the future for
Spark, Big Data, data science, and machine learning—at least through Michael
Malak’s personal crystal ball—then his blogs are good resources:
 http://technicaltidbit.com
 http://datascienceassn.org/blogs/michaelmalak

C.2

Scala
The best Scala resources are books. Some Scala books are quite long. But because
Scala has so many tricks, an alternative is to get the ones that are encyclopedias of
tricks:
 Scala Cookbook by Alvin Alexander (O’Reilly, 2013)
 Scala Puzzlers by Andrew Phillips (Artima, 2014)

C.3

Graphs
There are tons of books on graph theory, many of them highly theoretical, either for
use as college textbooks or for use by researchers. Practitioners, however, may find the
following useful:
 Graph-Based Natural Language Processing and Information Retrieval by Rada Mihal-

cea and Dragomir Radev (Cambridge University Press, 2011)
 Graph Databases by Ian Robinson et al (O’Reilly, 2015)

appendix D
List of Scala tips in this book
This book is not intended to teach you Scala, but rather provides Scala tips along
the way to help you along, under the assumptions that Scala may not be your first
and most familiar language and that you may not have seen all the requisite Scala
tricks before. For books on learning Scala, see appendix C.
Below is a list of the Scala tips sprinkled throughout the book:
CHAPTER 2
Underscores mean different things in different places… 27
CHAPTER 4
The object keyword… 65
The apply() method… 65
Type inference of return values… 68
Type parameter for generics cannot be inferred… 70
The Option[] class… 72
Multiple imports on same line… 76
Backticks to escape reserved words… 78
Pasting blocks of code into the REPL… 79
Multiple parameter lists… 87
Named parameters… 88
CHAPTER 5
Optional parentheses on function invocation… 98
Regex and “raw” strings… 104
For comprehensions… 106
CHAPTER 6
List operator +: for appending… 114
The type keyword… 116
ClassTag… 119
CHAPTER 7
Multiple return values 131
Dot product idiom using zip(), map(), and reduce()… 132
HashMap initialization using ->… 161
CHAPTER 8
Formatted output using "${myVar}"… 172

249

index
Symbols

algorithms 90–124
Connected Components
overview 100–106
predicting social circles using 101–106
convergent 107
Dijkstra 112
EM (Expectation-Maximization) 141
greedy 115–117
incremental 243
iterative 7–8, 44
K-Means 152
Kruskal’s 111, 117–118
LabelPropagation 107
Minimum Spanning Tree
deriving taxonomies with Word2Vec
and 121–124
general discussion 117–124
missing from GraphX 167–186
basic graph operations 168–171
global clustering coefficient 184–186
graph isomorphism 179–183
reading RDF graph files 172–177
Online Variational Bayes 141
PageRank 91–95
invoking in GraphX 92–94
overview 91
Personalized 94–95
R-MAT 81–83
shortest paths with weights 111–114
ShortestPaths 99
strongly connected components 106–107
SVD++ 128–135
biases 134
item-to-item similarity 135
latent variables 134–135
Traveling Salesman 115, 117–118

_ + _ idiom 41
:+ operator 114
() function 70
@tailrec annotation 37
# character 26
+: operator 114
++ operator 140
edges 180
=> charater 36
-> operator 161
->() syntax 233
${myVar} 172

Numerics
2-D plane 152
2-item tuples 39
3-D cube 152
3-item tuples 39

A
AbstractFunction1, AbstractFunction2 219
accuracy 128
actions 44
acyclic graphs 53, 200
addition 40
adjacency matrix 12, 56
aggregateMessages() function 67, 69, 72, 83,
111, 161, 184, 209, 229, 233
AggregateMessagesBuilder class 229
AGI (Artificial General Intelligence) 126
AI (artificial intelligence) 125–126

251