Tải bản đầy đủ - 0 (trang)
5 RHadoop—a simpler integration of client-side R and Hadoop

5 RHadoop—a simpler integration of client-side R and Hadoop

Tải bản đầy đủ - 0trang



YRLGPDS 2EMHFWNH\2EMHFWYDOXH

YRLGUHGXFH 2EMHFWNH\,WHUDWRUYDOXHV



-RLQLPSOHPHQWDWLRQ

SampleReduce



SampleMap



Figure D.2



Class diagram showing main classes in the framework and a sample implementation



The map method first asks the implementing class to engineer an OutputValue object,

which contains the value that the implementing class wants to use in the join (and

presumably include in the final output), and a boolean indicating whether the value is

being materialized from the smaller dataset. If the map method then asks the implementing class to engineer the key that will be used for the join, this will make it the

output key for the map:

Abstract method that returns

the map output value.



Create the map

output value.



protected abstract OptimizedTaggedMapOutput

generateTaggedMapOutput(Object value);

protected abstract String generateGroupKey(Object key,

OptimizedTaggedMapOutput aRecord);



An abstract method to

create the map output key,

which is used to group data

together for the join.



public void map(Object key, Object value,

OutputCollector output, Reporter reporter)

throws IOException {

OptimizedTaggedMapOutput aRecord =

generateTaggedMapOutput(value);



Update the output value to indicate whether

it originated from the smaller dataset.



if (aRecord == null) {

return;

}



If a NULL is

returned, you

discard this

record.



aRecord.setSmaller(smaller);

String groupKey = generateGroupKey(aRecord);

if (groupKey == null) {

return;

}



Retrieve the group key, which

is the map output key.



If a NULL is returned, you

discard this record.



outputKey.setKey(groupKey);

output.collect(outputKey, aRecord);

}



www.it-ebooks.info



Emit the key/value.



496



D



APPENDIX



Optimized MapReduce join frameworks



.H\VRUWLQJ



FODVV!! CompositeKey



-RLQNH\



FODVV!! OutputValue



6PDOOILOHLQGLFDWRU



6PDOOILOHLQGLFDWRU



&XVWRPYDOXHIRUMRLQLQJ



3DUWLWLRQNH\



Figure D.3



Map output key and value



Figure D.3 shows the composite key and values emitted by the map. The secondary

sort will partition on the join key, but will order all the keys for a single join key using

the whole composite key. The composite key contains an integer indicating if the data

source is from the small file, and as such can be used to ensure that values from the

small file are passed to the reducer before records for the large file.

The next step is to look at the reducer. Because you have a guarantee that the

small file values will arrive at the reducer ahead of the large file values, you can cache

all of the values from the small dataset, and when you start seeing values from the

large dataset, join each one with the cached values:

public void reduce(Object key, Iterator values,

OutputCollector output, Reporter reporter)

throws IOException {

CompositeKey k = (CompositeKey) key;

List smaller =

new ArrayList();

while (values.hasNext()) {

Object value = values.next();



Create a structure to store

the cached values from the

small dataset.



Clone the value because it’s reused

by the MapReduce code to store

subsequent reducer values.



OptimizedTaggedMapOutput cloned =

((OptimizedTaggedMapOutput) value).clone(job);



Cache it if it’s from

the small dataset.



if (cloned.isSmaller().get()) {

smaller.add(cloned);

} else {

joinAndCollect(k, smaller, cloned, output, reporter);

}



Perform the join if it’s

from the large dataset.



}

}



www.it-ebooks.info



497



A replicated join framework



The joinAndCollect method combines values from the two datasets together and emits

them:

An abstract method that must

be implemented to perform the

combination of dataset values

and return the value to be

emitted by the reducer.



protected abstract OptimizedTaggedMapOutput combine(

String key,

OptimizedTaggedMapOutput value1,

OptimizedTaggedMapOutput value2);

private void joinAndCollect(CompositeKey key,

List smaller,

OptimizedTaggedMapOutput value,

OutputCollector output,

Reporter reporter)

throws IOException {

if (smaller.size() < 1) {

OptimizedTaggedMapOutput combined =

combine(key.getKey(), null, value);

collect(key, combined, output, reporter);

} else {

for (OptimizedTaggedMapOutput small : smaller) {

OptimizedTaggedMapOutput combined =

combine(key.getKey(), small, value);

collect(key, combined, output, reporter);

}

}

}



Even if no data was collected from the

small dataset, calling the combine

method allows implementations to

perform an outer join.

For each small dataset

value combine if with the

large dataset value.



Call the collect method, which emits the

combined record if it isn’t NULL.



Now you have the main guts of the framework uncovered. Chapter 4 shows how you

can use the framework.



D.2



A replicated join framework

A replicated join is a map-side join, and gets its name from the fact that the smallest of

the datasets is replicated to all the map hosts. The implementation of the replicated

join is straightforward and is demonstrated in Chuck Lam’s Hadoop in Action.

The goal in this section is to create a generic replicated join framework that can

work with any datasets. I’ll also provide an optimization that will dynamically determine if the distributed cache contents are larger than the input split, in which case

you’d cache the map input and execute the join in the mapper cleanup method.

The class diagram for this framework is shown in figure D.4. Rather than provide

an abstract Join class, I’ll instead provide an implementation of the join (class GenericReplicatedJoin), which out of the box works with KeyValueTextInputFormat and TextOutputFormat, and assumes that the first token in each file is the join key. But the join

class can be extended to support any input and output formats.



www.it-ebooks.info



498



APPENDIX



D



Optimized MapReduce join frameworks



Mapper



Join framework

DistributedCacheFileReader



GenericReplicatedJoin

Extensible

methods that can

support any

InputFormat and

OutputFormat



Pair readFromInputFormat(Object key, Object value)

DistributedCacheFileReader getDistributedCacheReader()

Pair join(Pair inputSplitPair, Pair distCachePair)

@Override



void setup(Context context)



TextDistributedCacheFileReader



Pair



@Override void map(Object key, Object value, ...)

@Override void cleanup(Context context)



Figure D.4



Class diagram for replicated join framework



Figure D.5 shows the algorithm for the join framework. The mapper setup method

determines if the map’s input split is larger than the distributed cache, in which case it

loads the distributed cache into memory. The map function either performs a join or

caches the key/value pairs, based on whether the setup method loaded the cache. If

the input split is smaller than the distributed cache, the map cleanup method will

read the records in the distributed cache and join them with the cache created in the

map function.



Figure D.5 Algorithm for

optimized replicated join



www.it-ebooks.info



499



A replicated join framework



The setup method in the GenericReplicatedJoin is called at map initialization time. It

determines if the size of the files in the distributed cache are smaller than the input

split, and if they are, loads them into a HashMap:

@Override

protected void setup(

Context context)

throws IOException, InterruptedException {

distributedCacheFiles = DistributedCache.getLocalCacheFiles(

context.getConfiguration());

int distCacheSizes = 0;

for (Path distFile : distributedCacheFiles) {

File distributedCacheFile = new File(distFile.toString());



Tally up sizes of all

the files in the

distributed cache.



distCacheSizes += distributedCacheFile.length();

}

if(context.getInputSplit() instanceof FileSplit) {

FileSplit split = (FileSplit) context.getInputSplit();

long inputSplitSize = split.getLength();

distributedCacheIsSmaller =

(distCacheSizes < inputSplitSize);

} else {

distributedCacheIsSmaller = true;

}



If the input split is from a file, determine

whether the distributed cache files are

smaller than the length of the input split.



If the input split is not from a file, assume that the

distributed cache is smaller, because you have no way

of knowing the length of the input split.



if (distributedCacheIsSmaller) {

for (Path distFile : distributedCacheFiles) {

File distributedCacheFile = new File(distFile.toString());

DistributedCacheFileReader reader =

getDistributedCacheReader();

Call a method to engineer a

reader.init(distributedCacheFile);

DistributedCacheFileReader to read

for (Pair p : (Iterable) reader) {

addToCache(p);

}

reader.close();

}

}



records from the distributed cache file.



Add each record to

your local HashMap.



}



Your map method chooses its behavior based on whether the setup method cached the

distributed cache. If the distributed cache was loaded into memory, then it proceeds

to join the tuple supplied to the map method with the cache. Otherwise, it caches the

map tuple for use later in the cleanup method:

@Override

protected void map(Object key, Object value, Context context)

throws IOException, InterruptedException {

Pair pair = readFromInputFormat(key, value);



www.it-ebooks.info



500



APPENDIX



D



Optimized MapReduce join frameworks



if (distributedCacheIsSmaller) {

joinAndCollect(pair, context);

} else {

addToCache(pair);

}

}



Join the map tuple with

the distributed cache.



Cache the map tuple.



public void joinAndCollect(Pair p, Context context)

Ensure the join method is called with records

throws IOException, InterruptedException {

List cached = cachedRecords.get(p.getKey());

in a predictable order: the record from the

if (cached != null) {

input split first, followed by the record by

for (Pair cp : cached) {

the distributed cache.

Pair result;

if (distributedCacheIsSmaller) {

result = join(p, cp);

} else {

result = join(cp, p);

}

if (result != null) {

If the result of the join

context.write(result.getKey(), result.getData());

was a non-NULL object,

}

emit the object.

}

}

}

public Pair join(Pair inputSplitPair, Pair distCachePair) {

StringBuilder sb = new StringBuilder();

if (inputSplitPair.getData() != null) {

sb.append(inputSplitPair.getData());

}

sb.append("\t");

if (distCachePair.getData() != null) {

sb.append(distCachePair.getData());

}

return new Pair(

new Text(inputSplitPair.getKey().toString()),

new Text(sb.toString()));

}



The default implementation of the join,

which can be overridden to support other

InputFormat and OutputFormat classes,

concatenates the string forms of the

values together.



After all of the records have been fed to the map method, the MapReduce framework

will call the cleanup method. If the contents of the distributed cache were larger than

the input split, it is here where you perform the join between the map function’s

cache of the input split tuples with the records contained in the distributed cache:

@Override

protected void cleanup(

Context context)

throws IOException, InterruptedException {

if (!distributedCacheIsSmaller) {

for (Path distFile : distributedCacheFiles) {

File distributedCacheFile = new File(distFile.toString());

DistributedCacheFileReader reader =

getDistributedCacheReader();



www.it-ebooks.info



A replicated join framework



501



reader.init(distributedCacheFile);

for (Pair p : (Iterable) reader) {

joinAndCollect(p, context);

}

reader.close();

}

}

}



Finally, the job driver code must specify the files that need to be loaded into the distributed cache. The following code works with a single file, as well as a directory containing the results of a MapReduce job:

Configuration conf = new Configuration();

FileSystem fs = smallFilePath.getFileSystem(conf);

FileStatus smallFilePathStatus = fs.getFileStatus(smallFilePath);

if(smallFilePathStatus.isDir()) {

for(FileStatus f: fs.listStatus(smallFilePath)) {

if(f.getPath().getName().startsWith("part")) {

DistributedCache.addCacheFile(f.getPath().toUri(), conf);

}

}

} else {

DistributedCache.addCacheFile(smallFilePath.toUri(), conf);



The assumption with this framework is that either the distributed cache or the input

split contents can be cached in memory. The advantage of this framework is that it will

cache the smaller of the distributed cache and the input split.

In the paper “A Comparison of Join Algorithms for Log Processing in MapReduce,”1

you can see a further optimization of this approach in cases where the distributed cache

contents are larger than the input split. In their optimization they further partition the

distributed cache into N partitions, and likewise cache the map tuples into N hashtables, which process provides a more optimal join in the map cleanup method.

A downside to the replicated join is that each map task must read the distributed

cache on startup. A potential optimization suggested by the paper referenced in the

previous paragraph is to override the FileInputFormat splitting such that input splits

that exist on the same host are combined into a single split, thereby cutting down on

the number of map tasks that need to load the distributed cache into memory.

On a final note, Hadoop comes with a built-in map-side join in the org.apache

.hadoop.mapred.join package. But it requires that the input files of both datasets be

sorted and distributed into identical partitions, which requires a good amount of preprocessing prior to leveraging their join mechanism.



1



See http://pages.cs.wisc.edu/~jignesh/publ/hadoopjoin.pdf.



www.it-ebooks.info



www.it-ebooks.info



index

Symbols

$HADOOP_HOME/logs 428



Numerics

2 bytes (SMALLINT) 337

4 bytes (INT) 337

8 bytes (BIGINT) 337



A

Accumulator interface 374–375

addrstatus_counts 376

agentSink 40

aggregate functions 370

algebraic functions 372

Algebraic interface 372–375

algorithms

for classification 325

for clustering 332

allowinsert 76

Apache Thrift

building 453–454

ingress and egress with

in Perl 472–473

in Python 472–473

in Ruby 472–473

interface for 472

resources for 453

ApacheCommonLogReader

403

architecture

of Hadoop 10–12

of Pig 360



array types 361

as-avrodatafile 65

as-sequencefile 65

auto-ship 379

automating, data ingress and

egress

of binary files 43–48

to local filesystem 73–74

Avro 451–452

installing 452

resources for 452

serialization of 119–127

small files in HDFS 170–178

types in 398

Avro files 57, 61, 120,

122–123, 127

Avro objects 124–125

AvroJob 126

AvroStockFileRead class 57

AvroStorage 126, 384



B

bags 361

baseurl 466

Bayes spam classifier 315–324

data preparation 316

running classifier 318–321

selecting features 316

training classifier 317–318

training data 316

BayesAlgorithm 324

BayesClassifierDriver class 324

BE (best effort) 39

BIGINT (8 bytes) 337



503



www.it-ebooks.info



bin/slurper-inittab.sh file 47

binary files, data ingress and

egress from 42–52

automating 43–48

scheduling with Oozie 48–52

BitSet.toString() method 280

Bloom filters 275–283

parallelized creation in

MapReduce 277–281

semi-join with 281–283

BloomFilter 279–280

BloomFilterDumper 280

boundary-query 63

Buffers files 112, 115, 176

Buffers objects 114, 116

byte array 135



C

cascading 407, 409

cat command 291–292, 296, 429

classification 314–325

algorithms for 325

Bayes spam classifier 315–324

data preparation 316

running classifier 318, 321

selecting features 316

training classifier 317–318

training data 316

clear-staging-table 77

CLI (command-line

interface) 471, 487

Cloudera 444

clusterdump 330



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

5 RHadoop—a simpler integration of client-side R and Hadoop

Tải bản đầy đủ ngay(0 tr)

×