Tải bản đầy đủ
Chapter 29. The Small Files Problem

Chapter 29. The Small Files Problem

Tải bản đầy đủ

• Solution 1: using a custom merge of small files; this solution merges small files
into big files on the client side.
• Solution 2: using a custom implementation of CombineFileInputFormat.1

Solution 1: Merging Small Files Client-Side
Let’s assume that we have to process 20,000 small files (assuming that each file’s size is
much smaller than 64 MB) and we want to process them efficiently in the MapRe‐
duce/Hadoop environment. If you just send these files as input via FileInputFor
mat.addInputPath(Job, Path), then each input file will be sent to a mapper and you
will end up with 20,000 mappers, which is very inefficient. Let dfs.block.size be 64
MB. Further assume that the size of these files is between 2 and 3 MB (so we assume
that on average each small file’s size is 2.5 MB). Further, assume that we have M (such
as 100, 200, 300, ...) mappers available to us. The following multithreaded algorithm
(which is a POJO and non-MapReduce solution) will solve the small files problem.
Since our small files on average occupy 2.5 MB, we can put 25 (25 × 2.5 ≈ 64 MB)
small files into one HDFS block, which we call a bucket. Now we just need 800
(20,000 ÷ 25 = 800) mappers, which will be very efficient compared to 20,000 map‐
pers. Our algorithm puts N files (in our example, 25) into each bucket and then con‐
currently merges these small files into one file whose size is closer to the
dfs.block.size.
Before submitting our small files to MapReduce/Hadoop, we merge them into big
ones; we then submit these to the MapReduce driver program. Example 29-1 (taken
from the driver program that submits MapReduce/Hadoop jobs) shows how to merge
small files into one large file.
Example 29-1. Merging small files into a large file
1
2
3
4
5
6
7
8
9
10
11
12

// prepare input
int NUMBER_OF_MAP_SLOTS_AVAILABLE = ;
Job job = ;
List smallFiles = ;
int numberOfSmallFiles = smallFiles.size();
if ( NUMBER_OF_MAP_SLOTS_AVAILABLE >= numberOfSmallFiles ) {
// we have enough mappers and there is no need
// to merge or consolidate small files; each
// small file will be sent as a block to a mapper
for (Path path : smallFiles) {
FileInputFormat.addInputPath(job, path);
}

1 org.apache.hadoop.mapred.lib.CombineFileInputFormat

662

|

Chapter 29: The Small Files Problem

13 }
14 else {
15
// the number of mappers is less than the number of small files
16
// create and fill buckets with merged small files
17
18
// Step 1: create empty buckets (each bucket may hold a set of small files)
19
BucketThread[] buckets = SmallFilesConsolidator.createBuckets(
20
smallFiles,
21
NUMBER_OF_MAP_SLOTS_AVAILABLE);
22
23
// Step 2: fill buckets with small files
24
SmallFilesConsolidator.fillBuckets(buckets, smallFiles, job);
25
26
// Step 3: merge small files per bucket
27
// each bucket is a thread (implements Runnable interface)
28
// merging is done concurrently for each bucket
29
SmallFilesConsolidator.mergeEachBucket(buckets, job);
30 }

The SmallFilesConsolidator class accepts a set of small Hadoop files and then
merges these small files together into larger Hadoop files whose size is less than or
equal to dfs.block.size (i.e., the HDFS block size). The optimal solution is to create
the smallest possible number of files (recall that there will be one mapper per file), so
each file should be as close as possible to the HDFS block size. We generate these
large files (as GUIDs) under the /tmp/ directory in HDFS (of course, the directory
you use is configurable):
// this directory is configurable
private static String MERGED_HDFS_ROOT_DIR = "/tmp/";
...
private static String getParentDir() {
String guid = UUID.randomUUID().toString();
return MERGED_HDFS_ROOT_DIR + guid + "/";
}

The BucketThread class enables us to concatenate small files into one big file whose
size is smaller than the HDFS block size. This way, we will submit fewer mappers with
big input files. BucketThread class implements the Runnable interface and provides
the copyMerge() method, which merges the small files into a larger file. Since each
BucketThread object implements the Runnable interface, it will be able to run in its
own thread. This way, all BucketThread objects can merge their small files concur‐
rently. The BucketThread.copyMerge() is the core method; it merges all small files in
one bucket into another temporary HDFS file. For example, if a bucket holds the
small files {File1, File2, File3, File4}, then the merged file will look like
Figure 29-1 (note that the MergedFile is the concatenation of all four files).

Solution 1: Merging Small Files Client-Side

|

663

Figure 29-1. Small files merged into larger file
Example 29-2 shows the implementation of the BucketThread.copyMerge() method.
Example 29-2. The copyMerge() method
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

664

/**
* Copy all files in several directories to one output file (mergedFile).
*
* parentDir will be "/tmp//"
* targetDir will be "/tmp//id/"
* targetFile will be "/tmp//id/id"
*
* merge all paths in bucket and return a new directory
* (targetDir), which holds merged paths
*/
public void copyMerge() throws IOException {
// if there is only one path/dir in the bucket,
// then there is no need to merge it
if ( size() < 2 ) {
return;
}
// here bucket.size() >= 2
Path hdfsTargetFile = new Path(targetFile);
OutputStream mergedFile = fs.create(hdfsTargetFile);
try {
for (int i = 0; i < bucket.size(); i++) {
FileStatus contents[] = fs.listStatus(bucket.get(i));
for (int k = 0; k < contents.length; k++) {
if (!contents[k].isDir()) {
InputStream smallFile = fs.open(contents[k].getPath());
try {
IOUtils.copyBytes(smallFile, mergedFile, conf, false);
}
finally {
HadoopUtil.close(smallFile);
}
}

|

Chapter 29: The Small Files Problem

35
36
37
38
39
40
41 }

} // for k
} // for i
}
finally {
HadoopUtil.close(mergedFile);
}

The SmallFilesConsolidator class therefore provides three pieces of functionality:
1. Create the required empty buckets. Each bucket will hold a set of small files. This
will be done by SmallFilesConsolidator.createBuckets().
2. Fill the buckets. We will place enough small files in a bucket so that the total size
of all small files will be about dfs.block.size. This behavior is implemented by
SmallFilesConsolidator.fillBuckets().
3. Merge each bucket. Here we will merge all small files in the bucket to create a
single large file, whose size will be about dfs.block.size. This is accomplished
by SmallFilesConsolidator.mergeEachBucket().
To demonstrate the small files problem, we will run the classic word count program
with and without the SmallFilesConsolidator class. For input, for each case we will
use 30 small files. We will clearly see that using SmallFilesConsolidator outper‐
forms the original solution; it finishes in 58,235 milliseconds, while the original word
count program with small files finishes in 80,435 milliseconds.

Input Data
We use the following input data (30 small files) for both solutions:
# hadoop fs -ls /small_input_files/input/
Found 30 items
-rw-r--r-- 1 ... /small_input_files/input/Document-1
-rw-r--r-- 1 ... /small_input_files/input/Document-2
...
-rw-r--r-- 1 ... /small_input_files/input/Document-29
-rw-r--r-- 1 ... /small_input_files/input/Document-30

Solution with SmallFilesConsolidator
In this solution we will use the SmallFilesConsolidator class to merge the small
files into a larger file.

Hadoop implementation classes
Table 29-1 shows the Java classes we’ll require in this solution.

Solution 1: Merging Small Files Client-Side

|

665

Table 29-1. Required Java classes for solution with SmallFilesConsolidator
Class name

Class description

BucketThread

Used to merge small files into larger files

HadoopUtil

Defines some basic Hadoop utilities

SmallFilesConsolidator

Manages consolidation of small files into a larger file

WordCountDriverWithConsolidator Word count driver with consolidator
WordCountMapper

Defines map()

WordCountReducer

Defines reduce() and combine()

The SmallFilesConsolidator class is the driver class to consolidate the small files
into a larger file whose size is closer to the HDFS block size. The main methods are:
getNumberOfBuckets()

Determines the number of buckets needed for merging all files into bigger files.
public static int getNumberOfBuckets(int totalFiles,
int numberOfMapSlotsAvailable,
int maxFilesPerBucket)

createBuckets()

Creates the required buckets.
public static BucketThread[] createBuckets(
int totalFiles,
int numberOfMapSlotsAvailable,
int maxFilesPerBucket)

fillBuckets()

Fills each bucket with small files.
public static void fillBuckets(
BucketThread[] buckets,
List smallFiles, // list of small files
Job job,
int maxFilesPerBucket)

mergeEachBucket()

Merges small files to create a larger file.
public static void mergeEachBucket(BucketThread[] buckets,
Job job)

Sample run
Here is a sample run for our solution (edited and formatted to fit the page):

666

| Chapter 29: The Small Files Problem

# ./run_with_consolidator.sh
...
Deleted hdfs://localhost:9000/small_input_files/output
13/11/05 10:54:04 ...: inputDir=/small_input_files/input
13/11/05 10:54:04 ...: outputDir=/small_input_files/output
13/11/05 10:54:05 ...added path: /tmp/906e6c30-c411-4a70-b68f-114ba7511e63/
...
13/11/05 10:54:05 ...added path: /tmp/906e6c30-c411-4a70-b68f-114ba7511e63/
...
13/11/05 10:54:05 INFO input.FileInputFormat: Total input paths to process : 8
...
13/11/05 10:54:05 INFO mapred.JobClient: Running job: job_201311051023_0002
13/11/05 10:54:06 INFO mapred.JobClient: map 0% reduce 0%
...
13/11/05 10:55:01 INFO mapred.JobClient: map 100% reduce 100%
13/11/05 10:55:02 INFO mapred.JobClient: Job complete: job_201311051023_0002
13/11/05 10:55:02 INFO mapred.JobClient: Launched reduce tasks=10
13/11/05 10:55:02 INFO mapred.JobClient: Launched map tasks=8
13/11/05 10:55:02 INFO mapred.JobClient: Data-local map tasks=8
...
13/11/05 10:55:02 INFO mapred.JobClient: Map input records=48
13/11/05 10:55:02 INFO mapred.JobClient: Reduce input records=48
13/11/05 10:55:02 INFO mapred.JobClient: Reduce input groups=7
13/11/05 10:55:02 INFO mapred.JobClient: Reduce output records=7
13/11/05 10:55:02 INFO mapred.JobClient: Map output records=201
13/11/05 10:55:02 INFO WordCountDriverWithConsolidator: returnStatus=0
13/11/05 10:55:02 INFO WordCountDriverWithConsolidator:
Finished in milliseconds: 58235

As you can see from the log of the sample run, we have consolidated 30 HDFS small
files into 8 large HDFS files.

Solution Without SmallFilesConsolidator
This solution is just a basic word count application that does not use the SmallFiles
Consolidator class. As you can see from the following snippet from the sample run,
the total number of input paths to process is 30, which is exactly the number of small
files we want to process:
...
13/11/05 10:29:13 INFO input.FileInputFormat: Total input paths to process : 30
...

This solution is not an optimal solution at all, since every small file will be sent to a
mapper. As you know, the ideal case is to send input files whose size is just under or
equal to the HDFS block size (because Hadoop is designed for handling large files).

Hadoop implementation classes
Table 29-2 shows the Java classes required for our solution without
SmallFilesConsolidator.
Solution 1: Merging Small Files Client-Side

|

667

Table 29-2. Java classes required for solution without SmallFilesConsolidator
Class name

Class description

HadoopUtil

Defines some basic Hadoop utilities

WordCountDriverWithoutConsolidator Word count driver without consolidator
WordCountMapper

Defines map()

WordCountReducer

Defines reduce() and combine()

Sample run
Here is the output from a sample run (edited and formatted to fit the page) of our
solution without SmallFilesConsolidator:
# ./run_without_consolidator.sh
...
Deleted hdfs://localhost:9000/small_input_files/output
13/11/05 10:29:12 ... inputDir=/small_input_files/input
13/11/05 10:29:12 ... outputDir=/small_input_files/output
...
13/11/05 10:29:13 INFO input.FileInputFormat: Total input paths to process : 30
...
13/11/05 10:29:13 INFO mapred.JobClient: Running job: job_201311051023_0001
13/11/05 10:29:14 INFO mapred.JobClient: map 0% reduce 0%
...
13/11/05 10:30:32 INFO mapred.JobClient: map 100% reduce 100%
13/11/05 10:30:33 INFO mapred.JobClient: Job complete: job_201311051023_0001
...
13/11/05 10:30:33 INFO mapred.JobClient: Map-Reduce Framework
13/11/05 10:30:33 INFO mapred.JobClient: Map input records=48
13/11/05 10:30:33 INFO mapred.JobClient: Reduce input records=153
13/11/05 10:30:33 INFO mapred.JobClient: Reduce input groups=7
13/11/05 10:30:33 INFO mapred.JobClient: Combine output records=153
13/11/05 10:30:33 INFO mapred.JobClient: Reduce output records=7
13/11/05 10:30:33 INFO mapred.JobClient: Map output records=201
13/11/05 10:30:33 INFO WordCountDriverWithoutConsolidator:
run(): status=true
13/11/05 10:30:33 INFO WordCountDriverWithoutConsolidator:
Finished in milliseconds: 80435

Solution 2: Solving the Small Files Problem with
CombineFileInputFormat
This section uses the Hadoop API (the abstract class CombineFileInputFormat) to
solve the small files problem. This is how CombineFileInputFormat (as an abstract
class) is defined in Hadoop 2.5.0:
package org.apache.hadoop.mapred.lib;
...
@InterfaceAudience.Public

668

|

Chapter 29: The Small Files Problem

@InterfaceStability.Stable
public abstract class CombineFileInputFormat
extends CombineFileInputFormat
implements InputFormat

The idea behind the abstract class CombineFileInputFormat is to enable combining
small files into Hadoop’s splits (or chunks) by using a custom InputFormat. To use
the abstract class CombineFileInputFormat, we have to provide/implement three cus‐
tom classes:
• CustomCFIF extends CombineFileInputFormat (which is an abstract class with
no implementation, so we must create this subclass to support it).
• PairOfStringLong is a Writable class that stores the small filename (as a
String) and its offset (as a Long) and overrides the compareTo() method to com‐
pare the filename first, then the offset.
• CustomRecordReader is a custom RecordReader:
public class CustomRecordReader
extends RecordReader {
...
}

The custom implementation
Example 29-3.

of

CombineFileInputFormat

is

provided

in

Example 29-3. CustomCFIF class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

import
import
import
import
import
import
import
import
import
import

java.io.IOException;
org.apache.hadoop.fs.Path;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.InputSplit;
org.apache.hadoop.mapreduce.JobContext;
org.apache.hadoop.mapreduce.RecordReader;
org.apache.hadoop.mapreduce.TaskAttemptContext;
org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;

import edu.umd.cloud9.io.pair.PairOfStringLong;
// PairOfStringLong = Tuple2 = Tuple2
// https://github.com/lintool/Cloud9/
/**
* A custom file input format that combines/merges smaller files
* into big files controlled by MAX_SPLIT_SIZE
*
* @author Mahmoud Parsian
*

Solution 2: Solving the Small Files Problem with CombineFileInputFormat

|

669

22 */
23 public class CustomCFIF extends CombineFileInputFormat {
24
final static long MAX_SPLIT_SIZE = 67108864; // 64 MB
25
26
public CustomCFIF() {
27
super();
28
setMaxSplitSize(MAX_SPLIT_SIZE);
29
}
30
31
public RecordReader createRecordReader
32
(InputSplit split,
33
TaskAttemptContext context)
34
throws IOException {
35
return new CombineFileRecordReader(
36
(CombineFileSplit)split,
37
context,
38
CustomRecordReader.class);
39
}
40
41
@Override
42
protected boolean isSplitable(JobContext context, Path file) {
43
return false;
44
}
45 }

You should set the MAX_SPLIT_SIZE based on the HDFS block size (default to 64 MB).
If most of your files are bigger than 64 MB, then you may set the HDFS block size to
128 MB or even 256 MB (in some genomic applications, the HDFS block size is set to
512 MB). In Hadoop 2.5.1, the HDFS block size is set to 128 MB (134,217,728 bytes)
by default. You can control the HDFS block size inside the hdfs-site.xml file (this is
one of the files to configure a Hadoop cluster) via the dfs.blocksize property. For
example:
$ cat $HADOOP_HOME/etc/hadoop/hdfs-site.xml



dfs.blocksize
268435456
256MB


...

...


670

|

Chapter 29: The Small Files Problem

Setting the maximum split size (MAX_SPLIT_SIZE) will determine the number of map‐
pers needed. For example, consider the following HDFS directory:1
# hadoop fs -ls /small_input_files | wc -l
10004
# hadoop fs -ls /small_input_files | head -3
-rw-r--r-- 3 ... 9184 2014-10-06 15:20 /small_input_files/file1.txt
-rw-r--r-- 3 ... 27552 2014-10-06 15:20 /small_input_files/file2.txt
-rw-r--r-- 3 ... 27552 2014-10-06 15:20 /small_input_files/file3.txt
# hadoop fs -ls /small_input_files | tail -3
-rw-r--r-- 3 ... 27552 2014-10-06 15:28 /small_input_files/file10002.txt
-rw-r--r-- 3 ... 27552 2014-10-06 15:28 /small_input_files/file10003.txt
-rw-r--r-- 3 ... 27552 2014-10-06 15:28 /small_input_files/file10004.txt
# hadoop fs -dus /small_input_files
275584288
/small_input_files
#

As you can see, the HDFS directory /small_input_files has 10,004 small files and
requires 275,584,288 bytes. If we do not use the CustomCFIF as our input format, then
a basic MapReduce job will queue 10,004 mappers (which takes over 34 minutes to
execute on a three-node cluster). But, using the CustomCFIF, we need only five map‐
pers (which takes under 2 minutes to execute on a three-node cluster). Why do we
need five mappers? The following calculations answer that question:
HDFS-split-size = 64MB = 64*1024*1024 = 67108864
Required Bytes for 10004 small files = 275584288
275584288/67108864 = 4
Therefore we need 5 splits:
67108864+67108864+67108864+67108864+7148832 = 275584288
Therefore, 5 input splits are required
=> this will launch 5 mappers (one per split)

If you set MAX_SPLIT_SIZE to 128 MB (134,217,728 bytes), then the Hadoop job will
launch only three mappers, like so:
HDFS-split-size = 128MB = 128*1024*1024 = 134217728
Required Bytes for 10004 small files = 275584288
275584288/134217728 = 2
Therefore we need 3 splits:
134217728+134217728+7148832 = 275584288
Therefore, 3 input splits are required
=> this will launch 3 mappers (one per split)

1 For testing purposes, you can create a lot of small files using bash; for details, see http://bit.ly/

many_small_files.

Solution 2: Solving the Small Files Problem with CombineFileInputFormat

|

671

Custom CombineFileInputFormat
The CustomCFIF class solves the small files problem by combining small files into
splits determined by the MAX_SPLIT_SIZE (this is basically the maximum size of the
bigger file into which the small files are merged). This custom class (which extends
the abstract class CombineFileInputFormat) has several functions:
• It sets the maximum split size, invoking setMaxSplitSize(MAX_SPLIT_SIZE) in
the constructor. The maximum combination of small files will not exceed this
size.
• It defines a custom record reader by createRecordReader(), and then provides a
plug-in class, CustomRecordReader, which reads small files into large split sizes
(the maximum size is determined by MAX_SPLIT_SIZE).
• It defines key-value pairs to be fed to the mappers. We use PairOfStringLong as
a key and Text (a single line of a text file) as a value. PairOfStringLong repre‐
sents two pieces of information: the filename (as a String) and offset (as a Long).
• It indicates that the combined/merged files should not be split; this is set by the
isSplitable() method, which returns false.

Sample Run Using CustomCFIF
The following subsections provide the script, run log, and output (edited and format‐
ted to fit the page) for our sample run using CustomCFIF.

The script
# cat run_combine_small_files.sh
#!/bin/bash
BOOK_HOME=/mp/data-algorithms-book
CLASSPATH=.:$BOOK_HOME/dist/data_algorithms_book.jar
APP_JAR=$BOOK_HOME/dist/data_algorithms_book.jar
CLASSPATH=$CLASSPATH:$BOOK_HOME/lib/spark-assembly-1.2.0-hadoop2.6.0.jar
INPUT=/small_input_files
OUTPUT=/output/1
PROG=org.dataalgorithms.chap29.combinesmallfiles.CombineSmallFilesDriver
hadoop jar $APP_JAR $PROG $INPUT $OUTPUT

Log of the sample run
# ./run_combine_small_files.sh
input path = /small_input_files
output path = /output/1
14/10/06 15:51:39 INFO input.FileInputFormat:
..Total input paths to process : 10003
14/10/06 15:51:40 INFO input.CombineFileInputFormat:

672

|

Chapter 29: The Small Files Problem