From WordCount see how to improve the efficiency of Hadoop

Hadoop example program, is a well-written example of MapReduce programs, it describes an efficient implementation, then we analyze line by line.

1. In place to create a new object to verify

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        context.write(word, one);

If the code used in the map function context.write (new Text (), new IntWritable (1)), then, the program will loop to create many objects.

Include the following in tasktracker process configuration:-verbose: gc-XX: + PrintGCDetails, then tasktrakcer will run during the process of GC in the detail of information printed to the log, according to log observations, if we can see spending a lot of time jvm to do garbage collection, so often, the program has too many unnecessary new object. Of course, the heap size of each task can be configured, so GC is not in the jvm's memory footprint to a certain threshold, you did not run, if the configuration of small head size, then the treatment will take up a lot of gc time, task of running into trouble when.

2. Use the right object type

Many students like to use the Text object, even value object or a complex data structure. Type the numbers into UTF8 string efficiency is very low, if this operation a lot, it will take up too much cpu time lead to lower operating efficiency. Any time, when dealing with non-text data (numbers, floating-point type, etc.), use IntWritable or FloatWritable efficiency will be much higher. In addition to the problems Text, binary writable objects will usually take up less storage space . As the disk IO and network transmission operations often are mapreduce bottleneck, which could largely improve job performance, especially when the size of big time job. When dealing with Integer data, also sometimes you can use VIntWritable or VLongWritable to improve performance. Can not only reduce the cpu occupation can also reduce the space occupied.

3. To improve the efficiency of using Combiner

Job job = new Job(conf, "word count");
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);

Job will be run during a series of sort of operation, and reduce input groups of the counter value of the variable is far less than that reduce input records counter. Job completion of the mapper, shuffle the process of transfer a large number of intermediate results file (for example: Each slave on the map output bytes are several GB). The job of the webui the counter to see, job of spilled records the number is far greater than the number of map output records. If the job of the algorithm involves many of the sorting operation, you can try to write a Combiner to improve job performance. hadoop 's mapreduce framework provides a Combiner to reduce intermediate results on disk to write and to reduce intermediate results between the mapper and reducer transmission, usually these two aspects are very affected two aspects of job performance.

4. Regulation Block Size to change the running time of each Map

  • Each map or reduce only 30-40 seconds to end
  • Ultra-large-scale job , it is usually requires a lot of map and reduce the slots support, but after running job, running the map and reduce, and not covered with clusters of available slots
  • When almost all of the map and reducers are scheduling system in the running, this time there are one or two pending the map, or reduce, has never run so that the normal end of the job has been unable to.

The map of a job number and reduce the number of settings on the operation of a job is very important and very simple. Here are some setting the value of these lessons learned:

  • If the job of each map or reduce task running time is only 30-40 seconds, then reduce the job of the map, or reduce the number of each task (map | reduce) the setup and added to the scheduler in scheduling, The middle of the process may have to spend a few seconds, so if each task is very fast and run over, will the task start and end time to waste too much time. JVM to reuse approach can solve this problem.
  • If an input of the file is very large, such as 1TB, can be considered hdfs located on each block size large, for example set to 256MB or 512MB, this map, and reduce the data can be reduced. And the user can also command : hadoop distcp -Ddfs.block.size=$[256*1024*1024] /path/to/inputdata /path/to/inputdata-with-largeblocks The already existing sounds hdfs data chunk of . And then delete the original file .
  • As long as each task runs at least 30-40 seconds, you may consider expanding the number of mapper, such as cluster map slots for the 100, then the mapper would not want a job set to 101, so that map to parallel the first 100 completed , and the last map to the end of the first 100 mapper after starting, and therefore reduce start running before, map would almost double the time period.
  • Do not run too much as far as possible reduce task. For most job, the best number of rduce up and reduce the cluster of flat, or smaller than the cluster reduce slots. This small cluster is particularly important.
分类:Java 时间:2010-07-19 人气:195
blog comments powered by Disqus


  • A comprehensive analysis of Java's garbage collection mechanism 2010-01-29

    Introduction Java run-time heap is a data area, the class instance (object) from which the allocation of space. Java Virtual Machine (JVM) heap storage of the applications that are running all the objects created, these objects through the new, newar

  • Comprehensive analysis of Java's garbage collection mechanism 2010-04-19

    Java heap is a run-time data area, the class instance (object) from distribution space. Java Virtual Machine (JVM) heap storage of the applications are running all the objects created, these objects through the new, newarray, anewarray and multianewa

  • Java's garbage collection mechanism 2010-06-24

    The full text: Study: yang677888 (from CSDN) Introduction Java run-time heap is a data area, the class instance (object) from distribution space. Java Virtual Machine (JVM)

  • Bad memory, memo. (JVM memory model and the garbage collection policy analysis) 2010-06-03

    JVM memory model and strategy analysis for a JVM garbage collection memory model 1.1 Java stack Java stack is associated with each thread, JVM is created each time a thread will be allocated a stack space for threads. It is mainly used to store threa

  • Java-based brush up - memory management. Garbage Collection 2010-05-03

    -------------------------------------------------- -------------------------------------------- Reference: "Java2 Programming Xiangjie" -------------------------------------------------

  • JVM (java virtual machine), tuning, GC (garbage collection) 2009-09-04

    Keywords: JVM (java virtual machine), tuning, GC (garbage collection) JVM GC tuning the JVM GC in order to be able to tune to use in a specific practice, the following will use several examples to illustrate the GC-tuning. Example 1: Heap size settin

  • JDK5.0 garbage collection optimization of - Don 't Pause (turn south white) 2010-07-24

    Study: white South , the latest version of the link: , All rights reserved Please keep the original link. Originally wanted to title easier as the -, "Do not stop," but even still ow

  • JDK5.0 garbage collection optimization of - Don 't Pause Collection 2010-08-22 Study: white South, the latest version of the link:, All rights reserved Please keep the original link. Originally wanted a

  • JVM garbage collection summary (8): Reflection and Reference 2010-08-26

    The Paradox of garbage collection The so-called "winner Xiao He Xiao He lost." Java's garbage collection have brought many benefits for the development is made easier. However, in some high-performance, high concurrency situation, garbage collec

  • IBM JDK and the sun jdk difference 2010-04-19

    IBM's virtual machine in the official guidance document clearly states that prohibit the virtual machine is set to equal the maximum and minimum, otherwise it will result in the following two consequences <1> greatly increase the garbage collection

iOS 开发

Android 开发

Python 开发



PHP 开发

Ruby 开发






Javascript 开发

.NET 开发



Copyright (C), All Rights Reserved. 版权所有 闽ICP备15018612号

processed in 0.039 (s). 13 q(s)