From WordCount see how to improve the efficiency of Hadoop

sponsored links
Hadoop example program, is a well-written example of MapReduce programs, it describes an efficient implementation, then we analyze line by line.

1. In place to create a new object to verify

  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }


If the code used in the map function context.write (new Text (), new IntWritable (1)), then, the program will loop to create many objects.

Include the following in tasktracker process configuration:-verbose: gc-XX: + PrintGCDetails, then tasktrakcer will run during the process of GC in the detail of information printed to the log, according to log observations, if we can see spending a lot of time jvm to do garbage collection, so often, the program has too many unnecessary new object. Of course, the heap size of each task can be configured, so GC is not in the jvm's memory footprint to a certain threshold, you did not run, if the configuration of small head size, then the treatment will take up a lot of gc time, task of running into trouble when.

2. Use the right object type

Many students like to use the Text object, even value object or a complex data structure. Type the numbers into UTF8 string efficiency is very low, if this operation a lot, it will take up too much cpu time lead to lower operating efficiency. Any time, when dealing with non-text data (numbers, floating-point type, etc.), use IntWritable or FloatWritable efficiency will be much higher. In addition to the problems Text, binary writable objects will usually take up less storage space . As the disk IO and network transmission operations often are mapreduce bottleneck, which could largely improve job performance, especially when the size of big time job. When dealing with Integer data, also sometimes you can use VIntWritable or VLongWritable to improve performance. Can not only reduce the cpu occupation can also reduce the space occupied.

3. To improve the efficiency of using Combiner

Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);


Job will be run during a series of sort of operation, and reduce input groups of the counter value of the variable is far less than that reduce input records counter. Job completion of the mapper, shuffle the process of transfer a large number of intermediate results file (for example: Each slave on the map output bytes are several GB). The job of the webui the counter to see, job of spilled records the number is far greater than the number of map output records. If the job of the algorithm involves many of the sorting operation, you can try to write a Combiner to improve job performance. hadoop 's mapreduce framework provides a Combiner to reduce intermediate results on disk to write and to reduce intermediate results between the mapper and reducer transmission, usually these two aspects are very affected two aspects of job performance.

4. Regulation Block Size to change the running time of each Map

  • Each map or reduce only 30-40 seconds to end
  • Ultra-large-scale job , it is usually requires a lot of map and reduce the slots support, but after running job, running the map and reduce, and not covered with clusters of available slots
  • When almost all of the map and reducers are scheduling system in the running, this time there are one or two pending the map, or reduce, has never run so that the normal end of the job has been unable to.

The map of a job number and reduce the number of settings on the operation of a job is very important and very simple. Here are some setting the value of these lessons learned:

  • If the job of each map or reduce task running time is only 30-40 seconds, then reduce the job of the map, or reduce the number of each task (map | reduce) the setup and added to the scheduler in scheduling, The middle of the process may have to spend a few seconds, so if each task is very fast and run over, will the task start and end time to waste too much time. JVM to reuse approach can solve this problem.
  • If an input of the file is very large, such as 1TB, can be considered hdfs located on each block size large, for example set to 256MB or 512MB, this map, and reduce the data can be reduced. And the user can also command : hadoop distcp -Ddfs.block.size=$[256*1024*1024] /path/to/inputdata /path/to/inputdata-with-largeblocks The already existing sounds hdfs data chunk of . And then delete the original file .
  • As long as each task runs at least 30-40 seconds, you may consider expanding the number of mapper, such as cluster map slots for the 100, then the mapper would not want a job set to 101, so that map to parallel the first 100 completed , and the last map to the end of the first 100 mapper after starting, and therefore reduce start running before, map would almost double the time period.
  • Do not run too much as far as possible reduce task. For most job, the best number of rduce up and reduce the cluster of flat, or smaller than the cluster reduce slots. This small cluster is particularly important.
  • del.icio.us
  • StumbleUpon
  • Digg
  • TwitThis
  • Mixx
  • Technorati
  • Facebook
  • NewsVine
  • Reddit
  • Google
  • LinkedIn
  • YahooMyWeb

Related Posts of From WordCount see how to improve the efficiency of Hadoop

  • Hibernate access picture sample

    General web users to upload picture at treatment would normally uses two types of strategies: First, put directly into the picture in the database Blob field; II database are stored only at the picture on the path of the server information?, Pictures stor

  • Play flash page 3 (javascript script swfobject.js1)

    / ** * SWFObject v1.5: Flash Player detection and embed - http://blog.deconcept.com/swfobject/ * * SWFObject is (c) 2007 Geoff Stearns and is released under the MIT License: * Http://www.opensource.org/licenses/mit-license.php * * / if (typeof deconc ...

  • hibernate query cache

    hibernate query cache Query cache is cached result set of common property On the entity object cache only the result set id Query cache life cycle, when the associated table happened to amend, then the query cache of the life cycle of The End Query cache

  • hibernate how to store binary, image and other major fields.

    model categories: reportByte binary type for the database fields. public class PfReportStyle implements java.io.Serializable , Cloneable { // Fields /** * */ private static final long serialVersionUID = 1L; private Long id; private byte[] reportByte; // C

  • hibernate study of the second

    Persistence of three main points: 1, a statement for persistent fields accessors (accessors) and whether the variable signs (mutators) Property statement is not necessarily required for the public's. Hibernate can be default, protected or private ...

  • hibernate (jpa) composite primary key annotation statement Ways

    In the design of the database tables are designed with a composite primary key of the table, that table's record by more than one field joint identification, such as: Table CREATE TABLE TB_HOUR_DATA ( STAT_DATE DATE NOT NULL, PATH_ID NUMBER(20) NOT NULL,

  • jboss ejb3 Message Driven Bean

    Super Medium ejb hate. . . . . . . . . . . . . . . . . . . ================================================ To configure a Message Driven Bean in a different application server parameters are not the same. Currently only passed the test jboss. Message Dri

  • Java technology 25 study points

    1. You need to master the object-oriented analysis and design (OOA / OOD), involving patterns (GOF, J2EEDP) as well as the integrated model. You should understand the UML, especially class, object, interaction and statediagrams. 2. You need to learn basic

  • hibernate generic generic DAO

    package org.lzpeng.dao; import java.io.Serializable; import java.util.List; import org.hibernate.Criteria; import org.hibernate.Query; import org.hibernate.criterion.Criterion; import org.springside.modules.orm.hibernate.Page; /** * * @version 2009-1-10 *

  • Java Technology wishing cow needed 25 points of study

    1. You need to master the object-oriented analysis and design (OOA / OOD), involving patterns (GOF, J2EEDP) as well as the integrated model. You should understand the UML, especially class, object, interaction and statediagrams. 2. You need to learn basic

blog comments powered by Disqus
Recent
Recent Entries
Tag Cloud
Random Entries