MapredLoadTest generates a bunch of work that exercises
a Hadoop Map-Reduce system (and DFS, too). It goes through
the following steps:
1) Take inputs 'range' and 'counts'.
2) Generate 'counts' random integers between 0 and range-1.
3) Create a file that lists each integer between 0 and range-1,
and lists the number of times that integer was generated.
4) Emit a (very large) file that contains all the integers
in the order generated.
5) After the file has been generated, read it back and count
how many times each int was generated.
6) Compare this big count-map against the original one. If
they match, then SUCCESS! Otherwise, FAILURE!
OK, that's how we can think about it. What are the map-reduce
steps that get the job done?
1) In a non-mapred thread, take the inputs 'range' and 'counts'.
2) In a non-mapread thread, generate the answer-key and write to disk.
3) In a mapred job, divide the answer key into K jobs.
4) A mapred 'generator' task consists of K map jobs. Each reads
an individual "sub-key", and generates integers according to
to it (though with a random ordering).
5) The generator's reduce task agglomerates all of those files
into a single one.
6) A mapred 'reader' task consists of M map jobs. The output
file is cut into M pieces. Each of the M jobs counts the
individual ints in its chunk and creates a map of all seen ints.
7) A mapred job integrates all the count files into a single one.