Hadoop – Mapper & Reducer (Mar `16)

image2

Data file & Description

For this project, I used a huge enrollment data of my university. There are over 660,000 data rows on this csv file. It’s almost impossible to look over all these. So, what I tried to do was parsing this chunk of data as my need using mapper and reducer from Hadoop.

  • A: Semester ID
  • B: Semester
  • C: Location
  • D: Days of the week
  • E: Time
  • F: Coursed
  • G: Course Name
  • H: Actual Enrollment
  • I: Max Enrollment

excelData

With this csv, I parsed as [Location(w/o room#)_Semester  ActualEnrolled].

So on my mapper, it gets location(but not room#) and then # of Actual Enrollment, but it filters unknown data(some data has unknown or Arr value).

On reducer, it adds up all the same building’s # of enrollment.

Actual code

/*

Chang Min Park

This code filters out Semester name "Unknown"
also, building name "Arr" or "Unknown", but building number doesn't matter
also, if enrolled number is not an actual number.

*/

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
   public static boolean isNumber(String string){
      try{
         Integer.parseInt(string);
      }
      catch(Exception e){
         return false;
      }
      return true;
   }

   public static class TokenizerMapper
      extends Mapper<Object, Text, Text, IntWritable>{

      private final static IntWritable one = new IntWritable (1);
      private Text word = new Text ();
      public void map (Object key, Text value, Context context
         )throws IOException, InterruptedException{
         String[] str = value.toString().split(",");
         String[] building = str[2].split(" ");

         if(str[1]=="Unknown"||building[0].contains("Arr")||building[0].contains("Unknown")
            ||!isNumber(str[7])){
            return;
         } else{
            word.set(building[0]+"_"+str[1]);
            context.write(word, new IntWritable(Integer.parseInt(str[7])));
         }
      }
   }

   public static class IntSumReducer
      extends Reducer <Text, IntWritable, Text, IntWritable> {
      private IntWritable result = new IntWritable();
      public void reduce (Text key, Iterable<IntWritable> values,
         Context context) throws IOException, InterruptedException {
         int sum = 0;
         for(IntWritable val : values) {
            sum += val.get();
         }
         result.set(sum);
         context.write(key, result);
      }
   }
   public static void main (String[] args) throws Exception {
      Configuration conf = new Configuration();
      Job job = Job.getInstance(conf, "word count");
      job.setJarByClass(WordCount.class);
      job.setMapperClass(TokenizerMapper.class);
      job.setReducerClass(IntSumReducer.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(IntWritable.class);
      FileInputFormat.addInputPath(job, new Path(args [0]));
      FileOutputFormat.setOutputPath(job, new Path(args [1]));
      System.exit(job.waitForCompletion(true) ? 0 : 1 );
   }
}

Result

result
This result prints out each location's all year total # of actual enrollment.
It can help a lot on assigning lecture hall for university technicians.