Data file & Description
For this project, I used a huge enrollment data of my university. There are over 660,000 data rows on this csv file. It’s almost impossible to look over all these. So, what I tried to do was parsing this chunk of data as my need using mapper and reducer from Hadoop.
- A: Semester ID
- B: Semester
- C: Location
- D: Days of the week
- E: Time
- F: Coursed
- G: Course Name
- H: Actual Enrollment
- I: Max Enrollment
With this csv, I parsed as [Location(w/o room#)_Semester ActualEnrolled].
So on my mapper, it gets location(but not room#) and then # of Actual Enrollment, but it filters unknown data(some data has unknown or Arr value).
On reducer, it adds up all the same building’s # of enrollment.
Actual code
/* Chang Min Park This code filters out Semester name "Unknown" also, building name "Arr" or "Unknown", but building number doesn't matter also, if enrolled number is not an actual number. */ import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static boolean isNumber(String string){ try{ Integer.parseInt(string); } catch(Exception e){ return false; } return true; } public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable (1); private Text word = new Text (); public void map (Object key, Text value, Context context )throws IOException, InterruptedException{ String[] str = value.toString().split(","); String[] building = str[2].split(" "); if(str[1]=="Unknown"||building[0].contains("Arr")||building[0].contains("Unknown") ||!isNumber(str[7])){ return; } else{ word.set(building[0]+"_"+str[1]); context.write(word, new IntWritable(Integer.parseInt(str[7]))); } } } public static class IntSumReducer extends Reducer <Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce (Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main (String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args [0])); FileOutputFormat.setOutputPath(job, new Path(args [1])); System.exit(job.waitForCompletion(true) ? 0 : 1 ); } }
Result
This result prints out each location's all year total # of actual enrollment. It can help a lot on assigning lecture hall for university technicians.