Mapreduce Programming Assignment 5

15-440 Assignments

There will be three programming projects and three written homework assignments.

TopicAssignedDueOther InfoSolutions
Project 1: Distributed Password Cracker08/28/1009/23/2010
Homework 1 9/8/2010 9/15/2010 11:59pmHomework 1 Solutions
Project 2: Distributed File System 9/30/2010
  • Parts 1-3: 10/15/2010
  • Parts 4-6: 10/28/2010
Homework 211/1/201011/10/2010 11:59pmHomework 2 Solutions
Project 3: Hadoop
  • Introduction: 11/9/2010
  • Main Project: 11/18/2010
  • Introduction: 11/16/2010 11:59pm
  • Main Project: 12/3/2010 11:59pm
Homework 3

Project 3: Hadoop MapReduce Programming

Project 2: A Distributed File System

Project 2 files can be found here:

The project rubric is as follows:

  • part1 15 points (5 for LOSSY)
  • part2 10 points
  • part3: 15 points
  • part4: 15 points
  • part5: 20 points (5 caching improvement, 5 for LOSSY)
  • part6: 10 points
  • style: 15 points

In Project 2, you will be doing your development in a Virtual Machine, and we will be supporting VirtualBox as the virtualization software. Please download VirtualBox for your system here: http://www.virtualbox.org. A brief introduction to VirtualBox for the purposes of this project can be found here: VirtualBox Intro for 15-440. You can find the Ubuntu image here: VirtualBox Ubuntu Image with FUSE support (Warning: 1.4GB file)

Project 1: A Simple Distributed Password Cracker

Using a trivially parallelizable, easy computation (brute-force cracking a password), this lab introduces students to the communication and coordination challenges involved in harnessing a cluster or wide-area distributed group of machines to accomplish a common goal.

  • Using the provided protocol description, implement the coordinator process and slave processes that run on the worker machines. Your processes must be able to inter-operate with the course-supplied example binaries.
  • Implement a reliable work assignment mechanism, operating on top of UDP, that successfully allocates jobs to slave machines, can deal with the loss or delay of UDP packets, and calculates timeouts to reschedule work that was allocated to failed machines.

Project 1 files can be found here:

All homework and the first project is to be done individually. The second and third programming projects will be done in groups of two students.

The later projects are done in groups for two reasons. The first is the size of the class. The second and more important reason is that this is an opportunity to experience the joys and frustrations of working with others. It's a skill you only get better at with practice.

Since 15-440 fulfills the project-class requirement of the CS degree, you will be expected to learn and practice good software engineering, as well as demonstrate mastery of the networking concepts. Both partners in a project group will need to fully understand the project and your solution in order to do well on those exam questions relating to the projects. For example, a typical question might be: "When you implemented X, you came across a particular situation Y that required some care. Explain why this simple solution Z doesn't work and describe how you solved it." We'll pick questions such that it will take some effort to figure out Y. If you didn't take the time to work the problem yourself and just relied on your partner, you won't have enough time during the test to figure it out. Be careful, the insights you'll need will come only from actually solving the problem as opposed to just seeing the solution.

By their nature, the assignments aren't going to be completely comprehensive of everything you'll encounter in the real world or in class. To assist you, we've compiled a list of suggested study problems that you may want to do in addition to the normal homework. They're not graded, but they'd make great topics to discuss with the course staff during office hours.

-->

Notes on the Programming Projects

A key objective of this course is to provide a significant experience with system programming, where you must write programs that are robust and that must integrate with a large, installed software base. Oftentimes, these programs are the ones that other people will build upon or use as tools. Systems programming is very different from the application program development you have done in earlier courses:

  • It is typically done in a low-level language, such as C, to ensure close control over system resources.
  • Especially with server code, it must be designed to run indefinitely. It must handle reliably handle every possible error condition, and it must manage resources such as memory with care.
  • It must be secure. Connecting a system to a network makes it vulnerable to malicious attacks initiated anywhere in the world. Poorly designed or implemented network software provides a common entrypoint for attack. System software must be invulnerable to flaws such as string overflows or malformed incoming messages. (This point bears repeating: Any system software must stringently check input it receives from the network or from the user. Do not trust either one! They're often out to get you.)
  • The interfaces to other parts of the system are generally specified by documented protocols.
  • Distributed systems nearly always involve concurrency, both within individual machines (multiple processes or threads) as well as among the different network components.
  • An important part of system programming is to develop comprehensive test methods for the programs. A significant effort should be invested in writing programs that will thoroughly test the system code, including the handling of different error conditions.

Finally, please note that by design, the projects do not always specify every corner case bit of behavior or every design decision you may have to make. A major challenge in implementing real systems is in making the leap from a specification that is often slightly incomplete to a real-world implementation. Don't get frustrated -- our grading will not dock you for making reasonable design decisions! We suggest three general guidelines to follow:

  • Be conservative in what you do, be liberal in what you accept from others.. This is the design guideline underlying many Internet services, first uttered as a robustness principle by Jon Postel in the first TCP RFC, RFC 793.
  • Browse the newsgroup and FAQ, ask the course staff!
  • Make a reasonable design decision and document it. In a perfect world, all aspects of a design would be comletely specified, but most real-world, large, complex systems do not achieve this goal. You will often hear the course staff reply: You may pick either alternative as long as your server does not crash. This advice applies particularly to error handling, where there are a nearly infinite number of possible errors with partially-specified error responses. The goal of the course is to gain experience with creating large systems; we don't expect students to be psychic, merely to exercise good judgement about creating a robust and usable system.

We'll go into more detail about each of these points during the recitation sections. But keep in mind: The programming assignments are larger and more open-ended than in other courses. Doing a good job on the project requires more than just producing code that runs: it should have a good overall organization, be well implemented and documented, and be thoroughly tested.


Last updated: Wed Dec 01 22:31:05 -0500 2010 [validate xhtml]

Homework Assignment 5

This homework will make you more familiar with the Hadoop File System (HDFS) and the Hadoop mapreduce framework. You will be provided a disk image with everything pre-installed so hopefully the setup-phase for this homework will go smooth.

If you have one hour to spare I suggest that you watch this lecture from UC Berkeley. It will give you a great overview of what mapreduce is and what it is good at.

Overview

In this homework you will “do big data” on some weather data from the 18th century. The data will be given as many relatively small files (one for each year) so it will be easy for the tech-savvy ones to write a small program that will do what we ask of in Python (perhaps a nice way to double check your results before you hand them in ;). But since that is not very “big data” we will instead use Hadoop mapreduce.

The input data will be one -file for each year in the period 1763 to 1799. Each of these files contains one row per measurement. To make things a little tricky there were only one weather station recording data in 1763 and at least two weather stations in 1799. Furthermore there are at least two measurements each day, one for the maximum temperature (TMAX) and one for the minimum temperature (TMIN), and sometimes one for the precipitation (PRCP).

The content of looks similar to this:

  • Desciption of the columns:
    1. The weather station id
    2. the date in format yyyymmdd
    3. type of measurement (for this homework we care about the maximum temperature TMAX)
    4. temperature in tens of degrees (e.g. -90 = -9.0 deg. C., -184 = -18.4 deg. C.)

The Goal

The goal of this assignment is to create a sorted list for the most common temperatures (where the temperature is rounded to the nearest integer) in the period 1763 to 1799. The list should be sorted from the least common temperature to the most common temperature. The list should also state how many times each temperature occurred.

You solve this very important problem in two steps - task 1 and task 2.

Outline

The outline of the homework is as follows:

  1. Download and install the virtual disk image (homework5.ova)
  2. Intro to Hadoop File System (i.e. how to add/remove directories(s) and upload/download file(s))
  3. Intro to Hadoop Mapreduce (i.e. how to compile, run, and check the log-files of a mapreduce program)
  4. Task 1 - Count how many times each temperature occurs
  5. Task 2 - Sort the temperatures to see which is least/most common

Deadline and Deliverables

Deadline for this homework is May 12th. You will hand in your source code and the all result files.

Installing the necessary software

For this assignment everything is provided in the virtual disk image.

  1. Download the image
    • homework5.ova
    • Install it
    • Start it
    • Log in to with as password
      • You might have to switch user to from to
      • Works fine if you want to log in via as well
    • Look around in the file system and make sure that you have:
      • - this will contain all you need for the homework
      • - a demo program for your inspiration/guide
      • - code skeleton for task 1
      • - code skeleton for task 2
      • - contains all the input data you need

Get to know Hadoop File System (HDFS)

It’s now time to start playing around a little bit with HDFS! You will notice that it is almost exactly like navigating in a regular UNIX environment - just a zillion times slower…

First things first, you might have to start up HDFS and YARN:

  • Start HDFS

  • Test to see if HDFS is up and running:

  • Start YARN

Some basic HDFS commands:

Create a text file and add some content to it, then continue

A complete list of all the commands!

Get to know Mapreduce - our demo program

We will try to introduce mapreduce a bit through a demo, using the same weather data as you will use in Task 1 and Task 2. If you need to fresh up on what mapreduce is, I suggest that you take a look at the wikipedia page as well as the Hadoop MapReduce Tutorial.

Overview

The goal for this task is to take all the original weather data for the period 1763 to 1799 and run your own mapreduce program on it. The output of the mapreduce job should be a file for each weather station. Within these files you should list the days and temperatures that this weather station recorded. An overview can be seen below:

WeatherJob.java

This is the Main-file for the mapreduce program. In this you specify which mapper and reducer you will be using. You will also specify the input and output format for the mapper and reducer.

You will not have to modify any of the -files that we’ve given you!

Documentation for the .

WeatherMapper.java

  • Input
    • (key, value) = (, )
      • is the input data file
      • will be the text-data within these files
      • The will be fed this text data row-by-row
    • The input file might contain multiple weather stations
    • The temperature is given in tens of degrees
    • e.g. -123 = -12.3 degrees Celsius
    • We only care about the maximum temperature for each day (TMAX)
  • Output

    • (key, value) = (weather-station-id, “date,temp”)
    • e.g.
    • “date,temp” should be submitted as one string
    • Note that and is of type ! Documentation for Text
  • Please take a moment to study the mapper-code!

Documentation for the

WeatherReducer.java

The specification of the reducer is as follows:

  • Input
    • (key, value) = (weather-station-id, “date,temp”)
    • This is the output from the mapper
  • Output

    • One output file per key (i.e. one output file per weather station)
    • e.g. , or
    • In these files you should list:

      • the dates for when this temperature occurred
      • the temperature
      • For the output file this would be:

        17860108 -3.1

        17860107 -3.1

        17860102 5.0

        17860101 7.3

    • To write to an output file you can use the following method:

    • where , , and might be:

    • = “17860101”, = “7.3”, = “ITE00100554”
    • and is of type - Documentation for Text
  • Please take a moment to study the reducer-code!

Documentation for the

Compile and run the program

Follow this guide to compile and run the mapreduce demo:

  1. Make sure that and are up and running, otherwise start them
  2. Upload some new input data ( and ):

  3. Compile the -program (make sure you’re in the correct directory)

  4. Turn it into a runnable -file (named ):

  5. Run it!

    • For this demo we have:
    • jar.file =
    • Main-method =
    • Input path (in HDFS) =
    • Output path (in HDFS) =
    • Note that the output path cannot be an existing directory!

  6. Look at what it prints:

    • It should print something like this:

    • Open up the web browser in the virtual machine and enter the it gave you.

      • In this example the was:
      • http://ubuntu1410adminuser:8088/proxy/application14301559043610001/
    • You can also view the log-files in the file browser:

      • In this example it will be located in:

  7. Download and view the results:

    • Now you should be able see all the output files from the mapreduce demo!

Task 1

The goal for this task is to take all the original weather data for the period 1763 to 1799 and run your own mapreduce program on it. The output of the mapreduce job should be a file for each degree of temperature that was present within this period. Within these files you should list the days (and for which weather station) this temperature was recorded. Furthermore, at the end of each file there should be a summation of the total number of times that this temperature occurred. An overview can be seen below:

Mapper - specification

The specification of the mapper is as follows:

  • Input
    • (key, value) = (, )
      • is the input data file
      • will be the text-data within these files
      • The will be fed this text data row-by-row
    • The input file might contain multiple weather stations
    • The temperature is given in tens of degrees
    • e.g. -123 = -12.3 degrees Celsius
    • We only care about the maximum temperature for each day (TMAX)
  • Output
    • (key, value) = (temp, “date,weather-station-id”)
    • temp should be in degrees Celcius and rounded to the nearest integer
    • e.g. -123 => -12.3 => -12, 247 => 24.7 => 25
    • “date, weather-station-id” should be submitted as one string
    • Note that and have to be of type ! Documentation for Text

Reducer - specification

The specification of the reducer is as follows:

  • Input
    • (key, value) = (temp, “date,weather-station-id”)
    • This is the output from the mapper
  • Output

    • One output file per key (i.e. one output file per temperature)
    • e.g. , or
    • In these files you should list:

      • the dates for when this temperature occurred
      • with which weather station that recorded it
      • a summation of the total number of occurrences
      • For the output file this would be:

        17840104 EZE00100082

        17881228 EZE00100082

        17881218 EZE00100082

        17890104 EZE00100082

        17760120 EZE00100082

        SUM 5

    • To write to an output file you can use the following method:

    • where , , and might be:
    • = “17840104”, = “EZE00100082”, = “-16”
    • or, = “SUM”, = 5, () = “-16”
    • and is of type - Documentation for Text

How to run it?

Example of how you can compile and run the program:

  1. Make sure that and are up and running, otherwise start them
  2. Compile the -program (make sure you’re in the correct directory)

  3. Turn it into a runnable -file (named ):

  4. Run it!

    • For this demo we have:
    • jar.file =
    • Main-method =
    • Input path (in HDFS) =
    • Output path (in HDFS) =
    • Note that the output path cannot be an existing directory!

Task 2

The goal for this task it to take the output from the previous task and use as input for this task. The output of this task should be just one file. In this file you should have a sorted list showing which temperature was most common and how many times this temperature occurred. An overview can be seen below:

Mapper - specification

The specification of the mapper is as follows:

  • Input

    • (key, value) = (, )
      • is the input data file produced in task 1
      • will be the text-data within these files
      • The will be fed this text data row-by-row
    • To extract the filename of the input file you can use:

  • Output

Reducer - specification

The specification of the reducer is as follows:

  • Input
    • (key, value) = (numOccurrences, “temp”)
    • This is the output from the mapper
  • Output

    • One output file which contains a sorted list of the most common temperature
    • Least common temperature first
    • Should also show how many times each temperature occurred
    • Example:

    • When you only want to write to one output file you can use:

    • Where and might be
    • = 3, = “35”
    • Note that and have to be of type ! Documentation for Text
    • To convert an to you can use:

How to run it?

Example of how you can compile and run the program:

  1. Make sure that and are up and running, otherwise start them
  2. Compile the -program (make sure you’re in the correct directory)

  3. Turn it into a runnable -file (named ):

  4. Run it!

    • For this demo we have:
    • jar.file =
    • Main-method =
    • Input path (in HDFS) =
    • Output path (in HDFS) =
    • Note that the output path cannot be an existing directory!



Comments

Leave a Reply

Your email address will not be published. Required fields are marked *