EECS 349 Problem Set 2

Due 11:59PM Thursday, October 21

v2.3 Updated Mon Oct 18 15:47:43 CDT 2010


Overview

In this assignment you will implement a decision tree learning algorithm and apply it to a synthetic dataset. You will also implement a pruning strategy in your algorithm. You will be given labeled training data, from which you will generate a model. You will be given labeled validation data, for which you will report your model's performance. You will also be given individualized unlabeled test data for which you will generate predictions.

Submission Instructions

Here is how James Bond would submit the homework. Please adjust for your own name:

  1. Create a single text or PDF file with your answers to the questions below. Name this file PS2-James-Bond.txt or PS2-James-Bond.pdf.
  2. Create a directory (i.e. a folder) named PS2-James-Bond.code that contains your source code.
  3. Create a file named README that explains how to build and run your code.
  4. Get the test file corresponding to your name (see download instructions below). For James Bond this file would be named James-Bond.unlabeled. Run your code on this test file and output a file in the same format, but with your predicted labels in the last column. Name this file PS2-James-Bond.csv.
  5. Create a ZIP file named PS2-James-Bond.zip containing:
  6. Ensure that the zip file contains all of your source code.You may have to tell the ZIP utility explicitly to include the contents of the subdirectory containing your code.
  7. Compose an email with "EECS349-PS2-James-Bond" in the subject line. You may leave the body empty. Attach PS2-James-Bond.zip to the email and send the email to both of the following addresses:

Download the Dataset

The dataset files are here:

This dataset is based on:

R. Agrawal, T. Imielinski, A. Swami (1993). Database Mining: A Performance Perspective. IEEE Transactions on Knowledge and Data Engineering. 5 (6):914-925.

The dataset is from a synthesized (and therefore fictitious) people database where each person has the following attributes:

The class label is given by the group attribute. This is a binary classification problem with numeric and nominal attributes. Some attribute values are missing (as might happen in a real-world scenario). These values are indicated by a "?" in the file. In the test files the class labels are missing, and these missing labels are also indicated by a "?". The test sets are all drawn from the same distribution as the training and validation sets.

If you want, you can imagine that the task is to predict whether a loan application by the given person will be approved or denied. However for this assignment it is not necessary (or even useful) to interpret the task or the attributes.

Implementation

For this assignment you will implement a decision tree algorithm in the language of your choice. In particular, you should not use Weka or any other existing framework for generating decision trees. You are free to choose how your algorithm works. Your program must be able to:

  1. Read the training data file and generate a decision tree model.
  2. Output the generated decision tree in disjunctive normal form.
  3. Read the validation data file and report the accuracy of the model on that data (i.e. the percentage of the validation data that was classified correctly).
  4. Read a test data file with missing labels (question marks) in the last column and output a copy of that file with predicted labels in the last column (replacing the question marks).

Note: your algorithm must handle missing attributes.

A Note About Design

The data files are provided to you in CSV format so that it will be easier for you to read them in. One drawback of the CSV format is that it does not contain metadata (as ARFF does, for example). This means that it is not possible from the data alone to know which attributes are nominal and which are numeric. For example, zipcode and car are actually nominal attributes that are represented as integers, as described above. Therefore you need to represent this information somewhere. You can either put this information directly in the code that reads in the input files, or you can generate a metadata file of your own and write code that interprets the input file based on the contents of the metadata file.

Regardless of how you translate the input file into an internal representation, write your decision tree algorithm to handle a general binary classification problem. The algorithm should be able to handle another binary classification problem with a different composition of numeric and nominal attributes. For example, the algorithm itself should not assume that each example contains exactly 12 attributes, nor for example should it assume that there is an attribute named "elevel" with 5 categories.

Pruning

Add a pruning strategy to your decision tree algorithm. You are free to choose the pruning strategy.

Be sure you can run your algorithm both with and without pruning.

Common-Sense Guidelines

  1. Write your program so that you do not have to modify code when switching from one task to another or when turning pruning on or off. For example, you might use command-line parameters to enable or disable pruning and to distinguish between the model generation task, the validation task, etc. An acceptable alternative is to follow the style of LIBSVM and have separate programs for each task, e.g. model-train, model-validate, model-predict, etc.
  2. Do not hardcode the names of input or output files in your program. It should be possible to run your program on another input file.
  3. Document the usage of your program in the README.
  4. While it is not required for this assignment, you may find it useful to have your program be able to output the generated decision tree in a human-readable format similar to that produced by J48 in Weka.

Questions

Put answers to the following questions in a text or PDF file, as described in the submission instructions.

Answer concisely. You may include pseudocode or short fragments of actual code if it helps to answer the question. However, please keep the answer document self-contained. It should not be necessary to look at your source files to understand your answers.
  1. How did you represent the decision tree in your code?
  2. How did you represent examples (instances) in your code?
  3. How did you choose the attribute for each node?
  4. How did you handle missing attributes in examples?
  5. What is the termination criterion for your learning process?
  6. Apply your algorithm to the training set, without pruning. Print out a Boolean formula in disjunctive normal form that corresponds to the unpruned tree learned from the training set. For the DNF assume that group label "1" refers to the positive examples.
  7. Explain in English one of the rules in this (unpruned) tree.
  8. How did you implement pruning?
  9. Apply your algorithm to the training set, with pruning. Print out a Boolean formula in disjunctive normal form that corresponds to the pruned tree learned from the training set.
  10. What is the difference in size between the pruned and unpruned trees?
  11. Test the unpruned and pruned trees on the validation set. What are the accuracies of each tree? Explain the difference, if any.
  12. Which tree do you think will perform better on the unlabeled test set? Why? Run this tree on the test file and submit your predictions as described in the submission instructions.

Grading Breakdown

This assignment is worth 15 points, broken down as follows:

It is possible to get up to 12 points of credit without implementing pruning. (If you do not implement pruning, Questions 11-12 can still receive full credit based on the output of the algorithm without pruning.)


Version History
1.0 Wed Oct 6 02:40:38 CDT 2010 Initial version.
2.0 Sun Oct 10 20:33:15 CDT 2010 Add individual test files. Add point breakdown for grading. Clarify implementation.
2.1 Tue Oct 12 17:19:36 CDT 2010 Clarify that algorithm only needs to handle binary classification.
2.2 Wed Oct 13 14:16:12 CDT 2010 Change due date to Thu Oct 21.
2.3 Mon Oct 18 15:47:43 CDT 2010 Clarify DNF.