EECS 349 Problem Set 3

Due 11:59PM Thursday, November 4

v1.1 Thu Nov 4 18:14:13 CDT 2010

Instructions

Answer clearly and concisely. Some questions ask you to "describe a data set." These data sets can be completely abstract creations. Often it will help to draw a picture. Your argument doesn't have to be rigorously formal but it should be convincing. To give you an idea of what we're looking for, consider the following sample question:

Sample Question: With continuous attributes, nearest-neighbor sometimes outperforms decision trees. Describe a data set in which nearest neighbor is likely to outperform decision trees.

Sample Answer: Consider a data set with two continuous attributes x1 and x2 which lie between 0 and 1, where the target function is "return 1 if x2 > x1, and 0 otherwise." Decision trees must attempt to approximate the separating line x1 = x2 using axes-parallel lines (a "stair-step" function), which will require many distinct splits. Thus, decision trees will be inefficient at both training and test time, and could be inaccurate if there isn't enough data to generate enough splits to approximate the separating line x1=x2 well. On the other hand, the lines of the Voronoi diagram in nearest neighbor can be parallel or nearly parallel to the separating line x1 = x2, so with a reasonable number of training examples we would expect NN to approximate the target function well.

This sample answer is not mathematically precise, but it is plausible and demonstrates that the writer knows the key concepts about each approach.

How to Submit

Create a single text or PDF file with your answers to the questions below. Attach it in an email with the subject "EECS349-PS3-<first-name>-<last-name>" and send the email to both of these addresses:

zhiyaoduan00 at gmail dot com
arefinhuq2013 at u dot northwestern dot edu

Questions

The questions are worth 15 points total, plus 1 point of extra credit.

Experimental Design.
1. Your boss says "I'd like to know why our customers sometimes leave our Web site after visiting only three or four pages. Can you use your fancy machine learning to predict when they'll leave?" Assume the data you have available are logs of several weeks of activity on your Web site. The logs include millions of triples of the form <IP address of user, URL accessed, timestamp>. State precisely the target function you would attempt to learn and three features you would start with. (3 points)
2. If you have a choice between decision trees and nearest-neighbor for this task, which would you use, and why? (1 point)
Give an example of a problem in which 3-nearest neighbors performs better than 1-nearest neighbor. (3 points)
Genetic Algorithms vs. Simulated Annealing.
1. What's the key ingredient in GAs that distinguishes it from simulated annealing with beam search? (0.5 points)
2. Give an example of an optimization problem where simulated annealing with beam search outperforms genetic algorithms, and another where GAs outperform simulated annealing with beam search. (2.5 points)
Give a data set with three examples that fall into two natural clusters, such that both of the following properties hold:
1. hierarchical clustering, at the point when it has two clusters, always has the right clusters, and
2. sequential clustering with a limit of q=2 clusters and Θ=1.1 will output the right clusters for one example ordering, but not for some other example ordering.
Assume both algorithms measure distance from an example x to a cluster C as the distance from x to the nearest example in C. (3 points)
Clustering Techniques Compared.
1. What is the key difference between k-means clustering and the other clustering techniques we discussed (sequential and greedy hierarchical clustering) that makes k-means less applicable to examples with nominal attributes? (1 point)
2. Can you devise a way to adapt part of k-means to overcome this difficulty? (1 point)
(Extra Credit) Consider a k-means algorithm that breaks ties arbitraily when reassigning points to clusters. Show that if the choices are particularly bad this can make k-means run forever. (Hint: You can show this with as few as four examples and two clusters.) What's a simple change to k-means that fixes this problem? (1 point extra credit)

Version History
1.0	Mon Oct 25 21:08:52 CDT 2010	Initial version.
1.1	Thu Nov 4 18:14:13 CDT 2010	Fix small typo.