EECS 349 Problem Set 3
Due 11:59PM Thursday, November 4
v1.1
Thu Nov 4 18:14:13 CDT 2010
Instructions
Answer clearly and concisely. Some questions ask you to "describe a data
set." These data sets can be completely abstract creations. Often it
will help to draw a picture. Your argument doesn't have to be rigorously
formal but it should be convincing. To give you an idea of what we're looking
for, consider the following sample question:
Sample Question: With continuous attributes, nearest-neighbor
sometimes outperforms decision trees. Describe a data set in which nearest
neighbor is likely to outperform decision trees.
Sample Answer: Consider a data set with two continuous attributes x1
and x2 which lie between 0 and 1, where the target function is "return 1
if x2 > x1, and 0 otherwise." Decision trees must attempt to
approximate the separating line x1 = x2 using axes-parallel lines (a
"stair-step" function), which will require many distinct splits.
Thus, decision trees will be inefficient at both training and test time, and
could be inaccurate if there isn't enough data to generate enough splits to
approximate the separating line x1=x2 well. On the other hand, the lines of
the Voronoi diagram in nearest neighbor can be parallel or nearly parallel to
the separating line x1 = x2, so with a reasonable number of training examples
we would expect NN to approximate the target function well.
This sample answer is not mathematically precise,
but it is plausible and demonstrates that the writer knows the key
concepts about each approach.
How to Submit
Create a single text or PDF file with your answers to the questions below.
Attach it in an email with the subject
"EECS349-PS3-<first-name>-<last-name>" and send the email
to both of these addresses:
zhiyaoduan00 at gmail dot com
arefinhuq2013 at u dot northwestern dot edu
Questions
The questions are worth 15 points total, plus 1 point of extra credit.
-
Experimental Design.
- Your boss says "I'd like to know why our customers
sometimes leave our Web site after visiting only three or four pages. Can you
use your fancy machine learning to predict when they'll leave?"
Assume the data you have available are logs of several weeks of activity on
your Web site. The logs include millions of triples of the form
<IP address of user, URL accessed, timestamp>.
State precisely the target function you would attempt to learn and three
features you would start with. (3 points)
- If you have a choice between decision trees and nearest-neighbor for this
task, which would you use, and why? (1 point)
-
Give an example of a problem in which 3-nearest neighbors performs
better than 1-nearest neighbor. (3 points)
-
Genetic Algorithms vs. Simulated Annealing.
- What's the key ingredient in GAs that distinguishes it from simulated
annealing with beam search? (0.5 points)
- Give an example of an optimization problem where simulated annealing with
beam search outperforms genetic algorithms, and another where GAs outperform
simulated annealing with beam search. (2.5 points)
- Give a data set with three examples that fall into two natural
clusters, such that both of the following properties hold:
- hierarchical clustering, at the point when it has two clusters,
always has the right clusters, and
- sequential clustering with a limit of q=2 clusters and Θ=1.1
will output the right clusters for one example ordering, but not for some
other example ordering.
Assume both algorithms measure distance from an example x to a cluster
C as the distance from x to the nearest example in C.
(3 points)
-
Clustering Techniques Compared.
- What is the key difference between k-means clustering and the
other clustering techniques we discussed (sequential and greedy hierarchical
clustering) that makes k-means less applicable to examples with
nominal attributes? (1 point)
- Can you devise a way to adapt part of k-means to overcome this
difficulty? (1 point)
-
(Extra Credit) Consider a k-means algorithm that breaks ties
arbitraily when reassigning points to clusters. Show that if the
choices are particularly bad this can make k-means run forever.
(Hint: You can show this with as few as four examples and two clusters.)
What's a simple change to k-means that fixes this problem?
(1 point extra credit)
| Version History |
| 1.0 |
Mon Oct 25 21:08:52 CDT 2010 |
Initial version. |
| 1.1 |
Thu Nov 4 18:14:13 CDT 2010 |
Fix small typo. |