EECS 349, Machine Learning

  Insturctor: Prof. Doug Downey
By Majed Valad Beigi
Northwestern University
home Introduction Design Back Propagation NN LAMSTAR NN SVM Results Conclusion

Offline HandWritten Character Recognition

You can download the full project report and source code here:
FullReport_Source Code

Support Vector Machine

Training algorithm:

(a) Linearly Separable Binary Classification:

The training data for SVM is in the form of {xi,yi} where i=1,2, ...,52 and yi ϵ {-1, +1}. In this formula xi is the input vector. The corresponding 5*7 (6*8) grids are applied in the form of 1*35 (1*48) vectors to the input.

In SVM, we have some hyperplane which separates the positive (yi = +1) from the negative (yi = -1) examples (a separating hyperplane). The points ‘x’ which lie on the hyperplane satisfy x  w + b = 0 where w is normal to the hyperplane. Here,  

is the perpendicular distance from the hyperplane to the origin, and ||w|| is the Euclidean norm of w. Let d+(d-) be the shortest distance from the separating hyperplane to the closest positive (negative) example. The “margin” of a separating hyperplane is defined to be (d+) + (d-). For the linearly separable case, the support vector algorithm simply looks for the separating hyperplane with largest margin. This can be formulated as follows: suppose that all the training data satisfy the following constraints:

These can be combined into one set of inequalities:

Now consider the points for which the equality in Equation 1 holds (requiring that there exists such a point is equivalent to choosing a scale for w and b). These points lie on the hyperplane H1: xi・ w + b = 1 with normal w and perpendicular distance from the origin

Similarly, the points for which the equality in Equation 2 holds lie on the hyperplane H2: xi  w + b = −1, with normal again w, and perpendicular distance from the origin

Hence (d+) = (d-)  and the margin is simply

H1 and H2 are parallel (they have the same normal) and no training points fall between them. Thus we can find the pair of hyperplanes which gives the maximum margin by minimizing

subject to constraint 1.

Thus, the solution for a typical two dimensional case have the form shown in the following figure. Those training points for which the equality in constraint 1 holds (i.e. those which wind up lying on one of the hyperplanes H1, H2), and whose removal would change the solution found, are called support vectors; they are indicated in figure by the extra circles [14].

Figure 7: Linear separating hyperplanes for the separable case.

To solve this minimization problem by considering the constraint mentioned above, Lagrange formulas must be used.

(b) Non-Linearly separable Binary Classification:

When the data set is not linearly separable, a Kernel function would be used to map the data to a higher dimensional space (feature space). Some examples of Kernel functions are given in the following: For this project, I have used the linear Kernel function.


Figure 8: A Kernel function can be used to map the data point to a higher dimensional space.

(c) Multi-Class Classification:

SVMs are inherently two-class classifiers. In particular, the most common technique in practice has been to build as many one-versus-rest classifiers as the number of classes (commonly referred to as ``one-versus-all'' or OVA classification), and to choose the classifier with the largest positive output. In other words, this technique is based on building binary classifiers which distinguish between one of the labels and the rest (one-versus-all). Classification of new instances for the one-versus-all is done by a winner-takes-all strategy, in which the classifier with the highest output function assigns the class. Therefore for training, K (number of classes) different binary problems must be solved (K binary classifiers are required) to classify “class k” versus “the rest classes” for k = 1,2, ..., K. For this project, I have 52 classifiers (because I have 52 classes).

o   Testing algorithm:

For the testing, test sample would be assigned to the class that gives the largest fc(x) (most positive) value, where fc(x) is the solution from the c’th problem (classifier).