Function Word Classification

This is an experiment where we classify congressional speeches based on function word analysis.

Read Paper »

EECS 349-0: Machine Learning | Northwestern University | Professor Downey

Al Johri

Seth McCammon

Daniel Thirman


Output of party classifier results on congressional data



Typically, Natural Language Processing (NLP) utilizes a bag of words approach to classify texts. These words contained in these bags are generally based on the content of the text that the classifier attempts to work with. For example a spam filter would flag emails referring to “nigerian princes” as spam. For our project we wanted to build a content-independent classifier that eschews content words in favor of function words. Function words are the filler words of a language, such as pronouns, prepositions, and modifying verbs, that fit around the content of a sentence. We think that a classifier based on these function words would be more generally applicable across genres of text, rather than being content-restricted. We decided to narrow down the problem, by looking to classify congressional speeches based on party affiliation by using only function words.

Our dataset, as previously mentioned consists of 890,000 congressional speeches from the congressional speech archive. These speeches are from congresses between 1994 and 2014. Accompanying each speech is a variety of metadata that includes the party affiliation of the speaker, the name of the speaker and the congress and session numbers. The featureset is derived from a series of 64 dictionaries created by the Linguistic Inquiry Word Count (LIWC), which comprise both content-based and function-word dictionaries. We used a variety of methods to derive features from these dictionaries. The baseline content-based method, utilizes a percentage of each speech present in each dictionary. For subsequent function-word analyses, we counted the number of times each word in the function-word dictionaries occurred as a percentage of the total word count of the speech. These numbers were formed into vectors, which became our featureset for our classifier. To classify each speech we used a Bernoulli Naive Bayes classifier.

Our results are promising. When the content based dictionaries were incorporated, our classifier achieved 70% accuracy in 5-fold cross validation on congressional speeches over a 2-year period (the length of one congress). When we restricted our dataset to only function words, we expected our accuracy to drop, as we were using a less descriptive dataset. Indeed it did, as our classifier, using only function words as features, achieved a 68% accuracy in classifying speeches based on party affiliation in a single congress.