Natural-Language-Processing

1. Language Modeling

1 Language Modeling

1.1 Introduction to N-grams

Today's goal: assign a probability to a sentence or a sequence of words: \[ P(W) = P(w_1, w_2, w_3, w_4, \ldots, w_n) \]
Related task: probability of an upcoming word: \[ P(w_5|w_1, w_2, w_3, w_4) \]
A model that computes either of these: \[ P(W) \mbox{ or } P(w_n|w_1,w_2, \ldots, w_{n-1}) \mbox{ is called a language model.} \]
Better: the grammar. But language model or LM is starndard.

1.1.1 How to compute P(W)

How to compute this joint probability of P(its, water, is, so, transparent, that)
Intuition: use Chain Rule of Bayes

1.1.2 How to estimate these probabilities

just count and divide?
- No. Too many sentences.
Markov Assumption
- Simplifying assumption: \[ P(the|\mbox{its water is so transparent that}) \approx P(the|that) \]
- Or maybe \[ P(the|\mbox{its water is so transparent that}) \approx P(the|transparent\ taht) \]
- Formally \[ P(w_i|w_1w_2\ldots w_{i-1}) \approx P(w_i)|w_{i-k}\ldots w_{i-1} \]

1.1.3 Simplest case: Unigram model

\[ P(w_1w_2\ldots w_n) \approx \prod\limits_i {P(w_i)} \]

1.1.4 Bigram model

Condition on the previous word:

\[ P(w_i|w_1w_2\ldots w_{i-1}) \approx P(w_i|w_{i-1}) \]

1.1.5 N-gram models

We can extend to trigrams, 4-grams, 5-grams
In general this is an insufficient model of language
- because language has long-distance dependencies:
"The computer which I had just put into the machine room on the fifth floor crashed."
But we can often get away with N-gram models

1.2 Estimating N-gram Probabilities

1.2.1 B-gram

\begin{equation*} P(w_i|w_{i-1}) = \frac{c(w_{i-1}w_i)}{w_{i-1}} \end{equation*}

More examples: Berkeley Restaurant Project sentences

1.2.2 Practical Issues

We do everything in log space
- Avoid underflow
- (also adding is faster than multiplying)

1.2.3 Language Model Toolkit

SIRL Google N-gram Release, August 2006 Google Book N-grams

1.3 Evaluation and Perplexity

Does our language model prefer good sentences to bad ones?

1.3.1 Extrinsic evaluation of N-gram models

1.3.2 Difficulty of extrinsic(in-vivo) evaluation of N-gram models

Extrinsic evaluation
- Time-consuming
So
- sometimes use intrinsic evaluation: perplexity
- Bad approximation
  - unless the test data looks just like the training data
  - So generally only useful in pilot experiments
- But is helpful to think about

1.3.3 Intuition of Perplexity

The Shannon Game:
- How well can we predict the next word?
The best LM is one that best predicts an unseen test set
- Gives the highest P(sentence)
Perplexity is the probability of the test set, normalized by the number of words:

\[ PP(W) = P(w_1w_2\ldots w_N)^{-\frac{1}{N}} \]

1.3.4 Perplexity as branching factor

Let's suppose a sentence consisting of random digits
What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?

\[ PP(W) = P(w_1w_2\ldots w_n)^{-\frac{1}{N}} = \frac{1}{10} \] The lower the better.

1.4 Generalization and zeros

1.4.1 The Shannon Visualization Method

Choose a random bigram (<s>, w) according to its probability
Now choose a random bigram (w, x) according to its probability
And so on until we choose </s>
Then string the words together

1.4.2 The perils of overfitting

N-grams only work well for word prediction if the test corpus looks like the training corpus
- In real life, it often doesn't
- We need to train robust models that generalize!
- One kind of generalization: Zeros!
  - Things that don't ever occur in the training set
    - But occur in the test set

1.4.3 Zeros

Bigrams with zero probability
- mean that we will assign 0 probability to the test set!
And hence we cannot compute perplexity (can't divide zero)

1.5 Smoothing: Add-One

Also called Laplace smoothing
Pretend we saw each word one more time than we did

\[ P_{Add-1}(w_i|w_{i-1} = \frac{c(w_{i-1},w_i)+1}{c(w_{i-1}+V)}) \]

1.5.1 Reconstituted counts

\[ c^*(w_{n-1}w_n) = \frac{[C(w_{n-1}w_n)+1]\times C(w_{n-1})}{C(w_{n-1})+V} \]

1.5.2 Add-1 estimation is a blunt instrument

So add-1 isn't used for N-grams:
- we'll see better methods
But add-1 is used to smooth other NLP models
- For text classification
- In domains where the number of zeros isn't so huge

1.6 Interpolation

Backoff and Interpolation

Sometimes it helps to use less context
- Condition on less context for contexts you haven't learned much
Backoff:
- use trigram if you have good evidence
- otherwise bigram, otherwise unigram
Interpolation:
- mix unigram, bigram, trigram
Interpolation works better

1.6.1 Linear Interpolation

\[ P() = \lambda_1 P()\]

Lambdas conditional on context:

1.7 Good-Turing Smoothing

More general formulations: Add-K \[ P_{Add-k}(w_i|w_{i-1})=\frac{c(w_{i-1}, w_i)}{} \] \[ P_{UnigramPrior}(w_i|w_{i-1} = \frac{c()}{}) \]

1.7.1 Advanced smoothing algorithms

Intuition used by many smoothing algorithms
- Good
Notation: N_c = Frequency of frequency c
- N_c = the count of things we've seen c times

1.7.2 Good Turing claculations

\[ P^*_{GT}(things with zero frequency)=\frac{N_1}{N} \]

Unseen (bass or catfish)
- c = 0
- MLE p = 0/18 = 0
- P^*_GT(unseen) = N₁/N = 3/18

1.8 Kneser-Ney Smoothing

Absolute Discounting Interpolation

Save ourselves some time and just subtract 0.75 (or some d)

\[ P_{AbsoluteDiscounting(w_i|w_{i-1}=\frac{}{}}\]

(Maybe keeping a couple)

1.8.1 KN Smoothing

Better estimate for probabilities of lower-order unigrams!
- Shannon game: I can't see without my reading Francisco?
- "
The unigram is useful exactly when we haven't seen this bigram!
Instead of P(w): "How likely is w"
P_continuation(w):" How likely is w to appear as a novel continuation?
- For each word, count the number of bigram types it completes
- Every bigram type was a novel continuation the first time it was seen
\[ P_{} \propto |{w_{i-1}:c(w_{i-1},w)>0}| \]

1.8.2 Kneser-Ney Smoothing II

-How many times

1.8.3 Kneser-Ney Smoothing III

Alternative metaphor: The number of # of word types seen to precede w

1.8.4 Kneser-Ney Smoothing IV

\[ P_{KN}(w_i|w_{i-1} = \frac{}{} + \lambda(w_{i-1}P_{Continuation}(w_i)))\]

1.8.5 Kneser-Ney Smoothing: Recursive formulation

\[\]

\begin{equation} c_{KN}(\dot) \end{equation}

Continuation count = Number of unique single word contexts for \dot