Natural-Language-Processing
Table of Contents
1 Language Modeling
1.1 Introduction to N-grams
- Today's goal: assign a probability to a sentence or a sequence of words: \[ P(W) = P(w_1, w_2, w_3, w_4, \ldots, w_n) \]
- Related task: probability of an upcoming word: \[ P(w_5|w_1, w_2, w_3, w_4) \]
- A model that computes either of these: \[ P(W) \mbox{ or } P(w_n|w_1,w_2, \ldots, w_{n-1}) \mbox{ is called a language model.} \]
- Better: the grammar. But language model or LM is starndard.
1.1.1 How to compute P(W)
- How to compute this joint probability of P(its, water, is, so, transparent, that)
- Intuition: use Chain Rule of Bayes
1.1.2 How to estimate these probabilities
- just count and divide?
- No. Too many sentences.
- Markov Assumption
- Simplifying assumption: \[ P(the|\mbox{its water is so transparent that}) \approx P(the|that) \]
- Or maybe \[ P(the|\mbox{its water is so transparent that}) \approx P(the|transparent\ taht) \]
- Formally \[ P(w_i|w_1w_2\ldots w_{i-1}) \approx P(w_i)|w_{i-k}\ldots w_{i-1} \]
1.1.3 Simplest case: Unigram model
\[ P(w_1w_2\ldots w_n) \approx \prod\limits_i {P(w_i)} \]
1.1.4 Bigram model
- Condition on the previous word:
\[ P(w_i|w_1w_2\ldots w_{i-1}) \approx P(w_i|w_{i-1}) \]
1.1.5 N-gram models
- We can extend to trigrams, 4-grams, 5-grams
- In general this is an insufficient model of language
- because language has long-distance dependencies:
"The computer which I had just put into the machine room on the fifth floor crashed."
- But we can often get away with N-gram models
1.2 Estimating N-gram Probabilities
1.2.1 B-gram
- MLE
- More examples: Berkeley Restaurant Project sentences
1.2.2 Practical Issues
- We do everything in log space
- Avoid underflow
- (also adding is faster than multiplying)
1.2.3 Language Model Toolkit
SIRL Google N-gram Release, August 2006 Google Book N-grams
1.3 Evaluation and Perplexity
- Does our language model prefer good sentences to bad ones?
1.3.1 Extrinsic evaluation of N-gram models
1.3.2 Difficulty of extrinsic(in-vivo) evaluation of N-gram models
- Extrinsic evaluation
- Time-consuming
- So
- sometimes use intrinsic evaluation: perplexity
- Bad approximation
- unless the test data looks just like the training data
- So generally only useful in pilot experiments
- But is helpful to think about
1.3.3 Intuition of Perplexity
- The Shannon Game:
- How well can we predict the next word?
- The best LM is one that best predicts an unseen test set
- Gives the highest P(sentence)
- Perplexity is the probability of the test set, normalized by the number of words:
\[ PP(W) = P(w_1w_2\ldots w_N)^{-\frac{1}{N}} \]
1.3.4 Perplexity as branching factor
- Let's suppose a sentence consisting of random digits
- What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?
\[ PP(W) = P(w_1w_2\ldots w_n)^{-\frac{1}{N}} = \frac{1}{10} \] The lower the better.
1.4 Generalization and zeros
1.4.1 The Shannon Visualization Method
- Choose a random bigram (<s>, w) according to its probability
- Now choose a random bigram (w, x) according to its probability
- And so on until we choose </s>
- Then string the words together
1.4.2 The perils of overfitting
- N-grams only work well for word prediction if the test corpus looks like the training corpus
- In real life, it often doesn't
- We need to train robust models that generalize!
- One kind of generalization: Zeros!
- Things that don't ever occur in the training set
- But occur in the test set
- Things that don't ever occur in the training set
1.4.3 Zeros
- Bigrams with zero probability
- mean that we will assign 0 probability to the test set!
- And hence we cannot compute perplexity (can't divide zero)
1.5 Smoothing: Add-One
- Also called Laplace smoothing
- Pretend we saw each word one more time than we did
\[ P_{Add-1}(w_i|w_{i-1} = \frac{c(w_{i-1},w_i)+1}{c(w_{i-1}+V)}) \]
1.5.1 Reconstituted counts
\[ c^*(w_{n-1}w_n) = \frac{[C(w_{n-1}w_n)+1]\times C(w_{n-1})}{C(w_{n-1})+V} \]
1.5.2 Add-1 estimation is a blunt instrument
- So add-1 isn't used for N-grams:
- we'll see better methods
- But add-1 is used to smooth other NLP models
- For text classification
- In domains where the number of zeros isn't so huge
1.6 Interpolation
Backoff and Interpolation
- Sometimes it helps to use less context
- Condition on less context for contexts you haven't learned much
- Backoff:
- use trigram if you have good evidence
- otherwise bigram, otherwise unigram
- Interpolation:
- mix unigram, bigram, trigram
- Interpolation works better
1.6.1 Linear Interpolation
\[ P() = \lambda_1 P()\]
- Lambdas conditional on context:
1.7 Good-Turing Smoothing
More general formulations: Add-K \[ P_{Add-k}(w_i|w_{i-1})=\frac{c(w_{i-1}, w_i)}{} \] \[ P_{UnigramPrior}(w_i|w_{i-1} = \frac{c()}{}) \]
1.7.1 Advanced smoothing algorithms
- Intuition used by many smoothing algorithms
- Good
- Notation: Nc = Frequency of frequency c
- Nc = the count of things we've seen c times
1.7.2 Good Turing claculations
\[ P^*_{GT}(things with zero frequency)=\frac{N_1}{N} \]
- Unseen (bass or catfish)
- c = 0
- MLE p = 0/18 = 0
- P*GT(unseen) = N1/N = 3/18
1.8 Kneser-Ney Smoothing
Absolute Discounting Interpolation
- Save ourselves some time and just subtract 0.75 (or some d)
\[ P_{AbsoluteDiscounting(w_i|w_{i-1}=\frac{}{}}\]
- (Maybe keeping a couple)
1.8.1 KN Smoothing
- Better estimate for probabilities of lower-order unigrams!
- Shannon game: I can't see without my reading Francisco?
- "
- The unigram is useful exactly when we haven't seen this bigram!
- Instead of P(w): "How likely is w"
- Pcontinuation(w):" How likely is w to appear as a novel continuation?
- For each word, count the number of bigram types it completes
- Every bigram type was a novel continuation the first time it was seen
\[ P_{} \propto |{w_{i-1}:c(w_{i-1},w)>0}| \]
1.8.2 Kneser-Ney Smoothing II
-How many times
1.8.3 Kneser-Ney Smoothing III
- Alternative metaphor: The number of # of word types seen to precede w
1.8.4 Kneser-Ney Smoothing IV
\[ P_{KN}(w_i|w_{i-1} = \frac{}{} + \lambda(w_{i-1}P_{Continuation}(w_i)))\]
1.8.5 Kneser-Ney Smoothing: Recursive formulation
\[\]
\begin{equation} c_{KN}(\dot) \end{equation}Continuation count = Number of unique single word contexts for \dot