Mining-Massive-Datasets

0.1. Improvements to A-Priori
0.2. All (Or Most) Frequent Itemsets In ≤ 2 Passes

1. Week 3

1.1. The Affiliation Grpah Model
- 1.1.1. Community-Affiliation Graph
- 1.1.2. AGM: Flexibility
1.2. From AGM to BigCLAM
- 1.2.1. community membership strength Factor Matrix F
1.3. Solving the BigCLAM
1.4. Detecting Clusters
1.5. The Graph Laplacian Matrix
1.6. Spectral Clustering Alogirthms
1.7. Trawling
1.8. Mining Data Streams
1.9. Bloom Filters
1.10. Counting 1's
- 1.10.1. DGIM Algorithm
1.11. Sampling Streams
1.12. Counting Distinct Elements
- 1.12.1. Estimating Counts

2. Week 4 Recommender System

2.1. overview
2.2. Content-based
2.3. Collaborative Filtering
2.4. Evaluating Recommender System
2.5. Latent Factor Models
2.6. Latent Factor Recommender System
- 2.6.1. Latent Factor Models
2.7. Finding the Latent Factors
2.8. CUR

3. Week 5 Clustering

3.1. Bradley-Fayyad-Reina (BFR) Algorithm
3.2. CURE Algorithm
- 3.2.1. Clustering Using REpresentatives:
- 3.2.2. Starting CURE
3.3. Performance-based Advertising
- 3.3.1. Greedy Algorithm
- 3.3.2. Analyzing the Greedy Algorithm
3.4. Algorithmic Challenges
3.5. The BALANCE Algorithms
3.6. Worst case for BALANCE

4. week 6

4.1. Soft-Margin SVMs

5. week 7

5.1. LSH Families of Hash Functions
5.2. More LSH Families
5.3. Topic Specific (aka Personalized) PageRank
- 5.3.1. Matrix Formulation

0.1 Improvements to A-Priori

0.1.1 PCY Alogrithm (Park-Chen-Yu-Algorithm)

1st pass: hash p 2nd pass:

0.1.2 Multistage

two hash tables, 2 bitmaps, 3 passes. important points The hash functions have to be independent. We need to check both hashes on the third pass. Otherwise we will count pairs(freq item, freq item) hashTo [infreq bucket] hashTo [freq bucket].

0.1.3 Multihash

Key idea: use multiple independent hash tables in 1st pass.

0.1.4 Single-Pass Approximate Algorithms

0.2 All (Or Most) Frequent Itemsets In ≤ 2 Passes

0.2.1 Simple Algorithm

Take a random sample of the market baskets.

0.2.2 Savasere-Omiecinski-Navathe (SON) Algorithm

Only work if the number of candidates can be counted in the main memory Subset Key "monotonicity"

0.2.3 Toivonen's Algorithm

no false negative, in some case it may not give an answer, so you need to rerun it. No gurantee for finishing. Start as in the simple algorithm, but lower the threshold slightly. Goal is to

Negative Border {A, B, C, D}

1 Week 3

1.1 The Affiliation Grpah Model

1.1.1 Community-Affiliation Graph

Community, C Memberships, M Nodes, V AGM -> Graph $ P(u, v) = 1 - \prod\limits_{c\in M_u \cap M_v}(1-p_c) $

1.1.2 AGM: Flexibility

can express a variety of community stuctures: Non-overlapping, Overlapping, Nested

1.2 From AGM to BigCLAM

F_uA The membership strength of node $u$ Each community $A$ links nodes independently: $ P_A(u, V) = 1 - exp(-F_{uA}\cdot F_{vA}) $

1.2.1 community membership strength Factor Matrix F

$ P(u,v) = 1 - \prod\limits_c (1-p_c(u,v)) $ Then prob. at least one common $C$ links them:

\begin{align*} P(u,v) &= 1 - exp(-\sum_C F_{uC}\cdot F_{vC}) \\ &= 1 - exp(-F_u\cdot F_v^{T}) \end{align*}

1.3 Solving the BigCLAM

1.3.1 How to find F

Given a Network G(V,E), estimate F $ arg max_F \prod p(u,v) \prod (1-p(u,v)) $ take the log likelihood $l(F_u)$

\begin{equation*} l(F_u) = \sum\limits_{v\in N(u)} log(1-exp(-F_uF_v^T)) - \sum\limits_{v\not \in N(u)}(F_u F_V^T) \end{equation*}

1.3.2 BigCLAM: V1.0

gradient descent slow because $ \bigtriangledown l(F_u) $ takes linear time

1.3.3 BigCLAM: V2.0

$ \sum\limits_{v\not \in N(u)} F_v = (\sum\limits_v F_v - F_u - \sum\limits_{v\in N(u)} F_v) $ We cache $ sum_{v\not \in N(u)} F_v $ now it takes linear time in the degree $|N(u)|$ of $u$

1.4 Detecting Clusters

Goal: find densely linked clusters Discovering social circles, circles of trust Graph Partitioning Criteria: Conductance $\phi(A) = \cfrac{CUT(A)}{VOL(A)}$ vol(A): total weight of the edges with at least one endpoint in A: vol(A) = ∑\limits_{i∈ A}d_i Why use this criteria? Produces more balanced partitions.

1.5 The Graph Laplacian Matrix

Adjacency matrix Graph Partitioning Task: Partition the graph into two pieces such the resulting pieces have low conductance. Problem: Computing optimal cut is NP-hard. $A$: adjacency matrix of undirected G A_ij = 1 if (i, j) is an edge, else 0 $x$ is a vector of label/value of each node of $G$ A⋅ x (1) L = D - A (2) λ₂ = \cfrac{x^T Lx}{x^T x} (2) can be derived from (1). min f(y) = sum (y_i - y_j)² = y^T Ly Fiedler vector

1.6 Spectral Clustering Alogirthms

Three basic stages:

Pre-processing

Construct a matrix representation of the graph

Decomposition

Compute eigenvalues and eigenvectors of the matrix Map each point to a lower-dimensional representation based on one or more eigenvectors

Grouping

Assign points to two or more clusters, based on the new representation

K-way Sepectral Clustering Recursive bi-partitioning ('92) Disadvantages: inefficient, unstable Cluster multiple eigenvectors ('00 preferable)

1.7 Trawling

Searching for small communities in the Web graph Frequent itemsets = complete bipartite graphs! view each mode i as a set Si of nodes i points to K_s,t = a set Y of size t that occurs in s sets S_i s … minimum support (|S| = s) t … itemset size (|Y| = t)

1.8 Mining Data Streams

1.8.1 The Stream Model

Data Management VS Steam Management DBMS e.p. SQL internel insert Stream Management is import when the input rate is controlled externally e.g. Google query

Two Forms of Query

Ad-hoc queries Standing queries query once, active all the time e.g. Report each new max value ever seen in stream S archive streams, if working storage is limited
Applications

Mining query streams Mining click streams IP packets can be monitored at a switch

1.8.2 Sliding Windows

queries are about a window of length N what if N > main memory E.G. Averages Stream of integers Standing query: what is the average of the integers in the window(size N)? For the first N inputs, simply sum and count the average Afterward, when a new input $i$ arrives, change the average by adding $(i-j)/N$

1.8.3 Counting 1's

1.9 Bloom Filters

Bloom Filters can have false positive A Bloom filter is an array of bits, together with a number of hash functions The argument of each hash function is a stream element, and it returns a position in the array. E.G. Bloom Filter Use N = 11 bits for our filter Stream elements = integers Use two hash functions: h₁(x) = odd numbered bits h₂(x) = the same, but takes even numbered bits Bloom Filter Lookup compute h(y) for each hash function y. if all resulting in 1

1.10 Counting 1's

Counting Bits Problem: given a stream of 0's and 1's, be prepared to answer queries of the form "how many 1's in the last k bits?" where k $\leq$ N

1.10.1 DGIM Algorithm

O(log² N bits per stream)

Timestamps

Each bit in the stream has a timestamp, starting 0, 1, … Record timestamps modulo N(the window size), so we can represent any relevant timestamp in O(log2N) bits.
Buckets
A bucket is a segment of the window; it is represented by a record consisting of:
1. The timestamp of its end [O(log N) bits]
2. The number of 1's between its beginning and its beginning and end [O(log log N) bits].
Representing a Stream by Buckets

Either one Bucket do not overlap Buckets are sorted by size Buckets disappear when their end time > N
Updating Buckets
When a new bit comes in, drop the oldest bucket if its end-time is prior to N time units before the cur time If the current bit is 0, no other changes If the current bit is 1:
1. Create a new bucket of size 1, for just this bit. End timestamp = current time.
2. If there are now three buckets of size 1, combine the oldest two into a bucket of size 2
3. If there are now three buckets of size 2, …
so on …

1.11 Sampling Streams

1.11.1 When Sampling Doesn't Work

Google find unique query for the past month

1.11.2 Smapling Based on Hash Value

E.G.: Fixed Sample Size Sampling Key-Value Pairs

1.11.3 Counting Distinct Items

1.11.4 Computing Moments

1.12 Counting Distinct Elements

Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Applications How many different words are founc among the web pages being crawled at a site? How many unique users visited Facebook the past month?

1.12.1 Estimating Counts

Flajolet-Martin Approach Pick a hash function h that maps each of the n elements to at least log2n bits. For each stream element a, let r(a) be the number of trailing 0's in h(a) Record R = the maximum r(a) seen Estimate = 2^R

Why it works

The probability that a given h(a) ends in at least i 0's is 2^-i If there are m different elements, the probability that R ≥ i is 1-(1-2^-i)^m Since 2^-i is small, 1-(1-2^-i)^m = 1 - e^{-m2^-i}
Solution

Partition your samples into small groups. Log n, where n = size of universal set, suffices Take the average of groups Then take the median of the averages
Generalization: Moments

AMS Expected Value of X 2^nd moment is ∑_a(m_a)² E(X) = (1/n)(∑_{all times t} n * (twice the number of times the stream element at time t appears from that time on)-1)) = ∑_a(1+3+5+\ldots + 2m_a-1) Problem: Streams never end Fixups

h(1) 3+7=10 1010 =1 h(9) 3*9+7=34%11=3 0011 = 0 h(8) 31%11=9 1001 = 0 h(5) 22 = 0 0000 = 4 h(2) 2 0010 = 1 h(6) 25%11=3 0011 = 0 h(3) 16%11=5 0101 = 0 h(4) 19%11=8 1000 = 3 h(7) 28%11=6 0110 = 1 h(10) 37%11=4

2 Week 4 Recommender System

2.1 overview

long tail

2.1.1 Types of Recommendations

Editorial and hand curated Simple aggregates top 10, most popular, recent uploads Tailored to individual users

2.1.2 Formal Model

C = set of Customers S = set of Items Utility function u: C× S → R R = set of ratings R is a totally ordered set

2.1.3 Key Problems

Gathering Known Ratings

implicit: purchase means high ratings
Extrapolating Utilities

Key problem: matrix U is sparse Cold start: New items have no ratings New items have no history Three approaches to recommender systems Content-based Collaborative Latent factor based

2.2 Content-based

Main idea: Recommand items to customer x similar to previous items rated highly by x Users like → Item profiles → User profile

2.2.1 Item Profiles

For each item, create an item profile Profile is a set of features (a vector) Text features Profile = set of "important" words in item (document) How to pick important words? TF-IDF

2.2.2 User Profiles

User has rated items with profiles i₁, \ldots, i_n Simple: (weighted) average of rated item profiles Variant: Normalize weights using average rating of user

Example 1: Boolean Utility Matrix

Items are movies, only feature is "Actor" Suppose users x has watch 5 movies 2 movies featuring actor A 3 movies featuring actor B User Profile A rating = 0.4 B rating = 0.6
Example 2: Star Ratings

Same example, 1-5 star ratings Actor A's movies rated 3, 5 Actor B's movies rated 1, 2, 4 Useful step: Normalize ratings by subtracting user's mean rating(3)

2.2.3 Making predictions

User profile x, Item profile i Estimate U(x, i) = cos(θ) = (x ⋅ i) / (|x||i|) cosine distance

2.2.4 Pros: Content-based Approach

No need for data on other users Able to recommend to users with unique tastes Able to recommend new & unpopular items No first-rater problem Explanations for recommended items Content features that caused an item to be recommended

2.2.5 Cons: Content-based Approach

Finding the appropriate features is hard Overspecialization Never recommends items outside user's content profile People might have multiple interests Unable to exploit quality judgments of other users Cold-start problem for new users How to build a user profile?

2.3 Collaborative Filtering

Consider user x Find set N of other users whose ratings are "similar" to x's ratings Estimate x's ratings based on ratings of users in N

Option 1 : Jaccard Similarity sim(A,B) = |r_A∩ r_B|/|r_A∪ r_B| Problem: Ignores rating values Option 2: Cosine similarity sim(A,B) = cos(r_A, r_B) Problem: treat missing data as negtive Option 3: Centered cosine Normalize ratings by subtracting row mean Also known as Pearson Correlation

2.3.1 Rating Predictions

Let r_x be the vector of user x's ratings Let N be the set of k users most similar to x who have also rated item i Prediction for user x and item i Option 1: r_xi = 1/k ∑_{y∈ N} r_yi Option 2: r_xi = ∑_{y∈ N} s_xyr_yi/sum_{y∈ N} s_xy where s_xy is the similarity between x and y

2.3.2 Item-Item Collaborative Filtering

Estimate rating for item i based on ratings for similar items r_xi = \cfrac{∑_{j∈ N(i;x)s_ij*r_ij}}{∑_{j∈ N(i;x)}s_ij}

2.3.3 Item-Item v. User-User

In theory, user-user and item-item are dual approaches In practice, item-item outperforms user-user in many user cases Item are "simpler" than users Items belong to a small set of "genres", users have varied tastes Item Similarity is more meaningful than user Similarity

2.4 Evaluating Recommender System

Root-mean-square erroe (RMSE) Problems: Narrow focus on accuracy sometimes misses the point Prediction Diversity Prediction Context Order of predictions In practice, we care only to predict high ratings Alternative: precision at top k

2.5 Latent Factor Models

2.5.1 A Modern Recommender System

Multi-scale modeling of the data Global overall deviation Factorization addressing "regional" effects Collaborative filtering extract local patterns

2.5.2 Modeling Local & Global Effects

Global: Baseline estimation Local neighborhood (CF/NN): Final estimate

2.5.3 Collaborative filtering (Item-Item)

In practice we get better estimates if we model deviations: b_xi = μ + b_x + b_i r_xi = b_xi + \frac{∑ s_ij(r_xj-b_xj)}{∑ s_ij} μ = overall mean rating b_x = rating deviation of user x b_i = (avg. rating of movie i) - mu Problems/Issues: Similarity measures are "arbitrary" Pairwise similarities neglect interdependencies among users Taking a weight average can be restricting

2.6 Latent Factor Recommender System

Recommendations via Optimization

2.6.1 Latent Factor Models

"SVD" on Netflix data: R ≈ Q⋅ P^T SVD gives minimum reconstruction error (Sum of squared errors)

2.7 Finding the Latent Factors

2.8 CUR

Pros & Cons +Easy interpretation +Sparse basis -Duplicate columns and rows columns of large norms will be sampled many times

3 Week 5 Clustering

3.1 Bradley-Fayyad-Reina (BFR) Algorithm

BFR is a variant of k-means for very large (disk-resident) data sets Assumes each cluster is normally distributed around a centroid in Euclidean space

3.1.1 BFR Algorithm

Points are read from disk one main-memory-full at a time Most points from previous memory are summarized by simple statistics To begin, from the initial load we select the initial k centroid

3.1.2 Three Classes of Points

Discard set (DS): Points close enough to a centroid to be summarized Compression set (CS): Groups of points that are close together but not close to any existing centroid These points are summarized, but not assigned to a cluster Retained set (RS): isolated points waiting to be assigned to a compression set

3.1.3 Summarizing Sets of Points

For each cluster, DS is summarized by: The number of points, N The vector SUM, whose i^th component = sum of the coordinates of the points in i^th dimension The vector SUMSQ: i^th component = sum of squares of coordinates in i^th dimension centroid and variance can be calculated by N, SUM and SUMSQ

3.1.4 Processing a chuck of points

Consider merging compressed sets in the CS If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster

3.1.5 A Few Details…

Q1) How do we decide if a point is "close enough" to a cluster (and discard BFR approach The Mahalanobis distance is less than a threshold High likelihood of the point belonging to currently 3 σ Q2) Should 2 CS subclusters be combined? Combine if the combined variance is below some threshold Many alternatives: Treat dimensions differently, consider density

3.2 CURE Algorithm

Limitations of BFR Algorithm Makes strong assumptions, not work on non-linear separable

3.2.1 Clustering Using REpresentatives:

Assumes a Euclidean distance Allows clusters to assume any shape Uses a collection of representative points to represent cluster

3.2.2 Starting CURE

Pass 1 of 2: Pick a random sample of points that fit in main memory Cluster sample points hierarchically to create the initial clusters Pick representative points: For each cluster, pick k representative points, as dispersed as possible Move each representative point a fixed fraction (e.g., 20%) toward the centroid of the cluster Pass 2 of 2: Now, rescan the whole dataset and visit each point p in the data set Place it in the "closest cluster"

3.3 Performance-based Advertising

Matching Algorithm Problem: Find a maximum matching for a given bipartite graph A perfect one if exists There is a polynomial-time offline algorithm based on augmenting paths Online Graph Matching Problem girls -> boys In each round, one girl's choices are revealed At that time, we have decide to either pair them. Example of application: Assigning tasks to servers.

3.3.1 Greedy Algorithm

just pick a boy eligible for a new girl Competitive Ration = min_{all possible inputs I}(|M_greedy|/|M_opt|) the worst case performance overall all possible inputs of greedy algorithm

3.3.2 Analyzing the Greedy Algorithm

M_opt

<= 2

M_greedy

3.4 Algorithmic Challenges

Performance-based advertising works! Multi-billion-dollar industry What ads to show for a given query?

3.4.1 AdWords Problems

A stream of queries arrives at the search engine: q₁,q₂,… Several advertisers bid on each query When query q_i arrives, search engine must pick a subset of advertisers whose ads are shown Goal: Maximize serach engine's revenues Clearly we need an online algorithm!

3.4.2 Expected Revenue

CTR (click through rate) * Bid

3.4.3 Adwords Problem

Given: A set of bids by advertisers for search queries A click-through rate for each advertiser-query pair A budget for each advertiser (say for 1day, month…) A limit on the number of ads to be displayed with each search query Respond to each search query with a set of advertisers such that: The size of the set is no larger than limitation Each advertiser has bid on the serach query Each advertiser has enough budget left to pay

3.4.4 Limitations of Simple Algorithm

CTR of an ad is unknown Advertisers have limited budgets and bid on multiple ads (BALANCE algorithm)

3.4.5 Estimating CTR

CTR for a query-ad pair is measured historically Averaged over a time peroid Some complications we won't cover in this lecture: CTR is position dependent Explore v Exploit: Keep showing ads we already know the CTR of, or show new ads to estimate their CTR?

3.5 The BALANCE Algorithms

3.5.1 Dealing with Limited Budgets

Simplest algorithm is greedy.

3.5.2 Bad Scenario for Greedy

Two advertisers A and B A bids on query x, B bids on x and y Both have budgets of $4 Query stream: x x x x y y y y worst case greedy choice: B B B B _ _ _ _ Optimal: A A A A B B B B This is the worst case! And it's determinstic. Greedy always give the same answer to the same situation.

3.5.3 BALANCE Algorithm [MSVV]

For each query, pick the advertiser with the largest unspent budget Break ties arbitrarily (but in a determinstic way)

3.5.4 Analyzing 2-advertiser BALANCE

BALANCE must exhaust at least one advertiser's budget: if not, we can allocate more queries Assume BALANCE exhausts A_2's budget

3.5.5 BALANCE: General Result

In the general case, worst competitive ration of BALANCE is 1-1/e = approx. 0.63 Interestingly, no online algorithm has a better competitive ratio

3.6 Worst case for BALANCE

N advertisers: A₁, A₂, … A_N Queries: N$ \cdot $ B queries appear in N rounds of B queries each Bidding: Round 1 queries: bidders A₁, A₂, …, A_N Round 2 queries: bidders A₂, …, A_N Round i queries: bidders A_i, …, A_N

3.6.1 BALANCE Allocation

After k rounds, the allocation to advertiser k is: S_K = ∑_{1≤ i ≤ k} B/(N-i+1)

3.6.2 BALANCE: Analysis

Fact: for large n Result due to Euler ln(N) - 1 = ln(N - k) k = N(1 - 1/e) So after the first k = N(1-1/e) rounds, we cannot allocate a query to any advertiser Revenue = B⋅ N(1-1/e) Competitive ratio = 1 - 1/e

3.6.3 General Version of the Problem

So far: all bids = 1, all budgets equal (=B) In a general setting BALANCE can be terrible

3.6.4 Generalized BALANCE

Consider query q, bidder i Bid = x_i Budget = b_i Amount spent so far = m_i Fraction of budget left over f_i = 1 - m_i/b_i Define φ_i(q) = x_i(1-e^-f_i) Allocate query q to bidder i with largest value of φ_i(q) Same competitive ratio (1 - 1/e)

4 week 6

4.1 Soft-Margin SVMs

Hinge Loss

5 week 7

5.1 LSH Families of Hash Functions

5.1.1 Hash Functions Decide Equality

There is a subtlety about what a "hash function" really is in the context of LSH family. A hash function h really takes two elements x and y, and returns a decision whether x and y are candidates for comparison. E.g.: the family of minhash functions computes minhash values and says "yes" iff they are the same. Shorthand: "h(x) = h(y)" means h says "yes" for pair elements x and y

5.1.2 LSH Families Defined

Suppose we have a space S of points with a distance measure d. A family H of hash functions is said to be (d₁, d₂, p₁, p₂)-sensitive if for any x and y in S:

If $ d(x,y) \leq d_1 $, then the probability over all h in H, that h(x) = h(y) is at least p₁.
If $ d(x,y) \geq d_2 $, then the probability over all h in H, that h(x) = h(y) is at most p₂.

5.1.3 E.g.: LS Family

Let S = sets, d = Jaccard distance, H is formed from the minhash functions for all permuatations. Then Prob[h(x)=h(y)] = 1 - d(x,y). Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance. Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d.

5.1.4 Amplifying a LSH-Family

The "bands" technique we learned for signature matrices carries over to this more general setting. Goal: the "S-curve" effect seen here. AND construction like "rows in a band." OR construction like "many bands."

5.1.5 AND of Hash Functions

Given family H, construct family H' whose members each consist of r functions from H. For $ h = {h_1, \ldots, h_r} $ in H', h(x) = h(y) iff h_i(x) = h_i(y) for all i. Theorem: If H is (d₁, d₂, p₁, p₂)-sensitive, then H' is (d₁, d₂, (p₁)^r, (p₂)^r)-sensitive. Proof: Use fact that h_i's are independent.

5.1.6 OR of Hash Functions

Given family H, construct family H' whose members each consist of b functions from H. For $ h = {h_1, \ldots, h_b} $ in H', h(x) = h(y) iff h_i(x) = h_i(y) for some i. Theorem: If H is (d_i, d₂, p₁, p₂)-sensitive, then H' is (d₁, d₂, 1- (1-p₁)^b, (1-p₂)^b)-sensitive.

5.1.7 Effect of AND and OR Constructions

AND makes all probabilities shrink, but by choosing r conrrectly, we can make the lower probablity approach 0 while the higher does not. OR makes all probabilities grow, but by choosing b correctly, we can make the upper probability approach 1 while the lower does not.

5.1.8 Composing Constructions

As for the signature matrix, we can use the AND construction followed by the OR construction. Or vice-versa. Or any sequence of AND's and OR's alternating.

5.1.9 AND-OR Composition

Each of the two probabilities p is transformed into 1-(1-p^r)^b. The "S-curve" studied before. E.g.: Take H and construct H' by the AND construction with r=4. Then, from H', construct H'' by the OR construction with b=4. (1-(1-p⁴)⁴)

5.1.10 OR-AND Composition

Each of the two probabilities p is transformed 1-(1-p^b)^r The same S-curve, mirrored horizontally and vertically.

5.1.11 Cascading Constructions

E.g.: Apply the (4-4) OR-AND construction followed by the (4,4) AND-OR construction. Transfrom a (.2,.2,.8,.8)-sensitive into (.2,.8,.9999996,.0008715)-sensitive

5.1.12 General Use of S-Curves

For each S-curve 1-(1-p^r)^b, there is a threshold t, for which 1-(1-t^r)^b = t. Above t, high probabilities are increased; below t, they are decreased. You improve the sensitivity as long as the low probability is less than t, and the high probability is gerater thant. Iteratea as you like.

5.2 More LSH Families

For cosine distance, there is a technique analogous to minhashing for generating a (d₁,d₂,(1-d₁/180),(1-d₂/180))-sensitive family for andy d₁ and d₂ Called random hyperplane.

5.2.1 Random Hyperplanes

Each vector v determines a hash function h_v with two buckets. h_v(x) = +1 if $ v \cdot x > 0 $; = -1 if $ v \cdot x < 0 $ LS-family H = set of all functions derived from any vector. Clain: Prob[h(x)=h(y)] = 1 - (angle between x and y divided by 180)

5.2.2 Signatures for Cosine Distance

Pick some number of vectors, and hash your data for each vector. The result is a signature(sketch) of +1's and -1's that can be used for LSH lke the minhash signatures for Jaccard distance. But you don't have to think this way. The existence of the LSH-family is sufficient amplification by AND/OR.

5.2.3 Simplification

We need not pick from among all possible vectors v to form a component of a sketch. It suffices to consider only vector v consisting of +1 and -1 components.

5.2.4 LSH for Euclidean Distance

Simple idea: hash functions correspond to lines. Partition the line into buckets of size a. Hash each point to the bucket containig its projection onto the line. Nearby points are always close; distant points are rarely in same bucket.

If points are distance $ \geq 2a $ apart then $ 60 \leq \theta \leq 90 $ for there to be a chance that the points go in the same bucket. I.e., at most 1/3 probability If points are distance $ \leq a/2 $, then there is at least 1/2 chance they share a bucket. Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions.

5.2.5 Fixup: Euclidean Distance

For previous distance measures, we could start with a (d,e,p,q)-sensitive family for any d < e, and drive p and q to 1 and 0 by AND\OR constructions. Here, we seem to need $ e \geq 4d $. But as long as d < e, the probability of points at distance d falling in the same bucket is greater than the probability of points at distance e doing so. Thus, the hash familiy formed by projecting onto lines is a (d,e,p,q)-sensitive family for some p > q.

5.3 Topic Specific (aka Personalized) PageRank

Instead of generic popularity, can we measure popularity within a topic? Goal: Evaluate Web pages not just according to their popularity, but by how cloase theay are to a particular topic, e.g. "sports" or "history". Allow search queries to be answered based on interests of the user. E.g.:Query "Trojan" wants different pages depending on whether you are interested on sports, history or computer security. Random walker has a small probability of teleporting at any step Teleport can go to: Standard PageRank: Any page with equal probability to avoid dead-end and spider-trap problems Topic Specific PageRank: A topic-specific set of "relevant" pages (teleport set) Idea: Bias the random walk When walker teleports, she pick a page from a set S S contains only pages that are relevant to the topic e.g., Open Directory(DMOZ) pages for a given topic/query For each teleport set S, we get a different vector r_s.

5.3.1 Matrix Formulation

To make this work all we need is to update the teleportating part of the PageRank formulationg:

\begin{equation} A_{ij} = \begin{cases} \beta M_{ij}+(1-\beta)/|S| &\mbox{if}\ i\in S \\ \beta M_{ij} & \mbox{otherwise} \end{cases} \end{equation}

A is stochastic! We weighted all pages in the teleport set S equally Could also assign different weights to pages! Random walk with Restart: S is a single element Compute as for regular PageRank: Multiply by M, the add a vector Maintains sparseness