Mining-Massive-Datasets
Table of Contents
- 1.1. The Affiliation Grpah Model
- 1.2. From AGM to BigCLAM
- 1.3. Solving the BigCLAM
- 1.4. Detecting Clusters
- 1.5. The Graph Laplacian Matrix
- 1.6. Spectral Clustering Alogirthms
- 1.7. Trawling
- 1.8. Mining Data Streams
- 1.9. Bloom Filters
- 1.10. Counting 1's
- 1.11. Sampling Streams
- 1.12. Counting Distinct Elements
- 5.1. LSH Families of Hash Functions
- 5.1.1. Hash Functions Decide Equality
- 5.1.2. LSH Families Defined
- 5.1.3. E.g.: LS Family
- 5.1.4. Amplifying a LSH-Family
- 5.1.5. AND of Hash Functions
- 5.1.6. OR of Hash Functions
- 5.1.7. Effect of AND and OR Constructions
- 5.1.8. Composing Constructions
- 5.1.9. AND-OR Composition
- 5.1.10. OR-AND Composition
- 5.1.11. Cascading Constructions
- 5.1.12. General Use of S-Curves
- 5.2. More LSH Families
- 5.3. Topic Specific (aka Personalized) PageRank
0.1 Improvements to A-Priori
0.1.1 PCY Alogrithm (Park-Chen-Yu-Algorithm)
1st pass: hash p 2nd pass:
0.1.2 Multistage
two hash tables, 2 bitmaps, 3 passes. important points The hash functions have to be independent. We need to check both hashes on the third pass. Otherwise we will count pairs(freq item, freq item) hashTo [infreq bucket] hashTo [freq bucket].
0.1.3 Multihash
Key idea: use multiple independent hash tables in 1st pass.
0.1.4 Single-Pass Approximate Algorithms
0.2 All (Or Most) Frequent Itemsets In ≤ 2 Passes
0.2.1 Simple Algorithm
Take a random sample of the market baskets.
0.2.2 Savasere-Omiecinski-Navathe (SON) Algorithm
Only work if the number of candidates can be counted in the main memory Subset Key "monotonicity"
0.2.3 Toivonen's Algorithm
no false negative, in some case it may not give an answer, so you need to rerun it. No gurantee for finishing. Start as in the simple algorithm, but lower the threshold slightly. Goal is to
Negative Border {A, B, C, D}
1 Week 3
1.1 The Affiliation Grpah Model
1.1.1 Community-Affiliation Graph
Community, C Memberships, M Nodes, V AGM -> Graph \( P(u, v) = 1 - \prod\limits_{c\in M_u \cap M_v}(1-p_c) \)
1.1.2 AGM: Flexibility
can express a variety of community stuctures: Non-overlapping, Overlapping, Nested
1.2 From AGM to BigCLAM
FuA The membership strength of node \(u\) Each community \(A\) links nodes independently: \( P_A(u, V) = 1 - exp(-F_{uA}\cdot F_{vA}) \)
1.2.1 community membership strength Factor Matrix F
\( P(u,v) = 1 - \prod\limits_c (1-p_c(u,v)) \) Then prob. at least one common \(C\) links them:
\begin{align*} P(u,v) &= 1 - exp(-\sum_C F_{uC}\cdot F_{vC}) \\ &= 1 - exp(-F_u\cdot F_v^{T}) \end{align*}1.3 Solving the BigCLAM
1.3.1 How to find F
Given a Network G(V,E), estimate F \( arg max_F \prod p(u,v) \prod (1-p(u,v)) \) take the log likelihood \(l(F_u)\)
\begin{equation*} l(F_u) = \sum\limits_{v\in N(u)} log(1-exp(-F_uF_v^T)) - \sum\limits_{v\not \in N(u)}(F_u F_V^T) \end{equation*}1.3.2 BigCLAM: V1.0
gradient descent slow because \( \bigtriangledown l(F_u) \) takes linear time
1.3.3 BigCLAM: V2.0
\( \sum\limits_{v\not \in N(u)} F_v = (\sum\limits_v F_v - F_u - \sum\limits_{v\in N(u)} F_v) \) We cache \( sum_{v\not \in N(u)} F_v \) now it takes linear time in the degree \(|N(u)|\) of \(u\)
1.4 Detecting Clusters
Goal: find densely linked clusters Discovering social circles, circles of trust Graph Partitioning Criteria: Conductance \(\phi(A) = \cfrac{CUT(A)}{VOL(A)}\) vol(A): total weight of the edges with at least one endpoint in A: vol(A) = ∑\limitsi∈ Adi Why use this criteria? Produces more balanced partitions.
1.5 The Graph Laplacian Matrix
Adjacency matrix Graph Partitioning Task: Partition the graph into two pieces such the resulting pieces have low conductance. Problem: Computing optimal cut is NP-hard. \(A\): adjacency matrix of undirected G Aij = 1 if (i, j) is an edge, else 0 \(x\) is a vector of label/value of each node of \(G\) A⋅ x (1) L = D - A (2) λ2 = \cfrac{x^T Lx}{x^T x} (2) can be derived from (1). min f(y) = sum (yi - yj)2 = yT Ly Fiedler vector
1.6 Spectral Clustering Alogirthms
Three basic stages:
- Pre-processing
Construct a matrix representation of the graph
- Decomposition
Compute eigenvalues and eigenvectors of the matrix Map each point to a lower-dimensional representation based on one or more eigenvectors
- Grouping
Assign points to two or more clusters, based on the new representation
K-way Sepectral Clustering Recursive bi-partitioning ('92) Disadvantages: inefficient, unstable Cluster multiple eigenvectors ('00 preferable)
1.7 Trawling
Searching for small communities in the Web graph Frequent itemsets = complete bipartite graphs! view each mode i as a set Si of nodes i points to Ks,t = a set Y of size t that occurs in s sets Si s … minimum support (|S| = s) t … itemset size (|Y| = t)
1.8 Mining Data Streams
1.8.1 The Stream Model
Data Management VS Steam Management DBMS e.p. SQL internel insert Stream Management is import when the input rate is controlled externally e.g. Google query
1.8.2 Sliding Windows
queries are about a window of length N what if N > main memory E.G. Averages Stream of integers Standing query: what is the average of the integers in the window(size N)? For the first N inputs, simply sum and count the average Afterward, when a new input \(i\) arrives, change the average by adding \((i-j)/N\)
1.8.3 Counting 1's
1.9 Bloom Filters
Bloom Filters can have false positive A Bloom filter is an array of bits, together with a number of hash functions The argument of each hash function is a stream element, and it returns a position in the array. E.G. Bloom Filter Use N = 11 bits for our filter Stream elements = integers Use two hash functions: h1(x) = odd numbered bits h2(x) = the same, but takes even numbered bits Bloom Filter Lookup compute h(y) for each hash function y. if all resulting in 1
1.10 Counting 1's
Counting Bits Problem: given a stream of 0's and 1's, be prepared to answer queries of the form "how many 1's in the last k bits?" where k \(\leq\) N
1.10.1 DGIM Algorithm
O(log2 N bits per stream)
- Timestamps
Each bit in the stream has a timestamp, starting 0, 1, … Record timestamps modulo N(the window size), so we can represent any relevant timestamp in O(log2N) bits.
- Buckets
A bucket is a segment of the window; it is represented by a record consisting of:
- The timestamp of its end [O(log N) bits]
- The number of 1's between its beginning and its beginning and end [O(log log N) bits].
- Representing a Stream by Buckets
Either one Bucket do not overlap Buckets are sorted by size Buckets disappear when their end time > N
- Updating Buckets
When a new bit comes in, drop the oldest bucket if its end-time is prior to N time units before the cur time If the current bit is 0, no other changes If the current bit is 1:
- Create a new bucket of size 1, for just this bit. End timestamp = current time.
- If there are now three buckets of size 1, combine the oldest two into a bucket of size 2
- If there are now three buckets of size 2, …
so on …
1.11 Sampling Streams
1.11.1 When Sampling Doesn't Work
Google find unique query for the past month
1.11.2 Smapling Based on Hash Value
E.G.: Fixed Sample Size Sampling Key-Value Pairs
1.11.3 Counting Distinct Items
1.11.4 Computing Moments
1.12 Counting Distinct Elements
Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Applications How many different words are founc among the web pages being crawled at a site? How many unique users visited Facebook the past month?
1.12.1 Estimating Counts
Flajolet-Martin Approach Pick a hash function h that maps each of the n elements to at least log2n bits. For each stream element a, let r(a) be the number of trailing 0's in h(a) Record R = the maximum r(a) seen Estimate = 2R
- Why it works
The probability that a given h(a) ends in at least i 0's is 2-i If there are m different elements, the probability that R ≥ i is 1-(1-2-i)m Since 2-i is small, 1-(1-2-i)m = 1 - e-m2-i
- Solution
Partition your samples into small groups. Log n, where n = size of universal set, suffices Take the average of groups Then take the median of the averages
- Generalization: Moments
AMS Expected Value of X 2nd moment is ∑a(ma)2 E(X) = (1/n)(∑all times t n * (twice the number of times the stream element at time t appears from that time on)-1)) = ∑a(1+3+5+\ldots + 2ma-1) Problem: Streams never end Fixups
h(1) 3+7=10 1010 =1 h(9) 3*9+7=34%11=3 0011 = 0 h(8) 31%11=9 1001 = 0 h(5) 22 = 0 0000 = 4 h(2) 2 0010 = 1 h(6) 25%11=3 0011 = 0 h(3) 16%11=5 0101 = 0 h(4) 19%11=8 1000 = 3 h(7) 28%11=6 0110 = 1 h(10) 37%11=4
2 Week 4 Recommender System
2.1 overview
long tail
2.1.1 Types of Recommendations
Editorial and hand curated Simple aggregates top 10, most popular, recent uploads Tailored to individual users
2.1.2 Formal Model
C = set of Customers S = set of Items Utility function u: C× S → R R = set of ratings R is a totally ordered set
2.2 Content-based
Main idea: Recommand items to customer x similar to previous items rated highly by x Users like → Item profiles → User profile
2.2.1 Item Profiles
For each item, create an item profile Profile is a set of features (a vector) Text features Profile = set of "important" words in item (document) How to pick important words? TF-IDF
2.2.2 User Profiles
User has rated items with profiles i1, \ldots, in Simple: (weighted) average of rated item profiles Variant: Normalize weights using average rating of user
- Example 1: Boolean Utility Matrix
Items are movies, only feature is "Actor" Suppose users x has watch 5 movies 2 movies featuring actor A 3 movies featuring actor B User Profile A rating = 0.4 B rating = 0.6
- Example 2: Star Ratings
Same example, 1-5 star ratings Actor A's movies rated 3, 5 Actor B's movies rated 1, 2, 4 Useful step: Normalize ratings by subtracting user's mean rating(3)
2.2.3 Making predictions
User profile x, Item profile i Estimate U(x, i) = cos(θ) = (x ⋅ i) / (|x||i|) cosine distance
2.2.4 Pros: Content-based Approach
No need for data on other users Able to recommend to users with unique tastes Able to recommend new & unpopular items No first-rater problem Explanations for recommended items Content features that caused an item to be recommended
2.2.5 Cons: Content-based Approach
Finding the appropriate features is hard Overspecialization Never recommends items outside user's content profile People might have multiple interests Unable to exploit quality judgments of other users Cold-start problem for new users How to build a user profile?
2.3 Collaborative Filtering
Consider user x Find set N of other users whose ratings are "similar" to x's ratings Estimate x's ratings based on ratings of users in N
Option 1 : Jaccard Similarity sim(A,B) = |rA∩ rB|/|rA∪ rB| Problem: Ignores rating values Option 2: Cosine similarity sim(A,B) = cos(rA, rB) Problem: treat missing data as negtive Option 3: Centered cosine Normalize ratings by subtracting row mean Also known as Pearson Correlation
2.3.1 Rating Predictions
Let rx be the vector of user x's ratings Let N be the set of k users most similar to x who have also rated item i Prediction for user x and item i Option 1: rxi = 1/k ∑y∈ N ryi Option 2: rxi = ∑y∈ N sxyryi/sumy∈ N sxy where sxy is the similarity between x and y
2.3.2 Item-Item Collaborative Filtering
Estimate rating for item i based on ratings for similar items rxi = \cfrac{∑j∈ N(i;x)sij*rij}{∑j∈ N(i;x)sij}
2.3.3 Item-Item v. User-User
In theory, user-user and item-item are dual approaches In practice, item-item outperforms user-user in many user cases Item are "simpler" than users Items belong to a small set of "genres", users have varied tastes Item Similarity is more meaningful than user Similarity
2.4 Evaluating Recommender System
Root-mean-square erroe (RMSE) Problems: Narrow focus on accuracy sometimes misses the point Prediction Diversity Prediction Context Order of predictions In practice, we care only to predict high ratings Alternative: precision at top k
2.5 Latent Factor Models
2.5.1 A Modern Recommender System
Multi-scale modeling of the data Global overall deviation Factorization addressing "regional" effects Collaborative filtering extract local patterns
2.5.2 Modeling Local & Global Effects
Global: Baseline estimation Local neighborhood (CF/NN): Final estimate
2.5.3 Collaborative filtering (Item-Item)
In practice we get better estimates if we model deviations: bxi = μ + bx + bi rxi = bxi + \frac{∑ sij(rxj-bxj)}{∑ sij} μ = overall mean rating bx = rating deviation of user x bi = (avg. rating of movie i) - mu Problems/Issues: Similarity measures are "arbitrary" Pairwise similarities neglect interdependencies among users Taking a weight average can be restricting
2.6 Latent Factor Recommender System
Recommendations via Optimization
2.6.1 Latent Factor Models
"SVD" on Netflix data: R ≈ Q⋅ PT SVD gives minimum reconstruction error (Sum of squared errors)
2.7 Finding the Latent Factors
2.8 CUR
Pros & Cons +Easy interpretation +Sparse basis -Duplicate columns and rows columns of large norms will be sampled many times
3 Week 5 Clustering
3.1 Bradley-Fayyad-Reina (BFR) Algorithm
BFR is a variant of k-means for very large (disk-resident) data sets Assumes each cluster is normally distributed around a centroid in Euclidean space
3.1.1 BFR Algorithm
Points are read from disk one main-memory-full at a time Most points from previous memory are summarized by simple statistics To begin, from the initial load we select the initial k centroid
3.1.2 Three Classes of Points
Discard set (DS): Points close enough to a centroid to be summarized Compression set (CS): Groups of points that are close together but not close to any existing centroid These points are summarized, but not assigned to a cluster Retained set (RS): isolated points waiting to be assigned to a compression set
3.1.3 Summarizing Sets of Points
For each cluster, DS is summarized by: The number of points, N The vector SUM, whose ith component = sum of the coordinates of the points in ith dimension The vector SUMSQ: ith component = sum of squares of coordinates in ith dimension centroid and variance can be calculated by N, SUM and SUMSQ
3.1.4 Processing a chuck of points
Consider merging compressed sets in the CS If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster
3.1.5 A Few Details…
Q1) How do we decide if a point is "close enough" to a cluster (and discard BFR approach The Mahalanobis distance is less than a threshold High likelihood of the point belonging to currently 3 σ Q2) Should 2 CS subclusters be combined? Combine if the combined variance is below some threshold Many alternatives: Treat dimensions differently, consider density
3.2 CURE Algorithm
Limitations of BFR Algorithm Makes strong assumptions, not work on non-linear separable
3.2.1 Clustering Using REpresentatives:
Assumes a Euclidean distance Allows clusters to assume any shape Uses a collection of representative points to represent cluster
3.2.2 Starting CURE
Pass 1 of 2: Pick a random sample of points that fit in main memory Cluster sample points hierarchically to create the initial clusters Pick representative points: For each cluster, pick k representative points, as dispersed as possible Move each representative point a fixed fraction (e.g., 20%) toward the centroid of the cluster Pass 2 of 2: Now, rescan the whole dataset and visit each point p in the data set Place it in the "closest cluster"
3.3 Performance-based Advertising
Matching Algorithm Problem: Find a maximum matching for a given bipartite graph A perfect one if exists There is a polynomial-time offline algorithm based on augmenting paths Online Graph Matching Problem girls -> boys In each round, one girl's choices are revealed At that time, we have decide to either pair them. Example of application: Assigning tasks to servers.
3.3.1 Greedy Algorithm
just pick a boy eligible for a new girl Competitive Ration = minall possible inputs I(|Mgreedy|/|Mopt|) the worst case performance overall all possible inputs of greedy algorithm
3.3.2 Analyzing the Greedy Algorithm
Mopt | <= 2 | Mgreedy |
3.4 Algorithmic Challenges
Performance-based advertising works! Multi-billion-dollar industry What ads to show for a given query?
3.4.1 AdWords Problems
A stream of queries arrives at the search engine: q1,q2,… Several advertisers bid on each query When query qi arrives, search engine must pick a subset of advertisers whose ads are shown Goal: Maximize serach engine's revenues Clearly we need an online algorithm!
3.4.2 Expected Revenue
CTR (click through rate) * Bid
3.4.3 Adwords Problem
Given: A set of bids by advertisers for search queries A click-through rate for each advertiser-query pair A budget for each advertiser (say for 1day, month…) A limit on the number of ads to be displayed with each search query Respond to each search query with a set of advertisers such that: The size of the set is no larger than limitation Each advertiser has bid on the serach query Each advertiser has enough budget left to pay
3.4.4 Limitations of Simple Algorithm
CTR of an ad is unknown Advertisers have limited budgets and bid on multiple ads (BALANCE algorithm)
3.4.5 Estimating CTR
CTR for a query-ad pair is measured historically Averaged over a time peroid Some complications we won't cover in this lecture: CTR is position dependent Explore v Exploit: Keep showing ads we already know the CTR of, or show new ads to estimate their CTR?
3.5 The BALANCE Algorithms
3.5.1 Dealing with Limited Budgets
Simplest algorithm is greedy.
3.5.2 Bad Scenario for Greedy
Two advertisers A and B A bids on query x, B bids on x and y Both have budgets of $4 Query stream: x x x x y y y y worst case greedy choice: B B B B _ _ _ _ Optimal: A A A A B B B B This is the worst case! And it's determinstic. Greedy always give the same answer to the same situation.
3.5.3 BALANCE Algorithm [MSVV]
For each query, pick the advertiser with the largest unspent budget Break ties arbitrarily (but in a determinstic way)
3.5.4 Analyzing 2-advertiser BALANCE
BALANCE must exhaust at least one advertiser's budget: if not, we can allocate more queries Assume BALANCE exhausts A2's budget
3.5.5 BALANCE: General Result
In the general case, worst competitive ration of BALANCE is 1-1/e = approx. 0.63 Interestingly, no online algorithm has a better competitive ratio
3.6 Worst case for BALANCE
N advertisers: A1, A2, … AN Queries: N\( \cdot \) B queries appear in N rounds of B queries each Bidding: Round 1 queries: bidders A1, A2, …, AN Round 2 queries: bidders A2, …, AN Round i queries: bidders Ai, …, AN
3.6.1 BALANCE Allocation
After k rounds, the allocation to advertiser k is: SK = ∑1≤ i ≤ k B/(N-i+1)
3.6.2 BALANCE: Analysis
Fact: for large n Result due to Euler ln(N) - 1 = ln(N - k) k = N(1 - 1/e) So after the first k = N(1-1/e) rounds, we cannot allocate a query to any advertiser Revenue = B⋅ N(1-1/e) Competitive ratio = 1 - 1/e
3.6.3 General Version of the Problem
So far: all bids = 1, all budgets equal (=B) In a general setting BALANCE can be terrible
3.6.4 Generalized BALANCE
Consider query q, bidder i Bid = xi Budget = bi Amount spent so far = mi Fraction of budget left over fi = 1 - mi/bi Define φi(q) = xi(1-e-fi) Allocate query q to bidder i with largest value of φi(q) Same competitive ratio (1 - 1/e)
4 week 6
4.1 Soft-Margin SVMs
Hinge Loss
5 week 7
5.1 LSH Families of Hash Functions
5.1.1 Hash Functions Decide Equality
There is a subtlety about what a "hash function" really is in the context of LSH family. A hash function h really takes two elements x and y, and returns a decision whether x and y are candidates for comparison. E.g.: the family of minhash functions computes minhash values and says "yes" iff they are the same. Shorthand: "h(x) = h(y)" means h says "yes" for pair elements x and y
5.1.2 LSH Families Defined
Suppose we have a space S of points with a distance measure d. A family H of hash functions is said to be (d1, d2, p1, p2)-sensitive if for any x and y in S:
- If \( d(x,y) \leq d_1 \), then the probability over all h in H, that h(x) = h(y) is at least p1.
- If \( d(x,y) \geq d_2 \), then the probability over all h in H, that h(x) = h(y) is at most p2.
5.1.3 E.g.: LS Family
Let S = sets, d = Jaccard distance, H is formed from the minhash functions for all permuatations. Then Prob[h(x)=h(y)] = 1 - d(x,y). Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance. Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d.
5.1.4 Amplifying a LSH-Family
The "bands" technique we learned for signature matrices carries over to this more general setting. Goal: the "S-curve" effect seen here. AND construction like "rows in a band." OR construction like "many bands."
5.1.5 AND of Hash Functions
Given family H, construct family H' whose members each consist of r functions from H. For \( h = {h_1, \ldots, h_r} \) in H', h(x) = h(y) iff hi(x) = hi(y) for all i. Theorem: If H is (d1, d2, p1, p2)-sensitive, then H' is (d1, d2, (p1)r, (p2)r)-sensitive. Proof: Use fact that hi's are independent.
5.1.6 OR of Hash Functions
Given family H, construct family H' whose members each consist of b functions from H. For \( h = {h_1, \ldots, h_b} \) in H', h(x) = h(y) iff hi(x) = hi(y) for some i. Theorem: If H is (di, d2, p1, p2)-sensitive, then H' is (d1, d2, 1- (1-p1)b, (1-p2)b)-sensitive.
5.1.7 Effect of AND and OR Constructions
AND makes all probabilities shrink, but by choosing r conrrectly, we can make the lower probablity approach 0 while the higher does not. OR makes all probabilities grow, but by choosing b correctly, we can make the upper probability approach 1 while the lower does not.
5.1.8 Composing Constructions
As for the signature matrix, we can use the AND construction followed by the OR construction. Or vice-versa. Or any sequence of AND's and OR's alternating.
5.1.9 AND-OR Composition
Each of the two probabilities p is transformed into 1-(1-pr)b. The "S-curve" studied before. E.g.: Take H and construct H' by the AND construction with r=4. Then, from H', construct H'' by the OR construction with b=4. (1-(1-p4)4)
5.1.10 OR-AND Composition
Each of the two probabilities p is transformed 1-(1-pb)r The same S-curve, mirrored horizontally and vertically.
5.1.11 Cascading Constructions
E.g.: Apply the (4-4) OR-AND construction followed by the (4,4) AND-OR construction. Transfrom a (.2,.2,.8,.8)-sensitive into (.2,.8,.9999996,.0008715)-sensitive
5.1.12 General Use of S-Curves
For each S-curve 1-(1-pr)b, there is a threshold t, for which 1-(1-tr)b = t. Above t, high probabilities are increased; below t, they are decreased. You improve the sensitivity as long as the low probability is less than t, and the high probability is gerater thant. Iteratea as you like.
5.2 More LSH Families
For cosine distance, there is a technique analogous to minhashing for generating a (d1,d2,(1-d1/180),(1-d2/180))-sensitive family for andy d1 and d2 Called random hyperplane.
5.2.1 Random Hyperplanes
Each vector v determines a hash function hv with two buckets. hv(x) = +1 if \( v \cdot x > 0 \); = -1 if \( v \cdot x < 0 \) LS-family H = set of all functions derived from any vector. Clain: Prob[h(x)=h(y)] = 1 - (angle between x and y divided by 180)
5.2.2 Signatures for Cosine Distance
Pick some number of vectors, and hash your data for each vector. The result is a signature(sketch) of +1's and -1's that can be used for LSH lke the minhash signatures for Jaccard distance. But you don't have to think this way. The existence of the LSH-family is sufficient amplification by AND/OR.
5.2.3 Simplification
We need not pick from among all possible vectors v to form a component of a sketch. It suffices to consider only vector v consisting of +1 and -1 components.
5.2.4 LSH for Euclidean Distance
Simple idea: hash functions correspond to lines. Partition the line into buckets of size a. Hash each point to the bucket containig its projection onto the line. Nearby points are always close; distant points are rarely in same bucket.
If points are distance \( \geq 2a \) apart then \( 60 \leq \theta \leq 90 \) for there to be a chance that the points go in the same bucket. I.e., at most 1/3 probability If points are distance \( \leq a/2 \), then there is at least 1/2 chance they share a bucket. Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions.
5.2.5 Fixup: Euclidean Distance
For previous distance measures, we could start with a (d,e,p,q)-sensitive family for any d < e, and drive p and q to 1 and 0 by AND\OR constructions. Here, we seem to need \( e \geq 4d \). But as long as d < e, the probability of points at distance d falling in the same bucket is greater than the probability of points at distance e doing so. Thus, the hash familiy formed by projecting onto lines is a (d,e,p,q)-sensitive family for some p > q.
5.3 Topic Specific (aka Personalized) PageRank
Instead of generic popularity, can we measure popularity within a topic? Goal: Evaluate Web pages not just according to their popularity, but by how cloase theay are to a particular topic, e.g. "sports" or "history". Allow search queries to be answered based on interests of the user. E.g.:Query "Trojan" wants different pages depending on whether you are interested on sports, history or computer security. Random walker has a small probability of teleporting at any step Teleport can go to: Standard PageRank: Any page with equal probability to avoid dead-end and spider-trap problems Topic Specific PageRank: A topic-specific set of "relevant" pages (teleport set) Idea: Bias the random walk When walker teleports, she pick a page from a set S S contains only pages that are relevant to the topic e.g., Open Directory(DMOZ) pages for a given topic/query For each teleport set S, we get a different vector rs.
5.3.1 Matrix Formulation
To make this work all we need is to update the teleportating part of the PageRank formulationg:
\begin{equation} A_{ij} = \begin{cases} \beta M_{ij}+(1-\beta)/|S| &\mbox{if}\ i\in S \\ \beta M_{ij} & \mbox{otherwise} \end{cases} \end{equation}A is stochastic! We weighted all pages in the teleport set S equally Could also assign different weights to pages! Random walk with Restart: S is a single element Compute as for regular PageRank: Multiply by M, the add a vector Maintains sparseness