Mining-Massive-Datasets

Table of Contents

  • 1. Week 3
  • 2. Week 4 Recommender System
  • 3. Week 5 Clustering
  • 4. week 6
  • 5. week 7
  • 0.1 Improvements to A-Priori

    0.1.1 PCY Alogrithm (Park-Chen-Yu-Algorithm)

    1st pass: hash p 2nd pass:

    0.1.2 Multistage

    two hash tables, 2 bitmaps, 3 passes. important points The hash functions have to be independent. We need to check both hashes on the third pass. Otherwise we will count pairs(freq item, freq item) hashTo [infreq bucket] hashTo [freq bucket].

    0.1.3 Multihash

    Key idea: use multiple independent hash tables in 1st pass.

    0.1.4 Single-Pass Approximate Algorithms

    0.2 All (Or Most) Frequent Itemsets In ≤ 2 Passes

    0.2.1 Simple Algorithm

    Take a random sample of the market baskets.

    0.2.2 Savasere-Omiecinski-Navathe (SON) Algorithm

    Only work if the number of candidates can be counted in the main memory Subset Key "monotonicity"

    0.2.3 Toivonen's Algorithm

    no false negative, in some case it may not give an answer, so you need to rerun it. No gurantee for finishing. Start as in the simple algorithm, but lower the threshold slightly. Goal is to

    Negative Border {A, B, C, D}

    1 Week 3

    1.1 The Affiliation Grpah Model

    1.1.1 Community-Affiliation Graph

    Community, C Memberships, M Nodes, V AGM -> Graph \( P(u, v) = 1 - \prod\limits_{c\in M_u \cap M_v}(1-p_c) \)

    1.1.2 AGM: Flexibility

    can express a variety of community stuctures: Non-overlapping, Overlapping, Nested

    1.2 From AGM to BigCLAM

    FuA The membership strength of node \(u\) Each community \(A\) links nodes independently: \( P_A(u, V) = 1 - exp(-F_{uA}\cdot F_{vA}) \)

    1.2.1 community membership strength Factor Matrix F

    \( P(u,v) = 1 - \prod\limits_c (1-p_c(u,v)) \) Then prob. at least one common \(C\) links them:

    \begin{align*} P(u,v) &= 1 - exp(-\sum_C F_{uC}\cdot F_{vC}) \\ &= 1 - exp(-F_u\cdot F_v^{T}) \end{align*}

    1.3 Solving the BigCLAM

    1.3.1 How to find F

    Given a Network G(V,E), estimate F \( arg max_F \prod p(u,v) \prod (1-p(u,v)) \) take the log likelihood \(l(F_u)\)

    \begin{equation*} l(F_u) = \sum\limits_{v\in N(u)} log(1-exp(-F_uF_v^T)) - \sum\limits_{v\not \in N(u)}(F_u F_V^T) \end{equation*}

    1.3.2 BigCLAM: V1.0

    gradient descent slow because \( \bigtriangledown l(F_u) \) takes linear time

    1.3.3 BigCLAM: V2.0

    \( \sum\limits_{v\not \in N(u)} F_v = (\sum\limits_v F_v - F_u - \sum\limits_{v\in N(u)} F_v) \) We cache \( sum_{v\not \in N(u)} F_v \) now it takes linear time in the degree \(|N(u)|\) of \(u\)

    1.4 Detecting Clusters

    Goal: find densely linked clusters Discovering social circles, circles of trust Graph Partitioning Criteria: Conductance \(\phi(A) = \cfrac{CUT(A)}{VOL(A)}\) vol(A): total weight of the edges with at least one endpoint in A: vol(A) = ∑\limitsi∈ Adi Why use this criteria? Produces more balanced partitions.

    1.5 The Graph Laplacian Matrix

    Adjacency matrix Graph Partitioning Task: Partition the graph into two pieces such the resulting pieces have low conductance. Problem: Computing optimal cut is NP-hard. \(A\): adjacency matrix of undirected G Aij = 1 if (i, j) is an edge, else 0 \(x\) is a vector of label/value of each node of \(G\) A⋅ x (1) L = D - A (2) λ2 = \cfrac{x^T Lx}{x^T x} (2) can be derived from (1). min f(y) = sum (yi - yj)2 = yT Ly Fiedler vector

    1.6 Spectral Clustering Alogirthms

    Three basic stages:

    1. Pre-processing

    Construct a matrix representation of the graph

    1. Decomposition

    Compute eigenvalues and eigenvectors of the matrix Map each point to a lower-dimensional representation based on one or more eigenvectors

    1. Grouping

    Assign points to two or more clusters, based on the new representation

    K-way Sepectral Clustering Recursive bi-partitioning ('92) Disadvantages: inefficient, unstable Cluster multiple eigenvectors ('00 preferable)

    1.7 Trawling

    Searching for small communities in the Web graph Frequent itemsets = complete bipartite graphs! view each mode i as a set Si of nodes i points to Ks,t = a set Y of size t that occurs in s sets Si s … minimum support (|S| = s) t … itemset size (|Y| = t)

    1.8 Mining Data Streams

    1.8.1 The Stream Model

    Data Management VS Steam Management DBMS e.p. SQL internel insert Stream Management is import when the input rate is controlled externally e.g. Google query

    1. Two Forms of Query

      Ad-hoc queries Standing queries query once, active all the time e.g. Report each new max value ever seen in stream S archive streams, if working storage is limited

    2. Applications

      Mining query streams Mining click streams IP packets can be monitored at a switch

    1.8.2 Sliding Windows

    queries are about a window of length N what if N > main memory E.G. Averages Stream of integers Standing query: what is the average of the integers in the window(size N)? For the first N inputs, simply sum and count the average Afterward, when a new input \(i\) arrives, change the average by adding \((i-j)/N\)

    1.8.3 Counting 1's

    1.9 Bloom Filters

    Bloom Filters can have false positive A Bloom filter is an array of bits, together with a number of hash functions The argument of each hash function is a stream element, and it returns a position in the array. E.G. Bloom Filter Use N = 11 bits for our filter Stream elements = integers Use two hash functions: h1(x) = odd numbered bits h2(x) = the same, but takes even numbered bits Bloom Filter Lookup compute h(y) for each hash function y. if all resulting in 1

    1.10 Counting 1's

    Counting Bits Problem: given a stream of 0's and 1's, be prepared to answer queries of the form "how many 1's in the last k bits?" where k \(\leq\) N

    1.10.1 DGIM Algorithm

    O(log2 N bits per stream)

    1. Timestamps

      Each bit in the stream has a timestamp, starting 0, 1, … Record timestamps modulo N(the window size), so we can represent any relevant timestamp in O(log2N) bits.

    2. Buckets

      A bucket is a segment of the window; it is represented by a record consisting of:

      1. The timestamp of its end [O(log N) bits]
      2. The number of 1's between its beginning and its beginning and end [O(log log N) bits].
    3. Representing a Stream by Buckets

      Either one Bucket do not overlap Buckets are sorted by size Buckets disappear when their end time > N

    4. Updating Buckets

      When a new bit comes in, drop the oldest bucket if its end-time is prior to N time units before the cur time If the current bit is 0, no other changes If the current bit is 1:

      1. Create a new bucket of size 1, for just this bit. End timestamp = current time.
      2. If there are now three buckets of size 1, combine the oldest two into a bucket of size 2
      3. If there are now three buckets of size 2, …

      so on …

    1.11 Sampling Streams

    1.11.1 When Sampling Doesn't Work

    Google find unique query for the past month

    1.11.2 Smapling Based on Hash Value

    E.G.: Fixed Sample Size Sampling Key-Value Pairs

    1.11.3 Counting Distinct Items

    1.11.4 Computing Moments

    1.12 Counting Distinct Elements

    Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Applications How many different words are founc among the web pages being crawled at a site? How many unique users visited Facebook the past month?

    1.12.1 Estimating Counts

    Flajolet-Martin Approach Pick a hash function h that maps each of the n elements to at least log2n bits. For each stream element a, let r(a) be the number of trailing 0's in h(a) Record R = the maximum r(a) seen Estimate = 2R

    1. Why it works

      The probability that a given h(a) ends in at least i 0's is 2-i If there are m different elements, the probability that R ≥ i is 1-(1-2-i)m Since 2-i is small, 1-(1-2-i)m = 1 - e-m2-i

    2. Solution

      Partition your samples into small groups. Log n, where n = size of universal set, suffices Take the average of groups Then take the median of the averages

    3. Generalization: Moments

      AMS Expected Value of X 2nd moment is ∑a(ma)2 E(X) = (1/n)(∑all times t n * (twice the number of times the stream element at time t appears from that time on)-1)) = ∑a(1+3+5+\ldots + 2ma-1) Problem: Streams never end Fixups

      h(1) 3+7=10 1010 =1 h(9) 3*9+7=34%11=3 0011 = 0 h(8) 31%11=9 1001 = 0 h(5) 22 = 0 0000 = 4 h(2) 2 0010 = 1 h(6) 25%11=3 0011 = 0 h(3) 16%11=5 0101 = 0 h(4) 19%11=8 1000 = 3 h(7) 28%11=6 0110 = 1 h(10) 37%11=4

    2 Week 4 Recommender System

    2.1 overview

    long tail

    2.1.1 Types of Recommendations

    Editorial and hand curated Simple aggregates top 10, most popular, recent uploads Tailored to individual users

    2.1.2 Formal Model

    C = set of Customers S = set of Items Utility function u: C× S → R R = set of ratings R is a totally ordered set

    2.1.3 Key Problems

    1. Gathering Known Ratings

      implicit: purchase means high ratings

    2. Extrapolating Utilities

      Key problem: matrix U is sparse Cold start: New items have no ratings New items have no history Three approaches to recommender systems Content-based Collaborative Latent factor based

    2.2 Content-based

    Main idea: Recommand items to customer x similar to previous items rated highly by x Users like → Item profiles → User profile

    2.2.1 Item Profiles

    For each item, create an item profile Profile is a set of features (a vector) Text features Profile = set of "important" words in item (document) How to pick important words? TF-IDF

    2.2.2 User Profiles

    User has rated items with profiles i1, \ldots, in Simple: (weighted) average of rated item profiles Variant: Normalize weights using average rating of user

    1. Example 1: Boolean Utility Matrix

      Items are movies, only feature is "Actor" Suppose users x has watch 5 movies 2 movies featuring actor A 3 movies featuring actor B User Profile A rating = 0.4 B rating = 0.6

    2. Example 2: Star Ratings

      Same example, 1-5 star ratings Actor A's movies rated 3, 5 Actor B's movies rated 1, 2, 4 Useful step: Normalize ratings by subtracting user's mean rating(3)

    2.2.3 Making predictions

    User profile x, Item profile i Estimate U(x, i) = cos(θ) = (x ⋅ i) / (|x||i|) cosine distance

    2.2.4 Pros: Content-based Approach

    No need for data on other users Able to recommend to users with unique tastes Able to recommend new & unpopular items No first-rater problem Explanations for recommended items Content features that caused an item to be recommended

    2.2.5 Cons: Content-based Approach

    Finding the appropriate features is hard Overspecialization Never recommends items outside user's content profile People might have multiple interests Unable to exploit quality judgments of other users Cold-start problem for new users How to build a user profile?

    2.3 Collaborative Filtering

    Consider user x Find set N of other users whose ratings are "similar" to x's ratings Estimate x's ratings based on ratings of users in N

    Option 1 : Jaccard Similarity sim(A,B) = |rA∩ rB|/|rA∪ rB| Problem: Ignores rating values Option 2: Cosine similarity sim(A,B) = cos(rA, rB) Problem: treat missing data as negtive Option 3: Centered cosine Normalize ratings by subtracting row mean Also known as Pearson Correlation

    2.3.1 Rating Predictions

    Let rx be the vector of user x's ratings Let N be the set of k users most similar to x who have also rated item i Prediction for user x and item i Option 1: rxi = 1/k ∑y∈ N ryi Option 2: rxi = ∑y∈ N sxyryi/sumy∈ N sxy where sxy is the similarity between x and y

    2.3.2 Item-Item Collaborative Filtering

    Estimate rating for item i based on ratings for similar items rxi = \cfrac{∑j∈ N(i;x)sij*rij}{∑j∈ N(i;x)sij}

    2.3.3 Item-Item v. User-User

    In theory, user-user and item-item are dual approaches In practice, item-item outperforms user-user in many user cases Item are "simpler" than users Items belong to a small set of "genres", users have varied tastes Item Similarity is more meaningful than user Similarity

    2.4 Evaluating Recommender System

    Root-mean-square erroe (RMSE) Problems: Narrow focus on accuracy sometimes misses the point Prediction Diversity Prediction Context Order of predictions In practice, we care only to predict high ratings Alternative: precision at top k

    2.5 Latent Factor Models

    2.5.1 A Modern Recommender System

    Multi-scale modeling of the data Global overall deviation Factorization addressing "regional" effects Collaborative filtering extract local patterns

    2.5.2 Modeling Local & Global Effects

    Global: Baseline estimation Local neighborhood (CF/NN): Final estimate

    2.5.3 Collaborative filtering (Item-Item)

    In practice we get better estimates if we model deviations: bxi = μ + bx + bi rxi = bxi + \frac{∑ sij(rxj-bxj)}{∑ sij} μ = overall mean rating bx = rating deviation of user x bi = (avg. rating of movie i) - mu Problems/Issues: Similarity measures are "arbitrary" Pairwise similarities neglect interdependencies among users Taking a weight average can be restricting

    2.6 Latent Factor Recommender System

    Recommendations via Optimization

    2.6.1 Latent Factor Models

    "SVD" on Netflix data: R ≈ Q⋅ PT SVD gives minimum reconstruction error (Sum of squared errors)

    2.7 Finding the Latent Factors

    2.8 CUR

    Pros & Cons +Easy interpretation +Sparse basis -Duplicate columns and rows columns of large norms will be sampled many times

    3 Week 5 Clustering

    3.1 Bradley-Fayyad-Reina (BFR) Algorithm

    BFR is a variant of k-means for very large (disk-resident) data sets Assumes each cluster is normally distributed around a centroid in Euclidean space

    3.1.1 BFR Algorithm

    Points are read from disk one main-memory-full at a time Most points from previous memory are summarized by simple statistics To begin, from the initial load we select the initial k centroid

    3.1.2 Three Classes of Points

    Discard set (DS): Points close enough to a centroid to be summarized Compression set (CS): Groups of points that are close together but not close to any existing centroid These points are summarized, but not assigned to a cluster Retained set (RS): isolated points waiting to be assigned to a compression set

    3.1.3 Summarizing Sets of Points

    For each cluster, DS is summarized by: The number of points, N The vector SUM, whose ith component = sum of the coordinates of the points in ith dimension The vector SUMSQ: ith component = sum of squares of coordinates in ith dimension centroid and variance can be calculated by N, SUM and SUMSQ

    3.1.4 Processing a chuck of points

    Consider merging compressed sets in the CS If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster

    3.1.5 A Few Details…

    Q1) How do we decide if a point is "close enough" to a cluster (and discard BFR approach The Mahalanobis distance is less than a threshold High likelihood of the point belonging to currently 3 σ Q2) Should 2 CS subclusters be combined? Combine if the combined variance is below some threshold Many alternatives: Treat dimensions differently, consider density

    3.2 CURE Algorithm

    Limitations of BFR Algorithm Makes strong assumptions, not work on non-linear separable

    3.2.1 Clustering Using REpresentatives:

    Assumes a Euclidean distance Allows clusters to assume any shape Uses a collection of representative points to represent cluster

    3.2.2 Starting CURE

    Pass 1 of 2: Pick a random sample of points that fit in main memory Cluster sample points hierarchically to create the initial clusters Pick representative points: For each cluster, pick k representative points, as dispersed as possible Move each representative point a fixed fraction (e.g., 20%) toward the centroid of the cluster Pass 2 of 2: Now, rescan the whole dataset and visit each point p in the data set Place it in the "closest cluster"

    3.3 Performance-based Advertising

    Matching Algorithm Problem: Find a maximum matching for a given bipartite graph A perfect one if exists There is a polynomial-time offline algorithm based on augmenting paths Online Graph Matching Problem girls -> boys In each round, one girl's choices are revealed At that time, we have decide to either pair them. Example of application: Assigning tasks to servers.

    3.3.1 Greedy Algorithm

    just pick a boy eligible for a new girl Competitive Ration = minall possible inputs I(|Mgreedy|/|Mopt|) the worst case performance overall all possible inputs of greedy algorithm

    3.3.2 Analyzing the Greedy Algorithm

    Mopt <= 2 Mgreedy

    3.4 Algorithmic Challenges

    Performance-based advertising works! Multi-billion-dollar industry What ads to show for a given query?

    3.4.1 AdWords Problems

    A stream of queries arrives at the search engine: q1,q2,… Several advertisers bid on each query When query qi arrives, search engine must pick a subset of advertisers whose ads are shown Goal: Maximize serach engine's revenues Clearly we need an online algorithm!

    3.4.2 Expected Revenue

    CTR (click through rate) * Bid

    3.4.3 Adwords Problem

    Given: A set of bids by advertisers for search queries A click-through rate for each advertiser-query pair A budget for each advertiser (say for 1day, month…) A limit on the number of ads to be displayed with each search query Respond to each search query with a set of advertisers such that: The size of the set is no larger than limitation Each advertiser has bid on the serach query Each advertiser has enough budget left to pay

    3.4.4 Limitations of Simple Algorithm

    CTR of an ad is unknown Advertisers have limited budgets and bid on multiple ads (BALANCE algorithm)

    3.4.5 Estimating CTR

    CTR for a query-ad pair is measured historically Averaged over a time peroid Some complications we won't cover in this lecture: CTR is position dependent Explore v Exploit: Keep showing ads we already know the CTR of, or show new ads to estimate their CTR?

    3.5 The BALANCE Algorithms

    3.5.1 Dealing with Limited Budgets

    Simplest algorithm is greedy.

    3.5.2 Bad Scenario for Greedy

    Two advertisers A and B A bids on query x, B bids on x and y Both have budgets of $4 Query stream: x x x x y y y y worst case greedy choice: B B B B _ _ _ _ Optimal: A A A A B B B B This is the worst case! And it's determinstic. Greedy always give the same answer to the same situation.

    3.5.3 BALANCE Algorithm [MSVV]

    For each query, pick the advertiser with the largest unspent budget Break ties arbitrarily (but in a determinstic way)

    3.5.4 Analyzing 2-advertiser BALANCE

    BALANCE must exhaust at least one advertiser's budget: if not, we can allocate more queries Assume BALANCE exhausts A2's budget

    3.5.5 BALANCE: General Result

    In the general case, worst competitive ration of BALANCE is 1-1/e = approx. 0.63 Interestingly, no online algorithm has a better competitive ratio

    3.6 Worst case for BALANCE

    N advertisers: A1, A2, … AN Queries: N\( \cdot \) B queries appear in N rounds of B queries each Bidding: Round 1 queries: bidders A1, A2, …, AN Round 2 queries: bidders A2, …, AN Round i queries: bidders Ai, …, AN

    3.6.1 BALANCE Allocation

    After k rounds, the allocation to advertiser k is: SK = ∑1≤ i ≤ k B/(N-i+1)

    3.6.2 BALANCE: Analysis

    Fact: for large n Result due to Euler ln(N) - 1 = ln(N - k) k = N(1 - 1/e) So after the first k = N(1-1/e) rounds, we cannot allocate a query to any advertiser Revenue = B⋅ N(1-1/e) Competitive ratio = 1 - 1/e

    3.6.3 General Version of the Problem

    So far: all bids = 1, all budgets equal (=B) In a general setting BALANCE can be terrible

    3.6.4 Generalized BALANCE

    Consider query q, bidder i Bid = xi Budget = bi Amount spent so far = mi Fraction of budget left over fi = 1 - mi/bi Define φi(q) = xi(1-e-fi) Allocate query q to bidder i with largest value of φi(q) Same competitive ratio (1 - 1/e)

    4 week 6

    4.1 Soft-Margin SVMs

    Hinge Loss

    5 week 7

    5.1 LSH Families of Hash Functions

    5.1.1 Hash Functions Decide Equality

    There is a subtlety about what a "hash function" really is in the context of LSH family. A hash function h really takes two elements x and y, and returns a decision whether x and y are candidates for comparison. E.g.: the family of minhash functions computes minhash values and says "yes" iff they are the same. Shorthand: "h(x) = h(y)" means h says "yes" for pair elements x and y

    5.1.2 LSH Families Defined

    Suppose we have a space S of points with a distance measure d. A family H of hash functions is said to be (d1, d2, p1, p2)-sensitive if for any x and y in S:

    1. If \( d(x,y) \leq d_1 \), then the probability over all h in H, that h(x) = h(y) is at least p1.
    2. If \( d(x,y) \geq d_2 \), then the probability over all h in H, that h(x) = h(y) is at most p2.

    5.1.3 E.g.: LS Family

    Let S = sets, d = Jaccard distance, H is formed from the minhash functions for all permuatations. Then Prob[h(x)=h(y)] = 1 - d(x,y). Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance. Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d.

    5.1.4 Amplifying a LSH-Family

    The "bands" technique we learned for signature matrices carries over to this more general setting. Goal: the "S-curve" effect seen here. AND construction like "rows in a band." OR construction like "many bands."

    5.1.5 AND of Hash Functions

    Given family H, construct family H' whose members each consist of r functions from H. For \( h = {h_1, \ldots, h_r} \) in H', h(x) = h(y) iff hi(x) = hi(y) for all i. Theorem: If H is (d1, d2, p1, p2)-sensitive, then H' is (d1, d2, (p1)r, (p2)r)-sensitive. Proof: Use fact that hi's are independent.

    5.1.6 OR of Hash Functions

    Given family H, construct family H' whose members each consist of b functions from H. For \( h = {h_1, \ldots, h_b} \) in H', h(x) = h(y) iff hi(x) = hi(y) for some i. Theorem: If H is (di, d2, p1, p2)-sensitive, then H' is (d1, d2, 1- (1-p1)b, (1-p2)b)-sensitive.

    5.1.7 Effect of AND and OR Constructions

    AND makes all probabilities shrink, but by choosing r conrrectly, we can make the lower probablity approach 0 while the higher does not. OR makes all probabilities grow, but by choosing b correctly, we can make the upper probability approach 1 while the lower does not.

    5.1.8 Composing Constructions

    As for the signature matrix, we can use the AND construction followed by the OR construction. Or vice-versa. Or any sequence of AND's and OR's alternating.

    5.1.9 AND-OR Composition

    Each of the two probabilities p is transformed into 1-(1-pr)b. The "S-curve" studied before. E.g.: Take H and construct H' by the AND construction with r=4. Then, from H', construct H'' by the OR construction with b=4. (1-(1-p4)4)

    5.1.10 OR-AND Composition

    Each of the two probabilities p is transformed 1-(1-pb)r The same S-curve, mirrored horizontally and vertically.

    5.1.11 Cascading Constructions

    E.g.: Apply the (4-4) OR-AND construction followed by the (4,4) AND-OR construction. Transfrom a (.2,.2,.8,.8)-sensitive into (.2,.8,.9999996,.0008715)-sensitive

    5.1.12 General Use of S-Curves

    For each S-curve 1-(1-pr)b, there is a threshold t, for which 1-(1-tr)b = t. Above t, high probabilities are increased; below t, they are decreased. You improve the sensitivity as long as the low probability is less than t, and the high probability is gerater thant. Iteratea as you like.

    5.2 More LSH Families

    For cosine distance, there is a technique analogous to minhashing for generating a (d1,d2,(1-d1/180),(1-d2/180))-sensitive family for andy d1 and d2 Called random hyperplane.

    5.2.1 Random Hyperplanes

    Each vector v determines a hash function hv with two buckets. hv(x) = +1 if \( v \cdot x > 0 \); = -1 if \( v \cdot x < 0 \) LS-family H = set of all functions derived from any vector. Clain: Prob[h(x)=h(y)] = 1 - (angle between x and y divided by 180)

    5.2.2 Signatures for Cosine Distance

    Pick some number of vectors, and hash your data for each vector. The result is a signature(sketch) of +1's and -1's that can be used for LSH lke the minhash signatures for Jaccard distance. But you don't have to think this way. The existence of the LSH-family is sufficient amplification by AND/OR.

    5.2.3 Simplification

    We need not pick from among all possible vectors v to form a component of a sketch. It suffices to consider only vector v consisting of +1 and -1 components.

    5.2.4 LSH for Euclidean Distance

    Simple idea: hash functions correspond to lines. Partition the line into buckets of size a. Hash each point to the bucket containig its projection onto the line. Nearby points are always close; distant points are rarely in same bucket.

    If points are distance \( \geq 2a \) apart then \( 60 \leq \theta \leq 90 \) for there to be a chance that the points go in the same bucket. I.e., at most 1/3 probability If points are distance \( \leq a/2 \), then there is at least 1/2 chance they share a bucket. Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions.

    5.2.5 Fixup: Euclidean Distance

    For previous distance measures, we could start with a (d,e,p,q)-sensitive family for any d < e, and drive p and q to 1 and 0 by AND\OR constructions. Here, we seem to need \( e \geq 4d \). But as long as d < e, the probability of points at distance d falling in the same bucket is greater than the probability of points at distance e doing so. Thus, the hash familiy formed by projecting onto lines is a (d,e,p,q)-sensitive family for some p > q.

    5.3 Topic Specific (aka Personalized) PageRank

    Instead of generic popularity, can we measure popularity within a topic? Goal: Evaluate Web pages not just according to their popularity, but by how cloase theay are to a particular topic, e.g. "sports" or "history". Allow search queries to be answered based on interests of the user. E.g.:Query "Trojan" wants different pages depending on whether you are interested on sports, history or computer security. Random walker has a small probability of teleporting at any step Teleport can go to: Standard PageRank: Any page with equal probability to avoid dead-end and spider-trap problems Topic Specific PageRank: A topic-specific set of "relevant" pages (teleport set) Idea: Bias the random walk When walker teleports, she pick a page from a set S S contains only pages that are relevant to the topic e.g., Open Directory(DMOZ) pages for a given topic/query For each teleport set S, we get a different vector rs.

    5.3.1 Matrix Formulation

    To make this work all we need is to update the teleportating part of the PageRank formulationg:

    \begin{equation} A_{ij} = \begin{cases} \beta M_{ij}+(1-\beta)/|S| &\mbox{if}\ i\in S \\ \beta M_{ij} & \mbox{otherwise} \end{cases} \end{equation}

    A is stochastic! We weighted all pages in the teleport set S equally Could also assign different weights to pages! Random walk with Restart: S is a single element Compute as for regular PageRank: Multiply by M, the add a vector Maintains sparseness

    Author: Zhiyuan Wang

    Created: 2014-11-30 Sun 18:14

    Emacs 24.3.1 (Org mode 8.2.10)

    Validate