Graph-based Algorithms for Information Retrieval and Natural Language Processing Rada Mihalcea University of North Texas [email protected] RANLP 2005 Introduction Motivation Graph-theory is a well studied discipline So are the fields of Information Retrieval (IR) and Natural Language Processing (NLP) Often perceived as completely different disciplines Goal of the tutorial: provide an overview of methods and applications in IR and NLOP that rely on graph-based algorithms, e.g. Graph-based algorithms: graph traversal, min-cut algorithms, node ranking Applied to: Web search, text understanding, text summarization, keyword extraction, text clustering RANLP 2005 2 Outline Part 1: Elements of graph-theory: Terminology, notations, algorithms on graphs
Part 2: Graph-based algorithms for IR Web search Text clustering and classification Part 3: Graph-based algorithms for NLP Word Sense Disambiguation Clustering of entities (names/numbers) Thesaurus construction Text summarization content style RANLP 2005 3 What is a Graph? A graph G = (V,E) is composed of: V: set of vertices E: set of edges connecting the vertices in V An edge e = (u,v) is a pair of vertices Example:
V= {a,b,c,d,e} a b c d RANLP 2005 E= {(a,b),(a,c), (a,d), (b,e),(c,d),(c,e), (d,e)} e 4 Terminology: Adjacent and Incident If (v0, v1) is an edge in an undirected graph, v0 and v1 are adjacent The edge (v0, v1) is incident on vertices v0 and v1 If
incident to that vertex For directed graph, the in-degree of a vertex v is the number of edges that have v as the head the out-degree of a vertex v is the number of edges that have v as the tail if di is the degree of a vertex i in a graph G with n vertices and e edges, the number of edges is n 1 e ( 0 RANLP 2005 di ) / 2 Why? Since adjacent vertices each count the adjoining edge, it will be counted twice 6 Examples 0 3 2 1 0 3 1 3
2 3 3 directed graph in-degree out-degree 3 5 6 1 0 1 G2 1 in:1, out: 1 1 in: 1, out: 2 2 in: 1, out: 0 4 1 3G1 3 2
G3 RANLP 2005 7 Terminology: Path path: sequence of vertices v1,v2,. . .vk such that consecutive vertices vi and vi+1 are adjacent. 3 2 3 3 3 a b a c c e d d abedc RANLP 2005 b
e bedc 8 More Terminology simple path: no repeated vertices a b bec c e d cycle: simple path, except that the last vertex is the same as the first vertex a b acda c d RANLP 2005 e 9 Even More Terminology connected graph: any two vertices are connected by some path
connected not connected subgraph: subset of vertices and edges forming a graph connected component: maximal connected subgraph. E.g., the graph below has 3 connected components. RANLP 2005 10 Subgraph Examples 0 1 0 0 2 3 G1 0 1 (i) RANLP 2005 G3 2 2 3
0 1 (ii) (iii) (a) Some of the subgraphs of G1 0 1 2 1 (i) 2 3(iv) 0 0 0 1 1 1 2 2 (ii) (iii) (b) Some of the subgraphs of G3
(iv) 11 More tree - connected graph without cycles forest - collection of trees tree tree forest tree tree RANLP 2005 12 Connectivity Let n = #vertices, and m = #edges A complete graph: one in which all pairs of vertices are adjacent How many total edges in a complete graph? Each of the n vertices is incident to n-1 edges, however, we would have counted each edge twice! Therefore, intuitively, m = n(n -1)/2. Therefore, if a graph is not complete, m < n(n -1)/2 n
5 m (5 RANLP 2005 13 More Connectivity n 5 m 4 n = #vertices m = #edges For a tree m = n - 1 If m < n - 1, G is not connected n 5 m 3 RANLP 2005 14 Directed and Undirected Graphs An undirected graph is one in which the pair of vertices in a edge is unordered, (v0, v1) = (v1,v0) A directed graph is one in which each edge is a directed pair of vertices,
Weighted and Unweighted Graphs Edges have / do not have a weight associated with them 2 10 s 5 15 4 3 RANLP 2005 6 t 4 4 weighted 5 7 unweighted 16 Graph Representations: Adjacency
Matrix Let G=(V,E) be a graph with n vertices. The adjacency matrix of G is a two-dimensional n by n array, say adj_mat If the edge (vi, vj) is in E(G), adj_mat[i][j]=1 If there is no such edge in E(G), adj_mat[i][j]=0 The adjacency matrix for an undirected graph is symmetric; the adjacency matrix for a digraph need not be symmetric RANLP 2005 17 Examples of Adjacency Matrices 0 0 1 0 1 1 1 1 0 1 1 1 1 0 1
1 1 1 0 G1 1 2 0 1 0 1 0 1 0 0 0 G2 symmetric undirected: n2/2 directed: n2 RANLP 2005 5 1 2 2 3 4
0 6 3 0 1 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 07 0 0 1 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 G3 18 Advantages of Adjacency Matrix Starting with an adjacency matrix, it is easy to determine the links between vertices n 1 The degree of a vertex is adj _ mat[i][ j ] j 0 For a digraph (= directed graph), the row sum is the out_degree, while the column sum is the in_degree n 1 ind (vi ) A[ j , i ] j 0 RANLP 2005 n 1 outd (vi ) A[i , j ] j 0 19 Graph Representations: Adjacency
Lists Each row in an adjacency matrix is represented as a list An undirected graph with n vertices and e edges => n head nodes and 2e list nodes Pros: More compact representation Memory efficient Cons: Sometimes less efficient computation of inor out- degrees RANLP 2005 20 Examples of Adjacency Lists 0 1 1 0 0 0 2 2 1 1 G1 0 1 2 RANLP 2005
1 0 2 G3 1 2 2 5 6 3 3 0 1 2 3 4 0 7 3 3 3 2 0 1 2 0
1 2 3 4 5 6 7 1 0 0 1 5 4 5 6 2 3 3 2 6 7 21 Operations on Adjacency Lists degree of a vertex in an undirected graph # of nodes in adjacency list # of edges in a graph determined in O(n+e) out-degree of a vertex in a directed graph # of nodes in its adjacency list in-degree of a vertex in a directed graph traverse the whole data structure RANLP 2005
22 Algorithms on Graphs Graph traversal Min-Cut / Max-Flow Ranking algorithms Spectral Clustering Other: Dijkstras Algorithm Minimum Spanning Tree RANLP 2005 23 Graph Traversal Problem: Search for a certain node or traverse all nodes in the graph Depth First Search Once a possible path is found, continue the search until the end of the path Breadth First Search Start several paths at a time, and advance in each one step at a time RANLP 2005 24 Depth-First Search
RANLP 2005 A B C D E F G H I J K L M N O P 25 Depth-First Search Algorithm DFS(v); Input: A vertex v in a graph
Output: A labeling of the edges as discovery edges and backedges for each edge e incident on v do if edge e is unexplored then let w be the other endpoint of e if vertex w is unexplored then label e as a discovery edge recursively call DFS(w) else label e as a backedge RANLP 2005 26 Breadth-First Search 0 a) c) 0 1 A B C D A B C D
E F G H E F G H I J K L I J K L M N O
P M N O P 0 1 2 A B C D E F G H I J K L
M N O P RANLP 2005 b) 0 1 2 3 A B C D E F G H I
J K L M N O P d) 27 Breadth-First Search Algorithm BFS(s): Input: A vertex s in a graph Output: A labeling of the edges as discovery edges and cross edges initialize container L0 to contain vertex s i0 while Li is not empty do create container Li+1 to initially be empty for each vertex v in Li do if edge e incident on v do let w be the other endpoint of e if vertex w is unexplored then label e as a discovery edge insert w into Li+1 else label e as a cross edge ii+1 RANLP 2005 28
Path Finding Find path from source vertex s to destination vertex d Use graph search starting at s and terminating as soon as we reach d Need to remember edges traversed Use depth first search Use breath first search RANLP 2005 29 Path Finding with Depth First Search F B A G D C start E destination B DFS on B A DFS on A A
G D B A Call DFS on G RANLP 2005 C B A DFS on C D B A Call DFS on D Return to call on B found destination - done path is implicitly stored in DFS recursion path is: A, B, D, G 30 Path Finding with Breadth First Search F B A G destination D
C rear front start E rear front rear front B D C Initial call to BFS on A Dequeue A Add A to queue Add B Dequeue B Add C, D A rear front D Dequeue C Nothing to add
front G Dequeue D Add G RANLP 2005 rear found destination - done path must be stored separately 31 Max Flow Network Max flow network: G = (V, E, s, t, u) . (V, E) = directed graph, no parallel arcs. Two distinguished nodes: s = source, t = sink. u(e) = capacity of arc e. 10 s Capacity RANLP 2005 5 15 2 9
5 4 15 15 10 3 8 6 10 4 6 15 10 4 30 7 t 32 Flows
An s-t flow is a function f: E that satisfies: For each e E: 0 f(e) u(e) For each v V {s, t}: f (e ) e in to v f (e ) : e in to v 4 10 s Capacity Flow RANLP 2005 0 5 15 0 f (w, v ) w : ( w ,v ) E f (e ) (conservation) e out of v f (e ) : 2 0 9
4 4 0 15 3 4 8 4 0 0 6 4 (capacity) 30 0 e out of v f (v , w ) w : ( v ,w ) E 5 15 0 0 10 6 4 10
15 0 0 10 t 7 33 Flows An s-t flow is a function f: E that satisfies: For each e E: 0 f(e) u(e) For each v V {s, t}: f (e ) e in to v (capacity) f (e ) (conservation) e out of v MAX FLOW: find s-t flow that maximizes net flow out of the source. 4 10 s Capacity Flow RANLP 2005 0 5
15 0 2 0 9 4 4 0 15 3 4 8 4 0 0 6 4 30 0 f f (e ) e out of s 5 15 0
0 10 6 4 10 15 0 0 10 t Value = 4 7 34 Networks Flow Examples Network Nodes Arcs Flow telephone exchanges, cables, fiber optics, computers, satellites microwave relays voice, video, packets gates, registers, processors wires
current joints rods, beams, springs heat, energy hydraulic reservoirs, pumping stations, lakes pipelines fluid, oil financial stocks, currency transactions money airports, rail yards, street intersections highways, railbeds, airway routes freight, vehicles, passengers sites bonds
energy communication circuits mechanical transportation chemical RANLP 2005 35 Cuts An s-t cut is a node partition (S, T) such that s S, t T. The capacity of an s-t cut (S, T) is: u (e ) : e out of S u (v , w ). (v ,w ) E v S, w T Min s-t cut: find an s-t cut of minimum capacity. 10 s 5 15
2 9 5 4 15 15 10 3 8 6 10 4 6 15 10 4 RANLP 2005 30 7 t
Capacity = 30 36 Cuts An s-t cut is a node partition (S, T) such that s S, t T. The capacity of an s-t cut (S, T) is: u (e ) : e out of S u (v , w ). (v ,w ) E v S, w T Min s-t cut: find an s-t cut of minimum capacity. 10 s 5 15 2 9 5 4
15 15 10 3 8 6 10 4 6 15 10 4 RANLP 2005 30 7 t Capacity = 62 37 Flows and Cuts L1. Let f be a flow, and let (S, T) be a cut. Then, the net
flow sent across the cut is equal to the amount reaching t. f (e ) e out of S 10 10 s 4 5 15 10 Value = 24 RANLP 2005 f (e ) e in to S 2 6 9 4 4 0 15 3 8 8
4 0 0 6 4 30 10 f (e ) f e out of s 5 15 0 6 10 6 8 10 15 0 10 10 t 7
38 Flows and Cuts L1. Let f be a flow, and let (S, T) be a cut. Then, the net flow sent across the cut is equal to the amount reaching t. f (e ) e out of S 10 10 s 4 5 15 10 Value = 24 RANLP 2005 f (e ) e in to S f (e ) f e out of s
2 6 9 4 4 0 15 5 15 0 6 10 3 8 8 6 8 10 4 0 0 6 15 0 10 10 4
30 10 t 7 39 Flows and Cuts Let f be a flow, and let (S, T) be a cut. Then, f (e ) e out of S f (e ) f . e in to S Proof by induction on |S|. Base case: S = { s }. Inductive hypothesis: assume true for |S| = k-1. consider S cut (S, T) with |S| = k = S' { v } for some v s, t, |S' | = k-1 adding cap(S', T') = | f |. f (e )
v to S' increase cut capacity by e out of v e in to v t t v v s s S' RANLP 2005 f (e ) 0 Before S After 40 Max-Flow Min-Cut Theorem Max-Flow Min-Cut Theorem (Ford-Fulkerson, 1956): In any network, the value of the max flow is equal to the value of the min cut. 10 10 s
4 5 15 15 2 9 9 4 0 1 15 3 8 8 4 0 4 6 4 Cut capacity = 28 RANLP 2005 30 15 5 15 0
9 10 6 9 10 15 0 10 10 t 7 Flow value = 28 41 Random Walk Algorithms Algorithms for link analysis Based on random walks Results into a ranking of vertex importance Used in many fields: Social networks co-authorship, RANLP 2005
citations Web search engines Clustering NLP 42 PageRank Intuition (Brin & Page 1998) Imagine a web surfer doing a simple random walk on the entire web for an infinite number of steps. Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page. At some point, the percentage of time spent at each page will converge to a fixed value. This value is known as the PageRank of the page. RANLP 2005 43 HITS Intuition
Hyperlinked Induced Topic Search (Kleinberg 1999) Defines 2 types of importance: Authority Can be thought of as being an expert on the subject You are important if important hubs point to you Hub Can be thought of as being knowledgeable about the subject You are important if important authorities point to you Hubs and Authorities mutually reinforce each other Note: A node in a graph can be both an Authority and a Hub at the same time, eg. cnbc.com Note: this is equivalent to performing a random walk where: Odd steps you walk from hubs to authorities (follow outlink) Even steps you walk from authorities to hubs (follow inlink) RANLP 2005 44 0.51 A F 1.47 1.35
B D 1.17 E 0.36 0.63 RANLP 2005 C 45 Graph-based Ranking Decide the importance of a vertex within a graph A link between two vertices = a vote Vertex A links to vertex B = vertex A votes for vertex B Iterative voting => Ranking over all vertices Model a random walk on the graph A walker takes random steps Converges to a stationary distribution of probabilities Probability of finding the walker at a certain vertex Guaranteed to converge if the graph is Aperiodic
holds for non-bipartite graphs Irreducible holds for strongly connected graphs RANLP 2005 46 Graph-based Ranking Traditionally applied on directed graphs From a given vertex, the walker selects at random one of the out-edges Applies also to undirected graphs From a given vertex, the walker selects at random one of the incident edges Adapted to weighted graphs From a given vertex, the walker selects at random one of the out (directed graphs) or incident (undirected graphs) edges, with higher probability of selecting edges with higher weight RANLP 2005 47 Algorithms G = (V,E) a graph
In(Vi) = predecessors, Out(Vi) = successors PageRank (Brin & Page 1998) integrates incoming / outgoing links in one formula PR(Vi ) (1 d ) d 1 PR(V j ) jIn (Vi ) | Out (V j ) | where d [0,1] Adapted to weighted graphs w PR (Vi ) (1 d ) d jIn (Vi ) RANLP 2005 w ji w PR w (V j ) jk Vk Out (V j ) 48 Algorithms
HITS Hyperlinked Induced Topic Search (Kleinberg 1999) authorities / hubs HITS H (Vi ) HITS A (V j ) jOut (Vi ) HITS A (Vi ) HITS H (V j ) jIn (Vi ) Positional Function (Positional Power / Weakness (Herings et al. 2001) 1 POS P (Vi ) (1 POSW (V j )) | V | V j Out (Vi ) POSW (Vi ) RANLP 2005 1 (1 POS P (V j )) | V | V j In (Vi ) 49 Convergence Convergence Error below a small threshold value (e.g. 0.0001)
| S (Vi ) k S (Vi ) || S (Vi ) k 1 S (Vi ) k | Text-based graphs convergence usually achieved after 20-40 iterations more iterations for undirected graphs RANLP 2005 50 Convergence Curves Directed Unidrected 0.19 0.17 0.15 0.13 Score 0.11 0.09 0.07 0.05 0.03 0.01 -0.01 1 2 3 4
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Iteration (250 vertices, 250 edges) RANLP 2005 51 Spectral Clustering Algorithms that cluster data using graph representations Similarity graph: Distance decreases as similarity increases Represent dataset as a weighted graph G(V,E) 0.1 0.8 0.6 6 4 0.8 0.8 0.8
2 5 1 3 0.7 0.2 Clustering can be viewed as partitioning a similarity graph Bi-partitioning: divide vertices into two disjoint groups: (A,B) A 2 B 5 1 4 6 3 RANLP 2005 52 Clustering Objectives
Traditional definition of a good clustering: 1. Points assigned to same cluster should be highly similar. 2. Points assigned to different clusters should be highly dissimilar. Apply these objectives to our graph representation 0.1 0.8 5 1 0.8 0.8 0.6 6 2 4 0.7 0.2 0.8 3 Minimize weight of between-group connections Options:
1. Find minimum cut between two clusters 2. Apply spectral graph theory RANLP 2005 53 Spectral Graph Theory Possible approach Represent a similarity graph as a matrix Apply knowledge from Linear Algebra The eigenvalues and w11 eigenvectors of a matrix provide global information about its structure. wn1 w1n wnn x1
xn x1 xn Spectral Graph Theory Analyse the spectrum of matrix representing a graph Spectrum : The eigenvectors of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues {1 , 2 ,..., n } RANLP 2005 54 Matrix Representations Adjacency matrix (A) n x n matrix A [ wij ]: edge weight between vertex xi and xj 0.1 5 1 0.8 0.8 0.6
2 0.8 0.8 6 4 x1 x2 x3 x4 x5 x6 x1 0 0.8 0.6 0 0.1 0 x2
0.8 0 0.8 0 0 0 x3 0.6 0.8 0 0.2 0 0 x4 0 0 0.2 0 0.8 0.7
x5 0.1 0 0 0.8 0 0.8 x6 0 0 0 0.7 0.8 0 0.7 3 0.2 Important properties: Symmetric matrix RANLP 2005
55 Matrix Representations (continued) Degree matrix (D) n x n diagonal matrix D (i, i ) w ij : total weight of edges incident to vertex xi j 0.1 0.8 5 1 0.8 0.6 2 0.8 RANLP 2005 6 4
0.7 3 0.2 0.8 x1 x2 x3 x4 x5 x6 x1 1.5 0 0 0 0 0 x2 0 1.6
0 0 0 0 x3 0 0 1.6 0 0 0 x4 0 0 0 1.7 0 0 x5 0
0 0 0 1.7 0 x6 0 0 0 0 0 1.5 56 Matrix Representations L=D-A Laplacian matrix (L) n x n symmetric matrix 0.1 0.8 5
1 0.8 0.6 2 0.8 RANLP 2005 0.7 3 0.2 0.8 6 4 (continued) x1 x2 x3 x4 x5 x6 x1 1.5
-0.8 -0.6 0 -0.1 0 x2 -0.8 1.6 -0.8 0 0 0 x3 -0.6 -0.8 1.6 -0.2 0 0
x4 0 0 -0.2 1.7 -0.8 -0.7 x5 -0.1 0 0 0.8- 1.7 -0.8 x6 0 0 0 -0.7 -0.8
1.5 57 Spectral Clustering Algorithms Three basic stages: 1. Pre-processing Construct a matrix representation of the dataset. 2. Decomposition Compute eigenvalues and eigenvectors of the matrix. Map each point to a lower-dimensional representation based on one or more eigenvectors. 3. Grouping RANLP 2005 Assign points to two or more clusters, based on the new representation. 58 Spectral Bi-partitioning Algorithm Pre-processing 1.
Build Laplacian matrix L of the graph Decomposition 2. RANLP 2005 Find eigenvalues and eigenvectors X of the matrix L Map vertices to corresponding components of 2 (= algebraic connectivity of the graph) x1 x2 x3 x4 x5 x6 x1 1.5
-0.8 -0.6 0 -0.1 0 x2 -0.8 1.6 -0.8 0 0 0 x3 -0.6 -0.8 1.6 -0.2 0 0
x4 0 0 -0.2 1.7 -0.8 -0.7 x5 -0.1 0 0 -0.8 1.7 -0.8 x6 0 0 0 -0.7 -0.8
1.5 0.0 0.4 0.2 0.1 0.4 -0.2 -0.9 0.4 0.2 0.1 -0. 0.4 0.3 0.4 0.2 -0.2 0.0 -0.2
0.6 0.4 -0.4 0.9 0.2 -0.4 -0.6 2.5 0.4 -0.7 -0.4 -0.8 -0.6 -0.2 3.0 0.4 -0.7 -0.2 0.5 0.8
0.9 0.4 = 2.2 2.3 x1 0.2 x2 0.2 x3 0.2 x4 -0.4 x5 -0.7 x6 -0.7 X= Fiedler vector 59
Spectral Bi-partitioning (continued) Grouping Sort components of reduced 1-dimensional vector. Identify clusters by splitting the sorted vector in two. How to choose a splitting point? Nave approaches: Split at 0, mean or median value More expensive approaches Attempt to minimise (normalised) cut in 1-dimension RANLP 2005 Split at 0 x1
0.2 x2 0.2 x3 0.2 x4 -0.4 x5 -0.7 x1 0.2 x4 -0.4 x6 -0.7 x2 0.2 x5 -0.7 x3
0.2 x6 -0.7 Cluster A: Positive points Cluster B: Negative points A B 60 3-Clusters Lets assume the next data points RANLP 2005 61 If we use the 2nd eigenvector If we use the 3rd eigenvector RANLP 2005 62 K-Way Spectral Clustering
How do we partition a graph into k clusters? Two basic approaches: 1. Recursive bi-partitioning (Hagen et al.,91) Recursively apply bi-partitioning algorithm in a hierarchical divisive manner. 2. Cluster multiple eigenvectors Build a reduced space from multiple eigenvectors. Use 3rd eigenvector to further partition the clustering output of the 2 nd eigenvector Commonly used in recent work 3 Oloss (n ) Multiple eigenvectors prevent instability due to information RANLP 2005 63 K-Eigenvector Clustering K-eigenvector Algorithm (Ng et al.,01) 1. Pre-processing Construct the scaled adjacency matrix A' D
1/ 2 AD 1/ 2 2. Decomposition Find the eigenvalues and eigenvectors of A'. Build embedded space from the eigenvectors corresponding to the k largest eigenvalues. 3. Grouping Apply k-means to reduced n x k space to produce k clusters. RANLP 2005 Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. Assign each object to the group that has the closest centroid. When all objects have been assigned, recalculate the positions of the K centroids. Repeat Steps 2 and 3 until the centroids no longer move. 64 Outline Part 1: Elements of graph-theory: Terminology, notations, algorithms on graphs
Part 2: Graph-based algorithms for IR Web search Text clustering and classification Part 3: Graph-based algorithms for NLP Word Sense Disambiguation Clustering of entities (names/numbers) Thesaurus construction Text summarization content style RANLP 2005 65 Web-link Analysis for Improved IR Web: heterogeneous and unstructured Free of quality control on the web Commercial interest to manipulate ranking Use a model of Web page importance (based on link analysis) (Brin and Page 1998), (Kleinberg 1999) Related work
RANLP 2005 Academic citation analysis Clustering methods of link structure Hubs and Authorities Model 66 Back-links Link Structure of the Web Approximates the importance / quality of Web pages PageRank: Pages with lots of back-links are important Back-links coming from important pages convey more importance to a page Problem: rank sink RANLP 2005 67 Rank Sink Page cycles pointed by some incoming link Problem: this loop will accumulate rank but never distribute any rank outside
RANLP 2005 68 Escape Term Solution: Rank Source 1 PR (Vi ) (1 d ) E (u ) d PR (V j ) jIn (Vi ) | Out (V j ) | E(u) is some vector over the web pages uniform, favorite pages, etc. Recall that the rank (PageRank) is the eigenvector corresponding to the largest eigenvalue RANLP 2005 69 Random Surfer Model Page Rank corresponds to the probability distribution of a random walk on the web graphs E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever
RANLP 2005 70 Implementation Computing resources 24 million pages 75 million URLs Unique integer ID for each URL Sort and Remove dangling links Rank initial assignment Iteration until convergence Add back dangling links and Re-compute RANLP 2005 71 Ranking Web-pages Combined PageRank + vectorial model Vectorial model: Finds similarity between query and documents
Uses term weighting schemes (TF.IDF) Ranks documents based on similarity Ranking is query specific PageRank: Ranks documents based on importance Ranking is general (not query specific) Combined model: Weighted combination of vectorial and PageRank ranking RANLP 2005 72 Issues Users are not random walkers Content based methods Starting point distribution Actual usage data as starting vector Reinforcing effects/bias towards main pages No query specific rank Linkage spam PageRank favors pages that managed to get other pages to link to them Linkage not necessarily a sign of relevancy, only of promotion
(advertisement) RANLP 2005 73 Personalized PageRank 1. Initial value for rank source E can change ranking: Uniformly over all pages: e.g. copyright warnings, disclaimers, mailing lists archives 2. 3. may result in overly high ranking Total weight on a few sets of pages based on a-priori known popularity, e.g. Netscape, McCarthy Different weight depending on the query or topic: 1. topic-sensitive PageRank (Haveliwala 2001) 2. query-sensitive PageRank (Richardson 2002) RANLP 2005 74 Link Analysis for IR PageRank is a global ranking based on the web's graph structure PageRank relies on back-links information to bring order to the web Proven in practice to exceed the performance of other search engines, which are exclusively
based on vectorial / boolean models Can be adapted to personalized searches Although personalized search engines are still not very popular RANLP 2005 75 Text Classification Given: A description of an instance, xX, where X is the instance language or instance space. A fixed set of categories: C = {c1, c2,, cn} Determine: The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C RANLP 2005 76 Text Classification Examples Assign labels to each document or web-page: Labels are most often topics such as Yahoo-categories e.g., "finance," "sports," "news>world>asia>business" Labels may be genres e.g., "editorials" "movie-reviews" "news Labels may be opinion e.g., like, hate, neutral
Labels may be domain-specific binary e.g., "interesting-to-me" : "not-interesting-to-me e.g., spam : not-spam e.g., is a toner cartridge ad :isnt RANLP 2005 77 Traditional Methods Manual classification Used by Yahoo!, Looksmart, about.com, ODP, Medline very accurate when job is done by experts consistent when the problem size and team is small difficult and expensive to scale Automatic document classification Hand-coded rule-based systems Used by CS depts spam filter, Reuters, E.g., assign category if document contains a given boolean combination of words Supervised learning of document-label assignment function Many new systems rely on machine learning k-Nearest Neighbors (simple, powerful) Naive Bayes (simple, common method)
Support-vector machines (new, more powerful) RANLP 2005 78 Text Clustering Goal similar to the task of text classification: classify text into categories No apriori known set of categories Cluster data into sets of similar documents Pros: no annotated data is required Cons: more difficult to assign labels Traditional approaches: All clustering methods have been also applied to text clustering Agglomerative clustering K-means RANLP 2005 79 Spectral Learning / Clustering Use graph representation of text collections for: Text clustering: spectral clustering Text classification: spectral learning
Recall bi-partitioning and k-way clustering: Use 2nd or first K eigenvectors as a reduced space Map vertices to the reduce space Cluster the reduced space E.g. cluster the points in the 2nd eigenvector in two classes: positive vs. negative => bi-partitioning Use k-means to cluster the representation obtained using top K eigenvectors RANLP 2005 80 Spectral Learning / Clustering Spectral representation: (Kamvar et al., 2003) Find similarity graph for the collection Find Laplacian matrix (alternatives: normalized L, LSA, etc.) Find the K largest eigenvectors of the Laplacian => X Normalize the rows in X to be unit length Clustering: Use k-means (or other clustering algorithm) to cluster the rows in X into K clusters Classification: Assign known labels to the training points in the collection, using the representation in X Classify unlabeled rows corresponding to the test points in the collection, using the representation in X, with the help of any
learning algorithm (SVM, Nave Bayes, etc.) RANLP 2005 81 Empirical Results Data sets: 20 newsgroups: 1000 postings from each of 20 usenet newsgroups 3 newsgroups: 3 of the 20 newsgroups (crypt, politics, religion) Lymphoma: gene expression profiles. 2 classes with 42 and 54 samples each Soybean: the soybean-large data set from the UCI repository, with 15 classes RANLP 2005 82 Text Clustering Compare spectral clustering with k-means Use document-to-document similarities to build similarity graph for each collection Pairwise cosine similarity Use k = 20 Spectral Learning K-means 3 news 0.84 0.20 20 news 0.36
0.07 lymphoma 0.50 0.10 soybean 0.41 0.34 RANLP 2005 83 Text Classification Use the same spectral representation, but apply a learning algorithm trained on the labeled examples Reinforce the prior knowledge: set Aij to 1 if i and j are known to belong to the same class, or 0 otherwise Can naturally integrate unlabeled training examples included in the initial matrix RANLP 2005 84 Text Classification RANLP 2005 85 References
Brin, S. and Page, L., The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems, 1998 Herings, P.J. van der Laan, G. and Talman, D., Measuring the power of nodes in digraphs, Technical Report, Tinbergen Institute, 2001 Kamvar, S. and Klein, D. and Manning, C. Spectral Learning, IJCAI 2003 Kleinberg, J., Authoritative sources in a hyperlinked environment, Journal of the ACM, 1999 Ng, A. and Jordan, M. and Weiss, Y. On Spectral Clustering: Analysis and an algorithm, NIPS 14, 2002 RANLP 2005 86 Outline Part 1: Elements of graph-theory: Terminology, notations, algorithms on graphs Part 2: Graph-based algorithms for IR Web search Text clustering and classification Part 3: Graph-based algorithms for NLP
Word Sense Disambiguation Clustering of entities (names/numbers) Thesaurus construction Text summarization content style RANLP 2005 87 Word Sense Disambiguation Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. Sense Inventory usually comes from a dictionary or thesaurus. Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches Word polysemy (with respect to a dictionary) Ex: chair furniture or person Ex: child young person or human offspring Determine which sense of a word is used in a specific sentence Sit on a chair Take a seat on this chair The chair of the Math Department The chair of the meeting RANLP 2005 88 Graph-based Solutions for WSD
Use information derived from dictionaries / semantic networks to construct graphs Build graphs using measures of similarity Similarity determined between pairs of concepts, or between a word and its surrounding context Distributional similarity (Lee 1999) (Lin 1999) Dictionary-based similarity (Rada 1989) carnivore fissiped mamal, fissiped canine, canid wolf dingo wild dog dog hyena dog feline, felid hyena hunting dog dachshund RANLP 2005 bear terrier 89
Semantic Similarity Metrics Input: two concepts (same part of speech) Output: similarity measure E.g. (Leacock and Chodorow 1998) Path(C1 , C2 ) Similarity (C1 , C2 ) log 2D where D is the taxonomy depth E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42 Other metrics: Similarity using information content (Resnik 1995) (Lin 1998) Similarity using gloss-based paths across different hierarchies (Mihalcea and Moldovan 1999) Conceptual density measure between noun semantic hierarchies and current context (Agirre and Rigau 1995) RANLP 2005 90 Lexical Chains for WSD Apply measures of semantic similarity in a global context
Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976) A lexical chain is a sequence of semantically related words, which creates a context and contributes to the continuity of meaning and the coherence of a discourse Algorithm for finding lexical chains: 1. 2. 3. Select the candidate words from the text. These are words for which we can compute similarity measures, and therefore most of the time they have the same part of speech. For each such candidate word, and for each meaning for this word, find a chain to receive the candidate word sense, based on a semantic relatedness measure between the concepts that are already in the chain, and the candidate word meaning. If such a chain is found, insert the word in this chain; otherwise, create a new chain. RANLP 2005 91 Lexical Chains A very long train traveling along the rails with a constant velocity v in a certain direction train #1: public transport #1 change location
# 2: a bar of steel for trains #2: order set of things #3: piece of cloth travel #2: undergo transportation rail #1: a barrier #3: a small bird RANLP 2005 92 Lexical Chains for WSD Identify lexical chains in a text Usually target one part of speech at a time Identify the meaning of words based on their membership to a lexical chain Evaluation: (Galley and McKeown 2003) lexical chains on 74 SemCor texts give 62.09% (Mihalcea and Moldovan 2000) on five SemCor texts give 90% with 60% recall lexical RANLP 2005
chains anchored on monosemous words 93 Graph-based Ranking on Semantic Networks Goal: build a semantic graph that represents the meaning of the text Input: Any open text Output: Graph of meanings (synsets) importance scores attached to each synset relations that connect them Models text cohesion (Halliday and Hasan 1979) From a given concept, follow links to semantically related concepts Graph-based ranking identifies the most recommended concepts RANLP 2005 94 Two U.S. soldiers and an unknown number of civilian contractors are unaccounted for after a fuel convoy was attacked near the Baghdad International Airport today,
a senior Pentagon official said. One U.S. soldier and an Iraqi driver were killed in the incident. RANLP 2005 95 Main Steps Step 1: Preprocessing SGML parsing, text tokenization, part of speech tagging, lemmatization Step 2: Assume any possible meaning of a word in a text is potentially correct Insert all corresponding synsets into the graph Step 3: Draw connections (edges) between vertices Step 4: Apply the graph-based ranking algorithm PageRank, HITS, Positional power RANLP 2005 96 Semantic Relations Main relations provided by WordNet
ISA (hypernym/hyponym) PART-OF (meronym/holonym) causality attribute nominalizations domain links Derived relations coord: synsets with common hypernym Edges (connections) directed (direction?) / undirected Best results with undirected graphs Output: Graph of concepts (synsets) identified in the text importance scores attached to each synset relations that connect them RANLP 2005 97 Word Sense Disambiguation
Rank the synsets/meanings attached to each word Unsupervised method for semantic ambiguity resolution of all words in unrestricted text (Mihalcea et al. 2004) Related algorithms: Lesk Baseline (most frequent sense / random) Hybrid: Graph-based ranking + Lesk Graph-based ranking + Most frequent sense Evaluation Informed (with sense ordering) Uninformed (no sense ordering) Data Senseval-2 all words data (three texts, average size 600) SemCor subset (five texts: law, sports, debates, education, entertainment) RANLP 2005 98 Evaluation Size(words) Random SemCor law sports education debates entertainment
Avg. Senseval-2 d00 d01 d02 Avg. Average (all) Lesk TextRank TextRank+Lesk 825 808 898 799 802 826 37.12% 29.95% 37.63% 40.17% 39.27% 36.82% 39.62% 33.00% 41.33% 42.38% 43.05% 39.87% 46.42% 40.59% 46.88% 47.80% 43.89% 45.11%
49.36% 46.18% 52.00% 50.52% 49.31% 49.47% 471 784 514 590 740 28.97% 45.47% 39.24% 37.89% 37.22% 43.94% 52.65% 49.61% 48.73% 43.19% 43.94% 54.46% 54.28% 50.89% 47.27% 47.77% 57.39% 56.42% 53.86% 51.16% uninformed (no sense order)
RANLP 2005 99 Evaluation SemCor law sports education debates entertainment Avg. Senseval-2 d00 d01 d02 Avg. Average (all) Size(words) MFS Lesk TextRank TextRank+Lesk 825 808 898 799 802 826 69.09% 57.30% 64.03% 66.33%
59.72% 63.24% 72.65% 64.21% 69.33% 70.07% 64.98% 68.24% 73.21% 68.31% 71.65% 71.14% 66.02% 70.06% 73.97% 68.31% 71.53% 71.67% 66.16% 70.32% 471 784 514 590 740 51.70% 60.80% 55.97% 56.15% 60.58% 53.07% 64.28% 62.84%
60.06% 65.17% 58.17% 67.85% 63.81% 63.27% 67.51% 57.74% 68.11% 64.39% 63.41% 67.72% informed (sense order integrated) RANLP 2005 100 Ambiguous Entitites Name ambiguity in research papers: David S. Johnson, David Johnson, D. Johnson David Johnson (Rice), David Johnson (AT & T) Similar problem across entities: Washington (person), Washington (state), Washington (city) Number ambiguity in texts:
quantity: e.g. 100 miles time: e.g. 100 years money: e.g. 100 euro misc.: anything else Can be modeled as a clustering problem RANLP 2005 101 Name Disambiguation Extract attributes for each person e.g. for research papers: Co-authors Paper titles Venue titles Each word in these attribute sets constitutes a binary feature Apply a weighting scheme: E.g. normalized TF, TF/IDF Construct a vector for each occurrence of a person name (Han et al., 2005)
RANLP 2005 102 Spectral Clustering Apply k-way spectral clustering to name data sets Two data sets: DBLP: authors of 400,000 citation records use top 14 ambiguous names Web-based data set: 11 authors named J. Smith and 15 authors named J. Anderson, in a total of 567 citations Clustering evaluated using confusion matrices Disambiguation accuracy = sum of diagonal elements A[i,i] divided by the sum of all elements in the matrix RANLP 2005 103 Name Disambiguation Results DBLP data set: Web-based data set 11 J.Smith: 84.7% (k-means 75.4%) 15 J.Anderson: 71.2% (k-means 67.2%) RANLP 2005 104
Number Disambiguation Classify numbers into 4 classes: quantity, time, money, miscellaneous Two algorithms Spectral clustering on bipartite graphs Tripartite updating Features Surrounding window +/- 2, and the range of the word (w 3, 3 < w 10, 10 < w 30 ) Map all numbers in the data set into these features (Radev, 2004) RANLP 2005 105 Tripartite Updating F set of features, L/U set of labeled / unlabeled examples T1: connectivity matrix between F and L T2: connectivity matrix between F and U F and U are iteratively updated
F (t ) T1T L F (t 1) RANLP 2005 U (t ) F ( t 1) T 2 T F T 2 (t ) T U U (t ) ( t 1) F (t ) 106 Number Disambiguation Results
Use a subset of the numbers from the APW section of the ACQUAINT corpus 5000 unlabeled examples, of which 300 were manually annotated 200 labeled examples for evaluation RANLP 2005 107 Automatic Thesaurus Generation Idea: Use an (online) traditional dictionary to generate a graph structure, and use it to create thesaurus-like entries for all words (stopwords included) (Jannink and Wiederhold, 1999) For instance: Input: the 1912 Websters Dictionary Output: a repository with rank relationships between terms The repository is comparable with handcrafted efforts such as MindNet and WordNet RANLP 2005 108 Algorithm 1. Extract a directed graph from the dictionary
- e.g. relations between head-words and words included in definitions 2. 3. Obtain the relative measure of arc importance Rank the arcs with ArcRank (graph-based algorithm for edge ranking) RANLP 2005 109 Algorithm (cont.) 1. Extract a directed graph from the Dictionary Directed Graph Transport. To carry or bear from one place to another; to carry Transport head word or bear One arc from each headword to all
words in the definition. Potential problems as: syllable and accent markers in head words, misspelled head words, accents, special characters, mistagged fields, common abbreviations in definitions, steaming, multi-word head words, undefined words with common prefixes, undefined hyphenated. Source words: words never used in definitions Sink words: undefined words RANLP 2005 110 Algorithm (cont.) 2. Obtain the relative measure of arc importance re = Transport e edge ps / |as| pt carry t target node Where: re is the rank of the edge s source node
ps is the rank of the source node pt is the rank of the target node |as| is the number of outgoing nodes |as| outgoing arcs m rs,t = e=1 RANLP 2005 ps / |as| pt For more than 1 node (m) between s and t 111 Algorithm (cont.) 3. Rank the arcs using ArcRank Rank the importance of arcs with respect to source and target nodes It promotes arcs that are important in both endpoints ArcRank Input: triples (source s, target t, importance v s,t) 1. given source s and target t nodes 2. at s, sort vs, tj and rank arcs rs(vs, tj ) 3.
at t, sort vsi,t and rank arcs rt(csi,t) 4. compute ArcRank: mean(rs (vs,t), rt(cs,t)) 5. Rank Arcs input: sorted arc importance {0.9, 0.75, 0.75, 0.75, 0.6, 0.5, . , 0.1} sample values {1, 2, 2, 2, } equal values take same rank {1, 2, 2, 2, 3, } number ranks consecutively RANLP 2005 112 Results An automatically built thesaurus starting with Webster Webster WordNet MindNet 96,800 terms 112,897 distinct words 99,642 terms
159,000 head words 173,941 word senses 713,000 relationships between headwords error rates: <1% of original input <0.05% incorrect arcs (hyphenation) <0.05% incorrect terms (spelling) 0% artificial terms RANLP 2005 error rates: ~0.1% inappropriate classifications (not publicly available) ~1-10% artificial & repeated terms 113 Results Analysis The Websters repository PROS It has a very general structure It can also address stopwords It has more relationships than WordNet It allows for any relationships between words (not only within a lexical category) It is more natural: it does not include artificial concepts such as nonexistent words and artificial categorizations CONS The type of relationships is not always evident
The accuracy increases with the amount of data, nevertheless, the dictionary contains sparse definitions It only distinguishes senses based on usage, not grammar It is less precise than other thesauruses (e.g. WordNet) RANLP 2005 114 Extractive Summarization Graph-based ranking algorithms for extractive summarization 1. Build a graph: Sentences in a text = vertices Similarity between sentences = weighted edges Model the cohesion of text using intersentential similarity 2. Run a graph-based ranking algorithm: RANLP 2005 keep top N ranked sentences = sentences most recommended by other sentences (Mihalcea & Tarau, 2004) (Erkan & Radev 2004) 115 A Process of Recommendation Underlying idea: a process of recommendation A sentence that addresses certain concepts in a text
gives the reader a recommendation to refer to other sentences in the text that address the same concepts Text knitting (Hobbs 1974) repetition in text knits the discourse together Text cohesion (Halliday & Hasan 1979) RANLP 2005 116 Sentence Similarity Inter-sentential relationships weighted edges Count number of common concepts Normalize with the length of the sentence Sim( S1 , S 2 ) | {wk | wk Si wk S j } | log(| Si |) log(| S j |) Other similarity metrics are also possible: cosine, string kernels, etc. work in progress RANLP 2005 117 Graph Structure
Undirected No direction established between sentences in the text A sentence can recommend sentences that preceed or follow in the text Directed forward A sentence can recommend only sentences that follow in the text Seems more appropriate for movie reviews, stories, etc. Directed backward A sentence can recommend only sentences that preceed in the text More appropriate for news articles RANLP 2005 118 An Example 3. r i BC-HurricaneGilbert 09-11 0339 4. BC-Hurricane Gilbert , 0348 5. Hurricane Gilbert Heads Toward Dominican Coast 6. By RUDDY GONZALEZ 7. Associated Press Writer 8. SANTO DOMINGO , Dominican Republic ( AP ) 9. Hurricane Gilbert swept toward the Dominican Republic Sunday , and the Civil Defense alerted its heavily populated south coast to prepare for high winds , heavy rains and high seas . 10. The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph . 11. " There is no need for alarm , " Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday .
12. Cabral said residents of the province of Barahona should closely follow Gilbert 's movement . 13. An estimated 100,000 people live in the province , including 70,000 in the city of Barahona , about 125 miles west of Santo Domingo . 14. Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night 15. The National Hurricane Center in Miami reported its position at 2a.m. Sunday at latitude 16.1 north , longitude 67.5 west , about 140 miles south of Ponce , Puerto Rico , and 200 miles southeast of Santo Domingo . 16. The National Weather Service in San Juan , Puerto Rico , said Gilbert was moving westward at 15 mph with a " broad area of cloudiness and heavy weather " rotating around the center of the storm . 17. The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6p.m. Sunday . 18. Strong winds associated with the Gilbert brought coastal flooding , strong southeast winds and up to 12 feet to Puerto Rico 's south coast . 19. There were no reports of casualties . 20. San Juan , on the north coast , had heavy rains and gusts Saturday , but they subsided during A text from DUC 2002 on Hurricane Gilbert 24 sentences RANLP 2005 119 An Example (cont'd) [0.71] 4 [0.50] 24 [0.80] [0.70]
[1.02] 23 5[1.20] 0.15 6 [0.15] 7 0.19 22 [0.15] 0.55 8 [0.70] 0.35 21 0.15 9 [1.83] 0.30 [0.84] 20 10 19
0.59 [0.15] 0.15 0.27 18 [1.58] [0.70] 0.16 12 17 16 [1.65] RANLP 2005 [0.99] 15 [1.36] 14 11 [0.56] [0.93] 13
[0.76] [1.09] 120 Evaluation Task-based evaluation: automatic text summarization Single document summarization 100-word summaries Multiple document summarization 100-word multi-doc summaries clusters of ~10 documents Data from DUC (Document Understanding Conference) DUC 2002 567 single documents 59 clusters of related documents Automatic evaluation with ROUGE (Lin & Hovy 2003) n-gram based evaluations unigrams found to have the highest correlations with human judgment no stopwords, stemming RANLP 2005
121 Results: Single Document Summarization Single-doc summaries for 567 documents (DUC 2002) Graph Algorithm UndirectedDir.forwardDir.backward HITSA W 0.4912 0.4584 0.5023 HITSHW 0.4912 0.5023 0.4584 POSPW 0.4878 0.4538 0.3910 POSW W 0.4878
0.3910 0.4538 PRW 0.4904 0.4202 0.5008 Top 5 systems (DUC 2002) S27 S31 S28 S21 0.5011 0.4914 0.4890 0.4869 RANLP 2005 S29 Baseline 0.4681 0.4799 122 Multiple Document Summarization Cascaded summarization (meta summarizer) Use best single document summarization alorithms PageRank HITS A (Undirected / Directed Backward) (Undirected / Directed Backward) 100-word single document summaries
100-word summary of summaries Avoid sentence redundancy: set max threshold on sentence similarity (0.5) Evaluation: build summaries for 59 clusters of ~10 documents compare with top 5 performing systems at DUC 2002 baseline: first sentence in each document RANLP 2005 123 Results: Multiple Document Summarization Multi-doc summaries for 59 clusters (DUC 2002) PageRank-U PageRank-DB HITSA -U HITSA -DB PageRank-U PageRank-DB 0.3552 0.3499 0.3502 0.3448 0.3368 0.3259 0.3572 0.3520 HITSA -U HITSA -DB 0.3456
0.3519 0.3212 0.3462 0.3465 0.3439 0.3423 0.3473 Top 5 systems (DUC 2002) S26 S19 S29 S25 S20 Baseline 0.3578 0.3447 0.3264 0.3056 0.3047 0.2932 RANLP 2005 124 Portuguese Single-Document Evaluation 100 news articles from Jornal de Brasil, Folha de Sao Paolo Create summaries with the same length as manual summaries Automatic evaluation with ROUGE Same baseline (top sentences in each document) Algorithm HITS-A PageRank Baseline RANLP 2005 Undirected 0.4814 0.4939 Graph Forward
0.4834 0.4574 0.4963 Backward 0.5002 0.5121 125 Graph-based Ranking for Sentence Extraction Unsupervised method for sentence extraction based exclusively on information drawn from the text itself Goes beyond simple sentence connectivity Gives a ranking over all sentences in a text can be adapted to longer / shorter summaries No training data required can be adapted to other languages Already applied on Portuguese RANLP 2005 126 Subjectivity Analysis for Sentiment
Classification The objective is to detect subjective expressions in text (opinions against facts) Use this information to improve the polarity classification (positive vs. negative) E.g. Movie reviews ( see: www.rottentomatoes.com) Sentiment analysis can be considered as a document classification problem, with target classes focusing on the authors sentiments, rather than topic-based categories Standard machine learning classification techniques can be applied RANLP 2005 127 Subjectivity Extraction RANLP 2005 128 Subjectivity Detection/Extraction Detecting the subjective sentences in a text may be useful in filtering out the objective sentences creating a subjective extract Subjective extracts facilitate the polarity analysis of the text (increased accuracy at reduced input size)
Subjectivity detection can use local and contextual features: Local: relies on individual sentence classifications using standard machine learning techniques (SVM, Nave Bayes, etc) trained on an annotated data set Contextual: uses context information, such as e.g. sentences occurring near each other tend to share the same subjectivity status (coherence) (Pang and Lee, 2004) RANLP 2005 129 Cut-based Subjectivity Classification Standard classification techniques usually consider only individual features (classify one sentence at a time). Cut-based classification takes into account both individual and contextual (structural) features Suppose we have n items x1,,xn to divide in two classes: C1 and C2 . Individual scores: indj(xi) - non-negative estimates of each xi being in Cj based on the features of xi alone
Association scores: assoc(xi,xk) - non-negative estimates of how important it is that xi and xk be in the same class RANLP 2005 130 Cut-based Classification Maximize each items assignment score (individual score for the class it is assigned to, minus its individual score for the other class), while penalize the assignment of different classes to highly associated items Formulated as an optimization problem: assign the xi items to classes C1 and C2 so as to minimize the partition cost: ind x ind x assoc x , x 2 xC1 RANLP 2005 1 xC 2 i k xi C1 xk C 2
131 Cut-based Algorithm There are 2n possible binary partitions of the n elements, we need an efficient algorithm to solve the optimization problem Build an undirected graph G with vertices {v1, vn,s,t} and edges: (s,vi) with weights ind1(xi) (vi,t) with weights ind2(xi) (vi,vk) with weights assoc(xi,xk) RANLP 2005 132 Cut-based Algorithm (cont.) Cut: a partition of the vertices in two sets: S {s} S ' and T {t} T ' where s S ' , t T ' The cost is the sum of the weights of all edges crossing from S to T A minimum cut is a cut with the minimal cost A minimum cut can be found using maximum-flow algorithms, with polynomial asymptotic running times Use the min-cut / max-flow algorithm RANLP 2005
133 Cut-based Algorithm (cont.) Notice that without the structural information we would be undecided about the assignment of node M RANLP 2005 134 Subjectivity Extraction Assign every individual sentence a subjectivity score e.g. the probability of a sentence being subjective, as assigned by a Nave Bayes classifier, etc Assign every sentence pair a proximity or similarity score e.g. physical proximity = the inverse of the number of sentences between the two entities Use the min-cut algorithm to classify the sentences into objective/subjective RANLP 2005 135 Subjectivity Extraction with Min-Cut RANLP 2005 136
Results 2000 movie reviews (1000 positive / 1000 negative) The use of subjective extracts improves or maintains the accuracy of the polarity analysis while reducing the input data size RANLP 2005 137 References Galley, M. and McKeown, K.: Improving Word Sense Disambiguation in Lexical Chaining. IJCAI 2003. Hirst, G. and St-Onge, D. Lexical chains as representations of context in the detection and correction of malaproprisms. WordNet: An electronic lexical database, MIT Press, 1998 Halliday, M. and Hasan, R. Cohesion in English. Longman, 1976. Lesk, M. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. SIGDOC 1986. Mihalcea, R. and Moldovan, D. An iterative approach to word sense disambiguation. FLAIRS 2000. Mihalcea, R. and Tarau, P and Figa, E. PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004
RANLP 2005 138 References Mihalcea, R. and Tarau, P. TextRank: Bringing Order Into Texts, EMNLP 2004 Han, H and Zha, H. and Giles, C.L. Name disambiguation in author citations using a K-way spectral clustering method, ACM/IEEE IJCDL 2005 Radev, D., Weakly-Supervised Graph-based Methods for Classification, U. Michigan Technical Report CSE-TR-500-04, 2004 Jannink, J. and Wiederhold, G., Thesaurus Entry Extraction from an On-line Dictionary, Fusion 1999, Sunnyvale CA, 1999. Pang, B. and Lee, L., A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, ACL 2004. Erkan, G. and Radev, D.. "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization", Journal of Artificial Intelligence Research , 2004 RANLP 2005 139 Conclusions Graphs: encode information about nodes and relations between nodes Graph-based algorithms can lead to state-of-theart results in several IR and NLP problems
Graph-based NLP/IR: On the verge between graph-theory and NLP / IR Many opportunities for novel approaches to traditional problems RANLP 2005 140
IMPROVING STUDENTS OUTCOMES BY INCORPORATING GROUP PROJECTS IN
Historically this course has been a challenge for both, the students and the faculty. To improve student's engagement, interest and success rate a group project was incorporated into the course. The project is due at the end of the semester...