Senior Fitness - Exercise and Nutrition for Aging Men and Women
FREE Article Feed for your website.
Bio-Medical Research Article Database
Informative Articles on Life, Love and Happiness
Tutorials on Business to Writing
Famous Quotes from Famous People
Song Lyric Information
New US Patent Information
Comprehensive List of Content by Category
Online Auctions and Shopping Related Articles
Article Search
Most Recent Articles

Method and apparatus for generating summary information for hierarchically related information Number:7,031,970 from the United States Patent and Trademark Office (PTO) owispatent

Home    Author Login    Submit Article    Article Search    Add Your Link    Edit Your Link    Contact Us    Advertising    Disclaimer

   

Google
 

Top Breaking News
     Anger Erupts in Athens as Bailout Demands Spark Outrage by VOA News
     Zuma’s Plan for South Africa Wins Support by Delia Robertson
     Obama, Chinese Vice President to Meet at White House by Dan Robinson

Title: Method and apparatus for generating summary information for hierarchically related information

Abstract: A method is provided for digesting the content of hierarchically related information. The method chooses a set of extracted sentences representing a proportion of the text associated with a subtopic, by a combination of features resting on inherent properties of the sentences, and on the content of a developing summary.

Patent Number: 7,031,970 Issued on 04/18/2006 to Blitzer


Inventors: Blitzer; John C. (Fort Myers, FL)
Assignee: Palo Alto Research Center Incorporated (Palo Alto, CA)
Appl. No.: 321203
Filed: December 16, 2002


Current U.S. Class: 707/100 ; 707/101; 707/102
Current International Class: G06F 17/30 (20060101)
Field of Search: 707/1-104.1 715/532 704/251


References Cited [Referenced By]

U.S. Patent Documents
5619709 April 1997 Caid et al.
5799276 August 1998 Komissarchik et al.

Other References

K Tajima et al. "Cut as a Querying Unit for WWW, Netnews, and E-mail", Proc. 9th Annual International ACM Conf. on Hypertext, pp. 235-244, 1998. cited by other .
T. Hofmann. "Probabilistic latent semantic indexing", In 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 1999. cited by other .
T. Brants et al. "Topic-Based Document Segmentation with Probabilistic Latent Semantic Analysis", Proc. 11th International Conference on Information and Knowledge Management, McLean, VA, USA, Nov. 2002. cited by other .
H. Ozaku et al., "Topic Search for Intelligent Network News Reader HISHO" Proc. Of the 2000 ACM Symposium on Applied Computing, Como, Italy, 2000, pp. 28-33. cited by other .
M. White et al., "Selecting Sentences for Multidocument Summaries Using Randomized Local Search", Proc. of the ACL 2002 Summarization Workshop, Philadelphia, PA, USA, 2002. cited by other .
M. Sifer et al., "Zooming in one Dimension can be better than two: an Interface for Placing Search Results in Context with a Restricted Sitemap", in Proceedings of the IEEE Symposium on Visual Languages (VL'99) , Tokyo, Japan, Sep. 1999. cited by other .
B. Boguraev et al., "Discourse segmentation in aid of document summarization". In Proceedings of Hawaii International Conference on System Sciences (HICSS-33) , Minitrack on Digital Documents Understanding, Maui, Hawaii, 2000. cited by other .
D. Radev et al., Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation and user studies. In ANLP/NAACL 2000 Workshop, pp. 21-29, Apr. 2000. cited by other .
B.W. Kernighan et al., "An Efficient Heuristic Procedure for Partitioning Graphs", Bell System Technical Journal, vol. 49, No. 2, pp. 291-307, 1970. cited by other .
M. Sifer et al., "Structured Graphs: A Visual Formalism for Scalable Graph Based Case Tools", Australian Computer Journal, vol. 28, No. 1 Feb. 1996, pp. 13-26. cited by other.

Primary Examiner: Rimell; Sam
Assistant Examiner: Abel-Jalil; Neveen
Attorney, Agent or Firm: McBain; Nola Mae

Claims



The invention claimed is:

1. A method of forming a summary for hierarchically related information, where the information can be represented as a set of nodes wherein each node is associated with a portion of the information and that portion of the information contains at least one sentence, and wherein the nodes are connected by directed edges such that each node has at most one incoming edge, a parent node is the source of an incoming edge, and a child node is the target of an outgoing edge, the method comprising: a) determining a sentence vector for each sentence associated with each node, b) determining a centroid vector of the sentence vectors, c) determining an intrinsic score for each sentence using the centroid vector, d) selecting the sentence with the highest intrinsic score as a summary sentence to form a summary, e) determining an extract score for each of the remaining sentences using the intrinsic score for the remaining sentence and at least one summary sentence included in the summary, f) selecting the sentence having the highest extract score as a summary sentence and adding the summary sentence to the summary, and g) repeating steps e) and f) until a desired number of sentences from the remaining sentences are selected as summary sentences for inclusion in the summary.

2. The method of claim 1 wherein the centroid vector of the sentence vectors comprises a vector where each position, p, in the vector contains the average of all values in position p in all of the sentence vectors.

3. The method of claim 1 wherein determining an intrinsic score for each sentence comprises: a) determining a sentence position of the sentence in the information, a hierarchy position of the node associated with the sentence in the hierarchy, and using the centroid vector, determining a lexical centrality of the sentence to the information, and b) determining the intrinsic score for the sentence using the sentence position, the hierarchy position and the lexical centrality determined for the sentence.

4. The method of claim 1 wherein determining an extract score far each of the remaining sentences comprises: a) selecting one of the remaining sentences, b) determining the number of non-quoting sentences in the summary which are adjacent to the selected remaining sentence, the number of sentences in the summary whose associated nodes are either a parent node or a child node of the node associated with the selected remaining sentence, the number of sentences in the summary which are quoted immediately before the selected remaining sentence, and the number of sentences in the summary that appear immediately after a quote of the selected remaining sentence, and c) responsive to the determinations in step b and the intrinsic score of the selected remaining sentence, determining an extract score, and d) repeating steps a c until each of the remaining sentences has an extract score.

5. The method of claim 1 wherein the desired number of sentences selected as summary sentences for inclusion in the summary equals a percentage of a total number of sentences associated with all of the nodes representing the hierarchically-related information.

6. The method of claim 1 wherein the desired number of sentences selected as summary sentences for inclusion in the summary equals a percentage of a total number of the nodes representing the hierarchically-related information.

7. The method of claim 1 further including formatting the summary for display and displaying the summary using a display processor.

8. A method of forming a summary for hierarchically related information represented as a set of nodes wherein each node is associated with a portion of the information containing at least one sentence, and wherein the nodes are connected by directed edges such that each node has at most one incoming edge, a parent node is the source of an incoming edge, and a child node is the target of an outgoing edge, the method comprising: selecting a sentence from all of the sentences associated with all of the nodes as a first summary sentence to form a summary; the first summary sentence being selected by computing an intrinsic score indicating a lexical centrality of the sentence to the information represented by the set of nodes; the first summary sentence having the highest intrinsic score of all of the sentences associated with all of the nodes; and for each additional summary sentence to be added to the summary until a desired summary length is reached, selecting a remaining sentence from the plurality of remaining sentences associated with all of the nodes; the selected remaining sentence having the highest extract score of all extract scores computed for the remaining sentences; computing the extract score including determining the number of non-quoting sentences in the summary which are adjacent to the selected remaining sentence; determining the number of sentences in the summary whose associated nodes are either a parent node or a child node of the node associated with the selected remaining sentence; determining the number of sentences in the summary which are quoted immediately before the selected remaining sentence; and determining the number of sentences in the summary that appear immediately after a quote of the selected remaining sentence.

9. The method of claim 8 wherein computing the extract score further includes combining the extract score with the intrinsic score computed for the selected remaining sentence.

10. The method of claim 8 wherein computing the extract score further includes computing a sentence similarity quantity produced by adding a similarity quantity indicating a similarity of the selected remaining sentence to each summary sentence included in the summary the sentence similarity quantity being subtracted from the extract score and indicating that the selected remaining sentence includes information redundant to the summary.

11. The method of claim 8 further including formatting the summary for display and displaying the summary using a display processor.
Description



CROSS REFERENCE TO RELATED APPLICATIONS

This patent application is related to:

U.S. patent application Ser. No. 10/321,416, titled "A Method and Apparatus for Normalizing Quoting Styles in Electronic Mail", by Newman, filed concurrently herewith on Dec. 16, 2002,

U.S. patent application Ser. No. 10/321,097, titled "A Method and Apparatus for Clustering Hierarchically Related Information", by Newman et al. filed concurrently herewith on Dec. 16, 2002,

U.S. patent application Ser. No. 10/321,420, titled "A Method and Apparatus for Generating Overview Information for Hierarchically Related Information", by Newman et al. filed concurrently herewith on Dec. 16, 2002,

U.S. Patent Application Ser. No. 10/320,911, titled "Method and Apparatus for Displaying Hierarchical Information", by Newman filed concurrently herewith on Dec. 16, 2002, and

U.S. patent application Ser. No. 10/320,930, titled "Method and Apparatus for Segmenting Hierarchical Information for Display Purposes", by Newman filed concurrently herewith on Dec. 16, 2002.

INCORPORATION BY REFERENCE

The following patents and/or patent applications are herein incorporated by reference: U.S. patent application Ser. No. 09/732,024, titled "Method and System for Presenting Email Threads as Semi-connected Text by Removing Redundant Material", by Paula Newman and Michele Baldonado, filed Dec. 8, 2000. U.S. patent application Ser. No. 09/732,029, titled "Method and System for Display of Electronic Mail, by Paula Newman, filed Dec. 8, 2000. U.S. patent application Ser. No. 09/954,388, titled "Method and Apparatus for the Construction and use of Table-like visualizations of Hierarchic Material, by Paula Newman and Stuart Card, filed Sep. 10, 2001. U.S. patent application Ser. No. 09/954,530, titled "Method and Apparatus for the Viewing and Exploration of the Content of Hierarchical Information, by Paula Newman and Smart Card, filed Sep. 10, 2001, issued as U.S. Pat. No. 6,944,818 on Sep. 13, 2005. U.S. patent application Ser. No. 09/717,278, titled "Systems and Methods for Performing Sender-Independent Managing of Electronic Messages, by Michelle Baldonado, Paula Newman, and William Janssen, filed Nov. 22, 2000 U.S. patent application Ser. No. 09/732,028 titled "Method and System for presenting semi-linear hierarchy displays" by Paula Newman, filed Dec. 8, 2000, issued as U.S. Pat. No. 6,683,632 on Jan. 27, 2004. U.S. patent application Ser. No. 09/747,634, titled "System and Method for Browsing Node-Link Structures Based on Estimated Degree of Interest", filed on Dec. 21, 2000 by Stuart Card, issued as U.S. Pat. No. 6,944,830 on Sep. 13, 2005. U.S. patent application Ser. No. 10/103,053, titled "Systems and Methods for Determining the Topic Structure of a Portion of a Text" by loannis Tsochantaridis, Thorsten Brants, and Francine Chen, filed Mar. 2, 2002 U.S. patent application Ser. No. 10/164,587, titled "Authoring Tools, Including Content-Driven Treetables for Fluid Text" by Polle Zellweger, Paula Newman, and Maribeth Back (D/A2017)

BACKGROUND

The present invention relates generally to the field of information analysis and display. More specifically, it provides methods for partitioning tree-structured textual material into topically related clusters of adjacent items, then developing digests of each cluster. The digests include both shorter overviews and arbitrarily long summaries. The tree-structured material involved could be for example, but is not limited to, trees containing the messages and postings of an archived discussion within a newsgroup, discussion list, or on-line forum. This invention also provides methods for partitioning a two-dimensional tree visualization, called a treetable, into conveniently sized segments for detailed exploration. The segments may be grouped into regions corresponding to the topically related clusters.

To establish some terminology, a "tree" or "tree structure" is a standard term denoting an abstract data structure that models information as a set of nodes connected by directed edges such that: (a) there is exactly one element having no incoming edges, called the "root"; and (b) all other nodes have exactly one incoming edge. A leaf node is a node with no outgoing edges. All nodes besides the root node and the leaf nodes can be called "interior nodes". The "parent" of a node is the source of its incoming edge, and the "children" of a node are the targets of its outgoing edges. A "subtree" of a tree is a set of nodes consisting of a "subtree root" node that has no parent in the subtree, and other nodes all having parents in the subtree.

The present invention is intended for use in connection with tree structures whose interior nodes represent substantial amounts of logically related textual information. For example, in the tree-structures formed by archived discussions, the nodes represent individual messages or contributions, and a message represented by a child node is a response to the message represented by its parent. The creation of the parent-child links in archived discussions can be established by a combination of conventional means utilizing header information, and deeper means, as described in U.S. patent application Ser. No. 09/732,024, titled "Method and System for Presenting Email Threads as Semi-connected Text by Removing Redundant Material", by Paula Newman and Michelle Baldonado, incorporated by reference hereinabove.

Tree-structured archived discussions on a particular subject are usually represented for exploration by indented lists. Each contribution is represented by some identification information, such as contributor name and date, indented under the identification information for its parent. The individual contributions may then be accessed for reading by selecting one of the list items. However, archived discussions pay varying amounts of attention to the ostensible subject and initial contribution, and often branch into several subtopics, so the reader cannot assume, based only on the ostensible subject, whether any portion of the discussion is actually of interest, and, if so, what parts of the discussion. A more informative representation of the overall content of an archived discussion is described in U.S. patent application Ser. No. 09/732,029, titled "Method and System for Display of Electronic Mail, by Paula Newman, incorporated by reference hereinabove. In that representation, initial substantive fragments of each contribution, containing actual text of the message rather than quotes or quote introduction, are embedded within a reduced-width linear tree tailored to text embedding. This representation is suitable as a level of presentation of the discussion, and also as the content of an emailed digest summarizing activity in the discussion list, or as a client side digest of such activity. Client-side accumulation of email from discussion lists, involving concatenating, or sampling, messages from all mail received in a particular period, is introduced in U.S. patent application Ser. No. 09/717,278, titled "Systems and Methods for Performing Sender-Independent Managing of Electronic Messages, by Michelle Baldonado, Paula Newman, and William Janssen, incorporated by reference hereinabove.

Yet another method of representing the overall content of an archived discussion is described in U.S. patent application Ser. No. 09/954,388, titled "Method and Apparatus for the Construction and use of Table-like visualizations of Hierarchic Material, by Paula Newman and Stuart Card, and U.S. patent application Ser. No. 09/954,530, titled "Method and Apparatus for the Viewing and Exploration of the Content of Hierarchical Information, by Paula Newman and Stuart Card, both incorporated by reference hereinabove. In this method, the conversation tree is presented in a two dimensional tabular form called a "treetable". In such a treetable, each cell represents a single node and exactly spans the cells representing its children if any, and a substantive initial fragment of the message associated with the node is displayed in the cell, to the extent that space allows. The individual columns and subtrees of the treetable may be selected for expansion (reducing other parts of the tree), to view more of the associated texts, and the full texts of each column may be selected for display in auxiliary windows or frames. (Note that a similar representation is described in an article entitled "Structured Graphs: a visualization for scalable graph-based case tools" by M. Sifer and J. Potter, in the Australian Computer Journal, Volume 28 Number 1, and also in later papers authored by M. Sifer and other colleagues, but in these references the potential of the structure is not exploited for purposes of exploring trees whose nodes have associated significant text.)

While the latter two methods (reduced-width linear trees and treetables) with embedded initial fragments are useful methods of providing overviews for smaller discussions and other tree-structured textual material, they are less useful for larger discussions. For example, for a stored conversation consisting of 93 messages, a reduced-width linear tree containing initial fragments requires over 11 standard-size display windows. Alternatively, if such a conversation is represented in a treetable that can be contained in a single window, the cells are too small to contain any indicative content, and there are too many columns to expand individually to determine if there is content of interest.

Therefore, more accessible digests of such larger discussions are needed. Current approaches to text processing address some related problems. Methods have been developed for segmenting individual documents into extents dealing with different approximate subtopics, and for identifying the topics covered by the most indicative words, as described in U.S. patent application Ser. No. 10/103,053, titled "Systems and Methods for Determining the Topic Structure of Portion of a Text" by loannis Tsochantaridis, Thorsten Brants, and Francine Chen, incorporated by reference hereinabove. Methods have also be developed for summarizing identified topic extents by collections of extracted sentences, and for associating summary elements with the text extents covered, as described, for example, in a paper by Branimir Boguraev and Mary Neff entitled "Discourse Segmentation in Aid of Document Summarization", in the Proceedings of the Hawaii International Conference on System Sciences (2000). Methods have also been developed for summarizing collections of separate documents by grouping them by topic, generally using centroid-based clustering methods, and then extracting sentences dealing with each topic. An example of such an approach is described by Dragomir Radev, Hongyan Jing, and Malgorzata Budzikowska in the paper "Centroid-based summarization of multiple documents: sentence extraction" in the Proceedings of the ANLP/NAACL 2000 Workshop on Automatic Summarization (Seattle, Wash., April 2000) pages 21 29. However, tree-structured discussions are neither single documents nor collections of independent documents, and specialized methods are needed for their segmentation and summarization.

Two limited approaches seem to have been developed, to date, relating to segmenting tree-structured discussions, but none, as far as can be ascertained at this time, to summarizing those discussions. A paper by K. Tajima, Y. Mizuuchi, M. Kitagawa, and K. Tanaka entitled "Cut as a querying unit for WWW, Netnews, and E-mail", in the Proceedings of the 9.sup.th ACM Conference on Hypertext and Hypermedia (1998) descries a method for identifying overlapping subtrees of a discussion as units of information retrieval, to put retrieved messages into a useful context. The clustering method processes the thread tree bottom-up and, at each step, combines a parent with currently open child subtrees, separately or together, if the similarity between the parent word vector and the centroid vector of the child subtree or subtrees exceeds an (unspecified) absolute input threshold. The word vectors used to represent the vectors handle quoted passages by reducing the weights of quoted words, in order to keep inter-message distances from being too small. While no results are given, if the threshold is set relatively high, this method would probably lead to shallow subtrees, suitable as query results. However, it is unlikely that the method would lead to clustering results suitable for subtopic identification or digesting. Based on our experiments, quoted words require more detailed treatment, and some trials of a similar single-link clustering method using distances between a node and the centroid of an adjacent cluster produced unsatisfactory results.

Another approach related to discussion tree segmentation is described in a paper by H. Ozaku, K. Uchimoto, M. Murata, and H. Isahara entitled "Topic Search for Intelligent Network News Reader HISHO", in the Proceedings of the 2000 ACM Symposium on Applied Computing. This paper describes a method for retrieving many discussions relating to a query topic, and then attempting to filter out discussion subtrees irrelevant to the topic. The method uses, for the most part, noun keywords to represent messages, and tries to find "topic changing articles" where the proportion of never-seen-keywords shifts, and "topic branching articles" where a message gives rise to several responses distinguished by their keyword usage and their referenced quotes. This strategy is reported as of limited success in finding topic-changing articles (recall=57%) and larger success in finding topic branching articles.

The present invention incorporates methods of dividing a tree-structured discussion into major subtopics, and of developing digests containing segments for each such subtopic. Two types of digests are developed, that may be inspected in sequence. Shorter digests, which we will call "overviews", choose a set of texts in each subtopic based on topic-relevance and potential for providing coherent sequences, and represent each such text by one or more extracted sentences. Potentially longer digests, which we will call "summaries", choose a set of extracted sentences representing a proportion of the text associated with a subtopic, by a combination of features resting on inherent properties of the sentences, and on the content of a developing summary.

The present invention also provides methods for pre-segmenting a large tree or treetable for purposes of visualization and deeper exploration of individual nodes, with the segments sized so as to allow inclusion of at least some amount of content-indicative text for each node. There have been many approaches developed to allow investigation of detailed areas of large visualizations, usually distinguished as either "fisheye" approaches, that expand part of visualization at the expense of other parts, or "focus plus context" approaches that extract and expand part of visualization into another window. Some examples of these approaches as applied to trees and treetables are: (a) in-situ expansions of nodes in the neighborhood of a selected node within a "Degree of Interest Tree", described in U.S. U.S. patent application Ser. No. 09/747,634, titled "System and Method for Browsing Node-Link Structures Based on Estimated Degree of Interest, by Stuart Card, incorporated by reference hereinabove, (b) in-situ expansion of treetable columns and complete subtrees (all nodes descended from a given node, and extraction of sets of columns and complete subtrees into another window, as described in U.S. patent application Ser. No. 09/954,388, titled "Method and Apparatus for the Construction and use of Table-like visualizations of Hierarchic Material, by Paula Newman and Stuart Card, incorporated by reference hereinabove, and (c) iterative restriction of the display to subtrees or user-defined sets of nodes, is provided in a treetable-like visualization described in the papers "The SGF metadata framework and its support for social awareness on the World Wide Web", by O. Liechti et. al. in World Wide Web (Baltzer), 1999. 2(4., and "M. Sifer and O. Liechti, "Zooming in One Dimension Can Be Better Than Two: An Interface for Placing Search Results in Context with a Restricted Sitemap" by M. Sifer and O. Liechti, In Proceedings of the 1999 IEEE Symposium on Visual Languages (Tokyo, Japan) 72 79.

These methods all have some problems when used in connection with large trees. In-situ expansion of neighborhoods, columns, or subtrees can be disorienting when the expanded nodes are to contain significant amounts of text, because the shape of the tree changes dramatically, and little space is left for unexpanded nodes. Also, for large trees, subtree extraction (or restriction of the display to a subtree) tends to be an iterative process. This may be suitable when the tree represents a generalization hierarchy, so that higher level nodes provide good cues as to the content of lower level ones, but not otherwise, for example when the trees represent discussions, or the network of linked nodes on a website, reduced to a tree by removing cyclic paths. Finally, leaving the specification of sets of nodes to be extracted to users is problematic both because it is laborious, and because successive extractions are still generally needed to make sufficient node-identification information visible to permit an intelligent selection.

For this reason, the methods provided in this invention pre-partition a tree or treetable into segments of related nodes whose approximate maximum dimension permits significant text to be presented for each node. The segments can be visually differentiated in an outline depiction of the tree or treetable as a whole, and individual segments then extracted for deeper exploration. The segments may also be constrained to represent only nodes within the same logical grouping, which may be an identified subtopic, or collection of less-focused material, or other type of grouping. When the segments are so constrained, regions of adjacent segments associated with each such grouping can also be visually differentiated from other such regions.

The method for partitioning is related to, but different from, the large body of work on partitioning graphs (collections of nodes linked by edges, but not necessarily hierarchic) into a roughly equal-size subgraphs, given either the number of subgraphs to be found or a maximum size per subgraph, and possibly some additional constraints on the subgraphs, with the purpose being to minimize the number of edges between subgraphs. Much of the work derives from an algorithm described by B. W. Kernighan and S. Lin in the paper "An efficient heuristic procedure for partitioning graphs", in The Bell System Technical Journal, 49(2) 1970, in which a greedy algorithm obtains an initial partitioning, and then nodes are iteratively moved among subgraphs to improve the quality of the partition. Such methods have applications in VLSI design, distribution of processes and data among processors, and sparse matrix representations. The problem addressed by the methods of the present invention is different, in that the permissible subgraphs are far more constrained; subgraphs must represent either subtrees or sets of subtrees whose roots have a common parent (and sometimes are part of the same logical grouping), and must largely respect a given layout dimensionality. This, in turn, permits the careful initial partitioning algorithm described in this invention to be sufficient to the purpose.

Further advantages of the invention will become apparent as the following description proceeds.

SUMMARY OF THE INVENTION

Briefly stated and in accordance with the present invention, there is provided a method for partitioning a tree-structured discussion or other tree structured collections of texts into clusters dealing with identifiable subtopics, if such subtopics exist, or into manageable partitions if not. The partitioning method uses a variant of a standard method of agglomerative, single link ("nearest neighbor") clustering described by G. Salton in Automatic Text Processing, Addison Wesley (1989). In that method, each document is represented by a vector containing one position for each unique word to be taken into account in the combined texts, and is initially placed in a cluster containing only that document. Then a sequence of cluster combinations is performed, at each step combining the most similar two clusters, where the most similar two clusters are the clusters related by the most similar pair of document vectors, into a new cluster. The process can be halted before all clusters are combined based on application-specific criteria. In the partitioning method of this invention, the vector positions for a particular node are generally also filled in well known ways, namely, by "stemming" the words of the associated text, to lemma (root word) forms, and ignoring words occurring in a "stop list" of very common words, and representing each word by a tf.idf weighted frequency (a weighting of a term frequency tf by the "inverse document frequency" idf, which is basically the quotient of the number of documents in a collection divided by the number of documents in which the word appears, thus giving preference to words occurring less frequently in the collection, and has many variations, also described by Salton in Automatic Text Processing. Addison Wesley (1989).

The word vectors can also optionally be converted into vectors reflecting some semantic relationships, by methods such as Probabilistic Latent Semantic analysis (PLSA), as described by Thomas Hofmann in the paper titled "Probabilistic Latent Semantic Indexing", in Proceedings of SIGIR '99 (Berkeley, Calif., August 1999) 50 57.

However, the clustering method is unique to grouping tree-structured texts, especially message texts, both in processing that occurs before constructing word vectors, and within the clustering process proper. In this invention, before constructing the word vector for a node, a) First, the node-associated message is analyzed into its essential text, plus inessential material such as entire prefixed or suffixed prior messages and their introductory material, and contact information, using methods described in U.S. patent application Ser. No. 09/732,024, titled "Method and System for Presenting Email Threads as Semi-connected Text by Removing Redundant Material", by Paula Newman and Michelle Baldonado, incorporated by reference hereinabove. b) Then, the essential text of the node-associated message is adjusted to avoid vector distance distortions based on differences in quoting styles. Initially, selective quotes (that is, quotes that represent only part of the message associated with the parent) are included, as these are reasonably considered to form a logical part of the message, and entire prefixed or suffixed messages are omitted as playing an indeterminate role. c) Finally, if the message does not contain selective quotes, an analysis is done to determine whether all or part of the parent message constitutes a logical, albeit implicit, references. If so, all or some parts of the essential text of the parent are included in the adjusted message. In the current embodiment, the analysis determines that the message contains an implicit reference to the parent message if both messages are sufficiently short that it is likely that they both discuss a single issue, and, since the message is a response to the parent message, that it is likely to be the same issue. The word vector for the node is then constructed from the adjusted message.

It should be noted that this type of message adjustment is also suitable for use in representing message content for other applications, for example, in finding messages dealing with a particular topic, or finding inter-conversation topic groupings via centroid-based clustering methods.

The other divergences from standard clustering practice are within the clustering process proper. First, the inter-node relationships considered in the single-link clustering are restricted to parent-child and sibling relationships within the tree, so that a cluster consists either of a subtree, or of several subtrees whose roots have a common parent external to the cluster. Second, the final clusters used for summary purposes are isolated in the following tree-specific ways: a) The root of the tree may be held out from the clustering process to avoid relating otherwise unrelated children. After clustering is completed, the root is combined with the cluster containing the most lexically similar children. b) Cluster similarity may take into account additional factors besides linking by the most similar pair of document vectors c) Clustering for purposes of identifying groupings to be represented in the digests halts in one of the following circumstances, whichever occurs first: i. When a cluster is about to be formed whose size exceeds a particular function of the total number of nodes. The combined cluster may or may not then be formed depending on a possibly larger maximum size that is also a function of the total number of nodes. This halting point is adopted based on observing conversation structures within some discussion lists; there is usually one topic that occupies a considerable part of the discussion, whose size tends to be proportionate to the size of the discussion as a whole, and there are also, frequently, one or more digressions of some importance that are smaller in size. ii. When the distance between the two nodes representing the clustering link is large relative to the maximum of the parent-child and sibling distances (e.g., 75% of the maximum distance).

After clustering for digesting purposes is halted, clusters are selected that probably represent subtopics and are closed to further combination. The selection may be based on a function of the number of nodes in the cluster and in the tree as a whole. Not all such clusters may actually be represented in a displayed digest, which may further select among the subtopic-associated clusters based on absolute and tree-relative cluster size and other factors such as average text lengths. Clustering then continues to obtain secondary clusters that form less-closely related "collections", which can also be reflected as regions within a partitioned treetable. The results of such a clustering, as reflected in a segmented treetable, are shown in FIG. 4, discussed hereinbelow.

Two methods are also provided for digesting the content of individual clusters. One method, which obtains relatively short overviews, selects a proportion of representative nodes from the cluster, and then extracts and organizes one or more sentences from the text associated with each selected node. For text trees representing archived discussions, the basic intuition underlying the selection of nodes and sentences is that comment/response sequences drawn from lexically central nodes will capture those aspects of the discussion considered most important to discussion participants.

The nodes are selected based on their centrality to all or part of the cluster, and also for their potential for providing some degree of fluency in reading. They comprise central nodes and auxiliary nodes. Central nodes comprise a) One or more nodes whose word vectors are closest to the lexical centroid of the entire cluster (the average of the word vectors for all nodes in the cluster), and b) One or more nodes each of which has an associated word vector which is central to a subcluster larger than a given minimum size found during the cluster process and later combined with another cluster. We note that central nodes derived their lexical centrality both from a their inherent content and from their incorporated quotes. The total number of nodes used as central nodes is determined by an input parameter.

Auxiliary nodes are generally non-central nodes that are parents of central nodes or parents having more than a given number of children. It is from these nodes that passages quoted in central nodes, or in many nodes, will be drawn; thus auxiliary nodes both fulfill the coverage potential of their more central children, as well as supplying continuity. For some clusters, specifically those containing or descending from the root of the entire conversation, auxiliary nodes may also include the root text and a proportion of its children. This allows the overview to be more informative with respect to what is generally the most important topic of the discussion.

The total number of nodes represented in the overview is determined by the number of central nodes selected, and the number of auxiliary nodes added for any of the reasons given above.

After nodes are selected to represent a cluster, sentences are extracted from their associated texts for use in the overview so as to provide coherent comment/response sequences. The sentences extracted depend on the whether the node is a central node or an auxiliary node. For all selected nodes, if the associated text been has been quoted, sentences are extracted that correspond to the most germane quote (or quotes, depending on an input parameter) in each of the texts associated with each of its child nodes. In the current embodiment, a germane quote is a quote preceding a relatively long non-quoted sequence. Also, a) For an auxiliary node, if it has not been quoted, the final sentence is extracted. b) For a central node, an additional sentence or sentences are extracted. If the node-associated text contains selective quotes, the individual sentences following its most germane quote or quotes (as discussed above) are extracted. However, if the node-associated text does not contain selective quotes, an initial sentence or a sentence that is lexically central to the node-associated text is extracted.

If an extracted sentence is very short it is extended with additional text material, using material to the left if it is a quoted sentence, and to the right if not.

After sentences are extracted to represent a node they are sorted based on their position in the text associated with a node, duplicates are eliminated, and the remaining sentences are concatenated into a string representing the node in the overview, with intervening gap indicators such as ellipsis symbols " . . . " to represent sentences in the text not included in the overview. These strings are then arranged in the tree order of the cluster to form the cluster overview, using either a reduced width tree, such as shown in FIGS. 1 and 2, representing a completed overview for a discussion, or an indented tree, such as shown in FIG. 3, representing the part of the summary (see below) associated with group 3 of the same discussion. In both cases, the cluster overview may be prefaced with a list of the most frequent words occurring in the cluster, that serve to further illustrate the concerns of the cluster. The word lists are obtained by standard methods, e.g., by adding the idf-weighted word-vectors obtained for each text within the cluster and then selecting the words having the largest value in the resultant sum vector.

Another cluster digesting method is also provided that can obtain longer, more readable, and more informative summaries of clusters whose contained texts are well focused and longer. This method selects a percentage of the sentences associated with nodes in the cluster, in the current embodiment 10%, or, or, to accommodate to clusters containing many short texts, a percentage of the nodes in the cluster, in the current embodiment 60%, whichever number is larger, to obtain reasonably sized, informative summaries. The sentences are selected from the cluster as a whole using a staged, feature based method that builds on methods well known to practitioners of the art, such as the methods described by M. White, and C. Cardie, C. in the paper "Selecting sentences for multi-document summaries using randomized local search", in the Proceedings of ACL 2002 Summarization Workshop (Philadelphia, Pa., July 2002). However, many of the features used, and the specifics of the staging, are specific to obtaining coherent summaries of tree-structured discussions.

The method initially assigns an intrinsic score to each sentence based on its lexical centrality in the cluster, its ordinal position in its containing text, and the number of nodes in the tree descended from the node associated with that text in the tree, and chooses the sentence with the best intrinsic score as the first sentence to be included in the summary. Then the process iterates until the desired number of summary sentences is found. In each iteration, each not-yet-chosen sentence is given an extract score that modifies its intrinsic score by features concerned with the relationship of the sentence to sentences already in the extract, and the sentence with the highest extract score is added to the summary. The conventional features used to form the extract score measure the adjacency of the sentence being tested to sentences already in the summary (adjacency being desirable), and the redundancy of the sentence with sentences in the summary (redundancy being undesirable). The discussion specific features used in the process test whether the sentence is from a node that is a parent or child of a node already represented in the summary, and whether the sentence is quoted immediately before a sentence in the summary, or immediately follows a quote of a sentence represented in the summary. After the summary sentences are chosen by this method, they are organized in the same way as sentences in the overview, as described above

A method is also provided for dividing a tree of texts into segments for purposes of visualization within a display window or frame, with each segment either a subtree or a set of subtrees whose roots have the same parent external to the segment, and each segment sized such that extracting a visualization of the segment into another window or frame permits significant text to be associated with each node in the segment. The segments may also be constrained to be associated with the same logical grouping of nodes, which may be a cluster obtained by the method described hereinabove, or some other logical grouping, such as the set of nodes associated with a given time period.

The division of the tree into segments may be obtained by a method that operates bottom up within the tree. A desirable maximum width and height is set for segments. Then for each node in the tree, starting with the leaves and not processing a node until all of its children have been processed, the method proceeds as follows: (a) First, if the current node has any children (i.e., if it is not a leaf), and if nodes within a segment are constrained to belong to the same logical grouping, then close any currently open segments headed by children of the current node that do not belong to the same logical grouping as the current node, while combining segments in the same logical grouping if the result does not exceed the maximum segment width. In combining segments, try to obtain segments that are roughly equal in width.

(b) Then, if the current node has any remaining children heading open segments, compute the current width of a prospective new segment that would result from combining all of the remaining children heading open segments into one segment. This is the width that would be required to allocate a substantially sized space to each node of the portion of the tree or treetable represented by the segment (where it is assumed if a classic "node+edge" tree visualization were used, subtrees would be vertically separated) plus, if desired, some space to represent closed segments containing children of the current node as single cells (to provide more context to the visualization). The current width is computed by adding the widths of the currently open segments headed by its children, plus, if desired, proportion of the count of the closed segments containing its children. If the current width, so computed, exceeds the maximum desired segment width, try to combine and close enough open segments headed by its children to reduce the current width so that it is as large as possible without exceeding the maximum segment width. Note that if counting a proportion of closed segments containing its children itself exceeds the maximum segment width, then exceeding the maximum segment width cannot be avoided. In combining open child segments, try to obtain segments that are roughly equal in width. (c) Finally, create a new open segment by combining the current node with the currently open segments headed by its children, if any. If the current node has no children, the width of the resulting new segment is 1. If the current node has children, the width of the resulting new segment is the current width, as defined above. If the width of the new open segment reaches or exceeds a maximum desired dimension, close it unless doing so would foster some very small segments. Two such cases are (1) where the desired maximum height dimension has been reached and there are very few nodes in the same logical grouping above the current one in the tree, and (2) where the width dimension has been reached, and the current node has siblings in the same logical grouping. In the latter case the decision of whether to close the current segment is left to the processing of the immediate parent node.

Companion methods are also provided to visualize and explore the result of the segmentation. The visualizations place a two dimensional representation of the full tree in either a classic "node plus edge" representation, or a treetable representation, in one window or frame, and a representation of a selected segment in another window or frame. The visualization of the full tree may take one of several forms. It may be an outline treetable in which all cells are represented. In that case, the segments, as well as collections of adjacent segments within the same logical grouping, which we will call regions, can be visually differentiated as shown in FIG. 6, by visually separating regions, and differentiating segments within the same region by heavy outlining and or different cell backgrounds. Alternatively, the visualization of the full tree may be in the form of an abstracted segmented tree or treetable in which entire segments are represented as nodes, visually grouped into regions, and represented as a standard treetable, or a classic "node+edge" tree representation, or as a content-driven treetable as described in U.S. patent application Ser. No. 10/164,587, titled "Authoring Tools Including, Content-Driven Treetables for Fluid Text" by Polle Zellweger, Paula Newman, and Maribeth Back (D/A.sup.2017) incorporated by reference hereinabove.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a the first part of an overview of a 93-message discussion

FIG. 2 is the second part of an overview of a 93-message discussion.

FIG. 3 is the third part of an overview of a 93-message discussion

FIG. 4 is the summary equivalent of the overview of group 3 of FIG. 3.

FIG. 5 illustrates the overall structure of a framed segmented treetable display.

FIG. 6 illustrates a segmented outline treetable

FIG. 7 illustrates a legend popup for the framed segmented treetable display

FIG. 8 illustrates a treetable for a single segment

FIG. 9 is a flow diagram of the major processes associated with this invention.

FIG. 10 is a flow diagram of the clustering process.

FIG. 11 is a flow diagram of the method for normalizing quotes within messages.

FIG. 12 is a flow diagram of the method for clustering nodes

FIG. 13 is a flow diagram of the method for analyzing cluster pairs

FIG. 14 is a flow diagram of the overall process of constructing a digest

FIG. 15 is a flow diagram of the method for forming a cluster overview.

FIG. 16 is a flow diagram of the method for finding the central nodes of a cluster.

FIG. 17 is a flow diagram of the method for analyzing quote interactions within a cluster

FIG. 18 is a flow diagram of the method for forming an overview string for a node.

FIG. 19 illustrates the overview generation process.

FIG. 20 is a flow diagram of the method for formatting a cluster digest

FIG. 21 is a flow diagram of the method for formatting a digest for a subtree.

FIG. 22 is a flow diagram of the method for forming a cluster summary.

FIG. 23 is a flow diagram of the method for formatting a cluster summary

FIG. 24 is a flow diagram of the method for assigning an intrinsic score to a sentence.

FIG. 25 is a flow diagram of the method for finding the best addition to a summary.

FIG. 26 is a flow diagram of the method for computing an extract score for a sentence.

FIG. 27 is a flow diagram of the method for building a segmented treetable display

FIG. 28 is a flow diagram of the method for dividing a tree into segments

FIG. 29 is a flow diagram of the method for building an outline treatable for a tree

FIG. 30 is an illustration of the correspondence between a tree and a basic treetable layout.

FIG. 31 is a flow diagram of the method of building an outline treetable for a cluster, containing nested outline treetables

FIG. 32 is a flow diagram of the method for filling a display cell in an outline treetable

While the present invention will be described in connection with a preferred embodiment and/or method of use, it will be understood that it is not intended to limit the invention to that embodiment and procedure. On the contrary, it is intended to cover all alternatives, modifications and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims.

DETAILED DESCRIPTION OF THE INVENTION

Turning now to the figures, this embodiment of the invention presents a discussion list review tool using overviews, summaries, and segmented treetable visualizations for the exploration of long discussions. FIGS. 1 through 4 illustrate some summary and overview material produced by the methods making up the invention. The texts underlying these figures represent messages of the Usenet rec.motorcycles newsgroup. FIGS. 5 through 9 illustrate aspects of the segmented treetable visualization. The visualizations are obtained by submitting display specifications to a display processor. In this case an Internet Explorer browser was utilized. However, the methods of the invention can be used to create specifications for many kinds of display processors, such as those embedded within graphic user interface (GUI) toolkits associated with many programming languages. FIGS. 9 through 28 illustrate the methods used to develop the overview, summary, and segmented treetable outputs.

It should also be noted that although this particular embodiment is directed to a review tool for a stored discussion, the methods of this invention could be used, in whole or in part, to analyze and present other types of tree structured texts, in particular the content of tree-structured websites, and other tree-structured hypertext documents.

FIGS. 1, 2, and 3 depict portions of an overview obtained by the methods described in this invention, in this case for a tree representing a 93-message discussion whose subject line is "What's a BSA". The full overview occupies, by scrolling, approximately 2 pages. FIG. 1 shows the first part, of the overview, which contains only group overview 130, comprising the part of the overview devoted to the first major subtopic group of the discussion found by the clustering procedure of this invention. The group represented by group overview 130 contains the node associated with the initial message, which inquires as to the meaning of the brand name BSA, the value of the motorcycles, and the availability of parts. The group also contains nodes associated with those messages in the discussion that deal fairly directly with issues raised in the initial message. The overview for this group is relatively long, obtained using an option that gives additional representation in the overviews to the root node of a tree and its children. For stored discussions, this provides more coverage to messages that often contain the most useful information for a casual reader.

Group overview 130 comprises a group header 110, and a collection of node overview strings 120, 121, 122, 123, 124, 125, and 126. Group header 110 identifies the group by: (a) a group number 111, in this example group 0, (b) a keyword list 112, listing frequent words seen in the texts associated with nodes in the group, that augments the extracted sentences with additional group content information. The list is obtained by adding and ranking by tf.idf weighted word counts for the texts, thus giving preference to words occurring less frequently in the discussion as a whole. (c) A group size indication 113, showing the total number of nodes in the group, in this example 31, from which nodes were selected for representation in the overview

The group header 110 is linked to a display of a segmented treetable, an example of which is illustrated in FIG. 5, which shows the tree as a whole in the upper frame, and an expanded view of a group, or the first segment of a group, in the lower frame.

The node overview strings 120, 121, 122, 123, 124, 125 and 126 each represent a different node selected from the group for representation in the overview. It should be noted that overview strings for only a portion of the nodes that might be selected to represent the associated group are shown in FIG. 1. This is for illustrative purposes only and, in practice, more overview strings might be included and viewable, although some scrolling may be necessary. Each node overview string comprises one or more sentences selected from the node-associated text. If the entire text is not included in the node overview string, gap indicators 131, such as ellipsis symbols " . . . ", are used to indicate omitted sentences. In all overviews and summaries associated with this invention, strings representing individual nodes are organized to reflect the tree structure of the subtree or subtrees associated with the group. In this example, the structure is reflected by an adaptation of reduced-width trees, introduced in U.S. Pat. No. 09/732,028 titled "Method and System for presenting semi-linear hierarchy displays", by Paula Newman, incorporated by reference hereinabove. In the adaptation, a string representing a node that has no predecessors in the group is represented unindented, and prefixed by an initial indicator 127, chosen in this example to be a solid bullet. If a selected node n has selected descendants along only one path, the node overview string associated with the nearest such descendant is listed under the node overview string for node n, unindented, and prefixed by a no-sibling indicator 128, chosen in this example to be an outline bullet. All other strings are prefixed by a sibling indicator 129, in this example a solid bullet, and indented under the strings representing their closest selected parent in the group. Thus the nodes associated with node overview strings 121, 122, 123, and 125 all have the initial node as their closest parent represented in the group. The node associated with node overview string 124 is the single selected descendant of the node associated with node overview string 123, and node overview string 126 bears the same relationship to the node associated with node overview string 125.

FIG. 2 shows the second part of the overview for the same discussion as that shown in FIG. 1, and would be visible by scrolling or otherwise navigating from the view shown in FIG. 1. Groups 1 and 2, represented by group overviews 150 and 160, respectively, constitute humorous interludes in the discussion. The overviews for these groups are short, because the groups represented are relatively small, and do not contain children of the initial node. The group overviews 150 and 160 contain headers and text overview strings, such as group header 151, and node overview strings 152 and 153, organized in the same way as in FIG. 1.

FIG. 3 shows the third part of the same discussion as that shown in FIGS. 1 and 2, and again would be visible by scrolling or otherwise navigating from the views shown in FIGS. 1 and 2. Groups 3 and 4, represented by group overviews 170 and 180 respectively, are concerned with the ethics of purchasing something for less than its value as known to the buyer.

FIG. 4 illustrates a longer group summary, also obtained by methods described in this invention. Group summary 300 represents the same subtopic as that of group overview 170 of FIG. 3. It contains, by a parameter setting, approximately 10% of the total number of sentences in the texts of group 3. At the same time, by a series of parameter settings described in conjunction with FIGS. 23 through 27 hereinbelow, group summary string 300 represents fewer nodes than its counterpart group overview 170 of FIG. 3. Instead, it focuses on the texts that are most lexically central, and incorporates a considerable number of sentences from each, yielding a more readable digest that can often serve as a useful substitute for reading the texts in the group.

Group summary 300 comprises group header 310, a set of text summary strings 330, 340, 350, and 360, and a gap indicator 320. The text summary strings are slightly different in structure from the message overview strings of FIGS. 1 through 3, in that they include the text author, for example, text-author 331. However, this is an arbitrary difference; either type of text representation string could include or not include a text author or other text identifier. The text summary strings are also organized slightly differently from the text overview strings of FIGS. 1 through 3, in that a simple indented tree is used to portray the tree structure. In the indented tree, a summary string for a text is indented under the summary string for the nearest parent node that is represented in the summary. Gap indicators, such as gap indicator 320, indicate texts in the group not represented in the summary, and occurring in a tree path between the start of the group and some represented message, or between two represented texts. It should be noted that the two methods of organizing a set of text representation strings, namely via the reduced-width trees of FIGS. 1, 2, and 3, or the indented trees used in FIG. 4, can be used interchangeably for overviews and summaries alike. The reduced-width trees are useful for limiting the width of the digests when the text representation strings are deeply nested, and allow convenient "down the page" reading. The indented trees are a more familiar form that may be desirable in certain applications.

FIGS. 5 through 8 illustrate the results of the methods of this invention concerned with visualizing large text trees, including text trees divided into logical groupings obtained by, but not limited to, the clustering techniques provided by this invention. FIG. 5 shows the structure of a segmented treetable presentation of a text tree. It is divided into two frames. Upper frame 500 contains a legend link 510, which, when selected, obtains a legend popup 570 containing a precis of the tree subdivisions, described in more detail with respect to FIG. 7 hereinbelow, and the remainder of the frame contains a full tree representation 520, described in more detail with respect to FIG. 6 hereinbelow. Lower frame 540 contains an expanded treetable 550 for a single segment, selected from among those presented in outline segmented treetable 520, and is described in more detail with respect to FIG. 8 hereinbelow. Because the content of both the upper and lower frames may be too long to be contained in a half-window, scrollbars 530 and 560 are provided so that the frames may be independently scrolled.

FIG. 6 illustrates a segmented outline treetable 600 which can be used to represent full tree 520 in upper frame 530 of FIG. 5. The outline treetable 600 is divided into segments 610, 620, 630, and 640, each segment containing one or more subtrees, and each segment sized s


Free Web Sudoku Puzzles.
Solve with your browser.
  3   4 5   8    
          7   3  
5       8     6 7
          6   5  
    4       7    
  2   3          
8 1     2       9
  5   7          
    3   1 9   8  
What is it?



Add Your Site · Terms Of Service · Privacy Policy


DISCLAIMER
Linkgrinder is a free service that searches the Internet and indexes all files found so that you may search quickly and easily for shared files. These files are created and made available individually by users whose identity we are not aware of and who we have no control over. In essence we function like a search engine tool; these files ARE NOT STORED OR SERVED BY OUR NETWORK. We are not responsible for any materials obtained by using our service. We do not monitor any of the contents of these files. These files may contain viruses, illegal materials, materials inappropriate for minors, offensive files and the like. BY USING OUR SERVICE, YOU ASSUME FULL RESPONSIBILITY FOR DOWNLOADING THESE MATERIALS AND WILL INDEMNIFY US FOR ANY DAMAGES THAT MAY BE INCURRED.

For More Specific Information VIEW OUR TERMS OF SERVICE.

Thank you and Enjoy!