Senior Fitness - Exercise and Nutrition for Aging Men and Women
FREE Article Feed for your website.
Home Ownership Magazine
Party Planning Information
Article Marketing Resources
Bio-Medical Research Article Database
Informative Articles on Life, Love and Happiness
Tutorials on Business to Writing
Famous Quotes from Famous People
Song Lyric Information
New US Patent Information
Comprehensive List of Content by Category
Online Auctions and Shopping Related Articles
Article Search
Most Recent Articles
 

Trading Online Trading India Internet Trading Net Trading e Trad...
Category:
Finance / Investment  

Protect Your Home with Spy Camera
Category:
Home And Family  

7 Cost Effective Marketing Tips
Category:
Business  

How to Make a Free Web Site
Category:
Business  

Advertising Corporate Identity through Logo Design
Category:
Business  

Popcorn and Other Marketing Mistakes In a Changing Economy
Category:
Business  

Affiliate Marketing A business Without Hassle
Category:
Marketing  

Find Discount Scuba Diving Vacation Popularity Of Destination
Category:
Travel  

5 simple ways to get kick ass ideas for your articles
Category:
Business  

Global warming Should we heed the harbingers of doom
Category:
Home And Family  

Starting an Ebook Online Business in Just 3 Easy Steps
Category:
Business  

Give a man six inches and he ll want a
Category:
Health / Fitness  

Double Your Dish Network Affiliate Check
Category:
Marketing  

Going to the Beach Lose Up to 20 Pounds In Less Than 2 Weeks
Category:
Health / Fitness  

Tips On Getting A Suntan
Category:
Health / Fitness  

CHOOSING A LABEL PRINTER
Category:
Business  

Adverse Credit Credit Cards
Category:
Business  

mouth watering lobster recipes
Category:
Health / Fitness  

importance of food elements
Category:
Health / Fitness  

Blood Test To Predict Risk of Heart Disease For Diabetics
Category:
Health / Fitness  

How to Create a Money Magnet E commerce Web Site
Category:
Marketing  

10 Offline Tightwad Marketing Strategies to Help You Get More Cl...
Category:
Business  

Decent Acne Medicines
Category:
Health / Fitness  

Role play with added sex appeal
Category:
Health / Fitness  

Grow a Healthy Lawn You Can Do That
Category:
Home And Family  

Stock Images The Indispensable Tool For Designers And Webmasters...
Category:
Marketing  

Easy Work From Home Ideas Quickstarts For Everyone
Category:
Business  

Tips for Your Walking Program
Category:
Health / Fitness  

Everything About Arthritis
Category:
Health / Fitness  

A Gentle Warning To All Webmasters About RSS
Category:
Marketing  

15 Ways To Sell Yourself Effectively In A Job Interview Part Thr...
Category:
Business  

2 Ways Online Web Conferencing Can Save Your Business Money
Category:
Business  

Lighting Your Way to Outdoor Living
Category:
Home And Family  

7 Rules Every Salesman Should Follow
Category:
Business  

Give a man six inches and he ll want a
Category:
Health / Fitness  

Nurses Wanted Incredible Career Opportunities in Nursing Today
Category:
Health / Fitness  

Baby Wont Sleep Here s some helpful advice
Category:
Home And Family  

Why Cotoneaster Makes a Good Bonsai Candidate
Category:
Home And Family  

Home Hair Care Tips for Dry Hair
Category:
Health / Fitness  

A Home Gym and Walking a Great Exercise Program
Category:
Health / Fitness  

Preparing For Cosmetic Plastic Surgery
Category:
Health / Fitness  

Avoiding Razor Burn
Category:
Health / Fitness  

Curcumin An Anti Aging Herbal
Category:
Health / Fitness  

Take You Russian Fiance to an American Wedding Before You Get Ma...
Category:
Travel  

How and Why to Get an Awesome X Box 360 Skin for your XBOX Conso...
Category:
Entertainment / Television  

Where Are All of The Best Job Search Engines
Category:
Business  

The Power of Intention
Category:
Health / Fitness  

Traditional Therapies Can Prevent Heart Disease Too
Category:
Health / Fitness  

Handling devil Boss II
Category:
Home And Family  

10 Tips when using electronic forms
Category:
Business  

Mens Jewellery Snap Style Guide on Wearing Jewellery
Category:
Home And Family  

6 Things to Consider When Naming Your Baby
Category:
Home And Family  

Give a man six inches and he ll want a
Category:
Health / Fitness  

Stevie Wonder Challenges Memphis and the World
Category:
Entertainment / Television  

Writing the Resource Box so it Makes People click
Category:
Marketing  

Weight Loss Psychology
Category:
Health / Fitness  

Australia Visa Services Free Online Australian Immigration Asses...
Category:
Travel  

The Truth About Passive Income
Category:
Finance / Investment  

A New Way of Looking at NJ Divorce
Category:
Finance / Investment  

Can Stress Play a Role In Hair Loss
Category:
Health / Fitness  

Tips to Selecting an RSS News Aggregator
Category:
Computers  

WHY LABEL PRINTERS STAY SO BUSY
Category:
Business  

No Win No Fee Compensation Claims No Risk No Costs
Category:
Finance / Investment  

Why Heart Fails
Category:
Health / Fitness  

Find The Best Compensation Claim Specialist
Category:
Business  

Starting up a business in the 21st century
Category:
Business  

The Benefits of Press Releases
Category:
Business  

Tips on Improving the Positioning of your site on the Major
Category:
Computers  

Cheap Christmas Present
Category:
Home And Family  

How can a piece of article boost your marketing efforts
Category:
Marketing  

Philadelphia s Four Seasons Hotel For Business Vacations Or Wedd...
Category:
Travel  

7 Skin Care Tips Look Stunning in Your 50s
Category:
Health / Fitness  

Exercise Why Bother
Category:
Health / Fitness  

Frugal Living Money Making Ideas for Stay at Home Moms
Category:
Home And Family  

Internet marketing tips to help your business grow
Category:
Marketing

System and method for compressing a data table using models Number:7,143,046 from the United States Patent and Trademark Office (PTO) owispatent

Home    Author Login    Submit Article    Article Search    Add Your Link    Edit Your Link    Contact Us    Advertising    Disclaimer

   

 
Web LinkGrinder.com

Top Breaking News
     Greek, Cypriot Leaders Resume Unification Talks in Nicosia by Nathan Morley
     Indonesia Tobacco Sales Grow, Raising Health Fears
     South Korea Allows Top Defector to Travel Overseas by VOA News

Title: System and method for compressing a data table using models

Abstract: A system for, and method of compressing a data table using models and a database management system incorporating the system or the method. In one embodiment, the system includes: (1) a table modeller that discovers data mining models with guaranteed error bounds of at least one attribute in the data table in terms of other attributes in the data table and (2) a model selector, associated with the table modeller, that selects a subset of the at least one model to form a basis upon which to compress the data table.

Patent Number: 7,143,046 Issued on 11/28/2006 to Babu,   et al.


Inventors: Babu; Shivnath (Menlo Park, CA), Garofalakis; Minos N. (Chatham Township, NJ), Rastogi; Rajeev (New Providence, NJ)
Assignee: Lucent Technologies Inc. (Murray Hill, NJ)
Appl. No.: 10/033,199
Filed: December 28, 2001


Current U.S. Class: 704/500 ; 707/101
Current International Class: G06F 7/00 (20060101); G06F 17/30 (20060101)
Field of Search: 704/245,255,500,501,503 707/100,101,102


References Cited [Referenced By]

U.S. Patent Documents
5537589 July 1996 Dalal
5799311 August 1998 Agrawal et al.
6031995 February 2000 George
6189005 February 2001 Chakrabarti et al.
6247016 June 2001 Rastogi et al.
6308172 October 2001 Agrawal et al.
6473757 October 2002 Garofalakis et al.
6542894 April 2003 Lee et al.
6581058 June 2003 Fayyad et al.
6633882 October 2003 Fayyad et al.
6651048 November 2003 Agrawal et al.
6760724 July 2004 Chakrabarti et al.
6810368 October 2004 Pednault

Other References

Jagadish et al., "Semantic Compression and Pattern Extraction with Fascicles," Proceedings of the 25th VLDB Conference, 1999, pp. 186 to 197. cited by examiner.

Primary Examiner: Lerner; Martin

Claims



What is claimed is:

1. A data table compressor, comprising: a table modeller that discovers at least one model of data mining models with guaranteed error bounds of at least one attribute in a data table in terms of other attributes in different columns of said data table; a model selector, associated with said table modeller, that selects a subset of said at least one model to form a basis upon which to compress said data table to form a compressed data table; and a row aggregator that employs said selected subset from said model selector to improve a compression ratio of said compressed data table via row-wise clustering.

2. The data table compressor as recited in claim 1 wherein said table modeller employs classification and regression tree data mining models to model said at least one attribute.

3. The data table compressor as recited in claim 2 wherein construction of said models uses integrated building and pruning to exploit specified error bounds and decrease model construction time.

4. The data table compressor as recited in claim 2 wherein values for said at least one attribute are represented in said compressed data table by at least one of said classification and regression tree data mining models and are not explicitly stored therein.

5. The data table compressor as recited in claim 1 wherein said model selector employs a Bayesian network built on said at least one attribute to select relevant models for table compression.

6. The data table compressor as recited in claim 1 wherein said table modeller employs a selected one of a constraint-based and a scoring-based method to generate said at least one model.

7. The data table compressor as recited in claim 1 wherein said model selector selects said subset based upon a compression ratio and an error bound specific for each attribute of said data table.

8. The data table compressor as recited in claim 1 wherein said model selector selects said subset using a model built on attributes of said data table by a selected one of: repeated calls to a maximum independent set solution algorithm, and a greedy search algorithm.

9. A method of compressing a data table, comprising: discovering at least one model of data mining models with guaranteed error bounds of at least one attribute in said data table in terms of other attributes in different columns of said data table; selecting a subset of said at least one model to form a basis upon which to compress said data table; and employing said selected subset to improve a compression ratio of said compressed data table via row-wise clustering.

10. The method as recited in claim 9 wherein said discovering comprises employing classification and regression tree data mining models to model said at least one attribute.

11. The method as recited in claim 10 further comprising using integrated building and pruning to exploit specified error bounds and decrease model construction time.

12. The method as recited in claim 9 wherein said discovering comprises employing a Bayesian network built on said at least one attribute to select relevant models for table compression.

13. The method as recited in claim 9 wherein said discovering comprises employing a selected one of a constraint-based and a scoring-based method to generate said at least one model.

14. The method as recited in claim 9 wherein said selecting comprises selecting said subset based upon a compression ratio and an error bound specific for each attribute of said data table.

15. The method as recited in claim 9 wherein said selecting is NP-hard.

16. The method as recited in claim 9 wherein said selecting comprises selecting said subset using a model built on attributes of said data table by a selected one of: repeated calls to a maximum independent set solution algorithm, and a greedy search algorithm.

17. A database management system, comprising: a data structure having at least one data table therein; a database controller for allowing data to be provided to and extracted from said data structure; and a system for compressing said at least one data table, including: a table modeller that discovers at least one model of data mining models with guaranteed error bounds of at least one attribute in said data table in terms of other attributes in different columns of said data table, a model selector, associated with said table modeller, that selects a subset of said at least one model to form a basis upon which to compress said data table to form a compressed data table, and a row aggregator that employs said selected subset from said model selector to improve a compression ratio of said compressed data table via row-wise clustering.

18. The system as recited in claim 17 wherein said table modeller employs classification and regression tree data mining models to model said at least one attribute.

19. The system as recited in claim 18 wherein construction of said models uses integrated building and pruning to exploit specified error bounds and decrease model construction time.

20. The system as recited in claim 17 wherein said model selector employs a Bayesian network built on said at least one attribute to select relevant models for table compression.

21. The system as recited in claim 17 wherein said table modeller employs a selected one of a constraint-based and a scoring-based method to generate said at least one model.

22. The system as recited in claim 17 wherein said model selector selects said subset based upon a compression ratio and an error bound specific for each attribute of said data table.

23. The system as recited in claim 17 wherein said model selector selects said subset using a model built on attributes of said data table by a selected one of: repeated calls to a maximum independent set solution algorithm, and a greedy search algorithm.
Description



TECHNICAL FIELD OF THE INVENTION

The present invention is directed, in general, to data compression and, more specifically, to a system and method for compressing an arbitrary data table using models.

BACKGROUND OF THE INVENTION

Effective analysis and compression of massive, high-dimensional tables of alphanumeric data is an ubiquitous requirement for a variety of application environments. For instance, massive tables of both network-traffic data and "Call-Detail Records" (CDR) of telecommunications data are continuously explored and analyzed to produce certain information that enables various key network-management tasks, including application and user profiling, proactive and reactive resource management, traffic engineering, and capacity planning, as well as providing and verifying Quality-of-Service guarantees for end users. Effective querying of these massive data tables help to continue to ensure efficient and reliable service.

Traditionally, data compression issues arise naturally in applications dealing with massive data sets, and effective solutions are crucial for optimizing the usage of critical system resources, like storage space and I/O bandwidth (for storing and accessing the data) and network bandwidth (for transferring the data across sites). In mobile-computing applications, for instance, clients are usually disconnected and, therefore, often need to download data for offline use.

Thus, for efficient data transfer and client-side resource conservation, the relevant data needs to be compressed. Several statistical and dictionary-based compression methods have been proposed for text corpora and multimedia data, some of which (e.g., Lempel-Ziv or Huffman) yield provably optimal asymptotic performance in terms of certain ergodic properties of the data source. These methods, however, fail to provide adequate solutions for compressing a massive data table, as they view the table as a large byte string and do not account for the complex dependency patterns in the table.

Existing compression techniques are "syntactic" in the sense that they operate at the level of consecutive bytes of data. As explained above, such syntactic methods typically fail to provide adequate solutions for table-data compression, since they essentially view the data as a large byte string and do not exploit the complex dependency patterns in the data structure. Popular compression programs (e.g., gzip, compress) employ the Lempel-Ziv algorithm which treats the input data as a byte string and performs lossless compression on the input. Thus, these compression routines, when applied to massive tables, do not exploit data semantics or permit guaranteed error lossy compression of data.

Attributes (i.e., a "column") with a discrete, unordered value domain are referred to as "categorical," whereas those with ordered value domains are referred to as "numeric." Lossless compression schemes are primarily used on numeric attributes and do not exploit correlations between attributes. For instance, in certain page level algorithm compression schemes, each numeric attribute, its minimum value occurring in tuples (i.e., in "rows") in the page, is stored separately once for the entire page. Further, instead of storing the original value for the attribute in a tuple, the difference between the original value and the minimum is stored in the tuple. Thus, since storing the difference consumes fewer bits, the storage space overhead of the table is reduced. Tuple Differential Coding (TDC) is a compression method that also achieves space savings by storing differences instead of actual values for attributes. However, for each attribute value in a tuple, the stored difference is relative to the attribute value in the preceding tuple.

Other lossless compression schemes have been derived that essentially partitions the set of attributes of a table T into groups of correlated attributes that compress well (by examining a small amount of training material) and then simply using gzip to compress the projection of T on each group. Another approach for lossless compression first constructs a Bayesian network on the attributes of the table and then rearranges the table's attributes in an order that is consistent with a topological sort of the Bayesian network graph. A key intuition is that reordering the data (using the Bayesian network) results in correlated attributes being stored in close proximity; consequently, tools like gzip yield better compression ratios for the reordered table.

Another instance of a lossless compression algorithm for categorical attributes is one that uses data mining techniques (e.g., classification trees, frequent item sets) to find sets of categorical attribute values that occur frequently in the table. The frequent sets are stored separately (as rules) and occurrences of each frequent set in the table are replaced by the rule identifier for the set.

However, compared to these conventional compression methods for text or multimedia data, effectively compressing massive data tables presents a host of novel challenges due to several distinct characteristics of table data sets and their analysis. Due to the exploratory nature of many data-analysis applications, there are several scenarios in which an exact answer may not be required, and analysts may in fact prefer a fast, approximate answer, as long as the system can guarantee an upper bound on the error of the approximation. For example, during a drill-down query sequence in ad-hoc data mining, initial queries in the sequence frequently have the sole purposes of determining the truly interesting queries and regions of the data table. Providing (reasonably accurate) approximate answers to these initial queries gives analysts the ability to focus their explorations quickly and effectively, without consuming inordinate amounts of valuable system resources.

Thus, in contrast to traditional lossless data compression, the compression of massive tables can often afford to be lossy, as long as some (user-defined or application-defined) upper bounds on the compression error are guaranteed by the compression algorithm. This is obviously an important differentiation, as even small error tolerances can help achieve better compression ratios.

Effective table compression mandates, therefore, using compression procedures and techniques that are semantic in nature, in the sense that they account for and exploit both (1) the meanings and dynamic ranges of individual attributes (e.g., by taking advantage of the specified error tolerances); and, (2) existing data dependencies and correlations among attributes in the table.

Accordingly, what is needed in the art is a system and method that takes advantage of attribute semantics and data-mining models to perform guaranteed error lossy compression of massive data tables.

SUMMARY OF THE INVENTION

To address the above-discussed deficiencies of the prior art, the present invention provides a system for, and method of compressing a data table and a database management system incorporating the system or the method. In one embodiment, the system includes: (1) a table modeller that discovers data mining models with guaranteed error bounds of at least one attribute in the data table in terms of other attributes in the data table and (2) a model selector, associated with the table modeller, that selects a subset of the at least one model to form a basis upon which to compress the data table.

The present invention therefore introduces the broad concept of effectively compressing data tables by taking advantage of attribute semantics and data mining models to perform lossy compression of massive data tables containing a guaranteed error.

In one embodiment of the present invention, the table modeller employs classification and regression tree data mining models to model the at least one attribute. The tree in each such model is called a "CaRT." CaRTs are by themselves conventional, but have not until now been employed for compressing data tables.

In one embodiment of the present invention, the model selector employs a Bayesian network built on the at least one attribute to select relevant models for table compression. Those skilled in the pertinent art are familiar with Bayesian networks. Such networks find their first use in guaranteed error lossy compression of data tables.

In one embodiment of the present invention, the table modeller employs a selected one of a constraint-based and a scoring-based method to generate the at least one model. Such models will be set forth in detail in the Description that follows.

In one embodiment of the present invention, the model selector selects the subset based upon a compression ratio and an error bound. Thus, a compression technique that offers maximum compression, without exceeding error tolerance, is advantageously selected. However, those skilled in the pertinent art will understand that the subset may be selected on other or further bases.

In one embodiment of the present invention, the process by which the model selector selects the subset is NP-hard. Alternatively, two novel algorithms will be hereinafter described that allow subset selection to occur faster, at some cost.

In one embodiment of the present invention, the model selector selects the subset using a model built on attributes of the data table by a selected one of: (1) repeated calls to a maximum independent set solution algorithm and (2) a greedy search algorithm. These techniques will be set forth in the Detailed Description that follows.

The foregoing has outlined, rather broadly, preferred and alternative features of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features of the invention will be described hereinafter that form the subject of the claims of the invention. Those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiment as a basis for designing or modifying other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIG. 1 illustrates an exemplary system for compressing a data table;

FIG. 2 illustrates one embodiment of the procedure greedy ( ), a CaRT-selection algorithm, which may be implemented within the CaRTSelector 120;

FIGS. 3A 3D illustrates a Bayesian network graph defined over attributes X.sub.1, . . . , X.sub.4;

FIG. 4 illustrates one embodiment of a procedure MaxIndependentSet ( ) CaRT-selection algorithm which alleviates some drawbacks of the greedy algorithm;

FIG. 5 illustrates one embodiment of a procedure LowerBound ( ), an algorithm for computing L(N) for each "still to be expanded" leaf N in a partial tree R;

FIG. 6 illustrates diagrams for compression ratios for gzip, fascicles and the table compressor 100 for three data sets; and

FIGS. 7A C illustrate diagrams showing the effect of an error threshold and sample size on compression ratio/running time.

DETAILED DESCRIPTION

Referring initially to FIG. 1, illustrated is a table compressor 100, an embodiment of an exemplary system for compressing a data table according to the principles of the present invention. The table compressor 100 takes advantage of attribute semantics and data-mining models to perform lossy compression of massive data tables. The table compressor 100 is based in part upon a novel idea of exploiting data correlations and user-specified "loss"/error tolerances for individual attributes to construct concise and accurate "Classification and Regression Tree (CaRT)" models for entire columns of a table.

More precisely, the table compressor 100 generally selects a certain subset of attributes (referred to as "predicted" attributes) for which no values are explicitly stored in the compressed table; instead, concise CaRTs that predict these values (within the prescribed error bounds) are maintained. Thus, for a predicted attribute X that is strongly correlated with other attributes in the table, the table compression system 100 is typically able to obtain a very succinct CaRT predictor for the values of X, which can then be used to completely eliminate the column for X in the compressed table. Clearly, storing a compact CaRT model in lieu of millions or billions of actual attribute values can result in substantial savings in storage. In addition, allowing for errors in the attribute values predicted by a CaRT model only serves to reduce the size of the model even further and, thus, improve the quality of compression; this is because, as is well known to those skilled in the art, the size of a CaRT model is typically inversely correlated to the accuracy with which it models a given set of values.

Generally, the table compressor 100 focuses upon optimizing the compression ratio, that is, achieving the maximum possible reduction in the size of the data within the acceptable levels of error defined by the user. This choice is mainly driven by the massive, long-lived data sets that are characteristic of target data warehousing applications and the observation that the computational cost of effective compression can be amortized over the numerous physical operations (e.g., transmissions over a low-bandwidth link) that will take place during the lifetime of the data. Also, the table compressor 100 can tune compression throughput performance through control of the size of the data sample used by the table compressor's 100 model-construction algorithms and procedures, as will be described in more detail below. Setting the sample size based on the amount of main memory available in the system can help ensure high compression speeds.

A framework for the semantic compression of tables is based upon two technical ideas. First, the (user-specified or application-specified) error bounds on individual attributes are exploited in conjunction with data mining techniques to efficiently build accurate models of the data. Second, the input table is compressed using a select subset of the models built. This select subset of data-mining models is carefully chosen to capture large portions of the input table within the specified error bounds.

More formally, the model-based, compressed version of the input table T may be defined as a pair T.sub.c=<T', {M.sub.1, . . . , M.sub.p}> where (1) T' is a small (possibly empty) subset of the data values in T that are retained accurately in T.sub.c; and, (2) {M.sub.1, . . . , M.sub.p} is a select set of data-mining models, carefully built with the purpose of maximizing the degree of compression achieved for T while obeying the specified error-tolerance constraints.

Abstractly, the role of the set T' is to capture values (tuples or sub-tuples) of the original table that cannot be effectively "summarized away" in a compact data-mining model within the specified error tolerances. (Some of these values may in fact be needed as input to the selected models.) The attribute values in T' can either be retained as uncompressed data or be compressed using a conventional lossless algorithm.

A definition of the general model-based semantic compression problem can now be stated as follows.

Model-Based Semantic Compression (MBSC)

Given a massive, multi-attribute table T and a vector of (per-attribute) error tolerances , find a collection of models {M.sub.1, . . . , M.sub.m} and a compression scheme for T based on these models T.sub.c=<T', {M.sub.1, . . . , M.sub.p}> such that the specified error bound are not exceeded and the storage requirements |T.sub.c| of the compressed table are minimized.

An input to the table compressor 100 may consist of a n-attribute table T, comprising a large number of tuples (rows). X={X.sub.1, . . . , X.sub.n} denotes the set of n attributes of T and dom (X.sub.i) represent the domain of attribute X.sub.i. Attributes with a discrete, unordered value domain are referred to as "categorical," whereas those with ordered value domains are referred to as "numeric." T.sub.c is used to denote the compressed version of table T, and |T| (|T.sub.c|) to denote the storage-space requirements for T (T.sub.c) in bytes.

One important input parameter to the semantic compression algorithms of the table compressor 100 is a (user-specified or application-specified) n-dimensional vector of error tolerances =[e.sub.1, . . . e.sub.n] that defines the per-attribute acceptable degree of information loss when compressing T. (Per-attribute error bounds are also employed in a fascicles framework, to be described in more detail below.) Intuitively, the i.sup.th entry of the tolerance vector e.sub.i specifies an upper bound on the error by which any (approximate) value of X.sub.i in the compressed table T.sub.c can differ from its original value in T. Error tolerance semantics can differ across categorical and numeric attributes, due to the very different nature of the two attribute classes. 1. For a numeric attribute X.sub.i, the tolerance e.sub.i, defines an upper bound on the absolute difference between the actual value of X.sub.i in T and the corresponding (approximate) value in the compressed table T.sub.c. That is, if x, x' denote the accurate and approximate value (respectively) of attribute X.sub.i for any tuple of T, then the compressor guarantees that x is in [x'-e.sub.i, x'+e.sub.i]. 2. For a categorical attribute X.sub.i, the tolerance e.sub.i defines an upper bound on the probability that the (approximate) value of X.sub.i in T.sub.c is different from the actual value in T. More formally, if x, x' denote the accurate and approximate value (respectively) of attribute X.sub.i for any type of T, then the compressor guarantees that P[x=x'].gtoreq.1-e.sub.i.

For numeric attributes, the error tolerance could very well be specified in terms of quantiles of the overall range of values rather than absolute, constant values. Similarly, for categorical attributes the probability of error could be specified separately for each individual attribute class (i.e., value) rather than an overall measure. (Note that such an extension would, in a sense, make the error bounds for categorical attributes more "local," similar to the numeric case.) The model-based compression framework and algorithms of the present invention can be readily extended to handle these scenarios, so the specific definitions of error tolerance are not central to the present invention. However, for the sake of clarity, the definitions outlined above are used for the two attribute classes. (Note that the error-tolerance semantics can also easily capture lossless compression as a special case, by setting e.sub.i=0 for all i.)

One algorithmic issue faced by the table compressor 100 is that of computing an optimal set of CaRT models for the input table such that (a) the overall storage requirements of the compressed table are minimized, and (b) all predicted attribute values are within the user-specified error bounds. This can be a very challenging optimization problem since, not only is there an exponential number of possible CaRT-based models to choose from, but also building CaRTs (to estimate their compression benefits) is a computation-intensive task, typically requiring multiple passes over the data. As a consequence, the table compressor 100 employs a number of sophisticated techniques from the areas of knowledge discovery and combinatorial optimization in order to efficiently discover a "good" (sub)set of predicted attributes and construct the corresponding CaRT models.

In some practical cases, the use of fascicles can effectively exploit the specified error tolerances to achieve high compression ratios. As alluded to above, a fascicle basically represents a collection of tuples that have approximately matching values for some (but not necessarily all) attributes, where the degree of approximation is specified by user-provided compactness parameters. Essentially, fascicles can be seen as a specific form of data-mining models, i.e., clusters in subspaces of the full attribute space, where the notion of a cluster is based on the acceptable degree of loss during data compression. As stated above, a key idea of fascicle-based semantic compression is to exploit the given error bounds to allow for aggressive grouping and "summarization" of values by clustering multiple rows of the table along several columns (i.e., the dimensionality of the cluster).

There are however, several scenarios for which a more general, model-based compression approach is in order. Fascicles only try to detect "row-wise" patterns, where sets of rows have similar values for several attributes. Such "row-wise" patterns within the given error-bounds can be impossible to find when strong "column-wise" patterns/dependencies (e.g., functional dependencies) exist across attributes of the table. On the other hand, different classes of data-mining models (like Classification and Regression Trees (CaRTs)) can accurately capture and model such correlations and, thereby, attain much better semantic compression in such scenarios.

There is a need for a semantic compression methodology that is more general than simple fascicle-based row clustering in that it can account for and exploit strong dependencies among the attributes of the input table. Data mining, such as the data mining employed by the table compressor 100, offers models (i.e., CaRTs) that can accurately capture such dependencies with very concise data structures. Thus, in contrast to fascicles, the general model-based semantic compression paradigm can accommodate such scenarios.

As alluded to above, row-wise pattern discovery and clustering for semantic compression have been explored in the context of fascicles. In contrast, the table compressor 100 focuses primarily on the novel problems arising from the need to effectively detect and exploit (column-wise) attribute dependencies for the purposes of semantic table compression. One principle underlying the table compressor 100 is that, in many cases, a small classification (regression) tree structure can be used to accurately predict the values of a categorical (resp., numeric) attribute (based on the values of other attributes) for a very large fraction of table rows. This means that for such cases, compression algorithms can completely eliminate the predicted column in favor of a compact predictor (i.e., a classification or regression tree model) and a small set of outlier column values. More formally, the design and architecture of the table compressor 100 focuses mainly on the following concrete MBSC problem.

MBSC Problem Definition: Given a massive, multi-attribute table T with a set of categorical and/or numeric attributes X, and a vector of (per-attribute) error tolerances , find a subset {X.sub.1, . . . , X.sub.p} of X and a collection of corresponding CaRT models {M1, . . . , M.sub.p} such that: (1) model M.sub.i is a predictor for the values of attribute X.sub.1 based solely on attributes in X.sup.-{X1, . . . , X.sub.p}, for each I=1, . . . , p; (2) the specified error bounds are not exceeded; and (3) the storage requirements [T.sub.c] of the compressed table T.sub.c=<T.sup.1, {M.sub.1, . . . , M.sub.p> are minimized.

Abstractly, the semantic compression algorithms seek to partition the set of input attributes X into a set of predicted attributes {X.sub.1, . . . , X.sub.p} and a set of predictor attributes X-{X.sub.1, . . . , X.sub.p} such that the values of each predicted attribute can be obtained within the specified error bounds based on (a subset of) the predictor attributes through a small classification or regression tree (except perhaps for a small set of outlier values). (The notation X.sub.i.fwdarw.X.sub.i is used to denote a CaRT predictor for attribute X.sub.i using the set of predictor attributes X.sub.i-{X.sub.1, . . . , X.sub.p}.) Note that we do not allow a predicted attribute X.sub.i to also be a predictor for a different attribute. This restriction is important since predicted values of X.sub.i can contain errors, and these errors can cascade further if the erroneous predicated values are used as predictors, ultimately causing error constraints to be violated. The final goal, of course, is to minimize the overall storage cost of the compressed table. This storage cost [T.sub.c] is the sum of two basic components: 1. Materialization cost, i.e., the cost of storing the values for all predictor attributes X-{X.sub.1, . . . , X.sub.p}. This cost is represented in the T.sup.1 component of the compressed table, which is basically the projection of T onto the set of predictor attributes. (The storage cost of materializing attribute X.sub.1 is denoted by a procedure MaterCost (X.sub.i).) 2. Prediction Cost, i.e. the cost of storing the CaRT models used for prediction plus (possibly) a small set of outlier values of the predicted attribute for each model. (The storage cost of predicting attribute X.sub.i through the CaRT predictor X.sub.i.fwdarw.X.sub.i is denoted by a procedure PredCost (X.sub.i.fwdarw.X.sub.i); note that this does not include the cost of materializing the predictor attributes in X.sub.i.) Metrics

The basic metric used to compare the performance of different compression algorithms and the table compressor 100 is the well-known compression ratio, defined as the ratio of the size of the compressed data representation produced by the algorithm and the size of the original (uncompressed) input. A secondary performance metric is the compression throughput that, intuitively, corresponds to the rate at which a compression algorithm can process data from its input; this is typically defined as the size of the uncompressed input divided by the total compression time.

Turning once again to FIG. 1, disclosed is one embodiment of the table compressor 100 that includes four major functional blocks: a DependencyFinder 110, a CaRTSelector 120, a CaRTBuilder 130 and a RowAggregator 140. In the following paragraphs, a brief overview of each functional block is provided; a more detailed description of each functional block and the relevant algorithms are discussed more fully below.

Generally, the DependencyFinder 110 produces a mathematical model, also known as an interaction model, using attributes of a data table 112. The data output of the interaction model are then used to guide CaRT building algorithms, such as those used by the CaRTSelector 120 and the CaRTBuilder 130. The DependencyFinder 110 builds this interaction model in part because there are an exponential number of possibilities for building CaRT-based attribute predictors, and therefore a concise model that identifies the strongest correlations and "predictive" relationships in input data is needed. To help determine these correlations and "predictive" relationships in input data, an approach used by the DependencyFinder 110 is to construct a Bayesian network 115 model that captures the statistical interaction of an underlying set of attributes X.

A Bayesian network is a DAG whose edges reflect strong predictive correlations between nodes of the graph. Thus, a Bayesian network 115 on the table's attributes can be used to dramatically reduce the search space of potential CaRT models since, for any attribute, the most promising CaRT predictors are the ones that involve attributes in its "neighborhood" in the network. An implementation employed by the DependencyFinder 110 uses a constraint-based Bayesian network builder based on recently proposed algorithms for efficiently inferring the Bayesian network 115 from data. To control the computational overhead, the Bayesian network 115 may be built using a reasonably small random sample of the input table. Thus, intuitively, a set of nodes in the "neighborhood" of X, in G (e.g., X.sub.i's parents) captures the attributes that are strongly correlated to X.sub.i and, therefore, show promise as possible predictor attributes for X.sub.i.

After the Bayesian network 115, has been built, the CaRTSelector 120 will then be executed. Given the input table T 112 and error tolerances e.sub.i 123, (as well as the Bayesian network 115 on the attributes of T built by the DependencyFinder 110,) the CaRTSelector 120 is generally responsible for selecting a collection of predicted attributes and the corresponding CaRT-based predictors such that a final overall storage cost is minimized (within the given error bounds). The CaRTSelector 120 employs the Bayesian network 115 built on X to intelligently guide a search through the huge space of possible attribute prediction strategies. Clearly, this search involves repeated interactions with CaRTBuilder 130 which is responsible for actually building the CaRT-models for the predictors.

However, even in the simple case where the set of nodes that is used to predict an attribute node in the Bayesian network 115, is fixed, the problem of selecting a set of predictors by the CaRTSelector 120 that minimizes the combination of materialization and prediction cost naturally maps to the Weighted Maximum Independent Set (WMIS) problem, which is known to be NP-hard and therefore notoriously difficult to approximate.

Based on this observation, a specific CaRT-model selection strategy is therefore employed by the CaRTSelector 120. This selection strategy starts out with an initial solution obtained from a near-optimal heuristic for WMIS and then tries to incrementally improve it by small perturbations based on unique characteristics of the given variables. In an alternative embodiment of the present invention, a procedure "greedy" ( ) model-selection algorithm used by the CaRTSelector 120 chooses its set of predictors using a simple local condition during a single "roots-to-leaves" traversal of the Bayesian network 115, also referred to in the present application as a Bayesian network G. During the execution of the CaRTSelector 120, the CaRTBuilder 130 is repeatedly invoked and executed to build CaRT models for the predictors.

A significant portion of the table compressor 100 execution time is spent in building CaRT models. This is mainly because the table compressor 100 needs to actually construct many promising CaRTs in order to estimate their prediction cost, and CaRT construction is a computationally-intensive process. To reduce CaRT-building times and speed up system performance, the table compressor 100 employs the following three optimizations: (1) CaRTs may be built using random samples instead of the entire data set, (2) leaves may not be expanded if values of tuples in them can be predicted with acceptable accuracy, and (3) pruning is integrated into the tree growing phase using novel algorithms that exploit the prescribed error tolerance for the predicted attribute. These optimizations will be explained in more detail, below.

Given a collection of predicted and (corresponding) predictor attributes X.sub.i.fwdarw.X.sub.i, one goal of the CaRTBuilder 130 is to efficiently construct CaRT-based models for each X.sub.i in terms of X.sub.i for the purposes of semantic compression. Induction of various CaRT-based models by the CaRTBuilder 130 is typically a computation-intensive process that requires multiple passes over the input data. As is demonstrated, however, the CaRT-construction algorithms of the CaRTBuilder 130 can take advantage of compression semantics and can exploit the user-defined error-tolerances to effectively prune computation. In addition, by building CaRTs using data samples instead of the entire data set, the CaRTBuilder 130 is able to further speed up model construction.

The CaRTBuilder 130 exploits the inferred Bayesian network structure G by using it to intelligently guide the selection of CaRT models that minimize the overall storage requirement, based on the prediction and materialization costs for each attribute. Intuitively, the goal is to minimize the sum of the prediction costs (for predicted attributes) and materialization costs (for attributes used in the CaRTs). This model-selection problem is a strict generalization of the Weighted Maximum Independent Set (WMIS) problem which is known by those skilled in the art to be NP-hard. By employing a novel algorithm (detailed below) that effectively exploits the discovered Bayesian structure in conjunction with efficient, near optimal WMIS heuristics, the CaRTBuilder 130 is able to obtain a good set of CaRT models for compressing the table.

Once the CaRTSelector 120 has finalized a "good" solution to the CaRT-based semantic compression problem based off of computations performed by the CaRTBuilder 130, the CaRTSelector 120 then hands off its solution to the RowAggregator 140. The RowAggregator 140 tries to further improve a compression ratio through row-wise clustering. Briefly, the RowAggregator 140 uses a fascicle-based algorithm to compress the predictor attributes while ensuring (based on the CaRT models built) that errors in the predictor attribute values are not propagated through the CaRTs in a way that causes error tolerances (for predicted attributes) to be exceeded.

One important point here is that, since the entries of T.sub.1 are used as inputs to (approximate) CaRT models for other attributes, care must be taken to ensure that errors introduced in the compression of T.sup.1 do not compound over the CaRT models in a way that causes error guarantees to be violated. The issues involved in combining the CaRT-based compression methodology with row-wise clustering techniques are addressed in more detail below.

The CaRT-based compression methodology of the table compressor 100 is essentially orthogonal to techniques based on row-wise clustering, such as fascicles. It is entirely possible to combine the two techniques for an even more effective model-based semantic compression mechanism. As an example, the predictor attribute table T.sup.1 derived by the "column-wise" techniques can be compressed using a fascicle-based algorithm. (In fact, this is exactly the strategy used in the table compressor 100 implementation; however, other methods for combining the two are also possible.)

To reiterate, the essence of the CaRT-based semantic compression problem of the table compressor 100 of FIG. 1 lies in discovering a collection of "strong" predictive correlations among the attributes of an arbitrary table. The search space for this problem is obviously exponential: given any attribute X.sub.i, any subset of X-{X.sub.i} could potentially be used to construct a predictor for X.sub.i. Furthermore, verifying the quality of a predictor for the purposes of semantic compression is typically a computation-intensive task, since it involves actually building the corresponding classification or regression tree on the given subset of attributes. Since building an exponentially large number of CaRTs is clearly impractical, a methodology is disclosed for producing a concise interaction model that identifies the strongest predictive correlations among the input attributes. This model can then be used to restrict the search to interesting regions of the prediction space, limiting CaRT construction to truly promising predictors. Building such an interaction model is one main purpose of the DependencyFinder 110.

As stated previously, the specific class of attribute interaction models used in the table compressor 100 may be that of Bayesian networks. To reiterate, a Bayesian network is a combination of a probability distribution and a structural model in the form of a DAG over the attributes in which edges represent direct probabilistic dependence. In effect, a Bayesian network is a graphical specification of a joint probability distribution that is believed to have generated the observed data. Bayesian networks may be an essential tool for capturing causal and/or predictive correlations in observational data; such interpretations are typically based on the following dependence semantics of the Bayesian network structure:

Parental Markov Condition: Given a Bayesian network G over a set of attributes X, and node X.sub.i.epsilon.X is independent of all its non-descendant nodes given its parent nodes in G (denoted by .PI.(X.sub.i)).

Markov Blanket Condition: Given a Bayesian network G over a set of attributes X, the Markov blanket of X.sub.i.epsilon.X (denoted by .beta.(X.sub.i)) is defined as the union of X.sub.is children, and the parents of X.sub.i's children in G. Any node X.sub.i .epsilon.X is independent of all other nodes given its Markov blanket in G.

Based on the above conditions, a Bayesian network over the attributes of the input table can provide definite guidance on the search for promising CaRT predictors for semantic compression. More specifically, it is clear that predictors of the form .PI.(X.sub.i).fwdarw.X.sub.i or .beta.(X.sub.i).fwdarw.X.sub.i should be considered as prime candidates for CaRT-based semantic compression.

Construction Algorithm

Learning the structure of Bayesian networks from data is a difficult problem that has seen growing research interest in recent years. There are two general approaches to discovering Bayesian structure: (1) Constraint-based methods try to discover conditional independence properties between data attributes using appropriate statistical measures (e.g., X.sup.2 or mutual information) and then build a network that exhibits the observed correlations and independencies. (2) Scoring-based (or Bayesian) methods are based on defining a statistically-motivated score function (e.g., Bayesian or MDL-based) that describes the fitness of a probabilistic network structure to the observed data; the goal then is to find a structure that maximizes the score. (In general, this is a hard optimization problem that is typically NP-hard.)

The methods have different advantages. Given the intractability of scoring-based network generation, several heuristic search methods with reasonable time complexities have been proposed. Many of these scoring-based methods, however, assume an ordering for the input attributes and can give drastically different networks for different attribute orders. Further, due to their heuristic nature, such heuristic methods may not find the best structure for the data.

On the other hand, constraint-based methods have been shown to be asymptotically correct under certain assumptions about the data, but, typically, introduce edges in the network based on Conditional Independence (CI) tests that become increasingly expensive and unreliable as the size of the conditioning set increases. Also, several constraint-based methods have very high computational complexity, requiring, in the worst case, an exponential number of CI tests.

The DependencyFinder 110

The DependencyFinder 110 may implement a constraining-based Bayesian network builder, such as one based upon the algorithm of Cheng et al. in "Learning Belief Networks from Data: An Information Theory Based Approach" published in the November 1997 issue of the Proceedings of the Sixth International Conference on Information and Knowledge Management which in hereby incorporated by reference in its entirety. Unlike earlier constraint-based methods, the algorithm of Cheng, et al., explicitly tries to avoid complex CI tests with large conditioning sets and, by using CI tests based on mutual information divergence, eliminates the need for an exponential number of CI tests. In fact, given an n-attribute data set, the Bayesian network builder of the table compressor 100 requires at most O(n.sup.4)CI tests, which, in the present implementation, translates to at most O(n.sup.4) passes over the input tuples. Recall that the DependencyFinder 110 uses only a small random sample of the input table to discover the attribute interactions; the size of this sample can be adjusted according to the amount of main memory available, so that no I/O is incurred (other than that required to produce the sample).

Also, note that the DependencyFinder 110 is, in a sense, out of the "critical path" of the data compression process, since such attribute interactions are an intrinsic characteristic of the data semantics that only needs to be discovered once for each input table. The DependencyFinder 110 adds several enhancements to the basic Cheng et al. algorithm, such as the use of Bayesian-scoring methods for appropriately orienting the edges in the final network.

The CaRTSelector 120

The CaRTSelector 120 is an integral part of the table compressor 100 model-based semantic compression engine. Given the input data table and error tolerances, as well as the Bayesian network capturing the attribute interactions, a goal of the CaRTSelector 120 is to select (a) a subset of attributes to be predicted and (2) the corresponding CaRT-based predictors, such that the overall storage cost is minimized within the specified error bounds. As discussed above, the total storage cost T.sub.c is the sum of the materialization costs (of predictor attributes) and prediction costs (of the CaRTs for predicted attributes). In essence, the CaRTSelector 120 implements the core algorithmic strategies for solving the CaRT-based semantic compression problem. Deciding on a storage-optimal set of predicted attributes and corresponding predictors poses a hard combinatorial optimization problem; as the following theorem shows, the problem is NP-hard even in the simple case where the set of predictor attributes to be used for each attribute is fixed, as expressed in the following theorem.

Theorem 1: Consider a given set of n predictors {X.sub.i.fwdarw.X.sub.i: for all X.sub.i .epsilon. X, where X.sub.i .OR right. X}. Choosing a storage-optimal subset of attributes X.sub.pred .OR right. X to be predicted using attributes in X-X.sub.pred is NP-hard.

Interestingly, the simple instance of the CaRT-based semantic compression problem of the table compressor 100 described in the above theorem can be shown to be equivalent to the Weighted Maximum Independent Set (WMIS) problem, which is known to be NP-hard. The WMIS problem can be stated as follows: "Given a node-weighted, undirected graph G=(V,E), find a subset of nodes V' .OR right. V such that no two vertices in V' are joined by an edge in E and the total weight of nodes in V' is maximized." Abstractly, the partitioning of the nodes into V' and V V' corresponds exactly to the partitioning of attributes into "predicted" and "materialized" with the edges of G capturing the "predicted by" relation. Further, the constraint that no two vertices in V' are adjacent in G ensures that all the (predictor) attributes for a predicted attribute (in V') are materialized, which is a requirement of the compression problem of the table compressor 100. Also, the weight of each node corresponds to the "storage benefit" (materialization cost-prediction cost) of predicting the corresponding attribute. Thus, maximizing the storage benefit of the predicted attributes has the same effect as minimizing the overall storage cost of the compressed table.

Even though WMIS is known to be NP-hard and notoriously difficult to approximate for general graphs, several recent approximation algorithms have been proposed with guaranteed worst-case performance bounds for bounded-degree graphs. The optimization problem faced by the CaRTSelector 120 is obviously much harder than simple WMIS, since the CaRTSelector 120 is essentially free to decide on the set of predictor attributes for each CaRT. Further, the CaRTSelector 120 also has to invoke the CaRTBuilder 130 to actually build potentially useful CaRTs, and this construction is itself a computation-intensive task.

Given the inherent difficulty of the CaRT-based semantic compression problem, the CaRTSelector 120 implements two distinct heuristic search strategies that employ the Bayesian network model of T built by the DependencyFinder 110 to intelligently guide the search through the huge space of possible attribute prediction alternatives. The first strategy is a simple "greedy" selection algorithm that chooses CaRT predictors greedily based on their storage benefits during a single "roots-to-leaves" traversal of the Bayesian graph. The second, more complex strategy takes a less myopic approach that exploits the similarities between the CaRT-selection problem and WMIS; a key idea here is to determine the set of predicted attributes (and the corresponding CaRTs) by obtaining (approximate) solutions to a number of WMIS instances created based on the Bayesian model of T.

The Greedy CaRT Selector

Turning now to FIG. 2, illustrated is one embodiment of the procedure greedy ( ), a CaRT-selection algorithm, which may be implemented within the CaRTSelector 120. Briefly, the greedy algorithm visits the set of attributes X in the topological-sort order imposed by the constructed Bayesian network model G and tries to build a CaRT predictor for each attribute based on its predecessors. Thus, for each attribute X.sub.1 visited, there are two possible scenarios. 1. If X.sub.i has no parent nodes in G (i.e., node X.sub.i is a root of G) then Greedy concludes that X.sub.i cannot be predicted and, consequently, places X.sub.i directly in the subset of materialized attributes X.sub.mat (Step 4). 2. Otherwise (i.e., .PI. (X.sub.i) is not empty in G), the CaRTBuilder 130 component is invoked to construct a CaRT-based predictor for X.sub.i (within the specified error tolerance e.sub.i) using the set of attributes chosen for materialization thus far X.sub.mat (Step 6). (Note that possibly irrelevant attributes in X.sub.mat will be filtered out by the CaRT construction algorithm in CaRTBuilder 130.) Once the CaRT for X.sub.i is built, the relative cost of predicting X.sub.i is estimated and X.sub.i chosen (Steps 7 8).

The greedy algorithm provides a simple, low-complexity solution to the CaRT-based semantic compression problem of the table compressor 100. Given an attribute table and Bayesian network G, it is easy to see that greedy always constructs at most (n-1) CaRT predictors during its traversal of G. This simplicity, however, comes at a price.

More specifically, greedy CaRT selection suffers from two shortcomings. First, selecting an attribute X.sub.i to be predicted based solely on its "localized" prediction benefit (through its predecessors in G) is a very myopic strategy, since it ignores the potential benefits from using X.sub.i itself as a (materialized) predictor attribute for its descendants in G. Such very localized decisions can obviously result in poor overall predictor selections. Second, the value of the "benefit threshold" parameter .theta. can adversely impact the performance of the compression engine and selecting a reasonable value for .theta. is not a simple task. A high .theta. value may mean that very few or no predictors are chosen, whereas a low .theta. value may cause low-benefit predictors to be chosen early in the search thus excluding some high-benefit predictors at lower layers of the Bayesian network.

EXAMPLE 1

Turning now to FIGS. 3A 3D, consider the Bayesian network graph defined over attributes X.sub.1, . . . , X.sub.4. Let the materialization cost of each attribute be 125. Further, let the prediction costs of CaRT predictors be as follows: PredCost({X.sub.1}.fwdarw.X.sub.2)=75 PredCost({X.sub.1}.fwdarw.X.sub.3)=80 PredCost({X.sub.1}.fwdarw.X.sub.4)=125 PredCost({X.sub.2}.fwdarw.X.sub.3)=15 PredCost({X.sub.2}.fwdarw.X.sub.4)=80 PredCost({X.sub.3}.fwdarw.X.sub.4)=75

Suppose that .theta.=1.5. Since X.sub.1 has no parents, it is initially added to X.sub.mat (Step 4). In the next two iterations, since MaterCost (X.sub.2)/PredCost ({X.sub.1}.fwdarw.X.sub.2)=1.67>1.5 and MaterCost (X.sub.3)/PredCost ({X.sub.1}.fwdarw.X.sub.3)=1.56>1.5, X.sub.2 and X.sub.3 are added to X.sub.pred (Step 7). Finally, X.sub.4 is added to X.sub.mat since MaterCost (X.sub.4)/PredCost ({X.sub.1}.fwdarw.X.sub.4)=1<1.5. Thus, the overall storage cost of materializing X.sub.1 and X.sub.4, and predicting X.sub.2 and X.sub.3 is 125+75+80+125=405.

The MaxIndependentSet CaRT Selector

Turning now to FIG. 4, depicted is one embodiment of a procedure MaxIndependentSet ( ) CaRT-selection algorithm which alleviates some drawbacks of the greedy algorithm mentioned above. Intuitively, the MaxIndependentSet algorithm starts out by assuming all attributes to the materialized, i.e., X.sub.mat=X (Step 1), and then works by iteratively solving WMIS instances that try to improve the overall storage cost by moving the nodes in the (approximate) WMIS solution to the subset of predicted attributes X.sub.pred.

Consider the first iteration of a main while-loop (Steps 3 30). The algorithm MaxIndependentSet starts out by building CaRT-based predictors for each attribute X.sub.i in X based on X.sub.i's "predictive neighborhood" in the constructed Bayesian network G (Steps 5 7); this neighborhood function is an input parameter to the algorithm and can be set to either X.sub.i's parents or its Markov blanket in G. Then, based on the "predicted-by" relations observed in the constructed CaRTs, the algorithm MaxIndependentSet builds a node-weighted, undirected graph G.sub.temp on X with (a) all edges (X.sub.i, Y), where Y is used in the CaRT predictor for X.sub.i and (b) waits for each node X.sub.i set equal to the storage cost-benefit of predicting X.sub.i. (Steps 16 20). Finally, the algorithm MaxIndependentSet finds a (near-optimal) WMIS of G.sub.temp and the corresponding nodes/attributes are moved to the predicted set X.sub.pred with the appropriate CaRT predictors (assuming the total benefit of the WMIS is positive) (Steps 21 29).

Note that in Step 5, it is possible for mater neighbors (X.sub.i) to be .phi.. This could happen, for instance, if X.sub.i is a root of G and X.sub.i's neighborhood comprises of its parents. In this case, the model M returned by a subprogram BuildCaRT is empty and it does not make sense for X.sub.i to be in the predicted set X.sub.pred. X.sub.i should always stays in X.sub.mat by setting PredCost (PRED(X.sub.i).fwdarw.X.sub.i) to .infin. if PRED (X.sub.i)=.phi., which causes X.sub.i's weight to become -.infin. in Step 20.

The WMIS solution calculated after this first iteration of the Algorithm MaxIndependentSet can be further optimized, since it makes the rather restrictive assumption that an attribute can only be predicted based on its direct neighborhood in G. For example, consider a scenario where G contains the directed chain {X,Y}.fwdarw.Z.fwdarw.W, and the attribute pair {X,Y} provides a very good predictor for Z, which itself is a good predictor for the value of W. Then, the initial WMIS solution can obviously select only one of these predic


Free Web Sudoku Puzzles.
Solve with your browser.
  1 3 8 9        
9               7
  6           2  
    5 4 1   8    
    7       6    
    6   5 3 1    
  3           6  
5               8
        8 9 7 3  
What is it?



Add Your Site · Terms Of Service · Privacy Policy


DISCLAIMER
Linkgrinder is a free service that searches the Internet and indexes all files found so that you may search quickly and easily for shared files. These files are created and made available individually by users whose identity we are not aware of and who we have no control over. In essence we function like a search engine tool; these files ARE NOT STORED OR SERVED BY OUR NETWORK. We are not responsible for any materials obtained by using our service. We do not monitor any of the contents of these files. These files may contain viruses, illegal materials, materials inappropriate for minors, offensive files and the like. BY USING OUR SERVICE, YOU ASSUME FULL RESPONSIBILITY FOR DOWNLOADING THESE MATERIALS AND WILL INDEMNIFY US FOR ANY DAMAGES THAT MAY BE INCURRED.

For More Specific Information VIEW OUR TERMS OF SERVICE.

Thank you and Enjoy!