Senior Fitness - Exercise and Nutrition for Aging Men and Women
FREE Article Feed for your website.
Home Ownership Magazine
Party Planning Information
Article Marketing Resources
Bio-Medical Research Article Database
Informative Articles on Life, Love and Happiness
Tutorials on Business to Writing
Famous Quotes from Famous People
Song Lyric Information
New US Patent Information
Comprehensive List of Content by Category
Online Auctions and Shopping Related Articles
Article Search
Most Recent Articles
 

It pays to be popular as 3 launches revolutionary concept
Category:
Writing  

How Precious Are Tack Trunks Really
Category:
Hobbies / Pastimes  

The Best Prostate Cancer Treatment Is Early Treatment
Category:
Health / Fitness  

Home Acne Treatments Tips
Category:
Health / Fitness  

The 7 Steps to Creating Wealth Part Two
Category:
Business  

Stag Night
Category:
Travel  

Boating One Of Life s Great Pleasures
Category:
Sports  

Using subdomains to bypass Googles sandbox
Category:
Computers  

A Healthy And Successful Weight Loss Diet
Category:
Health / Fitness  

Vitamin for Menopause understanding what s good for you
Category:
Home And Family  

Vitamins and Minerals
Category:
Health / Fitness  

Just What Is Herpes Simplex
Category:
Health / Fitness  

What Attributes are Needed to Run a Successful Business
Category:
Business  

What Do The Numbers Indicate On Credit Cards
Category:
Business  

Assumptions in Credit Repair
Category:
Finance / Investment  

Solve the mystery of your english scottish and irish roots
Category:
Hobbies / Pastimes  

CASE STUDY How Website Monitoring Saved an Online Auto Parts Ret...
Category:
Webmaster  

What is a Reverse Mortgage
Category:
Business  

Best Way To Search For Air Fare Deals
Category:
Travel  

Trendsetting Handbag Designers
Category:
Fashion  

Dating Profile does it really make so much sense
Category:
Self Help  

So What Is It About ATVs
Category:
Travel  

Free International Calls The Latest Telephony Offer
Category:
Business  

Personal Loans are here to fulfill your Personal Desires
Category:
Finance / Investment  

Introduction To Hot Tub
Category:
Home And Family  

What Can You Do to Prevent Hair Loss
Category:
Health / Fitness  

Do You Know Why Adsense Is Essential For Content Sites
Category:
Computers  

Best buddy as lover
Category:
Entertainment / Television  

Business Intelligence
Category:
Business  

My Credit Repair Success Story
Category:
Home And Family  

Back links strategies
Category:
Computers  

Make a Fortune Online
Category:
Business  

website or no website
Category:
Marketing  

Tarp Systems For Trucks
Category:
Cars And Trucks  

How to Create Great Online Dating Profiles
Category:
Home And Family  

Getting Your Goals Right
Category:
Self Help  

Online High School Diplomas
Category:
Education  

Open Door for Slim Trim Figure in Company of Phendimetrazine
Category:
Health / Fitness  

Have You Got Credos
Category:
Self Help  

Powerful autorun programs creator for your CD DVD
Category:
Computers  

Education Loans Can Fund A Higher Degree To Boost Your Career
Category:
Finance / Investment  

MANAGING YOUR FEARS AS AN ALTERNATIVE TO ABUSE AND ADDICTION
Category:
Self Help  

Need for Speed The Turbocharger Way
Category:
Cars And Trucks  

Top Marketing Concepts to make money online Part 1
Category:
Marketing  

bad credit personal loan
Category:
Finance / Investment  

Selecting Furniture for a Play Room
Category:
Home And Family  

How to Double Your Romance with One Way Dates
Category:
Home And Family  

Writing Articles Can Explode Your Business
Category:
Marketing  

Spa Vacations Which Ones Are Best
Category:
Travel  

What Is eFax Or Internet Faxing
Category:
Business  

Great Ideas for an In Home Business
Category:
Business  

3 Marketing Myths that are Stopping you Succeed
Category:
Marketing  

Consolidation loans for homeowners when multiple credits become ...
Category:
Finance / Investment  

Pectin and Weight Loss
Category:
Health / Fitness  

7 Steps To Skyrocket Your Internet Business Profits
Category:
Marketing  

Are You Walking Your Talk
Category:
Business  

Write Emails Marketing Messages That Capture Your Audience
Category:
Marketing  

How To Write Copy That Sells
Category:
Marketing  

Do You Want To Know How To Make Halloween More Enjoyable For Tod...
Category:
Home And Family  

Secrets Of A Down On His Luck Guy Who Finally Made The Right Cho...
Category:
Self Help  

Public Relations Tips You Can Take to the Bank
Category:
Business  

4 Necessary Steps You Should Take When You are Online
Category:
Computers  

Double Your Dish Network Affiliate Check
Category:
Marketing  

Weight loss supplements buying tips
Category:
Health / Fitness  

The Science of Lowcarb Diets Why They Work
Category:
Health / Fitness  

7 Sure Fire Ways to Make a Positive Impression With Your Busines...
Category:
Business  

A Home Gym and Walking a Great Exercise Program
Category:
Health / Fitness  

12 Surefire Strategies For Overcoming Your Fear Of Public Speaki...
Category:
Self Help  

Yacht Charters Provide a Luxurious Escape
Category:
Travel  

How LASIK Vision Correction Works
Category:
Health / Fitness  

Motion Sickness Wristbands are Gaining in Popularity for Effecti...
Category:
Health / Fitness  

Fantastic New Solution For All Your Traffic Troubles
Category:
Marketing  

So How Many Credit Cards Do You Want
Category:
Business  

10 Costly Search Engine Mistakes to Avoid
Category:
Business  

Make front page news by NOT inviting the media
Category:
Business

Trainable, extensible, automated data-to-knowledge translator Number:7,096,210 from the United States Patent and Trademark Office (PTO) owispatent

Home    Author Login    Submit Article    Article Search    Add Your Link    Edit Your Link    Contact Us    Advertising    Disclaimer

   

 
Web LinkGrinder.com

Top Breaking News
     California Supreme Court Strikes Down Gay Marriage Ban by Mike O'Sullivan
     UN Scales Down Global Growth Forecast by Alex Villarreal
     Donovan, Leslie Lead USA Women's Basketball to Beijing by David Byrd

Title: Trainable, extensible, automated data-to-knowledge translator

Abstract: A trainable, extensible, automated data-to-knowledge translator is described. One aspects of the present invention includes a computerized system having at least one repository to store user-specified rules that govern the processing of data by the computerized system and at least one processing module to process data according to the rules and to generate knowledge from the data. Another aspect of the present invention is a computerized method of translating data to knowledge. The computerized method includes providing user-specified rules to govern the behavior of a computerized system for translating data to knowledge, and processing data according to the rules to generate knowledge. A further aspect of the present invention is a computer readable medium having computer-executable instructions stored thereon for executing a method of translating data to knowledge. The computerized method comprises receiving data in an unstructured form, converting the data to a neutral form, processing data according to user-specified rules to translate the data from the neutral form to knowledge, and exporting the knowledge to a knowledge repository.

Patent Number: 7,096,210 Issued on 08/22/2006 to Kramer,   et al.


Inventors: Kramer; Kevin M. (Coon Rapids, MN), Gaetjens; Steven C. (Brooklyn Park, MN), Voges; Harold C. (Shoreview, MN)
Assignee: Honeywell International Inc. (Morristown, NJ)
Appl. No.: 09/522,483
Filed: March 10, 2000


Current U.S. Class: 706/45 ; 706/11; 706/12
Current International Class: G06F 17/00 (20060101); G06N 5/00 (20060101)
Field of Search: 706/45,11,12 705/2


References Cited [Referenced By]

U.S. Patent Documents
5359509 October 1994 Little et al.
6236977 May 2001 Verba et al.
Foreign Patent Documents
62206628 Sep., 1987 JP
WO-9962002 Dec., 1999 WO

Other References

Apte, C., et al., "Towards Language Independent Automated Learning of Text Categorization Models", Proceedings of the Annual International ACM-SIGIR Conference on Research and Development in Information Retreival, Berlin, Germany, (Jul. 3-6, 1994), 23-30. cited by other .
Clearwater, S. H., et al., "RL4: A Tool for Knowledge-Based Induction", Proceedings of the IEEE International Conference on Tools for Artificial Intelligence, Los Alamitos, CA,(Nov. 6-9, 1990), 24-30. cited by other.

Primary Examiner: Hirl; Joseph P.
Attorney, Agent or Firm: Fredrick; Kris T.

Claims



What is claimed is:

1. A computerized system comprising: at least one repository to store user-specified rules that govern the processing of data by the computerized system; an import filter to receive source data and to generates, based on the user-specified rules, neutral data, wherein the import filter supports a variety of source data formats; at least one processing module to process the neutral data according to the user-specified rules and to generate knowledge from the neutral data; and a user interface to allow a user to create, modify, and delete the user-specified rules.

2. The computerized system of claim 1 further comprising a system executive to control the at least one repository and the at least one processing module.

3. The computerized system of claim 1 wherein the at least one processing module is a text extraction module.

4. The computerized system of claim 1 wherein the at least one processing module is a packet export module.

5. The computerized system of claim 1, wherein the import filter is to convert data into a neutral format.

6. The computerized system of claim 1 further comprising an export module to export the knowledge into a knowledge repository.

7. The computerized system of claim 1, wherein the at least one processing module is generate a child packet from the at least one packet, the at least one processing module to assign a context stored in the at least one context field of the at least one packet to an at least one context field of the child packet.

8. A computerized system comprising: at least one repository to store user-specified rules that govern the processing of data by the computerized system; an import filter to receive source data and to generate at least one packet based on import filter rules of the user-specified rules, wherein the at least one packet comprises at least one context field that augments content meaning of the data, and wherein the import filter can support a variety of source data formats; and at least one processing module to process data in the at least one packet according to packet construction rules of the user-specified rules and to generate knowledge from the data in the at least one packet.

9. The computerized system of claim 8 further comprising a system executive to control the at least one repository and the at least one processing module.

10. The computerized system of claim 8, wherein the at least one processing module is configured to generate packets having a neutral data format based on the packet construction rules.

11. The computerized system of claim 8, wherein the data is a hierarchically structured data and wherein a packet construction rule of the packet construction rules is associated with a node of a data hierarchy for the hierarchically structured data.

12. A computerized system comprising: at least one repository to store user-specified rules that govern the processing of data by the computerized system; a filter to receive source data and to convert the source data into a neutral form and to store the converted data into at least one packet based on packet construction rules of the user-specified rules, wherein the filter can support a variety of source data formats; and at least one processing module to process data in the at least one packet according to text extraction rules of the user-specified rules and to generate knowledge from the data.

13. The computerized system of claim 12, wherein the data is an input text and wherein the at least one processing module is to identify user-specified text entities within the input text based on the text extraction rules.

14. The computerized system of claim 13, wherein the at least one processing module is to format the identified user-specified text entities based on the text extraction rules.

15. The computerized system of claim 12 further comprising a system executive to control the at least one repository and the at least one processing module.

16. A computerized system comprising: at least one repository to store user-specified rules that govern the processing of data by the computerized system; and an import filter to receive source data and to generate at least one packet based on Packet construction rules of the user-specified rules, wherein the at least one packet comprises at least one content field for storage of the source data and at least one context field that augments content meaning of the data, and wherein the import filter can support a variety of source data formats; at least one processing module to process data in the at least one packet according to packet export rules of the user-specified rules and to generate knowledge from the data in the at least one packet.

17. The computerized system of claim 16 further comprising a system executive to control the at least one repository and the at least one processing module.

18. The computerized system of claim 16, wherein the at least one processing module is configured to map content of a packet to a field of a database table based on the packet export rules.

19. The computerized system of claim 16, wherein the at least one processing module is configured to map context of a packet to a field of a database table based on the packet export rules.

20. A computerized system comprising: at least one repository to store user-specified rules that govern the processing of data by the computerized system; an import filter to receive source data and to generate at least one packet based on packet construction rules of the user-specified rules, wherein the at least one packet comprises at least one context field that augments content meaning of the data, and wherein the import filter can support a variety of source data formats; and at least one processing module to process data in the at least one packet according to packet dispatch rules of the user-specified rules and to generate knowledge from the data in the at least one packet.

21. The computerized system of claim 20 further comprising a system executive to control the at least one repository and the at least one processing module.

22. The computerized system of claim 20, wherein the at least one processing module is configured to route a packet to at least one other processing module based on the packet dispatch rules.

23. The computerized system of claim 22, wherein the at least one processing module is configured to save a packet in a packet repository that is routed to the at least one other processing module.

24. A computerized method of translating data to knowledge, the computerized method comprising: providing user-specified rules to govern the behavior of a computerized system for translating data to knowledge; receiving data; generating at least one packet for storage of the data based on the user-specified rules, wherein the at least one packet comprises at least one context field that augments content meaning of the data; and processing data in the at least one packet according to the user-specified rules to generate knowledge, wherein ones of the user-specified rules are packet construction rules.

25. The computerized method of claim 24 further comprising data according to additional rules without modifying the existing processing components.

26. The computerized method of claim 24 wherein processing data according to the rules further comprises extracting text from the data.

27. The computerized method of claim 24 wherein processing data according to the rules further comprises exporting the knowledge to a knowledge repository.

28. The computerized method of claim 24 wherein processing data according to the rules further comprises converting data into a neutral format.

29. A computerized method of translating data to knowledge, the computerized method comprising: providing user-specified rules to govern the behavior of a computerized system for translating data to knowledge; receiving data, wherein the received data can be in one of a variety of supported formats; generating at least one packet for storage of the data based on the user-specified rules, wherein the at least one packet comprises at least one context field that augments content meaning of the data; and processing data in the at least one packet according to the user-specified rules to generate knowledge, wherein ones of the user-specified rules are packet dispatch rules.

30. The computerized method of claim 29, wherein processing the data according to the user-specified rules to generate knowledge comprises routing a packet to at least one processing module based on the packet dispatch rules.

31. The computerized method of claim 30, wherein processing the data according to the user-specified rules comprises saving a packet in a packet repository that is routed to the at least one processing module.

32. A computerized method of translating data to knowledge, the computerized method comprising: providing user-specified rules to govern the behavior of a computerized system for translating data to knowledge; receiving data, wherein the received data can be in one of a variety of supported formats; converting the data into a neutral form; storing the converted data into at least one packet based on the user-specified rules; and processing data according to the user-specified rules to generate knowledge, wherein ones of the user-specified rules are text extraction rules.

33. The computerized method of claim 32, wherein the data is an input text and wherein processing the data according to the user-specified rules to generate knowledge comprises identifying user-specified text entities within the input text based on the text extraction rules.

34. The computerized method of claim 33, wherein processing the data according to the user-specified rules to generate knowledge comprises formatting the identified user-specified text entities based on the text extraction rules.

35. A computerized method of translating data to knowledge, the computerized method comprising: providing user-specified rules to govern the behavior of a computerized system for translating data to knowledge; receiving data, wherein the received data can be in one of a variety of supported formats; generating at least one packet for storage of the data based on the user-specified rules, wherein the at least one packet comprises at least one content field for storage of the data and at least one context field that augments content meaning of the data; and processing data according to the user-specified rules to generate knowledge, wherein ones of the user-specified rules are packet export rules.

36. The computerized method of claim 35, wherein processing the data according to the user-specified rules to generate knowledge comprises mapping content of the packet to a field of a database table based on the packet export rules.

37. The computerized method of claim 35, wherein processing the data according to the user-specified rules to generate knowledge comprises mapping context of the packet to a field of a database table based on the packet export rules.

38. A computer readable medium having computer-executable instructions stored thereon for executing a method of translating data to knowledge, the computerized method comprising: receiving data in an unstructured form, wherein the unstructured form is one of a variety of supported unstructured forms; converting the data to a neutral form; processing data according to user-specified rules to translate the data from the neutral form to knowledge; and exporting the knowledge to a knowledge repository.

39. The computer readable medium of claim 38 wherein converting the data to a neutral form comprises applying user-specified packet construction rules to the data.

40. The computer readable medium of claim 38 wherein processing data according to user-specified rules comprises applying user-specified packet dispatch rules to the data to route the packets to a packet processing module.

41. The computer readable medium of claim 38 wherein processing data according to user-specified rules comprises applying user-specified text extraction rules to the data.

42. The computer readable medium of claim 38 wherein exporting the data to the knowledge repository comprises applying user-specified packet export rules to the data.
Description



FIELD OF THE INVENTION

The present invention is related to computer systems and in particular to computer systems to translate data to knowledge.

BACKGROUND OF THE INVENTION

Many products such as decision support systems require knowledge in order to make intelligent decisions. A decision support system is a computer-based system that combines knowledge, analytical tools, and models to aid a decision maker. A decision support system commonly includes a knowledge database or a knowledge repository. Knowledge is extracted from the knowledge database or repository and analyzed using the analytical tools and models in order to assist with decisions. In order to be useful to the decision support system, data must be analyzed, translated and organized into structured, meaningful knowledge before it is stored in the knowledge database.

Often, data is in the form of human readable documentation, which to the decision support system appears as unstructured, meaningless data. Data refers to information, raw facts, and the like. Data may exist in a variety of forms such as in paper documents or in digital documents. Data on its own has no meaning to a decision support system. For a decision support system to process data, the data must first be translated into a form that the decision support system can process.

As used herein, knowledge refers to information that can be processed by a decision support system. A collection of knowledge is referred to as a knowledge base or a knowledge repository. Even structured data formats such as the standard generalize markup language (SGML) or the extendible markup language (XML) may be unsuitable to the decision support system since not all of the needed knowledge may be tagged by markup. Human translation of data to knowledge is laborious, expensive, and error-prone; especially for data sources that are periodically updated. Special purpose knowledge base construction programs are often too inflexible to directly apply, or too costly to modify for new types of data and/or knowledge repositories.

What is needed is a way to convert unstructured, meaningless data, such as human consumable information, into structured, meaningful knowledge, i.e., machine consumable knowledge.

SUMMARY OF THE INVENTION

A trainable, extensible, automated data-to-knowledge translator is described. One aspect of the present invention includes a computerized system having at least one repository to store user-specified rules that govern the processing of data by the computerized system and at least one processing module to process data according to the rules and to generate knowledge from the data. Another aspect of the present invention is a computerized method of translating data to knowledge. The computerized method includes providing user-specified rules to govern the behavior of a computerized system for translating data to knowledge, and processing data according to the rules to generate knowledge. A further aspect of the present invention is a computer readable medium having computer-executable instructions stored thereon for executing a method of translating data to knowledge. The computerized method comprises receiving data in an unstructured form, converting the data to a neutral form, processing data according to user-specified rules to translate the data from the neutral form to knowledge, and exporting the knowledge to a knowledge repository.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a high level block diagram of a Data-to-Knowledge translator system according to one embodiment of the invention.

FIGS. 1B, 1C, and 1D are a high level block diagrams of alternate embodiments of a D2K translator system.

FIG. 2 is a block diagram of an example embodiment of a physical architecture to implement the Data-to-Knowledge translator (D2K) system shown in FIG. 1.

FIG. 3 is a flow chart of a process of training the D2K translator according to one embodiment of the invention.

FIG. 4 is a flow chart of a process of parsing a data source to construct a knowledge base with the D2K translator according to one embodiment of the invention.

FIG. 5 is a more detailed diagram of one embodiment of a packet data structure for the packets shown in FIG. 2.

FIG. 6 shows example source data to be applied to an import filter of one embodiment of the present invention such as the import filter shown in FIG. 2.

FIG. 7 shows example import filter rules to be applied to the example source data of FIG. 6.

FIG. 8 shows an example embodiment of a first packet produced by applying the example import filter rules shown in FIG. 7 to the sample source data shown in FIG. 6.

FIG. 9 shows an example embodiment of a second packet produced by applying the example import filter rules shown in FIG. 7 to the sample source data shown in FIG. 6.

FIG. 10 shows an alternate example embodiment of the first packet shown in FIG. 8 produced by applying the example import filter rules shown in FIG. 7 to the sample source data shown in FIG. 6.

FIG. 11 illustrates example patent match specification rules for use by the packet dispatcher of FIG. 2.

FIG. 12 is sample input text that is to be processed by a text extraction module of FIG. 2.

FIG. 13 is a hierarchical representation of a collection of text extraction module rules.

FIG. 14 is sample input text in which the text extraction module identified the EquipmentNumber entities as defined by the example text extraction rules of FIG. 13.

FIG. 15 is sample input text in which the text extraction module identified the DocumentReference entities as defined by the example text extraction rules of FIG. 13.

FIG. 16 shows the token values for the DocumentReference production (as defined by the example text extraction rules of FIG. 13) sorted into bins.

FIG. 17 illustrates the sets of token values formed by performing a full expansion on the bins of FIG. 16.

FIG. 18 illustrates the field labels and values resulting from applying the matched production's Volume and Bookmark formats to the sets of values of FIG. 17.

FIG. 19 illustrates example embodiments of packets that are created from the field labels and values of FIG. 18.

FIG. 20 is sample input text in which the text extraction module identified the Fault entities as defined by the example text extraction rules of FIG. 13.

FIG. 21 illustrates the relationships between the TemFaultEntity packet and the TemDocumentReferenceEntity and the TemEquipmentNumberEntity packets.

FIG. 22 is a block diagram illustrating a database schema of a sample knowledge base to which packets are exported.

FIG. 23 is a graphical representation of the packet export rules which map example entities to example tables.

FIG. 24 is an example table after exporting the TemDocumentReferenceEntity packets of FIG. 19.

FIG. 25 shows five example tables after exporting the packets of FIG. 19.

FIG. 26 is an example embodiment of a user interface for the Web Executive.

FIG. 27 is a screen capture of one embodiment of an import filter editor.

FIG. 28 is one embodiment of an import filter editor's search dialog box.

FIG. 29 is one embodiment of an import filter editor's packet construction properties dialog box.

FIG. 30 is one embodiment of an import filter editor's "Next" menu.

FIG. 31 is one embodiment, of an import filter editor's "Bookmarks" menu.

FIG. 32 is one embodiment of a modeless "Current Packet Information" window of the import filter user interface.

FIG. 33 is one embodiment a processor and packet selection panel of a packet dispatcher user interface.

FIG. 34 is one embodiment of a match specification panel of a packet dispatcher user interface.

FIG. 35 is a "Match Specification Properties" dialog box of one embodiment of a packet dispatcher user interface.

FIG. 36 is one embodiment of a dialog bar of a text extraction user interface.

FIG. 37 is a screen capture of one embodiment of a rules panel.

FIG. 38 is a screen capture of one embodiment of an annotation panel.

FIG. 39 is a screen capture of one embodiment of a grid panel.

FIG. 40 is a screen capture of one embodiment of a knowledge repository schema import control.

FIG. 41 is a screen capture of one embodiment a packet export selection panel.

FIG. 42 is a screen capture of one embodiment of a graphics panel.

FIG. 43 is a block diagram illustrating the relationships between the packets, packet dictionary, processing modules, and packet dispatch rules databases.

FIG. 44 is an entity relationship diagram of the schema for one embodiment of the text extraction rules database.

FIG. 45 illustrates two methods of representing a "one to many" relationship.

FIG. 46 is a block diagram of a computerized system in conjunction with which example embodiments of the invention may be implemented.

DESCRIPTION OF THE EMBODIMENTS

In the following detailed description of the preferred embodiments, reference is made to the accompanying drawings which form a part hereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention.

System Level Overview

FIG. 1A is a high level block diagram of a Data-to-Knowledge translator system 100 according to one embodiment of the invention. Due to the wide variation in data input formats and knowledge output requirements, the logical architecture of one embodiment of the Data-to-Knowledge (D2K) system of the present invention is a three-tiered system that isolates the data source input and the knowledge repository output as shown in FIG. 1A. The software components in tier one 102 import the source data 101 to a neutral format. Once the data is in a neutral format, the software components in tier two 104 analyze, organize, and process the data. Finally, the software components in tier three 106 export the processed data (knowledge) to a knowledge repository 108, where, if desired, further processing may occur. This three-tiered architecture maximizes the D2K system's extensibility by reducing the amount of development needed to apply the tool to a specific data format or an entirely different domain. Each tier is separated by well-defined interfaces so that internal changes to one tier do not necessitate changes to adjacent tiers. In addition to the three tiers, the D2K system contains a system executive 110, which provides global functions such as coordinating the activity within and between each one of the three tiers as well as interactions with the user and error reporting.

However, the D2K system is not limited to a three tier system. In an alternate embodiment, one or more additional tiers may be added to the logical architecture of the D2K translator system 100 shown in FIG. 1. In still another embodiment, the logical architecture can be scaled to operate as two tiers or even as a single tier.

FIG. 1B is a high level block diagram of an alternate embodiment of a D2K translator system 100. As shown in FIG. 1B, the D2K translator system 100 comprises at least one repository 114 to store user-specified rules that govern the processing of data by the computerized system 100. The D2K translator system 100 also comprises at least one processing module 112 to process data according to the rules and to generate knowledge from the data.

FIG. 1C is a high level block diagram of an additional embodiment of a D2K translator system 100. As shown in FIG. 1C, the D2K translator system 100 comprises at least one repository 114 and at least one processing module 112. The D2K system further comprises a system executive 110 to control the at least one repository and the at least one processing module.

FIG. 1D is a high level block diagram of another embodiment of a D2K translator system 100. As shown in FIG. 1D, the D2K translator system 100 comprises at least one repository 114, at least one processing module 112, and a system executive 110. The D2K translator system also comprises a user interface to allow a user to create, modify, and delete the user-specified rules.

Many data formats exist for the source data 101. Unstructured data may be in the form of, but are not limited to, documents, pictures, databases, diagrams, schematics, and the like. Since, in general, it is not always possible (or convenient) to convert data into a single format, the D2K system supports a variety of source data formats. The components in tier one 102 process data in its native format and convert the relevant information into a neutral format. In other words, the import tier (tier one) 102 isolates the details and intricacies of the source data format from the processing components in the processing tier (tier two) 104. The routines in tier two 104 analyze, organize and process the imported data. The processing routines 104 convert unstructured, meaningless data into structured, meaningful knowledge. The processing components use a variety of techniques such as regular expression search engines, natural language processing algorithms and graphics identification algorithms. The components in tier three 106 export knowledge to a knowledge repository 108. Just as data may reside in many source formats, knowledge may also be represented in several repository formats. Hence, the export tier 106 (tier three) isolates the details and intricacies of the knowledge repository format from the processing routines. In summary, the import tier 102 and export tier 104 components allow the processing tier 106 components to perform their task without having to consider the format of the data source or the knowledge repository.

Physical Architecture for an Example Embodiment

FIG. 2 is a block diagram of an example embodiment of a physical architecture to implement the D2K system shown in FIG. 1. The rectangles 202, 206, 212, 210, 216 shown in FIG. 2 represent transformation components. The transformation components receive some type of input and transform it to produce some type of output. The envelopes shown in FIG. 2 represent data structures referred to herein as "packets". Packets are the interface mechanism that the transformation components use to communicate with each other. The cylinders shown in FIG. 2 represent repositories that store a variety of information such as the D2K system's configuration, rules that govern the transformation components' behavior, training data, error messages, and the like. Finally, the people shown in FIG. 2 represent graphic user interfaces (GUIs). The user interacts via the GUIs to customize the D2K system's behavior.

The example physical architecture shown in FIG. 2 maps to the logical architecture shown in FIG. 1 as follows. The import filter 202 along with the import filter rules repository 204 comprise the import tier (tier one) shown in FIG. 1. The text extraction module 206 along with the text extraction rules repository 208 and the custom processing modules 210 comprise the processing tier (tier two) shown in FIG. 1. The packet export module 212 along with the packet export rules repository 214 comprise the export tier (tier three) shown in FIG. 1. The packet dispatcher 216 along with the packet dispatch rules repository 218, the GUIs and configuration modules (not shown) comprise the system executive shown in FIG. 1. Packets are the interface objects that facilitate communication between and within the three tiers shown in FIG. 1.

The data flow of the example embodiment of the D2K system 200 shown in FIG. 2 is as follows. The import filter 202 reads the data source 201 and discretizes it into packets according to the import filter rules 204. The import filter 202 then passes these packets to the packet dispatcher 216, which dispatches the packets to the appropriate processing modules according to the packet dispatch rules 218. The processing modules process packets. Some processing modules, such as the packet export module 212 are terminal processing modules that do not create additional packets. Other processing modules, such as the text extraction module 206, create several packets of finer resolution. These non-terminal processing modules pass the packets they create to the packet dispatcher 216, which, in turn, dispatches the packets to the appropriate processing modules and the cycle continues. In one embodiment, at least one terminal processing module exports packets to the knowledge repository. This process is illustrated in FIG. 4. The process shown in FIG. 4 continues until the entire data source is parsed.

System Components an Example Embodiment

This section describes in more detail the following system components of the example embodiment shown in FIG. 2: packets, the packet dictionary, the packet factory, the import filter, the packet dispatcher, packet processing modules, the text extraction module, the packet export module, computer graphic metafile (CGM) processing modules and the message log.

Packets. The packets shown in FIG. 2 are generic collections of information. FIG. 5 is a more detailed diagram of the packet data structure shown in FIG. 2. A packet 500 is a data structure containing fields for a type, content, and context as shown in FIG. 5. The packet's type 502 is a name of an entity or object. The type is typically descriptive of the packet's content. The packet's content 504 consists of zero or more labels, each having zero or more associated values. Content labels are relevant attributes of the entity or object. Content values are typically processed by packet processing modules and/or exported to the knowledge repository. Packet context 506, on the other hand, consists of zero or more labels, each having zero or one associated value. Context values contain information that either augments the content meaning or describes how the content was generated. Like content values, context values can also be exported to the knowledge repository.

When a processing module creates a child packet, it usually copies the parent's context and assigns it to the child. In other words, the child inherits the parent's context. Consequently, if it is desirable to identify packets with a unique identifier, then one can store the identifier as context. Since the children packets will inherit their parent's unique identifier, a relationship between parent and children packets will be created.

As an example, consider how one could represent a recipe as a packet. Since the packet type is descriptive of the packet content and the packet represents a recipe, the word "recipe" is an obvious choice for the packet's type. The relevant attributes of all recipes are a name, a list of ingredients, and preparation/cooking instructions. Consequently, our recipe packet will contain name, ingredients and instructions content. The name content will contain a single value, the recipe's name. The ingredients and instructions content, on the other hand, will contain multiple values. In other words, each ingredient and instruction step will be stored as separate values of the ingredients and instructions content respectively. Finally, information such as the number of servings, the number of calories per serving and nutritional information could be stored as context.

In one embodiment, packets are created by an import filter such as the import filter 202 shown in FIG. 2. The import filter passes the packet to a packet dispatcher such as the packet dispatcher 216 shown in FIG. 2. The packet dispatcher routes packets to packet processing modules, which, in turn, may create additional packets whose information is more atomic. The dispatcher may also store the packet in a packet repository such as the packet repository 220 of FIG. 2. Ultimately, all atomic packets are routed to a terminal packet processing module such as the packet export module 212 of FIG. 2, which exports packets to a knowledge repository, or to a null processor which discards the packets.

Packet Dictionary. In one embodiment, a packet dictionary, such as the packet dictionary 222 shown in FIG. 2, maintains a master list of legal packets. Several D2K system functions use the packet dictionary to mitigate data corruption. For instance, in some embodiments the training user interfaces use the packet dictionary to limit list box selections in order to prohibit the user from generating invalid training rules, while in other embodiments the packet factory uses the packet dictionary to guarantee that only legal packets are created.

The packet dictionary is populated during the registration of the D2K system import filters and packet processing modules. In one embodiment, any D2K system component that generates packets registers a prototype of each type of packet it can create. Conceptually, a packet prototype is a packet without any values. In other words, the packet prototypes specifies which content and context labels are legal for a given packet type.

Packet Factory. The purpose of the packet factory is to provide a set of packet related services such as reading a packet from the packet repository and writing a packet to the packet repository. In addition, the packet factory provides the service of instantiating packets, which were persisted in the packet repository, and passing them to the content dispatcher so that they can be routed to packet processing modules. Finally, the packet factory provides several services to build packets as well as a mechanism to clone the context of a packet in order to create a new child packet that inherits its parent's context.

Import Filter. In one embodiment, an import filter along with the import filter rules repository comprise tier one of FIG. 1. The purpose of an import filter, such as the import filter 202 shown in FIG. 2, is to handle the intricacies of the data format and hide these details from the processing modules. The import filter discretizes the relevant source data into packets, which are in the neutral data format of the D2K system. Depending upon the data source, the mechanism by which this process occurs is different. For example, for hierarchically structured data such as SGML or XML, packet construction rules may be associated to each node of the hierarchy. In another example embodiment, for data stored in Microsoft Word.RTM. documents, Visual Basic for Applications.RTM. scripts may be written to create packets.

An outline of the data's structure is captured and stored in a database. In the case of SGML documents, the outline is similar to the document tag definition (DTD) in that it contains a hierarchy of elements and attributes. In one embodiment, however, the outline only contains the portion of the DTD that is realized in the actual document.

Once the data structure is outlined, a packet construction rule is applied to each node in the hierarchy according to one embodiment of the present invention. Packet construction rules allow the user to do the following with the data that corresponds to the node. 1. Ignore the data. 2. Ignore the data and create a new packet. 3. Create a new packet and insert the data into the packet as content. 4. Insert the data into the current packet as content. 5. Append the data into the current packet as content. 6. Insert the data into the current packet as context. In one embodiment, depending upon the rule, the user may also specify the packet type, the content label, and/or the context label. Furthermore, the six rules may not be applicable to every node in the hierarchy. For example, it is invalid to insert data into the current packet as content at a node which does not have an ancestor that creates a packet. (Note, the user need not be cognitive of all of the restrictions since in one embodiment, the import filter training user interface will not allow the user to violate any restrictions.)

As previously mentioned, once a data source's outline is stored in a database, a packet construction rule is associated to each node in the hierarchy. The type of rule is dependent upon the existing information in the import filter rules repository. If a given node already exists in the rules repository, then it is assigned the same rule as the existing node. If the node does not already exist in the database, then it is assigned the "ignore the data" rule. In essence, the user is able to merge the structure of several data sources without losing past training, i.e., the application of rules to nodes. In addition, the user is given the ability to delete any nodes that exist in the rules repository but not in the recently outlined data source. These two mechanisms allow the user to store the packet construction rules for several data sources in one or more rule repositories while minimizing training requirements.

In one embodiment, after the import filter is trained, the import filter is registered. Registering the import filter populates the packet dictionary with prototypes of packets that the import filter can create while it is parsing the data source. In one embodiment, after the user trains the import rules via the import filter user interface, the GUI automatically registers the import filter. Once the import filter is registered, the import filter may parse a data source by applying the packet construction rules to construct packets.

As an example, consider the sample SGML text shown in FIG. 6. FIG. 6 shows example source data to be applied to an import filter of the present invention. FIG. 7 shows example import filter rules to be applied to the example source data shown in FIG. 6. As shown in FIG. 7, elements are identified by a boxed letter `E`; whereas, attributes by a boxed letter `A`. The element and attribute names immediately follow the boxed letter and the import filter rules follow the ellipsis. The "ignore the data" rule is implied for items without an explicit rule.

The SGML import filter parses the sample text element by element applying the appropriate import filter rules (also referred to as "packet construction rules"). In the sample SGML text shown in FIG. 6, the import filter first parses the master fault table (MSTFLTAB) element. Since the rule in FIG. 7 that corresponds to this element is to ignore the element's data, the import filter proceeds to parse the fault row (FLTROW) element. The rule in FIG. 7 that corresponds to this element instructs the import filter to create a packet of type Fault Code. Next, the import filter parses the FLTROW element's attributes. The rules in FIG. 7 that correspond to the chapter number (CHAPNBR), section number (SECTNBR) and unique key (KEY) attributes direct the import filter to add three context label-value pairs to the Fault Code packet. The values of the CHAPNBR, SECTNBR and KEY attributes, i.e., "29", "24" and "EN29240001-00001001", become the values of the Chapter Number, Section Number and Unique Key context respectively as shown in the packet of FIG. 8. FIG. 8 is a snapshot of the Fault Code packet prior to the import filter parsing the fault description (FLTDESC) element of the SGML text of FIG. 6. Since the rules that correspond to the next three elements instruct the parser to ignore the element's data, the import filter proceeds to parse the fault description (FLTDESC) element.

The rule in FIG. 7 that corresponds to the FLTDESC element instructs the import filter to create a second packet of type Fault Symptom. Next, the import filter inserts the value of the category (CATEG) attribute into the Fault Symptom packet shown in FIG. 9 as Fault Symptom Type content. The value of the fault type (FLTYPE) attribute is then appended to the current value of the Fault Symptom Type content. If the rule, which corresponds to the FLTYPE attribute, instructed the import filter to insert, rather than append, the attribute's data to the Fault Symptom Type content, then the import filter would have inserted this data as a second value. After parsing the FLTDESC element and its attributes, the import filter parses the fault message (FLTMSG) and ATA ECAM (ATAECAM) elements. The import filter inserts the value of these elements into the Fault Symptom-packet as Fault Symptom Text and ECAM ATA content as shown in FIG. 9. Next, the import filter encounters the FLTDESC end tag. Consequently, it clones the context of the Fault Symptom's parent packet, i.e., the Fault Code packet, and passes the Fault Symptom packet to the packet dispatcher. FIG. 9 is a snapshot of the Fault Symptom packet after the import filter encounters the fault description (FLTDESC) end tag in the sample SGML text of FIG. 6.

Upon returning from the packet dispatcher, the import filter parses the task reference (TASKREF) element of the SGML text of FIG. 6. The import filter inserts the value of this element in the Fault Code packet as FIP K12 Reference content as shown in FIG. 10. The import filter ignores the next element and then encounters the FLTROW end tag. Upon encountering this tag, the import filter passes the Fault Code packet to the content dispatcher. Upon returning from the packet dispatcher, the import filter encounters the MSFLTAB end tag and terminates. FIG. 10 is a snapshot of the Fault Code packet after the import filter encounters the fault row (FLTROW) end tag in the sample SGML text of FIG. 6. In summary, applying the packet construction rules of FIG. 7 to the sample SGML text of FIG. 6 produces a Fault Code packet as shown in FIG. 10 and a Fault Symptom packet as shown in FIG. 9.

Packet Dispatcher. A packet dispatcher, such as the packet dispatcher 216 of FIG. 2, is the hinge point between the front half (i.e., the import tier) and back half (i.e., processing tier and export tier) of the D2K system and functions as a packet `traffic cop`. In one embodiment, the packet dispatcher operates in three modes. In its normal mode of operation, the packet dispatcher routes packets to packet processing modules according to packet match specification rules, which are stored in a packet dispatch rules database, such as the Packet Dispatch Rules database 218 of FIG. 2. In a second mode, the user can disable packet dispatching and configure the packet dispatcher to merely save packets to a packet repository, such as the packet repository 220 of FIG. 2. In a third mode, the user can configure the packet dispatcher both to dispatch packets to packet processing modules and to store packets in the packet repository. This last mode is useful as a debugging aid and provides the user with a packet audit trail.

In one embodiment, the packet dispatcher supports two modes of the sequencing between the import filter, the packet dispatcher, and the packet processing modules: single-threaded and multi-threaded. In single-thread mode, the import filter generates a packet and passes it to the packet dispatcher, who passes it to an appropriate packet processing module. The packet processing module processes the packet and may, in turn, generate additional packets, which are referred to as children packets. Next, the packet processing module sequentially passes each child packet to the packet dispatcher, who passes it to an appropriate processing module. This cycle continues until all of the relevant information in the original information has been processed and exported. At this point, the import filter is free to resume parsing the input data in order to generate another packet. In summary, in the single-thread mode of operation, once the import filter generates a packet, it waits until this packet as well as all of its descendent packets are processed before it can resume its task of parsing the input data. In multi-threaded mode, the import filter does not have to wait for the dispatcher to process the packet before resuming its processing. The raw packets are queued in the packet dispatcher and processed serially by a second execution thread. This allows the import filter to work continuously. Multi-threaded operation is advantageous when the D2K system is hosted on a multi-processor computer system.

As mentioned previously, the packet dispatcher routes packets to packet processing modules according to packet match specification rules stored in the packet dispatch rules database. In one embodiment, the packet match specification rules map packet match specifications (referred to herein as matchspecs) to packet processing modules (referred to as packet processors). Matchspecs consists of a packet type, an optional processing argument, and zero or more context label-value pairs. Matchspecs are similar to packets with the following two exceptions. Matchspecs contain a packet processor argument. Matchspecs do not contain content. In one embodiment, packet processing modules, except for the null processor module, are specified by their global unique identifier (GUID). Processing modules without a GUID are assumed to be null processor modules.

FIG. 11 illustrates example packet match specification rules for use by a packet dispatcher. In FIG. 11, three matchspecs 1102, 1104, 1106 are mapped to two packet processors 1108, 1110. The first matchspec 1102 indicates that the Null Module 1108 can process all packets of type Fault Topic. The second matchspec 1104 indicates that the Text Extraction Module 1110 with the argument "Faults" can process packets of type Fault Topic whose Fault Topic Type context is equal to FAULT ISOLATION. The third matchspec 1106 indicates that the Text Extraction Module 1110 with the argument "Possible Causes" can process all packets of type Possible Causes. In one embodiment, the following guidelines apply to matchspecs: A matchspec must map to one and only one packet processor. Several matchspecs may map to the same packet processors. Matchspecs with the same packet type may map to different packet processors as long as they have different context label-value pairs.

In order to determine which packet processors should process a packet, the packet dispatcher first determines which matchspecs match a packet. Then, from this list, the packet dispatcher determines the best matchspecs. In order for a matchspec to match a packet, two requirements must be met. First, the matchspec must be of the same type as the packet. Second, the matchspec's context, if it exists, must be present in the packet. The proceeding statement does not imply that a packet has to have all of the same context as the matchspec in order for the matchspec to match. A packet, which has context that is not present in a matchspec, will still match the matchspec as long as the packet has the context specified by the matchspec. In other words, the packet's context must be a superset of the matchspec's context in order to match. Once the packet dispatcher determines a list all of the matchspecs that match a packet, it chooses the matchspecs, which have the most context, as the best. Once the best matchspecs are determined, the packet dispatcher passes the packet and the corresponding processing arguments to the packet processors that are mapped to the best matchspecs.

For example, consider the illustration shown in FIG. 11. Packets of type Fault Topic with Fault Topic Type context equal to FAULT CONFIRMATION will only be matched by the first matchspec 1102. Subsequently, these packets will be dispatched to the Null Module 1108. Packets of type Fault Topic with Fault Topic Type context equal to FAULT ISOLATION will be matched by both the first matchspec 1102 and second matchspec 1104. The packet dispatcher will dispatch this packet to the Text Extraction Module 1110 with an argument of "Faults" since the second matchspec 1104 has more context label-value pairs than the first matchspec.

Packet Processing Modules. The purpose of packet processing modules, or packet processors as they are also referred to, is to analyze, organize and process packets. In one embodiment, packet processors may be classified into two groups: generic packet processors and custom packet processors. Generic packet processors are those that will likely be used regardless of the data source. Custom packet processors, on the other hand, are data source specific. In addition, packet processors may also be categorized as terminal or non-terminal. Terminal packet processors are packet consumers. They process packets but do not generate child packets. Non-terminal packet processors are packet producers. They process packets and generate child packets.

In one embodiment of the invention, there are three generic packet processors: a text extraction module, a packet export module, and a null module. The text extraction module and packet export modules will be discussed in detail in the following sections. The null processor is a terminal packet processor. The null processor does not process packets. Its purpose is simply to consume packets. In one embodiment, the null processor is also unique in that is does not have an implementation. The packet dispatcher effectively performs its function. Instead of routing packets to a physical null processor, the packet dispatcher simply destroys them.

In one embodiment, before packet processors can analyze, organize and process packets, they are registered. Packet processor registration accomplishes two things. First, a record, which corresponds to the packet processor, is inserted into the processing module repository if one does not already exist. Second, the prototypes of packets, which the packet processor may produce, are registered in the packet dictionary. The first function makes other components, such as the packet dispatcher, aware of the packet processor itself. The second function makes other components aware of the packets that the packet processor may produce.

Text Extraction Module. A text extraction module (TEM), such as text extraction module 206 of FIG. 2, is a generic, non-terminal packet processor. The purpose of the text extraction module is to identify and format user-specified text entities. In one embodiment related to the directed maintenance domain, examples of text entities include document references, part numbers, observations and faults. As with all packet processors, the TEM is passed a packet and a processor argument. The processor argument specifies which collection of text extraction rules should be applied to the input text. The input text comprises the values of the user specified packet content.

The TEM performs the following acts when processing a packet. First, the TEM identifies the entities specified by the text extraction rules. Second, the TEM formats the entities according to the text extraction formatting rules. Finally, the TEM outputs one or more packets for each entity it has identified and formatted.

The TEM identifies entities as follows. First, the TEM performs a lexical analysis on the input text in order to transform the input text into a list of tokens. Tokens are specified by one or more extended regular expressions or by a previously specified entity. The specification of tokens, however, does not need to be exhaustive. The user does not need to specify regular expressions for text that does not directly contribute to the identification of an entity. Hence, tokenization is performed in two steps. The TEM finds all of the tokens that the user specified and then creates default tokens by applying user specified filters to the text between user specified tokens. Once the input text has been tokenized, the TEM performs a second lexical analysis on the tokenized input text in order to identify entities. Entities are specified by one or more productions. Productions are extended regular expressions whose atomic unit is a token. In summary, the entity identification process is a two pass lexical analysis. The first pass converts the input text to a list of tokens via extended regular expressions of characters. The second pass identifies entities in the tokenized input text via extended regular expressions of tokens.

Consider the sample input text of FIG. 12 and the entities, tokens, and regular expressions of FIG. 13. FIG. 12 is sample input text that is to be processed by a text extraction module. FIG. 13 is a hierarchical representation of a collection of text extraction module rules. The boxed letters in FIG. 13 identify the item in the rule hierarchy. The letter `C` indicates the item is a collection, `E` an entity, `T` a token, `R` a regular expression, `P` a production, and `F` a format. For example, to identify EquipmentNumber entities in the sample text of FIG. 12, the TEM first finds the tokens defined in the EquipmentNumber entity. In this example, the EquipmentNumber entity only defines one token, an EquipNum with a single regular expression [A Z][0 9]{2,} that matches a letter followed by two or more numbers. In the sample text of FIG. 12, the TEM identifies three EquipNum tokens: W121, W122 and W123. Next, the TEM filters the remaining text using the character filter specified on the EquipmentNumber entity item. In this example the character filter contains a space ` `, an open parenthesis `(`, a closed parenthesis `)`, and a comma `,`. The filter is applied by removing the leading and lagging characters, which are in the filter, from the remaining blocks of text. After applying the filter, the remaining blocks of text become default tokens.

For example, in FIG. 12 consider the text between the EquipNum tokens W122 and W123, i.e., a comma and a space. Since both of these characters are in the filter, they are removed and consequently no default token is made between these EquipNum tokens. On the other hand, in FIG. 12 consider the text between the W123 EquipNum token and the end of the input text, i.e., ") (DOC 21-51-11,-22,-23).". The TEM first removes the leading characters that are in the filter. Since the first character is a closed parenthesis (a character that is in the filter) it is removed and the next character is examined. Since it is a space, it is also removed. This continues until the text extraction module encounters a character that is not in the filter. The first such character is the letter `D`. The TEM then removes the lagging characters that are in the filter. Since the last character is a period, a character that is not in the filter, the character remains and the search terminates. If the last character was in the filter, the TEM would remove this character and examine the second to last character. This process would continue until the TEM encountered a character, which was not in the filter. Since the text "DOC 21-51-11,-22,-33)." remains after filtering the leading and lagging characters, it becomes a default token. Table 1 below provides a list of EquipNum tokens and default tokens resulting from the text extraction module performing a lexical analysis to identify EquipmentNumber entities in the sample text.

TABLE-US-00001 TABLE 1 A list of tokens and their values. Token Value Default IF THE PROBLEM CONTINUES, REPLACE THE L (R, C) WIDGET FIN W121 FIN W122 FIN W123 Default DOC 21-51-11, -22, -33).

At this point, the sample input text is tokenized into five tokens: Default, FIN, FIN, FIN, and Default. Next, TEM performs a second lexical analysis to find the EquipmentNumber entity's productions. In this example, there is only one production, FIN+, which matches one or more EquipNum tokens. (FIN is the abbreviation of EquipNum.) Consequentially, the text extraction module finds one entity, the three EquipNum tokens as shown in FIG. 14.

Once an entity is identified in the input text, it is formatted into one or more fields. The entity is then packaged as a packet and sent to the packet dispatcher. The TEM formats entities as follows. First, the TEM puts the matched production's tokens into bins according to their type. Second, the TEM performs a full or level expansion on the tokens in the bins. Third, the TEM creates a field for each of the matching production's formats. Finally, the TEM creates a packet and inserts the fields into the packet as content.

Again, let us consider the sample input text and the rules of the DocumentReference entity. Upon applying the two-level lexical analysis, the TEM identifies one ChapterSection token (CS), three Subject tokens (SUB), and two Separator tokens (SEP) as shown in FIG. 15. In addition, the TEM identifies one entity that matches the second production, VOL? CS SS (SEP SS)*, as shown in FIG. 15. The six tokens of the matched production are CS, SUB, SEP, SUB, SEP, and SUB. The TEM now puts the values of the listed tokens into bins according to their type. In one embodiment, prior to placing a value into a bin, the TEM first verifies that the bin has room.

FIG. 16 shows the token values for the DocumentReference production (as defined by the example text extraction rules of FIG. 13) sorted into bins. For example, the value of a first token is "21-51" and its type is ChapterSection (CS). Since the CS token definition did not specify a bin depth, the depth is assumed to be infinite. Consequently, the value "21-51" is inserting into the CS bin 1602. The value of a second token is "-11" and its type is Subject (SUB). Since the Subject's bin depth is also infinite, the value "-11" is inserted into the SUB bin 1604. The value of a third token is "," and its type is Separator (SEP). Since the Separator's bin depth is zero, as indicated in FIG. 13, the TEM does not insert this token's value into the SEP bin 1606. To do so would cause the number of values in the bin to exceed its depth. After the TEM attempts to put all of the token values into bins, it then checks if any bin is empty. If a bin is empty and its corresponding token rule specifies a default value, the default value is inserted into the bin. In the current example, the Volume (VOL) and Separator (SEP) bins 1600,1606 are empty. Since the VOL token specifies the default value "DOC", "DOC" is inserted into the VOL bin 1600 as shown in FIG. 16.

After the token values of the match production are inserted into bins as shown in FIG. 16, the TEM performs a level or full expansion on the bin contents. FIG. 17 lists the sets of token values formed by performing a full expansion on the bins of FIG. 16. During a level expansion, the TEM groups the ith value of each bin into a set, where index i ranges from 1 to the maximum number of values in any bin. The previous statement only applies to bins that contain multiple values. For bins that contain a single value, the first value is grouped into each set. If the values of multi-valued bins with unequal number of values are expanded, then some sets will be missing values. During a full expansion, the TEM groups all combination of the values into sets. In the curr


Free Web Sudoku Puzzles.
Solve with your browser.
      1          
  7     9   3 6  
8         2 9    
6   9   1 5      
7               8
      7 3   4   6
    1 4         2
  3 6   8     5  
          1      
What is it?



Add Your Site · Terms Of Service · Privacy Policy


DISCLAIMER
Linkgrinder is a free service that searches the Internet and indexes all files found so that you may search quickly and easily for shared files. These files are created and made available individually by users whose identity we are not aware of and who we have no control over. In essence we function like a search engine tool; these files ARE NOT STORED OR SERVED BY OUR NETWORK. We are not responsible for any materials obtained by using our service. We do not monitor any of the contents of these files. These files may contain viruses, illegal materials, materials inappropriate for minors, offensive files and the like. BY USING OUR SERVICE, YOU ASSUME FULL RESPONSIBILITY FOR DOWNLOADING THESE MATERIALS AND WILL INDEMNIFY US FOR ANY DAMAGES THAT MAY BE INCURRED.

For More Specific Information VIEW OUR TERMS OF SERVICE.

Thank you and Enjoy!