Title: Methodology for creating and maintaining a scheme for categorizing electronic communications
Abstract: Supervised learning is used to develop a computer program that can automatically route or respond to electronic communications assuming the existence of an appropriate formal scheme for categorizing incoming electronic communications. A method is described by which such a categorization scheme for electronic communications can be constructed. The method is based on an analysis of factors having an impact on the categorization scheme from both the business domain and the technology domain. The problem solved by this method is a new one that is only now emerging as automated methods of routing communications based on supervised learning are becoming feasible. Among other uses, this method may be practically employed as a disciplined way of carrying out consulting engagements that call for setting up and maintaining categorization schemes for electronic communications.
Patent Number: 6,925,454 Issued on 08/02/2005 to Lam,   et al.
| Inventors:
|
Lam; Kathryn K. (Ridgefield, CT);
Oles; Frank J. (Peekskill, NY)
|
| Assignee:
|
International Business Machines Corporation (Armonk, NY)
|
| Appl. No.:
|
733946 |
| Filed:
|
December 12, 2000 |
| Current U.S. Class: |
706/45; 705/8; 709/206 |
| Intern'l Class: |
G06F 017/60 |
| Field of Search: |
705/8
706/45
709/206
|
References Cited [Referenced By]
U.S. Patent Documents
| 5377354 | Dec., 1994 | Scannell et al.
| |
| 5666490 | Sep., 1997 | Gillings et al.
| |
| 5751960 | May., 1998 | Matsunaga.
| |
| 5802253 | Sep., 1998 | Gross et al.
| |
| 5878230 | Mar., 1999 | Weber et al.
| |
| 5878398 | Mar., 1999 | Tokuda et al.
| |
| 6038541 | Mar., 2000 | Tokuda et al.
| |
| 6401073 | Jun., 2002 | Tokuda et al.
| |
| 6714967 | Mar., 2004 | Horvitz.
| |
| 6751600 | Jun., 2004 | Wolin.
| |
| Foreign Patent Documents |
| 409190447 | Jul., 1997 | JP.
| |
| 0200/0113064 | Apr., 2000 | JP.
| |
Other References
Youngjoong et al "AutomaticText Categoriziation by Unsupervised Learning"; Department
of Computer Science, pp. 453-45.
William et al "Context-Sensitive Learning Methods for Text Categorization"; Apr.
1999; ACM Transaction on Information Systems, vol. 17, No. 2; pp. 141-173.
|
Primary Examiner: Jeanty; Romain
Attorney, Agent or Firm: Whitham, Curtis & Christofferson, P.C., Kaufman; Stephen C.
Claims
1. A computer implemented method for categorizing incoming electronic communications
using a supervised machine learning component, and for factoring an organization's
business domain into the technology domain to enable an acceptable automated response
and routing scheme, said method comprising the steps of:
(a) analyzing the business domain;
(b) determining an approach to machine learning using a program or an algorithm
for inducing a categorizer using supervised learning, the categorizer being generated
from training data comprising a set of examples of the type of electronic communications
to be classified;
(c) collecting existing data of representative examples of electronic communications
and inventories of personnel skills, business processes, workflows, and business
missions;
(d) Analyzing the collected data for determining one or more attributes of said
electronic communications selected from the group consisting of complexity, vagueness,
and uniqueness to be expected in the type of communications to be categorized,
as well as the relative numbers of electronic communications having a particular
attribute of said one or more attributes, and for determining a technical structure
of the communications relevant to categorization, and factoring the inventories
of personnel skills, business processes, workflows, and business missions collected
to determine what must be done with each electronic communication, and by whom;
(e) defining a categorization scheme;
(f) labeling examples of electronic communications with categories from the categorization
scheme for use both as training data to be used in the supervised learning step
and as test data;
(g) converting, using a computer, the labeled examples into a form suitable for
subsequent processing, both for purposes of machine learning and technical validation;
(h) performing using said computer, machine based supervised learning technology
to induce said categorizer for the categorization scheme; and
(i) validating the categorization scheme with respect to technical performance
and business requirements.
2. A method as recited in claim 1, further comprising the step of implementing
the categorization scheme by putting the categorization system into production.
3. A method as recited in claim 1, further comprising the steps of:
reviewing the categorization scheme to consider its adequacy in light of recent
distribution of communications; and
modifying the categorization scheme, as required, to accommodate new business
goals, or to keep in step with changes in the supervised learning technology, wherein
if it is determined to change the categorization scheme, steps (f) through (i)
are repeated.
4. A method as recited in claim 1, wherein the step of analyzing the business
domain further comprises steps:
analyzing anticipated content of relevant electronic communications;
analyzing business missions and goals;
evaluating skills of involved personnel;
analyzing the organization's workflow;
analyzing use of stored responses including determining whether answers have
been developed for frequently occurring questions; and
producing business requirements for use in the validation step using insight
gained by the analysis of the business domain.
5. A method as recited in claim 4, wherein the step of analyzing business missions
and goals further comprises:
reviewing a model of the business domain and determining success criteria and
measurements used to determine when the business is successful;
establishing turnaround times for the electronic communications to support mission
and goals of the business; and
determining a volume of electronic communications received daily and determining
a number of received communications that must be answered to meet the goals of
the business.
6. A method as recited in claim 4, wherein the step of analyzing the organization's
workflow, further comprises:
determining a workflow through the organization and routing performed on a category
by category basis;
determining if subject matter experts (SME) have been established for categories
of information; and
determining whether an automated or manual system for routing electronic communications
is being used.
7. A method as recited in claims
1, wherein the step of defining a categorization
scheme further comprises steps:
combining lists of categories in the group of categories related to business
mission groups, related to routing communications to specific individuals, communications
for which an automated response is feasible and desirable, and those related to
existing stored responses or stored templates for responses;
determining technically feasible categorization from the assembled categorization
scheme;
correlating knowledge of the technical structure of the communications with knowledge
of what kinds of features can actually be identified by the machine learning component;
and
eliminating or combining categories with few examples, if necessary.
8. A computer-implemented method for categorizing incoming electronic communications
using a supervised machine learning component, and for factoring an organization's
business domain the technology domain to enable an acceptable automated response
and routing scheme, said method comprising the steps of:
selecting a machine learning component for the technology domain;
preparing a set of training data comprising representations of previously categorized
electronic communications, wherein the data in an electronic communication is textual
and each electronic communication has features, where a feature is related to textual
data;
analyzing the organization's business domain with respect to desired routing
and handling of contemplated message categories of electronic communications, the
analysis resulting in identification of tasks to be performed and actions to be
taken in response to a received electronic communication of a contemplated message
category, the analysis also resulting in identification of features relevant to
categorization of electronic communications;
determining skill levels of personnel corresponding to required tasks and actions
identified in the step of analyzing the organization's business domain;
extracting, using a computer, a new representation of each electronic communication
in the training set depending on a frequency of occurrence in the electronic communication
of features identified as relevant to the business domain;
inducing a pattern characterization when an electronic communication belongs
to a category, wherein the patterns are presented as rules or another format correspond
the selected machine learning component; and
developing, using said computer, an initial categorization scheme based on areas
of the business domain receiving a greater quantity of electronic communications
or electronic communications of a relatively higher priority.
9. A method as recited in claim 8, wherein an electronic communication comprises
more than one part and each part of the electronic communication has corresponding
features related to a category and categorized based on each part in the inducing step.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to computer-assisted processing of electronic
communications and, more particularly, to a method for categorization of electronic
mail or other electronic communications based on the defined business processes,
personnel skills and workflows of an organization receiving the communications.
2. Background Description
Much of today's business is transacted by reading electronic mail (e-mail),
reports and other documents to gather the pertinent information to make informed
business decisions. Customers require information on products and services. Many
people are employed solely to read and respond to these customer requests and the
money required to pay for this headcount is a large part of departmental budgets.
As customer expectations are set to receive information ever more quickly as a
result of the ability to help oneself on the Internet, additional strain is placed
on resource-limited companies. Often, the same information is requested repeatedly.
Many companies would like to find a way to address their customers' requests
but at the same time they would like to reduce the time it takes to respond, to
reduce headcount associated with answering these questions and to provide a consistent
set of responses no matter the skill level of the people answering the requests
for information. It is also important to transfer the knowledge and decision points
from the experienced resources to new hires and less experienced personnel. Often,
the method presently used to categorize and respond to the incoming electronic
communications is incompletely defined, and the results of its use cannot be reliably repeated.
The current business practice for handling electronic communications is often
to leave processing decisions to the discretion of the individual answering the
communication. In general, the electronic communication enters a computer system
run by the organization, it is reviewed by a human, possibly the communication
is routed to other individuals, and when the proper respondent is reached, that
person sends a response. Thus, electronic communications are not necessarily simply
answered by the first person who sees them, although they might be, but they may
be routed to a more appropriate respondent. At some point in the process, the electronic
communication has a category attached to it to facilitate routing and response.
Whatever action is taken is largely determined by how an electronic communication
is categorized. The action taken is not entirely dependent on the assigned category
only because of the reliance of human oversight.
Additionally, the attached category can be useful when and if an analysis
of the workflow is undertaken. The categorization process is normally somewhat
unstructured, although there may be informal written guidelines as well as a collection
of response templates residing in a computer system that a person may use in handcrafting
a computer-assisted response. The categories used may be imprecise and ambiguous,
with the proper performance of the system depending on human intervention to resolve
any problems that arise.
Referring now to the drawings, and in particular to FIG. 1, there is shown
a high-level analysis of factors entering into current business practices for determining
how to categorize electronic communications. Current business practice in this
area has previously not been the subject of formal analysis. However, it is observed
by the inventors that the development of the manual categorization scheme
100
is governed by two main elements:
- 1. the anticipated content 102 of the incoming electronic communication; and
- 2. a workflow analysis 104 of how the business should deal with
various kinds of electronic communications. (Note: a workflow analysis determines
where and how a particular electronic communication will be handled. It is based
on the business mission 105, the skills of the people responding manually
to the electronic communication 107, as well as any previously developed
and stored responses 109.)
An ad hoc approach is currently used to develop the categorization scheme from
these two elements.
One way to automate the routing and/or response to e-mail is, broadly speaking,
to let a computer learn how to do the job of categorization. A specific means of
letting the computer learn to do this job is to employ techniques in the area of
machine learning called supervised learning. However, even with promising technology,
in the absence of the right categorization scheme, "let a computer learn how to
do the job" is only a slogan and not a solution. The best machine learning methods
in the world cannot work if the categorization scheme does not match both the technology
and the business needs simultaneously.
SUMMARY OF THE INVENTION
It is therefore an object of the invention to provide a methodology for creating
and maintaining a categorization scheme for electronic mail or other electronic
communications based on the defined business processes, personnel skills and workflows
of an organization receiving the communications.
According to the invention, a categorization scheme is to be implemented
using analysis of both the business domain of an organization and the technology
domain of a computer-implemented categorizer. Current practice does not involve
the disciplined integration of the two domains. Current practice also fails to
incorporate the careful, new analysis of the two domains, in relation to the problem
of categorizing electronic communications, that is found in this invention.
A first step is to analyze the business domain. The analysis of the business
domain
comprises the steps of:
1. Analyze the anticipated content of relevant electronic communications. Review
the existing electronic communications to determine if the same questions are frequently
asked. If they are, determine if those questions can be answered the same way each time.
2. Analyze business missions and goals as follows: Review the business model
and
determine the success criteria and measurements used to determine when the business
is successful. Establish turnaround times for the electronic communications to
support the business mission and goals. Determine the volume of electronic communications
that come in daily and how many have to be answered to meet the business goals.
3. Evaluate the skills of involved personnel. Determine whether the customer
service
representatives (CSR) can answer the questions asked or provide the requested information
directly. If not, determine if they forward the questions to a more experienced
person or depend on the answer being provided to them.
4. Analyze the organization's workflow as follows: Determine the flow through
the organization and routing performed on a category by category basis. Determine
if subject matter experts (SME) have been established for the categories of information
the people receiving the questions can't answer. Determine whether there is currently
an automated or manual system for routing electronic communications.
5. Analyze the use of stored responses in the following manner: Determine whether
answers have been developed for frequently occurring questions. Determine whether
the CSRs add additional comments to stored responses before sending them to the customer.
6. Using the insight gained by the analysis of the business domain so far, produce
business requirements to be used in the later validation phase.
The second step is to decide on an approach to machine learning in the form of
a program or an algorithm that will be used to induce a categorizer using supervised
learning. In supervised learning, the categorizer is generated from training data.
which in this case will be a set of examples of the communications of the kind
to be classified. In order to produce a categorizer, the items of training data
will be labeled with categories from the category scheme produced by this invention.
The third step is to gather existing data. Foremost is the assembly of a pool,
as large as possible, of representative examples of electronic communications.
This data set will be needed in subsequent steps. Informal or ad hoc methods of
labeling or classifying electronic communications may already exist, so all of
these existing categorization schemes should be collected. Also, make inventories
of personnel skills, business processes, workflows, and business missions.
The fourth step is to analyze the data. Examples of the electronic communications
should be studied to gain an appreciation of the complexity, vagueness, and uniqueness
to be expected in the communications to be categorized, as well as the relative
numbers of various kinds of communications. The technical structure of the communications
should be ascertained. An example of this kind of structure is the presence of
special fields, e.g., containing a URL or a CGI query string, that may be likely
to be relevant to categorization. The inventories of personnel skills, business
processes, workflows, and business missions collected earlier provide the basis
for obtaining a complete understanding of what must be done with an electronic
communication, and by whom. In particular, both the extent to which different people
necessarily handle clearly different kinds of communications and the extent to
which a single person may handle communications of a variety of types should be
clearly understood. This understanding will determine the level of granularity
of the categorization scheme that is required by structure of the business.
The fifth step is to define a categorization scheme. The first phase here is
to draw together lists of categories related to business groups, categories related
to routing communications to specific individuals, categories of communications
for which an automated response is feasible and desirable, and categories related
to existing stored responses or stored templates for responses. The assembled categorization
scheme should be then tempered by bringing to bear an analysis of what kinds of
categorization are technically feasible. Knowledge of the technical structure of
the communications should be correlated with knowledge of what kinds of features
can actually be identified by the machine learning component. This will lead to
conjectures about what kinds of categories are reasonable to consider for the categorization
scheme, in light of the fact that the categories for which supervised learning
is likely to be effective are those that are associated with distinctive vocabularies.
If two categories are so similar that it is hard to come up with words that will
frequently distinguish communications in one category from communications in the
other, then those categories should probably be amalgamated. Also, very general
categories are not likely to be good for supervised learning techniques, again
because of the lack of a distinctive vocabulary. Finally, the categorization scheme
should be considered in light of the amount of training data available. Categories
with very few examples, as a guideline, say, less than 30, should be eliminated
or combined with other categories because supervised learning is not likely to
produce a categorizer that performs well on those categories.
The sixth step is to label examples of electronic communications with categories
from the categorization scheme for use both as training data to be used in the
supervised learning step and as test data. The labeled examples should resemble
as closely as possible the data on which the induced categorizer will eventually
be used. However, if perfectly matching training data is unavailable, it is possible
to use instead other data that bears a close resemblance to the electronic communications
to be ultimately categorized.
The seventh step is to use a computer program to convert the labeled data into
a form suitable for subsequent processing, both for the purpose of machine learning
and technical validation.
The eighth step is to use a computer program based on the supervised learning
technology to induce a categorizer for the categorization scheme. This induction
of the categorizer may involve some experimentation involving tuning parameters
to improve performance. For instance, the particular algorithm used may use only
a small set of values as feature counts, but exactly how many values are used may
be setable. Similarly, specifying exactly what sections of a communication or document
are used in training may be setable. If it is technically feasible and judged desirable
(perhaps to compensate for known deficiencies in the training data), after the
supervised learning algorithm has induced the categorizer, manual modification
of the machine-generated categorizer may be done. For a rule-based categorizer,
such manual modification may be accomplished by adding additional rules to categorize
very important kinds of communications that were not adequately represented in
the training data.
The ninth step is to validate the categorization scheme with respect to technical
performance and business requirements. The technical performance criteria can be
evaluated using a data test set, possibly consisting of data held back from the
training set. This will involve using a computer program that uses the induced
categorizer to predict the categories to which electronic communications belong.
Also, in the context of the level of technical performance attained on various
categories, the performance level of the categorization scheme with respect to
business requirements must be judged. Finally, evaluate the overall performance
of the categorization scheme by exercising on new data the entire system for processing
electronic communications of which the categorization system is a part.
The tenth step is to implement the categorization scheme by putting the categorization
system into production. The system should be monitored, documenting errors made
and successes achieved.
The final step is to review and modify the categorization scheme, as required.
The scheme should be reviewed regularly to consider its adequacy in the light of
the latest distribution of communications. It should be modified to accommodate
new business goals, and it may need to be changed to keep in step with changes
in the supervised learning technology. If the categorization scheme must be changed,
return to the sixth step to consider relabeling the data, and proceed from there.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing and other objects, aspects and advantages will be better understood
from the following detailed description of a preferred embodiment of the invention
with reference to the drawings, in which:
FIG. 1 shows a high-level analysis of factors entering into current business
practices for determining how to categorize electronic communications;
FIG. 2 illustrates a high-level analysis of additional factors that must be
considered in constructing a categorization scheme that works well for automated
response and routing of electronic communications using a categorizer constructed
using supervised learning;
FIG. 3 shows a flow chart of the steps to be followed in for constructing a
categorization scheme suitable for use with supervised learning technology, according
to the present invention;
FIG. 4 illustrates the steps to be taken for analyzing the business domain;
FIG. 5 is a flow chart of a procedure for converting an electronic communication
to a form suitable to be data used for supervised learning, as well as for categorization
by a categorizer induced by a typical supervised learning algorithm;
FIG. 6 is a flow chart for a procedure for supervised learning of a categorizer
for a categorization scheme S using training data labeled with categories from S;
FIG. 7 is a flow chart for a procedure for supervised learning of a categorizer
for a single category C using training data labeled with categories from a categorization
scheme; and
FIG. 8 is a flow chart for a procedure for using a categorizer, Categorizer(S),
induced by supervised learning, to predict those categories in a categorization
scheme S to which an unlabeled electronic communication (document d in the flow
chart) belongs.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION
To develop the procedure for categorizing incoming electronic communications,
supervised learning technology can be based on decision trees or on logical rules
or on other mathematical techniques such as linear discriminant methods (including
perceptrons, support vector machines and related variants), nearest neighbor methods,
Bayesian inference, etc. Feature selection is included as part of supervised learning,
for the following discussion.
Referring again to the drawings, FIG. 2 illustrates a high-level analysis
of additional factors that must be considered in constructing a categorization
scheme that works well for automated response and routing of electronic communications
using a categorizer constructed using supervised learning. The categorization scheme
220 is the key to linking the business domain
230 and the technology
domain
200 for electronic communications. The business domain
230
contributes a workflow analysis
231, carried out in the context of the business
mission
232 and an inventory of the available skilled personnel
233.
Existing stored responses
234 are collated and analyzed with respect to
content and business function. The technology domain
200 contributes the
supervised learning engine and an analysis
210-in terms of content, structure,
and processing-of the incoming electronic communication. The categorization scheme
220 brings the two domains together. One aspect of the present invention
is this analysis of the structure of the problem.
Within each domain, there are contributing factors which must be considered
before a practical categorization scheme is developed and implemented.
Supervised learning technology
200 requires a "training set"
211
of representations of previously categorized electronic communications to enable
a computer to induce patterns that allow it to categorize future incoming electronic
messages. Generally, there is also a "test set"
215 that is used to evaluate
whatever specific categorization procedure is developed. In academic exercises,
the test set is usually disjoint from the training set to compensate for the phenomenon
of overfitting. In practice, if the data set is small, the only way to get really
useful results may be to use all the-available data in both the training set. and
the test set.
At the outset, the electronic communications in both the training set and the
test set are represented in terms of numbers derived from counting occurrences
of features
213. The relationship between features for the purposes of supervised
learning and the text of a message has an important impact on the success of the
enterprise, so it has to be addressed, but it is not part of supervised learning
per se.
An example of an electronic communication might look like the following:
- FROM: joe-user@where-ever.com
- SUBJECT: TP 755 CD
- REFERRED-FROM: www.ibm.com
- TEXT: I erased the hard drive on my 755 CD and now I want to reload
windows. Any suggestions?
If a category or categories, such as
- ThinkPad,
were associated with the above communication, then one would have a data item
suitable to be an item of training data. For the purpose of inducing a categorizer,
the FROM field is irrelevant. In the SUBJECT field, after tokenization and stemming,
three tokens would be identified:
- Tp 755 CD
and in the TEXT field, after tokenization and stemming, there would be 17 tokens:
- I erase the hard drive on my 755 CD and now want to reload window any suggestion
each occurring once, except for "I" which occurs twice. The REFERRED-FROM field
is different, being a URL, and so should be tokenized differently from the other
fields, yielding, after tokenization and stemming, a single token
- www.ibm.com
Note that the periods in the last case were not regarded as separators. Each
token in each section would be regarded as a distinct feature. Using a transparent
notation, 21 features are identified:
| |
| SUBJECT|TP |
SUBJECT|755 |
SUBJECT|CD |
| |
| TEXT|I |
TEXT|erase |
TEXT|the |
| TEXT|hard |
TEXT|drive |
TEXT|on |
| TEXT|my |
TEXT|755 |
TEXT|CD |
| TEXT|and |
TEXT|now |
TEXT|want |
| TEXT|to |
TEXT|reload |
TEXT|window |
| TEXT|any |
TEXT|suggestion |
| REFERRED-FROM|www.ibm.com |
| |
The counts associated with each feature would all be 1 except the count associated
with the feature TEXT|I would be 2. Hence, the vector would represent this document,
with the understanding that the order of the counts corresponds to the order of
the features as listed above. In actuality, in processing the training data, the
system for categorizer induction would likely encounter thousands of features,
and each electronic communication would contain only a small number of them, so
that sparse vectors would represent documents. The reader should keep in mind that
the analysis just given is only meant to be an example of typical case of feature
analysis. For instance, a sophisticated system might recognize windows in the context
of this communication as being the name of a family of operating systems, to be
recognized in this instance as an occurrence of a special token, e.g.,
- Microsoft Windows®
At any rate, the details of tokenization, stemming, section recognition, etc.,
could vary while still remaining in the spirit of this method.
When the data is based on text, as in an electronic communication
240,
the initial representations in terms of features are often too complicated for
a computer to handle. There are usually too many features, and some distillation
is needed. So, after the training set is prepared, a list of those features deemed
particularly relevant to categorization is typically extracted automatically. The
features in this list are referred to as the "selected features", and the process
of building the list is referred to as "feature selection". There is an issue in
regard to whether a single list of features, or a global dictionary, is created
during feature selection, or whether there is a separate list for each category,
referred to as local dictionaries. The resolution of this issue can depend on the
details of the supervised learning technique employed, but in applications related
to text, local dictionaries often give better performance. There are a variety
of criteria for judging relevance during feature selection. A simple one is to
use absolute or normalized frequency to compile a list of a fixed number of the
most frequent features for each category, providing for the fact that small categories
may be so underpopulated that the total number of features in them may be less
than the threshold. More sophisticated techniques involve the use of information-theoretic
measures such as entropy or the use of statistical methods such as principal component
analysis. The premise behind feature selection is that the occurrence of selected
features in incoming electronic communications will suffice for developing a sophisticated
pattern recognition system to assign one or more categories to the communication.
After feature selection, a new representation of each electronic communication
in the training data is then extracted in terms of how frequently each selected
feature occurs in that item. From these new representations, the computer induces
patterns that characterize when an electronic communication belongs to a particular
category. The term "pattern" is meant to be very general. These patterns may be
presented as rules or in other formats. Exactly what constitutes a pattern depends
on the particular machine learning technology employed. To use the patterns to
categorize incoming electronic communications, the newly arriving data must not
only undergo initial processing so as to obtain a representation in conformance
with the format of the training data
211, but it must then undergo further
re-representation based on the list of selected features
213, so that it
is finally represented in a way that permits the presence or absence of the computed
patterns to be determined.
The assignment of more than one category to an item is called "multiple categorization".
The requirement of support for multiple categorization should enter into the consideration
of the specific machine learning program to be employed for applications involving
categorization of text. Some techniques (for example, some approaches using decision
trees) make the assumption that each item categorized will belong to at most one
category. This is not desirable from the standpoint of categorizing electronic
communications. Some supervised learning systems may return a ranked list of possibilities
instead of a single category, but this is still slightly deficient. Such a system
might assign categories even to documents that should be placed in no category.
A better supervised learning method gives realistic confidence levels with each
assigned category. These methods providing confidence levels are the most flexible
of all, and they provide additional information for a business to use in determining
how to handle each incoming message.
Multiple categorization allows classification of an electronic communication
containing several topics into each relevant category. This eliminates the manual
method of choosing only one category and placing the communication there. Two problems
can arise from this:
- 1. Randomness can be introduced when different people categorize the
communication differently depending on the topic they choose.
- 2. Errors could be introduced into the business metrics used to determine,
for example, how many communications were received on a specific product.
By automatically categorizing the communication under all of the appropriate topics,
the business metrics maintain their accuracy and the communication is consistently
linked to the same categories.
The preferred embodiment of the present invention focuses on the steps that must
be taken to create a categorization scheme
220 that is a prerequisite to
using supervised learning technology effectively to assist in handling electronic
communications. For purposes of the following description, electronic communication
means both ordinary e-mail and web-mail (communication via the World Wide Web);
however, it is not intended to exclude any form of electronic communication. A
category scheme includes flat category schemes in which the categories are not
related by subsumption, as well as hierarchical category schemes in which some
categories subsume other categories. Identification of which categories are logically
disjoint from one another is part of the creation of a category scheme. Whether
or not the identification of mutually disjoint categories is of use to the supervised
learning engine, it is likely to be useful in deciding on the proper handling of
electronic communications. The initial categorization scheme is prepared manually.
Referring again to the drawings, in particular to FIG. 3, there is shown
a flow chart of the steps to be followed for constructing a categorization scheme
suitable for use with supervised learning technology. The first step
310
is to analyze the business domain. Referring now to FIG. 4, there is shown a flow
diagram further describing the analysis of the business domain. The analysis of
the business domain further comprises the steps of:
1. analyzing the anticipated content of relevant electronic communications
410;
and reviewing the existing electronic communications to determine if the same questions
are frequently asked. If they are, determining if those questions can be answered
the same way each time.
2. analyzing business missions and goals
420 as follows: Review the business
model and determine the success criteria and measurements used to determine when
the business is successful. Establish turnaround times for the electronic communications
to support the business mission and goals. Determine the volume of electronic communications
that come in daily and how many have to be answered to meet the business goals.
3. Evaluate the skills of involved personnel
430. Determine whether the
customer service representatives (CSR) can answer the questions asked or provide
the requested information directly. If not, determine if they forward the questions
to a more experienced person or depend on the answer being provided to them.
4. Analyze the organization's workflow
440 as follows: Determine the flow
through the organization and routing performed on a category by category basis.
Determine if subject matter experts (SME) have been established for the categories
of information the people receiving the questions can't answer. Determine whether
there is currently an automated or manual system for routing electronic communications.
5. Analyze the use of stored responses
450 in the following manner: Determine
whether answers have been developed for frequently occurring questions. Determine
whether the CSRs add additional comments to stored responses before sending them
to the customer.
6. Using the insight gained by the analysis of the business domain so far, produce
business requirements
460 to be used in the later validation phase.
Referring again to FIG. 3, the next step
320 for constructing a
categorization scheme is to decide on an approach to machine learning in the form
of a program or an algorithm that will be used to induce a categorizer using supervised
learning. In supervised learning, the categorizer is generated from training data.
which in this case will be a set of examples of the communications of the kind
to be classified. In order to produce a categorizer, the items of training data
will be labeled with categories from the category scheme produced by this invention.
The criteria affecting the suitability of the machine learning component are:
- the ability of the machine learning component to process data effectively
derived from electronic communications containing text, where the data representations
are normally vectors of high dimensionality,
- the potential of the machine learning component to produce a categorizer
whose performance as measured by precision, recall, and/or accuracy indicated likely
utility in a business setting, and
- a determination of whether it is critical to have a capability to extend
or modify a machine-generated categorized by manual means in order to cover gaps
due to the absence of particular kinds of training data.
In the experience of the inventors, machine learning programs based on symbolic
rule induction are likely to be good candidates according to all of these criteria.
However, while machine learning programs based on boosted decision trees, or based
on support vector machines, or based on techniques involving regularizing approaches
to supervised learning such as least squares fit, logistic regression, or related
methods are good according to the first two criteria, and may be employed if it
is determined that the third criterion is not critical.
The next step
330 is to gather existing data. Foremost is the assembly
of a pool, as large as possible, of representative examples of electronic communications.
This data set will be needed in subsequent steps. Informal or ad hoc methods of
labeling or classifying electronic communications may already exist, so all of
these existing categorization schemes should be collected. Also, make inventories
of personnel skills, business processes, workflows, and business missions.
The next step
340 is to analyze the data. Examples of the electronic communications
should be studied to gain an appreciation of the complexity, vagueness, and uniqueness
to be expected in the communications to be categorized, as well as the relative
numbers of various kinds of communications. The technical structure of the communications
should be ascertained. An example of this kind of structure is the presence of
special fields, e.g., containing URL or a CGI query string, that may be likely
to be relevant to categorization. The inventories of personnel skills, business
processes, workflows, and business missions collected earlier provide the basis
for obtaining a complete understanding of what must be done with electronic communication,
and by whom. In particular, both the extent to which different people necessarily
handle clearly different kinds of communications and the extent to which a single
person may handle communications of a variety of types should be clearly understood.
This understanding will determine the level of granularity of the categorization
scheme that is required by structure of the business.
The next step
350 is to define a categorization scheme. The first phase
here is draw together lists of categories related to business mission groups, categories
related to routing communications to specific individuals, categories of communications
for which an automated response is feasible and desirable, categories related to
existing stored responses or stored templates for responses. The assembled categorization
scheme should be then tempered by bringing to bear an analysis of what kinds of
categorization are technically feasible. Knowledge of the technical structure of
the communications should be correlated with knowledge of what kinds of features
can actually be identified by the machine learning component. This will lead to
conjectures about what kinds of categories are reasonable to consider for the categorization
scheme, in light of the fact that the categories for which supervised learning
is likely to be effective are those that are associated with distinctive vocabularies.
If two categories are so similar that it is hard to come up with words that will
frequently distinguish communications in one category from communications in the
other, then those categories should probably be amalgamated. Also, very general
categories are not likely to be good for supervised learning techniques, again
because of the lack of a distinctive vocabulary. Finally, the categorization scheme
should be considered in light of the amount of training data available. Categories
with very few examples, as a guideline, for example, 30, should be eliminated or
combined with other categories because supervised learning is not likely to produce
a categorizer that performs well on those categories.
The next step
355 is to label examples of electronic communications with
categories from the categorization scheme for use both as training data to be used
in the supervised learning step and as test data. The labeled examples should resemble
as closely as possible the data on which the induced categorizer will eventually
be used. However, if perfectly matching training data is unavailable, it is possible
to use instead other data that bears a close resemblance to the electronic communications
to be ultimately categorized.
The next step
360 is to use a computer program to convert the labeled
data into a form suitable for subsequent processing, both for the purpose of machine
learning and technical validation. Referring now to FIG. 5, there is shown a procedure
for converting an electronic communication to a form suitable to be data used for
supervised learning. First, a document d is read in block
510. The document
d is segmented into sections
520, if any, whose separate identity is significant
for categorization. Each section containing text is then tokenized in block
530.
Optionally, all tokens are converted to canonical forms, i.e., perform stemming,
in block
540. Stopwords are optionally deleted in block
550. A stopword
is a common word not useful for categorization. A representation r of the tokenized
document d from which the list of tokens in each section can be determined is then
output in block
569.
Referring again to FIG. 3, the next step
365 is to use a computer
program based on the supervised learning technology to induce a categorizer for
the categorization scheme. This induction of the categorizer may involve some experimentation
involving tuning parameters to improve performance. For instance, the particular
algorithm used may use only a small set of values as feature counts, but exactly
how many values are used may be setable. Similarly, specifying exactly what sections
of a communication or document are used in training may be setable. If it is technically
feasible and judged desirable (perhaps to compensate for known deficiencies in
the training data), after the supervised learning algorithm has induced the categorizer,
manual modification of the machine-generated categorizer may be done. For a rule-based
categorizer, such manual modification may be accomplished by adding additional
rules to categorize very important kinds of communications that were not adequately
represented in the training data.
Referring to FIG. 6, there is shown a procedure for supervised learning
of a categorizer for a categorization scheme S using training data labeled with
categories from S. A list S of categories and the set TR of representations of
the tokenized training documents labeled with categories in S are input in block
610. For each category C in S, TR is used to induce a categorizer T(C) that
can decide if an unlabeled document is in C, in block
620, as shown in detail
in FIG.
7.
Referring now to FIG. 7, there is shown a procedure for supervised learning
of a categorizer for a single category C using training data labeled with categories
from a categorization scheme. First a category C is specified in block
710.
The set TR of representations of the tokenized training documents is input in block
720. Feature selection is performed in block
730, creating a list
(i.e., a local dictionary) L(C) of selected features for this data set and this
category C. A set D(C) of category-specific representations of the tokenized training
documents is created in block
740, based on the list L(C) of selected features.
The category-specific representations D(C) are used to induce a categorizer T(C)
for the category C, in block
750. Data specifying the categorizer T(C) is
output in block
760.
Referring again to FIG. 6, data specifying the categorizers T(C) for all
C are assembled into data specifying a categorizer Categorizer(S) that can predict
which the categories in S to which an unlabeled document belongs, in block
630.
Data specifying the categorizer Catgorizer(S) is output in block
640.
Referring again to FIG. 3, the next step
370 is to validate the
categorization scheme with respect to technical performance and business requirements.
The technical performance criteria can be evaluated using a data test set, possibly
consisting of data held back from the training set. This will involve using a computer
program that follows the flow chart depicted in FIG. 8 to use the induced categorizer
to predict the categories to which electronic communications belong. Referring
to FIG. 8, there is shown a procedure for using a categorizer Categorizer(S), induced
by supervised learning, to predict those categories in S to which an unlabeled
electronic communication (document d) belongs. Data is read in specifying a categorizer
Categorizer(S) in block
810. Document d is read in block
820. A representation
r is created of document d in a manner corresponding to that used in the processing
of the training data that induced Categorizer(S), in block
830. Document
d is then categorized by returning all categories to which d is predicted to belong
by Categorizer(S), in block
840. In the context of the level of technical
performance attained on various categories, the performance level of the categorization
scheme with respect to business requirements must be judged. Finally, the overall
performance of the categorization scheme is evaluated by exercising on new data
the entire system for processing electronic communications of which the categorization
system is a part.
Referring again to FIG. 3, the next step
380 is to implement the
categorization scheme by putting the categorization system into production. The
system should be monitored, documenting errors made and successes achieved.
The final step
390 is to review and modify the categorization scheme,
as required. The scheme should be reviewed regularly to consider its adequacy in
the light of the latest distribution of communications. It should be modified to
accommodate new business goals, and it may need to be changed to keep in step with
changes in the supervised learning technology. If the categorization scheme must
be changed, return to the sixth step to consider relabeling the data, and proceed
from there.
Business Domain
The business mission must be well understood to develop a meaningful categorization
scheme. Thus, the first step is to gather data regarding the organization. The
business mission will determine the level of detail and complexity required. For
example, if the business mission requires sending a very specific response to a
very technical question within a short time frame, the categorization scheme must
support a sufficient level of detail for a successful first hit. If only a general
reply is required, the demands on the category determination are much less stringent.
The categorization engine must be robust enough to support more or less detailed categories.
The skill levels of the personnel who execute the day-to-day operations also
have an impact on the development of the categorization scheme. Many businesses
will want to use a tiered approach to answering their electronic communications.
A first-level generalist may act as a filter and answer as many e-mails as possible
before passing to the second or third tier specialists. The category assigned to
an incoming communication should facilitate and not hinder such a transfer. This
type of hand-off reflects both skill sets and workflow. The linkage between the
two must be understood to develop a categorization scheme that is truly functional
in a business environment. Thus, the data gathered is now analyzed in light of
defining a categorization scheme for electronic communications.
Proper execution of the workflow analysis when developing the categorization
scheme involves the following steps:
- 1. Review the personnel assignments in the light of the business mission
and the skills of individuals.
- 2. Understand how to use skilled resources to direct the electronic
communication to a specialist if additional expertise is required.
- 3. Understand how to use skilled resources to develop responses to the
electronic communication that keep customer satisfaction high and advance the business goals.
- 4. Understand the key decision points required to determine whether
additional support is necessary, whether routine answers can be provided, or whether
a custom answer is required.
- 5. Understand when and how to capture the routing information for automating
the responses or forwarding to the specialist.
- 6. Maintain the lowest level of complexity required to respond efficiently
to the communications and still meet the business goals.
- 7. Develop a tentative initial categorization scheme 305 based on areas
receiving the largest number and most important communications so efforts can be
directed toward those areas providing the biggest impact on the business.
- 8. Finally, for realistic use in a business setting, a requirement is
that the categorization scheme be conceived with the realization that many electronic
communications will of necessity be assigned more than one category. Thus, multiple
categorization is common in this domain. The different categories may be of entirely
different kinds (employee inquiry vs. customer inquiry, one product vs. another
product), or they may both be of the same general kind (as when a bank customer
inquires about different services, e.g. a credit card application and how to open
a money market account, in the same message).
In particular, sensitivity to multiple categorization-when it can happen and
what
else might happen when it does-is one way in which the workflow analysis and must
be done more carefully to support automated routing and response than in the case
of a system based solely or mostly on human response. This increased sensitivity
pays off not only in a higher level of correct response, but it has the potential
for better tracking of electronic communications for report generation purposes.
Technology Domain
There are three issues that should be considered when using supervised learning
technology for text categorization. The first two issues, data set size and distinctivity
of categories, are technical factors affecting performance. The third, technical
validation of results, pulls everything together. The results of validation must
be considered in the context of the technology in order to figure out what corrective
action can bolster results that do not meet expectations.
Data Set Size: The most obvious requirement of supervised learning technology
is that there be enough training data for the task at hand. How much training is
enough will be discussed later. However, this means that the development of a categorization
scheme should not be undertaken independently of the analysis of the training data,
although there is a great temptation to do so. In practice, a revision of an initial
categorization scheme may be necessary on the basis of the number of examples of
a category in the training data assembled.
Distinctivity of Categories: For any existing data set, labeled with
any categorization scheme, it is desirable to have the supervised learning technology
select the most relevant features and, on that basis, find the patterns, or rules,
that can be used to classify, or categorize, new data. This simply can't always
be done, as in the extreme and hopefully unrealistic case of when the data is labeled
randomly. Eventual success is likely so long as before the supervised learning
engine begins pattern induction, features are found that are likely to be distinctive
of the categories of interest. Such features could be t