Title: System and method for converting text-to-voice
Abstract: A method for converting text to concatenated voice by utilizing a digital voice library and a set of playback rules is provided. Multiple voice recordings correspond to a single speech item and represent various inflections of that single speech item. The method includes determining syllable count and impact value for each speech item in a sequence of speech items. A desired inflection for each speech item is determined based on the syllable count and the impact value and further based on a set of playback rules. A sequence of voice recordings is determined by determining a voice recording for each speech item based on the desired inflection and based on the available voice recordings that correspond to the particular speech item. Voice data are generated based on a sequence of voice recordings by concatenating adjacent recordings in the sequence of voice recordings.
Patent Number: 6,990,450 Issued on 01/24/2006 to Case,   et al.
| Inventors:
|
Case; Eliot M. (Denver, CO);
Weirauch; Judith L. (Denver, CO);
Phillips; Richard P. (Salt Lake City, UT)
|
| Assignee:
|
Qwest Communications International Inc. (Denver, CO)
|
| Appl. No.:
|
818331 |
| Filed:
|
March 27, 2001 |
| Current U.S. Class: |
704/260; 704/258; 704/261 |
| Current Intern'l Class: |
G10L 13/08 (20060101) |
| Field of Search: |
704/200,258-270,276
|
References Cited [Referenced By]
U.S. Patent Documents
| 4692941 | Sep., 1987 | Jacks et al.
| |
| 4979216 | Dec., 1990 | Malsheen et al.
| |
| 5278943 | Jan., 1994 | Gasper et al.
| |
| 5384893 | Jan., 1995 | Hutchins.
| |
| 5668926 | Sep., 1997 | Karaali et al.
| |
| 5737725 | Apr., 1998 | Case.
| |
| 5758323 | May., 1998 | Case.
| |
| 5774854 | Jun., 1998 | Sharman.
| |
| 5832432 | Nov., 1998 | Trader et al.
| |
| 5850629 | Dec., 1998 | Holm et al.
| |
| 5878393 | Mar., 1999 | Hata et al.
| |
| 5913193 | Jun., 1999 | Huang et al.
| |
| 5949961 | Sep., 1999 | Sharman.
| |
| 5960395 | Sep., 1999 | Tzirkel-Hancock.
| |
| 6076060 | Jun., 2000 | Lin et al.
| |
| 6101470 | Aug., 2000 | Eide et al.
| |
| 6115686 | Sep., 2000 | Chung et al.
| |
| 6144939 | Nov., 2000 | Pearson et al.
| |
| 6173263 | Jan., 2001 | Conkie.
| |
| 6175821 | Jan., 2001 | Page et al.
| |
| 6366883 | Apr., 2002 | Campbell et al.
| |
| 6438522 | Aug., 2002 | Minowa et al.
| |
| 6499014 | Dec., 2002 | Chihara.
| |
| 6600814 | Jul., 2003 | Carter et al.
| |
| 6601030 | Jul., 2003 | Syrdal.
| |
| 6665641 | Dec., 2003 | Coorman et al.
| |
Other References
U.S. Appl. No. 09/818,172, pending, Case.
U.S. Appl. No. 09/818,207, pending, Case et al.
U.S. Appl. No. 09/818,968, pending, Case et al.
U.S. Appl. No. 09/818,208, pending, Case et al.
|
Primary Examiner: McFadden; Susan
Assistant Examiner: Vo; Huyen X
Attorney, Agent or Firm: Brooks Kushman P.C.
Parent Case Text
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of U.S. provisional application Ser. No.
60/241,572 filed Oct. 19, 2000.
Claims
What is claimed is:
1. A method for converting text to concatenated voice by utilizing a digital
voice library and a set of playback rules, the digital voice library including
a plurality of speech items and a corresponding plurality of voice recordings wherein
each speech item corresponds to at least one available voice recording wherein
multiple voice recordings that correspond to a single speech item represent various
inflections of that single speech item, the method including receiving text data,
converting the text data into a sequence of speech items in accordance with the
digital voice library, the method further comprising:
determining a syllable count for each speech item in the sequence of speech items;
determining an impact value for each speech item in the sequence of speech items,
the impact values being determinative of where inflection changes are to take place
within the sequence of speech items;
determining a desired inflection for each speech item in the sequence of speech
items based on the syllable count and the impact value for the particular speech
item and further based on the set of playback rules;
determining a sequence of voice recordings by determining a voice recording for
each speech item based on the desired inflection for the particular speech item
and based on the available voice recordings that correspond to the particular speech item;
generating voice data based on the sequence of voice recordings by concatenating
adjacent recordings in the sequence of voice recordings; and
determining a pitch value for each speech item in the sequence of speech items
by normalizing the impact value for the particular speech item, wherein the desired
inflection for each speech item is further based on the pitch value for the particular
speech item.
2. The method of claim 1 wherein a plurality of the speech items are glue items
and a plurality of the speech items are payload items, the method further comprising:
setting a flag for any speech item in the sequence of speech items that is a
glue item, wherein the playback rules dictate that the desired inflection for a
glue item is based on the desired inflection for surrounding payload items in the
sequence of speech items and that the desired inflection for a payload item is
based on the desired inflection for nearest payload items in the sequence of speech items.
3. The method of claim 2 wherein the plurality of speech items includes a plurality
of phrases.
4. The method of claim 3 wherein the plurality of speech items includes a plurality
of words.
5. The method of claim 4 wherein the plurality of speech items includes a plurality
of syllables.
6. The method of claim 1 wherein multiple voice recordings that correspond to
a single speech item represent various inflections of that single speech item and
wherein the various inflections belong to various inflection groups including a
at least one standard inflection group, at least one emphatic inflection group,
and at least one question inflection group.
7. The method of claim 6 wherein the at least one question inflection group includes
a single word question inflection group and a multiple word question inflection group.
8. The method of claim 1 wherein the pitch value for each speech item is between
one and five.
9. The method of claim 8 further comprising:
remodulating the pitch values for the sequence of speech items such that no more
than two consecutive words have the same pitch value except when the, particular
consecutive words lead a sentence.
10. The method of claim 8 further comprising:
remodulating the pitch values for the sequence of speech items such that there
are at least two words between any two words having a pitch values of five.
11. The method of claim 8 further comprising:
remodulating the pitch values for the sequence of speech items such that there
is at least one word between any two words having pitch values of four.
12. The method of claim 8 further comprising:
remodulating the pitch values for the sequence of speech items such that any
word that is at the beginning of a sentence has a pitch value of at least three.
13. The method of claim 8 further comprising:
remodulating the pitch values for the sequence of speech items such that any
word that immediately precedes a comma or semi-colon has a pitch value of not more
than three.
14. The method of claim 8 further comprising:
remodulating the pitch values for the sequence of speech items such that any
word that is at the end of a sentence ending in a period or exclamation point has
a pitch value of one.
15. A method for converting text to concatenated voice by utilizing a digital
voice library and a set of playback rules, the digital voice library including
a plurality of speech items, including glue items and payload items, and a corresponding
plurality of voice recordings wherein each speech item corresponds to at least
one available voice recording wherein multiple voice recordings that correspond
to a single speech item represent various inflections of that single speech item,
the method including receiving text data, converting the text data into a sequence
of speech items in accordance with the digital voice library, the method further comprising:
determining a syllable count for each speech item in the sequence of speech items;
determining an impact value for each speech item in the sequence of speech items;
determining a pitch value within a range for each speech item in the sequence
of speech items by normalizing the impact value for the particular speech item;
determining a desired inflection for each speech item in the sequence of speech
items based on the syllable count and the pitch value for the particular speech
item and further based on the set of playback rules wherein the playback rules
dictate that the desired inflection for a glue item is based on the desired inflection
for surrounding payload items and that the desired inflection for a payload item
is based on the desired inflection for nearest payload items with priority being
given to speech items having a greater pitch value such that the desired inflections
are determined first for speech items having the greatest pitch value and, thereafter,
are determined for speech items in order of descending pitch;
determining a sequence of voice recordings by determining a voice recording for
each speech item based on the desired inflection for the particular speech item
and based on the available voice recordings that correspond to the particular speech
item; and
generating voice data based on the sequence of voice recordings by concatenating
adjacent recordings in the sequence of voice recordings.
16. The method of claim 15 wherein the plurality of speech items includes a plurality
of phrases.
17. The method of claim 16 wherein the plurality of speech items includes a plurality
of words.
18. The method of claim 17 wherein the plurality of speech items includes a plurality
of syllables.
19. The method of claim 18 wherein multiple voice recordings that correspond
to a single speech item represent various inflections of that single speech item
and wherein the various inflections belong to various inflection groups including
a at least one standard inflection group, at least one emphatic inflection group,
and at least one question inflection group.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a system and method for converting text-to-voice.
2. Background Art
Systems and methods for converting text-to-speech and text-to-voice are well
known for use in various applications. As used herein, text-to-speech conversion
systems and methods are those that generate synthetic speech output from textual
input, while text-to-voice conversion systems and methods are those that generate
a human voice output from textual input. In text-to-voice conversion, the human
voice output is generated by concatenating human voice recordings. Examples of
applications for text-to-voice conversion systems and methods include automated
telephone information and Interactive Voice Response (IVR) systems.
SUMMARY OF THE INVENTION
It is, therefore, an object of the present invention to provide a method for
converting
text to concatenated voice by utilizing a digital voice library and set of playback rules.
In carrying out the above object, a method for converting text to concatenated
voice by utilizing a digital voice library and a set of playback rules is provided.
The digital voice library includes a plurality of speech items and a corresponding
plurality of voice recordings. Each speech item corresponds to at least one available
voice recording. Multiple voice recordings that correspond to a single speech item
represent various inflections of that single speech item. The method includes receiving
test data, converting the text data into a sequence of speech items in accordance
with the digital voice library. The method further comprises determining a syllable
count for each speech item in the sequence of speech items, determining an impact
value for each speech item in the sequence of speech items, and determining a desired
inflection for each speech item in the sequence of speech items based on the syllable
count and the impact value for the particular speech item and further based on
the set of playback rules. The method further comprises determining a sequence
of voice recordings by determining a voice recording for each speech item based
on the desired inflection for the particular speech item and based on the available
voice recordings that correspond to the particular speech item. And further, voice
data is generated based on a sequence of voice recordings by concatenating adjacent
recordings in a sequence of voice recordings.
In a preferred embodiment, a plurality of the speech items are glue items and
a plurality of the speech items are payload items. The method further comprises
setting a flag for any speech item in the sequence of speech items that is a glue
item. The playback rules dictate that the desired inflection for a glue item is
based on the desired inflection for surrounding payload items in the sequence of
speech items and that the desired inflection for a payload item is based on the
desired inflection for nearest payload items in the sequence of speech items.
Further, in a preferred embodiment, the plurality of speech items include
a plurality of phrases, a plurality of words, and a plurality of syllables.
In a suitable implementation, multiple voice recordings that correspond to a
single
speech item represent various inflections of that single speech item. The various
inflections belong to various inflection groups including at least one standard
inflection group, at least one emphatic inflection group, and at least one question
inflection group. Preferably, the at least one question inflection group includes
a single word question inflection group and a multiple word question inflection group.
Further, in a preferred implementation, the plurality of speech items includes
a plurality of words. The method further comprises determining a pitch value for
each speech item in the sequence of speech items by normalizing the impact value
for the particular speech item. The desired inflection for each speech item is
further based on the pitch value for the particular speech item. In a suitable
implementation, the pitch value for each speech item is between one and five. A
preferred method further comprises remodulating the pitch values for the sequence
of speech items such that no more than two consecutive words have the same pitch
value except when the particular consecutive words lead a sentence.
In addition, embodiments of the present invention contemplate a number of other
remodulation techniques. For example, a method of the present invention may include
remodulating the pitch values for the sequence of speech items such that there
are at least two words between any two words having a pitch value of five. In addition,
the method may include remodulating the pitch values for the sequence of speech
items such that there is at least one word between any two words having pitch values
of four. Further, the method may include remodulating the pitch values for the
sequence of speech items such that any word that is at the beginning of a sentence
has a pitch value of at least three. Further, for example, the method may include
remodulating the pitch values for the sequence of speech items such that any word
that immediately precedes a comma or semicolon has a pitch value of not more than
three. Further, the method may include remodulating the pitch values for the sequence
of speech items such that any word that is at the end of a sentence ending in a
period or exclamation point has a pitch value of one.
Further, in carrying out the present invention, a method for converting
text to concatenated voice by utilizing a digital voice library and a set of playback
rules is provided. The method includes receiving text data, converting the text
data into a sequence of speech items in accordance with the digital voice library.
The method further comprises determining a syllable count and an impact value for
each speech item in the sequence of speech items. A pitch value within a range
is determined for each speech item in the sequence of speech items by normalizing
the impact value for the particular speech item. The method further comprises determining
a desired inflection for each speech item in the sequence of speech items based
on the syllable count and the pitch value for the particular speech item and further
based on the set of playback rules. The playback rules dictate that the desired
inflection for a glue item is based on the desired inflection for surrounding payload
items and that the desired inflection for a payload item is based on the desired
inflection for nearest payload items with priority being given to speech items
having a greater pitch value such that the desired inflections are first determined
for speech items having the greatest pitch value, and, thereafter, are determined
for speech items in order of descending pitch. The method further includes determining
a sequence of voice recordings by determining a voice recording for each speech
item based on the desired inflection for the particular speech item and based on
the available voice recordings that correspond to the particular speech item. The
method further comprises generating voice data based on the sequence of voice recordings
by concatenating adjacent recordings in the sequence of voice recordings.
The advantages associated with embodiments of the present invention are numerous.
For example, methods of the present invention determine desired inflections for
each speech item in a sequence of speech items based on syllable count and impact
value, and further based on a set of playback rules.
The above object and other objects, features, and advantages of the present invention
will be readily appreciated by one of ordinary skill in the art in the following
detailed description of the best mode for carrying out the invention when taken
in connection with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a simplified block diagram of a text-to-voice conversion system and
method of the present invention, such as for use in an automated telephone information
or IVR system;
FIG. 2 is an architectural and flow diagram of the text-to-voice conversion
system and method of FIG. 1;
FIG. 3 is a block diagram illustrating text breakdown;
FIGS. 4A-C are inflection mapping diagrams associated with a digital voice library;
FIG. 5 is a block diagram illustrating inflection selection in accordance with
playback rules and with the diagrams in FIGS. 4A-C;
FIG. 6 illustrates conversion of text as known words or literally spelled by
syllable to spoken output as pre-recorded words or phonetically spelled by syllable;
FIG. 7 broadly illustrates the conversion from input text to concatenated voice output;
FIG. 8 graphically represents a tone sound;
FIG. 9 graphically represents a noise sound;
FIG. 10 graphically represents an impulse sound;
FIG. 11 graphically represents concatenation of an impulse and an impulse;
FIG. 12 graphically represents concatenation of a tone and a tone;
FIG. 13 graphically represents concatenation of a tone and a tone with overlap;
FIG. 14 graphically represents concatenation of noise and noise;
FIG. 15 graphically represents concatenation of a tone and an impulse;
FIG. 16 graphically represents concatenation of a tone and an impulse with overlap;
FIG. 17 graphically represents concatenation of noise and an impulse;
FIG. 18 graphically represents concatenation of noise and a tone;
FIG. 19 graphically represents concatenation of an impulse and a tone;
FIG. 20 graphically represents concatenation of an impulse and a tone with overlap;
FIG. 21 graphically represents concatenation of an impulse and noise;
FIG. 22 graphically represents concatenation of a tone and noise;
FIG. 23 depicts word value assessment during inflection selection in accordance
with playback rules and shows impact values and syllable counts;
FIG. 24 depicts word value assessment during inflection selection in accordance
with playback rules and shows initial pitch/inflection values; and
FIG. 25 depicts example voice sample selections during inflection selection
in accordance with the playback rules.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT(S)
One drawback of computer systems which provide synthetic text-to-speech conversion
is that many times the synthetic speech that is generated sounds unnatural, particularly
in that inflections that are normally employed in human speech are not accurately
approximated in the audible sentences generated. One difficulty in providing a
more natural sounding synthetic speech output is that in some existing systems
and methods, words and inflection changes are based more upon the phoneme structure
of the target sentence, rather than upon the syllable and phrase structure of the
target sentence. Further, inflection and pitch changes are dependent not only on
the syllable structure of the target word, but also the syllable structure of the
surrounding words. Existing systems and methods for text-to-speech conversions
do not include analysis which accounts for such syllable structure concerns.
One problem associated with existing systems and methods for text-to-voice conversion
is that they are not capable of generating voice output for unknown text, such
as words that have not been previously recorded or concatenated and stored. Such
concatenated speech systems and methods have also ignored the type of audio content
at the beginnings and endings of recordings, essentially butting one recording
against another in order to generate the target output. While such a technique
has been relatively successful, it has contributed to the unnatural quality of
its generated output. Further, most systems and methods cannot produce the ligatures
or changes that occur to the beginning or end of words that are spoken closely together.
Finally, existing concatenated speech systems and methods have historically
been limited to outputting numbers and other commonly used and anticipated portions
of an entire speech output. Typically, such systems and methods use a prerecorded
fragment of the desired output up to the point at which a number or other anticipated
piece is reached. The concatenation algorithms then generate only the anticipated
portion of the sentence, followed by another prerecorded fragment used to complete
the output.
Thus, there exists a need for a text-to-voice conversion system and method
which accepts text as an input and provides high quality speech output through
use of multiple recordings of a human voice in a digital voice library. Such a
system and method would include a library of human voice recordings employed for
generating concatenated speech, and would organize target words, word phrases and
syllables such that their use in an audible sentence generated from a computer
system would sound more natural. Such an improved text-to-voice conversion system
and method would further be able to generate voice output for unknown text, and
would manipulate the playback switch points of the beginnings and endings of recordings
used in a concatenated speech application to produce optimal playback output. Such
a system and method would also be capable of playing back various versions of recordings
according to the beginning or ending phonemes of surrounding recordings, thereby
providing more natural sounding speech ligatures when connecting sequential voice
recordings. Still further, such a system and method would work over the entire
length of the required output, without the limitation of only accounting for specific
and anticipated portions of a required output, using inflection shape, contextual
data, and speech parts as factors in controlling voice prosody for a more natural
sounding generated speech output. Such a system and method also would not be limited
to use with any particular audio format, and could be used, for example, with audio
formats such as perceptual encoded audio, Linear Predictive Coding (LPC), Codebook
Excited Linear Prediction (CELP), or other methods that are parametric or model
based, or any other formats that may be used in either text-to-speech or text-to-voice systems.
Referring now to the Figures, the preferred embodiment of a system and
method for converting text-to-voice of the present invention will be described.
In general, the present invention includes a text-to-voice computer system and
method which may accept text as an input and provide high quality speech output
through use of multiple recordings of a human voice. According to the present invention,
a digital voice library of human voice recordings is employed for generating concatenated
speech output, wherein target words, word phrases and syllables are organized such
that their use in an audible sentence generated by a computer may sound more natural.
The present invention can convert text to human voice as a standalone product,
or as a plug-in to existing and future computer applications that may need to convert
text-to-voice. The present invention is also a potential replacement for synthetic
text-to-speech systems, and the digital voice library element can act as a resource
for other text-to-voice systems. It should also be noted that the present invention
is not limited to use with any particular audio format, and may be used, for example,
with audio formats such as perceptual encoded audio, Linear Predictive Coding (LPC),
Codebook Excited Linear Prediction (CELP), or other methods that are parametric
or model based, or any other formats that may be used in either text-to-speech
or text-to-voice systems.
More specifically, referring to FIG. 1, a simplified block diagram of a preferred
system and method for converting text-to-voice of the present invention is shown,
such as for use in an automated telephone information or IVR system, denoted generally
by reference numeral
10. As seen therein, the present invention generally
includes a digital voice library (
12), which is an asset database that includes
human voice recordings of syllables, words, phrases, and sentences in a significant
number of voiced inflections as needed to produce a more natural sounding voice
output than the synthetic output generated by existing text-to-speech systems and
methods. In operation, the present invention performs analysis of incoming text
(
14), and accesses digital voice library (
12) via look-up logic (
16)
for voice recordings with the desired prosody or inflection, and pronunciation.
The present invention then employs sentence construction algorithms (
18)
to concatenate together spoken sentences or voice output (
20) of the text input.
Referring now to FIG. 2, the architecture and flow of a preferred text-to-voice
conversion system and method of the present invention are shown, denoted generally
by reference numeral
80. As seen therein, generally, using the previously
described digital voice library, various look-ups are performed, such as for words
or syllables, to assemble the appropriate corresponding speech output data. Using
playback rules, such speech output data is concatenated in order to generate voice
output. More particularly, input text is received at input/output port interface
(
82) in the form of words, abbreviations, numbers and punctuation (
84)
and may be in the form of text blocks, a text stream, or any other suitable form.
Such text is then broken down, expanded or segmented into pseudo words (
86)
as appropriate. In so doing, the present invention utilizes an abbreviations database
(
88). Where the particular abbreviation being analyzed corresponds to only
one expanded word, that expanded word is immediately conveyed by abbreviations
database (
88) to look-up control module (
90). However, where the
particular abbreviation being analyzed corresponds to multiple expanded words,
abbreviations database (
88) conveys the appropriate expanded word to look-up
control module (
90) based on analysis by look-up control module (
90)
of contextual information pertaining to the use of the abbreviation in the input text.
Still referring to FIG. 2, look-up control module (
90) is provided in
communication with a phrase database (
92), word database (
94), a
new word generator module (
96), and a playback rules database (
98).
After input text (
84) is appropriately broken down, expanded and segmented
(
86), look-up control module (
90) first accesses phrase database
(
92). Phrase database (
92) performs forward and backward searches
of the input text to locate known phrases. The results of such searches, together
with accompanying context information relating to any known phrases located, are
relayed to look-up control module (
90).
Thereafter, look-up control module (
90) may access common words
database (
94), which searches the remaining input text to locate known words.
The results of such searching, together with accompanying context information relating
to any known words located, are again relayed to look-up control module (
90).
In that regard, common words database (
94) is also provided in communication
with abbreviations database (
88) in order to be appropriately updated, as
well as with a console (
100). Console (
100) is provided as a user
interface, particularly for defining and/or modifying pronunciations for new words
that are entered into common words database (
94) or that may be constructed
by the present invention and entered into common words database (
94), as
described below.
Look-up control module (
90) may next access new word generator module
(
96), in order to generate a pronunciation for unknown words, as previously
described. In that regard, new word generator module (
96) includes new word
log (
102), a syllable look-up module (
104), and a syllable database
(
106). Look-up module (
104) functions to search the input text for
sub-words and spellings of syllables for construction of new words or words recognized
as containing typographical errors. To do so, look-up module (
104) accesses
syllable database (
106), which includes a collection of numerous possible
syllables. Once again, the results of such searching are relayed to look-up control
module (
90). In addition, in some embodiments of the invention, module (
104)
functions to search the input text for multi-syllable components (for example,
words in word database (
94)).
Referring still to FIG. 2, using any results and context information provided
by abbreviations database (
88), phrase database (
92), common words
database (
94) and/or new word generator module (
96), look-up control
module (
90) performs context analysis of the input speech and accesses playback
rules database (
98). Using the appropriate rules from playback rules database
(
98), including rules concerning prosody, pre-distortions and edit points
as described herein, and based on the context analysis of the input speech, look-up
control module (
90) then generates appropriate concatenated voice data (
108),
which are output as an audible human voice via input/output port interface (
82).
The voice data (
108) may be a continuous voice file, a data stream, or may
take any other suitable form including a series of Internet protocol packets.
It is appreciated that the preferred embodiment illustrated in FIGS. 1 and 2
may
be implemented in a variety of ways. The digital voice library may include human
voice recordings of syllables, words, phrases, and even sentences (not shown).
Each item (syllable, word, phrase, or sentence) is recorded in a significant number
of voice inflections so that for a particular item, the correct recording may be
chosen based on the context around the item in the text input. Further, in a preferred
embodiment, the digital voice library includes multiple recordings for an item
in a specific inflection. That is, for example, a specific word may have multiple
inflections, and some of those inflections may require multiple recordings of the
same inflection but having different distortions or ligatures. As such, it is appreciated
that the digital voice library is a broad and scalable concept, and may include
items, for example, as large as a full sentence or as small as a single syllable
or even a phoneme. Further, for any item in the digital voice library, the digital
voice library may include multiple recordings of various inflections. And for a
particular inflection of a particular item, the library may further include multiple
recordings to form different ligatures or distortions as the item meshes with surrounding items.
In addition, it is appreciated that the architecture shown in FIG. 2 may take
many forms. For example, although a phrase database, a word database, and syllable
database are shown, architecture may be implemented with more databases on either
end. For example, there could be a small phrase database, a large phrase database,
and even a sentence database. In addition, there could be a syllable database and
even a sub-syllable or sound database. The general operation would still follow
that outlined above. In addition, it is appreciated that each database may be constructed
to interact with the databases above and below it in the hierarchy, for example,
as the new word generator module (
96) is shown to interact with word database (
94).
For example, word database (
94) could be implemented to appropriately
include a new phrase log, word look-up logic, and a word database, with the word
look-up logic being in communication with the phrase database. That is, the architecture
in a preferred embodiment is scalable and recursive in nature to allow broad discretion
in a particular implementation depending on the application. Further, in the example
shown, look-up control module (
90) sends text to the intelligent databases,
and the databases return pointers to look-up control module (
90). The pointers
point generally to items in the digital voice library (phrases, words, syllables,
etc.). That is, for example, a pointer returned by word database (
94) generally
points to a word in the digital voice library but does not specify a particular
recording (specific inflection, specific distortions, etc.).
Once look-up control module (
90) gathers a set of general pointers for
the sentence, playback rules database (
98) processes the pointer set to
refine the pointers into specific pointers. A specific pointer is generated by
playback rules database (
98). Each specifically points to a particular recording
within the digital voice library. That is, module (
90) interacts with the
databases to generally construct the sentence as a sequence of general pointers
(a general pointer points to an item in the library), and then playback rules database
(
98) cooperates with look-up control module (
90) to specifically
choose a particular recording of each item to provide for proper inflections, distortions,
and ligatures in the voice output. Thereafter, the sequence of specific pointers
(a specific pointer points to a specific recording of an item in the library) is
used to construct the voice data at (
98), which is sent to output interface
(
82). Construction of the voice data may include manipulation of playback
switch points.
The present invention can thus "capture" the dialects and accents of any language
and match the general item pointers returned by the databases with appropriate
specific pointers in accordance with playback rules (
98). The present invention
analyzes text input and assembles and generates speech output via a library by
determining which groups of words have stored phrase recordings, which words have
stored complete word recordings, and which words can be assembled from multiple
syllable recordings and, for unknown words, pronouncing the words via syllable
recordings that map to the incoming spellings. The present invention can either
map known common typographical errors to the correct word or can simply pronounce
the words as spelled primarily via syllable recordings and phoneme recordings if needed.
The present invention also calculates which inflection (and preferably, some
words or items may have multiple recordings at the same inflection but with different
distortions) would sound best for each recording that is played back in sequence
to form speech. A console may be provided to manually correct or modify how and
which recordings are played back including speed, prosody algorithms, syllable
construction of words, and the like. The present invention also adjusts pronunciation
of words and abbreviations according to the context in which the words or abbreviations
were used.
FIG. 3 illustrates a suitable text breakdown technique at
30 and FIGS.
4A-C illustrate a suitable inflection mapping table including groups
120,
130,
140, and
150. That is, each item in the digital voice
library may be recorded in up to as many inflections as present in the inflection
table. Further, there may be a number of recordings for each inflection. FIG. 5
broadly illustrates the selection of appropriate inflections for each word or item
in a sentence in a suitable implementation at
160. Below, FIGS. 3-5 are
described in detail, but of course, other implementations are possible and FIGS.
3-5 merely describe a suitable implementation. Further, as mentioned previously,
the architecture of FIG. 2 is scalable to handle items of various size, and similarly,
the mapping table of FIG. 4 is suitable for words, but similar approaches may be
taken to map larger items such as phrases or smaller items such as syllables.
Inflection and pitch changes that take place during a spoken sentence
are based upon the syllable structure of the target sentence, not upon the word
structure of the target sentence. Furthermore, inflection and pitch changes are
dependent not only on the syllable structure of the target word, but also on the
syllable structure of the surrounding words. Each sentence can normally be treated
as a stand-alone unit. In other words, it is generally safe to choreograph the
inflection/pitch changes for any given sentence without having concern for what
nearby sentences might contain. Below, an exemplary text breakdown technique is described.
Example Pseudo-Code Breakdown (FIG. 3):
Step #A
1:
Grab the next sentence from the input buffer (block
32). A sentence can
be considered to have terminated when any of the following are read in.
A Colon.
This is only considered as a sentence terminator if the byte that follows the
colon is a space character, a tab character or a carriage return.
A Period.
This is only considered as a sentence terminator if the byte that follows the
period is a space character, a tab character or a carriage return.
Exception: note that if it is determined that the word preceding the period
is an abbreviation, then this period will not be considered as a sentence terminator
(exception to the exception: unless the period is followed by one or more tab characters,
three or more space characters and/or two or more carriage returns in which case
the period following the abbreviation is considered a sentence terminator).
An Exclamation Point or Question Mark.
This is only considered as a sentence terminator if the byte that follows the
exclamation point or question mark is a space character, a tab character or a carriage return.
One or More Consecutive Tab Characters.
Three or More Consecutive Space Characters.
Two or More Consecutive Carriage Return Characters.
Of course, this list of sentence terminators is an example, and a different technique
may be used in the alternative.
Step #A
2:
Search the sentence for abbreviations (block
34). Among the many other
abbreviation categories that should be made a part of this process, this search
should probably include the United States Postal Service abbreviation list. Many
abbreviations will conclude with a period, but some will not. The Postal Service,
for example, asks that periods not be used as part of an address—even if
the word in question is an abbreviation—so the use of a period at the conclusion
of an abbreviation should necessarily be one of several search criteria. Once abbreviations
are identified, they can be converted into their full word equivalents.
Step #A
3:
Search the sentence for digits that end with "ST", "ND", "RD" and "TH" (block
36). Convert the associated number into instructions for speaking. For example,
"44
th" will be spoken as "forty-fourth." And "600
th" will
be spoken as "six hundredth."
Step #A
4:
Search the sentence for monetary values (block
38). In the United States,
this is indicated by a dollar sign ("$") followed directly by one or more numbers.
Sometimes this will extend to include a period (decimal point) and two more digits
representing the decimal part of a dollar. This can then be converted into the
instructions that will generate a spoken dollar (and cents) amount.
Step #A
5:
Search the sentence for telephone numbers (block
40). In the United
States, this will commonly be indicated in one of ten ways: 555-5555, 555 5555,
(000) 555-5555, (000) 555 5555, 000-555-5555, 000 555 5555, 1 (000) 555-5555, 1
(000) 555 5555, 1-000-555-5555, 1 000 555 5555.
Of course, there are telephone numbers that don't fit into one of the above ten
templates, but this pattern should cover the majority of telephone number situations.
Pinning down the existence and location of a phone number in most applications
will probably revolve around first searching for the typical <three digit><separator><fourdigit>
pattern common to all United States phone numbers.
Step #A
6:
Search the sentence for numbers that contain one or more commas (block
42).
Many times if a writer wishes his/her number to represent "how many" of something,
he/she will place a comma within the number. The parsing routines can use this
information to flag that the number should be read out in expanded form. In other
words, 24,692,901 would be read out as "twenty four million, six hundred ninety
two thousand, nine hundred one." Other numbers may be read out one digit at a time,
as many numbers are expected to be heard (for example, account numbers).
Step #A
7:
Search the sentence for internet mail addresses (block
44). These will
contain the at symbol ("@") somewhere within a consecutive group of characters.
There are a limited number of different characters that can be made a part of an
email address. Therefore, any byte that is not a legal address character (such
as a space character) can be used to locate the beginning and end of the address.
The period is pronounced as "dot."
Step #A
8:
Search the sentence for Internet Universal Resource Locator (URL) addresses
(block
46). Unlike email addresses, these will be a bit more difficult to
pin down.
Oftentimes they contain "www." but not always. Sometimes they begin with
"http://" or "ftp://" but not always. Sometimes they end with ".com" ".net" or
".org" but not always (especially when including international addresses). A suitable
implementation obtains the current list of all acceptable URL suffixes, and searches
each group of consecutive characters in the target sentence to see if any of these
groups end with one of the valid suffixes. In most cases where a valid suffix is
found (".com" for example) it is probably safe to assume that if the byte immediately
preceding the period is acceptable for use in a URL address, that the search routine
has actually located part of a valid URL.
Also note that many URLs are listed in some form of their 32-bit address. It
is also common for these numerical URL addresses to contain additional information
designed to fine tune the target location of the URL. The location of a period
in a URL address is spoken aloud and it is pronounced "dot."
Step #A
9:
If words are discovered that are not a part of the words library, then a syllable
based re-creation of the word will have to be generated as explained elsewhere herein.
Of course, it is appreciated that the example text breakdown steps given herein
do not limit the invention and many modifications may be made to arrive at other
suitable text breakdown techniques. Below, an exemplary inflection selection technique
is described.
Example Inflection Selection (FIG. 5):
Step #B
1:
Each and every word in the target sentence is analyzed to obtain three chunks
of information (blocks
162,
164, and
166 of FIG. 5).
First, the syllable count of each word in the target sentence is obtained
(block
162). In FIG. 23 this syllable count is displayed in parenthesis
below each word. In a suitable implementation, syllable count for each word is
determined as the list of to be recorded words is created.
Second, the impact value of each word in the target sentence is obtained
(block
164). In FIG. 23 the value that has been assigned to each word is
displayed just above the syllable count. The impact value for each word may be
determined as the list of to be recorded words is created.
Determining the impact value (from zero up through two hundred fifty-five
in the example) for each word will be a complex process. In short, the more descriptive
and/or important a word is, the higher will be its assigned impact value. These
values will be used to determine where in a spoken sentence the inflection changes
will take place. The overall objective of this impact value concept is to ensure
that each spoken sentence will have its own unique pattern of natural sounding
inflections, without any need to reference those sentences that precede and follow
the current sentence.
As impact values and syllable counts are obtained while parsing a sentence during
this step, many words will be discovered that do not exist in the current words
library. This means that in addition to having to generate a syllable based representation
of an unknown word, an impact value and syllable count number must also be created
for the newly generated word. Because a valid impact value runs from zero (0) at
the low end to two hundred fifty-five (255) at the upper end, the impact value
for an unknown word can be set to any number in this range, possibly based on the
number of syllables.
For example, an unknown single syllable word might be given an impact value of
one hundred eight (108). An unknown two syllable word might be given an impact
value of one hundred eighteen (118). An unknown three syllable word might be given
an impact value of one hundred twenty-eight (128). An unknown four syllable word
might be given an impact value of one hundred thirty-eight (138).
Third, each word must have a flag set (block
166) if its purpose is
not normally to carry information but rather to serve the needs of a sentence's
structure. Words that serve the needs of a sentence's structure are called glue
words or connective words. For example, "a," "at," "the" and "of" are all examples
of glue or connective words. When the software must determine which audio samples
to use to voice the current sentence, the inflection/pitch values for words flagged
as glue words can freely be adjusted to meet the needs of the surrounding payload
words. Of course, it is appreciated that this step and the remaining steps in the
inflection selection example given herein do not limit the invention and many modifications
may be made to arrive at other suitable inflection mapping techniques. Further,
the inflection maps of FIGS. 4A-C and method of FIG. 5 illustrate the mapping of
words from word database
94 to specific word inflections. However, similar
techniques may be utilized for mapping phrases, syllables, or other items in accordance
with the scalable architecture of embodiments of the present invention. A more
detailed description of glue words is given later herein.
Step #B
2:
If the target sentence is only one word in length, then the method the original
writer chose to use when writing the one word sentence will determine how the sentence
is spoken (block
168). In the remaining Step #Bx steps, inflections are
selected for each word from the tables of FIGS. 4A-C. It is appreciated that some
words may be recorded in each and every inflection, while others are recorded in
a limited number of inflections (the closest match would then be chosen.) Further,
some embodiments may have several records for a single inflection, with a different
distortion for each record.
For example, if the one word sentence ends with an exclamation point, then a
digitized word from the "Emphatic Inflection Group" (
130, FIG. 4B) will
be spoken. If the word contains only one syllable, then "
—!H3"
should be used. On the other hand, if the word contains more than one syllable,
then "
—!L3" should be used.
If the one word sentence ends with a question mark, then a digitized word from
either the "Single Word Question Inflection Group" (
140, FIG. 4C) or the
"Multiple Word Question Inflection Group" (
150, FIG. 4C) will be spoken.
If the one word question is anything except "why" then "
—?Q3"
should be used. On the other hand, if the word is "why," then "
—?S3"
should be used.
If the one word sentence ends with anything else (including a period), then a
digitized word from the "Standard Inflection Group" (
120, FIG. 4A) will
be spoken. If the word contains only one syllable, then "
—&H3"
should be used. On the other hand, if the word contains more than one syllable,
then "
—&L3" should be used.
Step #B
3:
For the remainder of this breakdown, the following example sentence will be used:
"A women in her early twenties sits alone in a small, windowless room at the University
of Hope's LifeFeelings Research Institute in Argentina." (FIG. 23) Please note
that the impact values assigned to the words in FIG. 23 are only examples (as the
sentence is also but an example).
Because each sentence should stand on its own, the sentence is normalized
(block
170). Normalizing is accomplished as follows:
- 1) Evaluate the current sentence to discover the word (or words, if
there is a tie between two or more words) with the largest impact value. In this
example, the word with the largest impact value is "Hope's" with a value of two
hundred twenty-three (223).
- 2) Divide the largest impact value by four (4). In this example, the
result would be fifty-five and seventy-five hundredths (55.75).
- 3) Work through the entire current sentence a word at a time and perform
this calculation: divide the impact value of the current word by the value that
was obtained at Step #2. For example, if the word in question is "windowless" (which
in our example has been assigned an impact value of one hundred twenty-one (121),
then the formula is "121/55.75=2.17"
- 4) This number is then rounded up or down to the closest integer value,
and then it is incremented by one (1). This will leave an integer ranging from
one (1) up through five (5). This final integer is loosely associated with the
five inflection/pitches of FIGS. 4A-C.
FIG. 24 gives a good idea of where each word's inflection/pitch will fall after
this part of the process has been performed.
Step #B4:
At this point things become somewhat more complex (block
172). A target
sentence can sound odd if within the sentence, three or more consecutive words
have the same inflection/pitch value. As an exception to this, however, three consecutive
words can sound just fine if the inflection/pitch value in question is a one (1)
or a two (2). Another exception is that in some situations as many as three or
four consecutive (inflection/pitch one [1], two [2] and three [3]) words can sound
acceptable if they lead the sentence.
Furthermore, there should be at least two or three words between any
two words that have an inflection/pitch value of five (5). There should also be
at least one or two words between any two words that have an inflection/pitch value
of four (4).
This is where the original impact values assigned to each word can again become
useful. Because Step #B
3 causes a kind of loss of resolution regarding the
impact values, these original values can be helpful when trying to jam an inflection/pitch
wedge between two words.
In order to make certain that these rules are not broken, it will oftentimes
become
necessary to remodulate a sentence using the original impact values as a guide.
If a word's inflection/pitch value must be changed, it will usually require that
changes be made no