Title: Speech processing system
Abstract: A speech processing system is provided which is operable to receive sets of signal values representative of a speech signal generated by a speech source. The system is operable to determine a measure of the quality of the speech signal by performing a statistical analysis of the received sets of signal values. The system stores data defining a predetermined function derived from a signal model which models the speech source and which defines a probability density function which gives, for a given set of model parameters, the probability that the signal model has those model parameters given that the signal model is assumed to have generated the received set of signal values. The system applies a current set of received signal values to the stored probability density function and then draws samples from it using a Gibbs sampler. The system then analyses the samples to determine a measure of the variance of some of the samples and then outputs a signal indicative of the quality of the received speech signal values in dependence upon the determined variance.
Patent Number: 7,010,483 Issued on 03/07/2006 to Rajan
| Inventors:
|
Rajan; Jebu Jacob (Bracknell, GB)
|
| Assignee:
|
Canon Kabushiki Kaisha (Tokyo, JP)
|
| Appl. No.:
|
866854 |
| Filed:
|
May 30, 2001 |
Foreign Application Priority Data
| Jun 02, 2000[GB] | 0013541 |
| Aug 17, 2000[GB] | 0020314 |
| Current U.S. Class: |
704/228; 704/233; 704/240 |
| Current Intern'l Class: |
G10L 15/20 (20060101); G10L 21/02 (20060101) |
| Field of Search: |
704/226,227,228,233,234,240
714/746
|
References Cited [Referenced By]
U.S. Patent Documents
| 4386237 | May., 1983 | Virupaksha et al.
| |
| 4811399 | Mar., 1989 | Landell et al.
| |
| 4860360 | Aug., 1989 | Boggs.
| |
| 4905286 | Feb., 1990 | Sedgwick et al.
| |
| 5012518 | Apr., 1991 | Liu et al.
| |
| 5315538 | May., 1994 | Borrell et al.
| |
| 5325397 | Jun., 1994 | Scholz et al.
| |
| 5432859 | Jul., 1995 | Yang et al.
| |
| 5432884 | Jul., 1995 | Kapanen et al.
| |
| 5507037 | Apr., 1996 | Bartkowiak et al.
| |
| 5611019 | Mar., 1997 | Nakatoh et al.
| |
| 5742694 | Apr., 1998 | Eatwell.
| |
| 5784297 | Jul., 1998 | O'Brien, Jr. et al.
| |
| 5799276 | Aug., 1998 | Komissarchik et al.
| |
| 5873076 | Feb., 1999 | Barr et al.
| |
| 5884255 | Mar., 1999 | Cox.
| |
| 5884269 | Mar., 1999 | Cellier et al.
| |
| 5963901 | Oct., 1999 | Vähätalo.
| |
| 6018317 | Jan., 2000 | Dogan et al.
| |
| 6044336 | Mar., 2000 | Marmarelis et al.
| |
| 6134518 | Oct., 2000 | Cohen et al.
| |
| 6157909 | Dec., 2000 | Mauuary et al.
| |
| 6215831 | Apr., 2001 | Nowack et al.
| |
| 6226613 | May., 2001 | Turin.
| |
| 6266633 | Jul., 2001 | Higgins et al.
| |
| 6324502 | Nov., 2001 | Handel et al.
| |
| 6374221 | Apr., 2002 | Haimi-Cohen.
| |
| 6377919 | Apr., 2002 | Burnett et al.
| |
| 6397181 | May., 2002 | Li et al.
| |
| 6438513 | Aug., 2002 | Pastor et al.
| |
| 6516090 | Feb., 2003 | Lennon et al.
| |
| 6546515 | Apr., 2003 | Vary et al.
| |
| 6549854 | Apr., 2003 | Malinverno et al.
| |
| 6708146 | Mar., 2004 | Sewall et al.
| |
| 6760699 | Jul., 2004 | Weerackody et al.
| |
| 6879952 | Apr., 2005 | Acero et al.
| |
| Foreign Patent Documents |
| 0 554 083 | Aug., 1993 | EP.
| |
| 0 631 402 | Dec., 1994 | EP.
| |
| 0 674 306 | Sep., 1995 | EP.
| |
| 0 952 589 | Oct., 1999 | EP.
| |
| 0 996 112 | Apr., 2000 | EP.
| |
| 1 022 583 | Jul., 2000 | EP.
| |
| 1 160 768 | Dec., 2001 | EP.
| |
| 1 034 441 | Apr., 2003 | EP.
| |
| 2 137 052 | Sep., 1984 | GB.
| |
| 2 332 054 | Jun., 1999 | GB.
| |
| 2 332 055 | Jun., 1999 | GB.
| |
| 2 345 967 | Jul., 2000 | GB.
| |
| 2 349 717 | Nov., 2000 | GB.
| |
| 2 356 106 | May., 2001 | GB.
| |
| 2 356 107 | May., 2001 | GB.
| |
| 2 356 313 | May., 2001 | GB.
| |
| 2 356 314 | May., 2001 | GB.
| |
| 2 360 670 | Sep., 2001 | GB.
| |
| 2 361 339 | Oct., 2001 | GB.
| |
| 2 363 557 | Dec., 2001 | GB.
| |
| 2001-44926 | Feb., 2001 | JP.
| |
| WO 92/2289/1 | Dec., 1992 | WO.
| |
| WO 98/3863/1 | Sep., 1998 | WO.
| |
| WO 99/2876/0 | Jun., 1999 | WO.
| |
| WO 99/2876/1 | Jun., 1999 | WO.
| |
| WO 99/6488/7 | Dec., 1999 | WO.
| |
| WO 00/1165/0 | Mar., 2000 | WO.
| |
| WO 00/3817/9 | Jun., 2000 | WO.
| |
| WO 00/4537/5 | Aug., 2000 | WO.
| |
| WO 00/5416/8 | Sep., 2000 | WO.
| |
Other References
Quatieri et al., "Magnitude-only estimation of handset nonlinearity with application
to speaker recognition," Proceeedings of the 1998 IEEE International Conference
on Acoustics, Speech, and Signal Processing, May 12-15, 1998, vol. 2, pp. 745 to 748.
Numerical Recipes in C by W. Press, et al., Chapter 7, Cambridge University Press (1992).
"Reversible jump Markov chain Monte Carlo Computation and Bayesian model determination"
by Peter Green, Biometrika, vol. 82, pp. 711-732 (1995).
"The Simulation Smoother For Time Series Models", Biometrika, vol. 82, 2, pp.
339-350 (1995).
"Probabilistic inference using Markov chain Monte Carlo methods" by R. Neal.
Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto (1993).
"Fundamentals of Speech Recognition," Rabiner, et al., Prentice Hall, Englewood
Cliffs, New Jersey, pp. 115 and 116, 1993.
"Bayesian Separation and Recovery of Convolutively Mixed Autoregssive Sources",
Godsill, et al., ICASSP, Mar. 1999.
"Statistical Properties of STFT Ratios for Two Channel Systems and Application
to Blind Source Separation", Balan, et al., Siemens Corporate Research, Princeton,
N, pp. 429-434.
Bayesian Approach to Parameter Estimation and Interpolation of Time-Varying Autoregressive
Interpolation of Time-Varying Autoregressive Processes Using the Gibbs Sampler,
Rajan, et al., IEE Proc.-Vis. Image Signal Process., vol. 44, No. 4, Aug. 1997,
pp. 249-256.
"An Introduction to the Kalman Filter", Welch, et al., Dept. of Computer Science,
University of North Carolina at Chapel Hill, NC, Sep. 1997.
"Query Expansion for Imperfect Speech: Appliations In Distributed Learning",
Srinivasan, et al., Proc. IEEE Workshop on Content-based Access of Image and Video
Libraries, 2000, pp. 50-54.
Couvreur, et al., "Wavelet-based Non-Parametric HMM's: Theory and Applications,"
Proc. International Conference Acoustics, Speech and Signal Processing, Istanbul,
vol. 1, Jun. 5-9, 2000, pp. 604-607.
Hopgood, et al., "Bayesian Single Channel Blind Deconvolution Using Parametric
Signal and Channel Models," Proc. IEEE Workshop on Applications of Signal Processing
to Audio and Acoustics, New Paltz, NY, Oct. 17-20, 1999, pp. 151-154.
Andrieu, et al., "Bayesian Blind Marginal Separation of Convolutively Mixed Discrete
Sources," IEEE Proc., 1998, pp. 43-52.
|
Primary Examiner: Lerner; Martin
Attorney, Agent or Firm: Fitzpatrick Cella Harper & Scinto
Claims
The invention claimed is:
1. An apparatus for determining a quality measure indicative of the quality of
a speech signal, the apparatus comprising:
a receiver operable to receive a set of speech signal values representative of
a speech signal generated by a speech source as distorted by a transmission channel
between the speech source and the receiver;
a memory operable to store a predetermined function which includes a first part
having first parameters which models said source and a second part having second
parameters which models said channel and which gives, for a given set of speech
signal values, a probability density for parameters of a predetermined speech model
which is assumed to have generated the set of speech signal values, the probability
density defining, for a given set of model parameter values, the probability that
the predetermined speech model has those parameter values, given that the model
is assumed to have generated the set of speech signal values;
an applicator operable to apply the set of received speech signal values to said
stored function to give the probability density for said model parameters for the
set of received speech signal values;
a processor operable to process said function with said set of received speech
signal values applied, to derive samples of at least said first parameters from
said probability density;
an analyser operable to analyse at least some of said derived samples of said
at least first parameters to determine a quality measure indicative of the quality
of the received speech signal values; and
an output operable to output values of said first parameters that are representative
of said speech signal generated by said speech source before it was distorted by
said transmission channel.
2. An apparatus according to claim 1, wherein said analyser is operable to determine
a measure of the variance of said at least some of said derived samples of said
at least first parameters to determine said quality measure.
3. An apparatus according to claim 2, wherein said probability density function
is in terms of said variance measure and wherein said processor is operable to
draw samples of said variance measure from said probability density function.
4. An apparatus according to claim 3, wherein said processor comprises a Gibbs sampler.
5. An apparatus according to claim 3, wherein said analyser is operable to determine
a histogram of said drawn samples and wherein said quality measure is determined
using said histogram.
6. An apparatus according to claim 5, wherein said analyser is operable to determine
said quality measure using a weighted sum of said drawn samples, and wherein the
weighting for each sample is determined from said histogram.
7. An apparatus according to claim 1, wherein said processor is operable to draw
samples iteratively from said probability density function.
8. An apparatus according to claim 1, wherein said receiver is operable to receive
a sequence of sets of speech signal values representative of an input speech signal
and wherein said applicator, processor and analyser are operable to perform their
respective functions with respect to each set of received speech signal values
to determine a quality measure for each set of received signal values.
9. An apparatus according to claim 8, wherein said processor is operable to use
the values of parameters obtained during the processing of a preceding set of signal
values as initial estimates for the values of the corresponding parameters for
a current set of signal values being processed.
10. An apparatus according to claim 8, wherein said sets of signal values in
said sequence are non-overlapping.
11. An apparatus according to claim 1, wherein said speech model comprises an
auto-regressive process model and wherein said parameters include auto-regressive
model coefficients.
12. An apparatus according to claim 1, wherein said speech signal model includes
a noise model having a noise parameter and wherein said quality measure is determined
using said noise parameter.
13. An apparatus according to claim 1, wherein said processor is operable to
determine a histogram of said derived samples and wherein said values of said first
parameters are determined from said histogram.
14. An apparatus according to claim 13, wherein said processor is operable to
determine said values of said first parameters using a weighted sum of said derived
samples, and wherein the weighting for each sample is determined from said histogram.
15. An apparatus according to claim 1, wherein said processor is operable to
derive samples of said second parameters and wherein said analyser is operable
to determine said quality measure using the derived samples of said second parameters.
16. An apparatus according to claim 1, wherein said function is in terms of a
set of raw speech signal values representative of speech generated by said source
before being distorted by said transmission channel, wherein the apparatus further
comprises a second processor operable to process the received set of signal values
with initial estimates of said first and second parameters, to generate an estimate
of the raw speech signal values corresponding to the received set of signal values
and wherein said applicator is operable to apply said estimated set of raw speech
signal values to said function in addition to said set of received signal values.
17. An apparatus according to claim 16, wherein said second processor comprises
a simulation smoother.
18. An apparatus according to claim 16, wherein said second processor comprises
a Kalman filter.
19. An apparatus according to claim 1, wherein said second part is a moving average
model and said second parameters comprise moving average model coefficients.
20. An apparatus according to claim 1, further comprising a comparator responsive
to said quality measure and operable to compare signals representative of the received
speech signal with prestored models, to generate a comparison result.
21. An apparatus according to claim 20, wherein said signals representative of
the speech signal are derived from said stored function.
22. An apparatus according to claim 1, further comprising an encoder operable
to encode signals representative of the speech signal in dependence upon the output
quality measure.
23. An apparatus for generating annotation data for use in annotating a data
file, the apparatus comprising:
a receiver operable to receive a speech annotation;
an apparatus according to claim 1 for generating a quality measure indicative
of the quality of the received speech annotation; and
a generator operable to generate annotation data using data representative of
the received speech annotation and said quality measure.
24. An apparatus according to claim 23, further comprising a speech recogniser
operable to process the speech annotation to identify words and/or phonemes within
the speech annotation, wherein said annotation data comprises data identifying
said words and/or phonemes.
25. An apparatus according to claim 1.
26. An apparatus according to claim 25, wherein said annotation data defines
a phoneme and word lattice.
27. An apparatus for searching a database comprising a plurality of information
entries to identify information to be retrieved therefrom, each of said plurality
of information entries having an associated annotation and a quality measure indicative
of the quality of the annotation;
a receiver operable to receive an input speech query;
an apparatus according to claim 1 for processing said input speech query to generate
a quality measure therefor; and
a comparator operable to compare data representative of the input speech query
with said annotations in dependence upon the quality measure of said input speech
query and the corresponding quality measures of said annotations.
28. An apparatus for searching a database comprising a plurality of annotations
which include annotation data and a quality measure indicative of the quality of
an annotation used to generate the annotation data, the apparatus comprising:
means for receiving an input audio query;
means for determining a quality measure for the input audio query; and
means for comparing data representative of said input query with the annotation
data of one or more of said annotations in dependence upon the quality measure
for said input query and the corresponding quality measure for the annotation.
29. An apparatus according to claim 28, wherein said data representative of said
input query and said annotation data comprise word and/or phoneme data.
30. An apparatus according to claim 28, wherein said comparing means is operable
to compare said query data with said annotation data using a first comparison technique
if both said quality measures exceed a predetermined threshold and is operable
to compare said query data with said annotation data using a second comparison
technique if either or both of said quality measures are below said predetermined threshold.
31. A method of determining a quality measure indicative of the quality of a
speech signal, the method comprising the steps of:
receiving, at a receiver, a set of speech signal values representative of a speech
signal generated by a speech source as distorted by a transmission channel between
the speech source and the receiver;
storing a predetermined function which includes a first part having first parameters
which models said source and a second part having second parameters which models
said channel and which gives, for a given set of speech signal values, a probability
density for parameters of a predetermined speech model which is assumed to have
generated the set of speech signal values, the probability density defining, for
a given set of model parameter values, the probability that the predetermined speech
model has those parameter values, given that the model is assumed to have generated
the set of speech signal values;
applying the set of received speech signal values to said stored function to
give the probability density for said model parameters for the set of received
speech signal values;
processing said function with said set of received speech signal values applied,
to derive samples of at least said first parameters from said probability density;
analysing at least some of said derived samples of said at least first parameters
to determine a quality measure indicative of the quality of the received speech
signal values; and
outputting values of said first parameters that are representative of said speech
signal generated by said speech source before it was distorted by said transmission channel.
32. A method according to claim 31, wherein said analysing step determines a
measure of the variance of said at least some of said derived samples of said at
least first parameters in determining said quality measure.
33. A method according to claim 32, wherein said probability density function
is in terms of said variance measure and wherein said processing step draws samples
of said variance measure from said probability density function.
34. A method according to claim 33, wherein said processing step uses a Gibbs sampler.
35. A method according to claim 33, wherein said analysing step determines a
histogram of said drawn samples and wherein said quality measure is determined
using said histogram.
36. A method according to claim 35, wherein said analysing step determines said
quality measure using a weighted sum of said drawn samples, and wherein the weighting
for each sample is determined from said histogram.
37. A method according to claim 31, wherein said processing step draws samples
iteratively from said probability density function.
38. A method according to claim 31, wherein said receiving step receives a sequence
of sets of speech signal values representative of an input speech signal and wherein
said applying step, processing step, and analysing step are performed with respect
to each set of received speech signal values to determine a quality measure for
each set of received signal values.
39. A method according to claim 38, wherein said processing step uses the values
of parameters obtained during the processing of a preceding set of signal values
as initial estimates for the values of the corresponding parameters for a current
set of signal values being processed.
40. A method according to claim 38, wherein said sets of signal values in said
sequence are non-overlapping.
41. A method according to claim 31, wherein said speech model comprises an auto-regressive
process model and wherein said parameters include auto-regressive model coefficients.
42. A method according to claim 31, wherein said speech signal model includes
a noise model having a noise parameter and wherein said quality measure is determined
using said noise parameter.
43. A method according to claim 31, wherein said processing step determines a
histogram of said derived samples and wherein said values of said first parameters
are determined from said histogram.
44. A method according to claim 43, wherein said processing step determines said
values of said first parameters using a weighted sum of said derived samples, and
wherein the weighting for each sample is determined from said histogram.
45. A method according to claim 31, wherein said processing step derives samples
of said second parameters and wherein said analysing step determines said quality
measure using the derived samples of said second parameters.
46. A method according to claim 31, wherein said function is in terms of a set
of raw speech signal values representative of speech generated by said source before
being distorted by said transmission channel, wherein the method further comprises
a second processing step of processing the received set of signal values with initial
estimates of said first and second parameters, to generate an estimate of the raw
speech signal values corresponding to the received set of signal values and wherein
said applying step applies said estimated set of raw speech signal values to said
function in addition to said set of received signal values.
47. A method according to claim 46, wherein said second processing step uses
a simulation smoother.
48. A method according to claim 46, wherein said second processing step uses
a Kalman filter.
49. A method according to claim 31, wherein said second part is a moving average
model and said second parameters comprise moving average model coefficients.
50. A method according to claim 31, further comprising a step of comparing signals
representative of the received speech signal with prestored models to generate
a comparison result and wherein said comparing step is responsive to said quality measure.
51. A method according to claim 50, wherein said signals representative of the
speech signal are derived from said stored function.
52. A method according to claim 31, further comprising a step of encoding signals
representative of the speech signal in dependence upon the output quality measure.
53. A method of generating annotation data for use in annotating a data file,
the method comprising the steps of:
receiving a speech annotation;
performing the method according to claim 31 to generate a quality measure indicative
of the quality of the received speech annotation; and
generating annotation data using data representative of the received speech annotation
and said quality measure.
54. A method according to claim 53, further comprising a step of using a speech
recognition unit to process the speech annotation to identify words and/or phonemes
within the speech annotation, wherein said annotation data comprises said words
and/or phonemes.
55. A method according to claim 31.
56. A method according to claim 55, wherein said annotation data defines a phoneme
and word lattice.
57. A method of searching a database comprising a plurality of information entries
to identify information to be retrieved therefrom, each of said plurality of information
entries having an associated annotation and a quality measure indicative of the
quality of the annotation, the method comprising the steps of:
receiving an input speech query;
using the method according to claim 31 to process said input speech query to
generate a quality measure therefor; and
comparing data representative of the input speech query with said annotations
in dependence upon the quality measure of said input speech query and the corresponding
quality measures of said annotations.
58. A computer readable medium storing computer executable process steps to cause
a programmable computer apparatus to perform the method according to claim 31.
59. Processor implementable process steps for causing a programmable computing
device to perform the method according to claim 31.
60. A method of searching a database comprising a plurality of annotations which
include annotation data and a quality measure indicative of the quality of an annotation
used to generate the annotation data, the method comprising the steps of:
receiving an input audio query;
determining a quality measure for the input audio query; and
comparing data representative of said input query with the annotation data of
one or more of said annotations in dependence upon the quality measure for said
input query and the corresponding quality measure for the annotation.
61. A method according to claim 60, wherein said data representative of said
input query and said annotation data comprise word and/or phoneme data.
62. A method according to claim 60, wherein said comparing step compares said
query data with said annotation data using a first comparison technique if both
said quality measures exceed a predetermined threshold and compares said query
data with said annotation data using a second comparison technique if either or
both of said quality measures are below said predetermined threshold.
63. An apparatus for determining a quality measure indicative of the quality
of a speech signal, the apparatus comprising:
means for receiving a set of speech signal values representative of a speech
signal generated by a speech source as distorted by a transmission channel between
the speech source and the receiving means;
a memory for storing a predetermined function which includes a first part having
first parameters which models said source and a second part having second parameters
which models said channel and which gives, for a given set of speech signal values,
a probability density for parameters of a predetermined speech model which is assumed
to have generated the set of speech signal values, the probability density defining,
for a given set of model parameter values, the probability that the predetermined
speech model has those parameter values, given that the model is assumed to have
generated the set of speech signal values;
means for applying the set of received speech signal values to said stored function
to give the probability density for said model parameters for the set of received
speech signal values;
means for processing said function with said set of received speech signal values
applied, to derive samples of at least said first parameters from said probability density;
means for analysing at least some of said derived samples of said at least first
parameters to determine a quality measure indicative of the quality of the received
speech signal values; and
means for outputting values of said first parameters that are representative
of said speech signal generated by said speech source before it was distorted by
said transmission channel.
64. An apparatus for generating annotation data for use in annotating a data
file, the apparatus comprising:
means for receiving a speech annotation;
an apparatus according to claim 63 for generating a quality measure indicative
of the quality of the received speech annotation; and
means for generating annotation data using data representative of the received
speech annotation and said quality measure.
65. An apparatus for searching a database comprising a plurality of information
entries to identify information to be retrieved therefrom, each of said plurality
of information entries having an associated annotation and a quality measure indicative
of the quality of the annotation;
means for receiving an input speech query;
an apparatus according to claim 63 for processing said input speech query to
generate a quality measure therefor; and
means for comparing data representative of the input speech query with said annotations
in dependence upon the quality measure of said input speech query and the corresponding
quality measures of said annotations.
Description
The present invention relates to an apparatus for and method of determining a
quality measure indicative of the quality of an audio signal. The invention particularly
relates to a statistical processing of an input speech signal to derive this quality measure.
Being able to provide a measure of the quality of an input speech signal is
beneficial in a number of systems. For example, it can be used to control the way
in which data files may be retrieved from a database or the way in which the speech
signal may be encoded for onward transmission. The speech quality measure may also
be used to control the recognition processing operation in, example, a speech recognition system.
The prior art techniques for determining a quality measure of a speech signal
rely on comparing the speech signal with a "clean" reference signal. These techniques
are also done off-line and are not suited to real-time speech quality determination.
One aim of the present invention is to provide an alternative technique for determining
a measure of the quality of an input speech signal. In one embodiment, the determined
quality measure is indicative of the signal to noise ratio for the input speech signal.
According to one aspect, the present invention provides an apparatus for
determining a quality measure indicative of the quality of an audio signal, the
apparatus comprising: a memory for storing a predetermined function which gives
a probability density for parameters of a predetermined audio model which is assumed
to have generated a set of received audio signal values; means for receiving a
set of audio signal values representative of an input audio signal; means for applying
a set of received audio signal values to the stored function to give the probability
density for the model parameters; means for processing the function with said set
of received audio signal values applied to derive samples of parameter values from
said probability density; and means for analysing at least some of said derived
samples of parameter values to determine a signal indicative of the quality of
the received audio signal values.
In one embodiment the audio model comprises an auto-regressive (AR) part which
models speech and a moving average (MA) part which models the channel between the
speech source and the receiver; and wherein the speech quality measure is derived
from parameters of at least one of those parts. For example, the speech quality
measure may be derived from the AR parameter values or from the MA parameter values.
Alternatively, it may be determined from the variance of some of these parameter values.
Exemplary embodiments of the present invention will now be described with
reference to the accompanying drawings in which:
FIG. 1 is a schematic view of a computer which may be programmed to operate
in accordance with an embodiment of the present invention;
FIG. 2 is block diagram illustrating the principal components of a data file
annotation system;
FIG. 3 is a schematic diagram of a word and phoneme lattice for an example audio
string input by a user;
FIG. 4 is block diagram illustrating the principal components of a data file
retrieval system;
FIG. 5
a is a flow diagram illustrating part of the flow control during
a retrieval operation using the system shown in FIG. 4;
FIG. 5
b is a flow diagram illustrating the remaining part of the flow
control of the retrieval system shown in FIG. 4;
FIG. 6 is a block diagram representing a model employed by a statistical analysis
unit which forms part of the data file annotation system shown in FIG. 2 and the
data file retrieval system shown in FIG. 4;
FIG. 7 is a flow chart illustrating the processing steps performed by a model
order selection unit forming part of the statistical analysis unit shown in FIGS.
2 and 4;
FIG. 8 is a flow chart illustrating the main processing steps employed by a
Simulation Smoother which forms part of the statistical analysis unit shown in
FIGS. 2 and 4;
FIG. 9 is a block diagram illustrating the main processing components of the
statistical analysis unit shown in FIGS. 2 and 4;
FIG. 10 is a memory map illustrating the data that is stored in a memory which
forms part of the statistical analysis unit shown in FIGS. 2 and 4;
FIG. 11 is a flow chart illustrating the main processing steps performed by
the statistical analysis unit shown in FIG. 9;
FIG. 12
a is a histogram for a model order of an auto regressive filter
model which forms part of the model shown in FIG. 6;
FIG. 12
b is a histogram for the variance of process noise modelled by
the model shown in FIG. 6;
FIG. 12
c is a histogram for a third coefficient of the AR filter model;
FIG. 13 is a block diagram illustrating the main components of an alternative
data annotation system; and
FIG. 14 is a schematic block diagram illustrating the form of a user terminal
which is operable to retrieve a data file from a database located within a remote
server in response to an input voice query.
Embodiments of the present invention can be implemented on computer hardware,
but the embodiment to be described is implemented in software which is run in conjunction
with processing hardware such as a personal computer, workstation, photocopier,
facsimile machine or the like.
FIG. 1 shows a personal computer (PC)
1 which may be programmed to operate
an embodiment of the present invention. A keyboard
3, a pointing device
5, a microphone
7 and a telephone line
9 are connected to
the PC
1 via an interface
11. The keyboard
3 and pointing
device
5 allow the system to be controlled by a user. The microphone
7
converts the acoustic speech signal of the user into an equivalent electrical signal
and supplies this to the PC
1 for processing. An internal modem and speech
receiving circuit (not shown) may be connected to the telephone line
9 so
that the PC
1 can communicate with, for example, a remote computer or with
a remote user.
The program instructions which make the PC
1 operate in accordance with
the present invention may be supplied for use with an existing PC
1 on,
for example, a storage device such as a magnetic disc
13, or by downloading
the software from the Internet (not shown) via the internal modem and telephone
line
9.
Data File Annotation
The operation of a data file annotation system embodying the present invention
will now be described with reference to FIG. 2. The system shown in FIG. 2 allows
a user to add a voice annotation to a data file
91 for use in subsequent
voice retrieval operations. In use, the user selects a data file to be annotated
(which can be any kind of data file such as a video file, an audio file, a multi-media
file or the like). The user then speaks the voice annotation towards microphone
7. Corresponding electrical signals output from the microphone
7
are then filtered by a filter
15 which removes unwanted frequencies (in
this embodiment frequencies above 8 kHz) from the input signal. The filtered signal
is then sampled (at a rate of 16 kHz) and digitised by an analogue to digital converter
17. The digitised speech samples are then stored in a buffer
19.
Sequential blocks (or frames) of speech samples are then passed from the buffer
19 to a statistical analysis unit
21 which performs a statistical
analysis of each frame of speech samples in sequence to determine a set of auto
regressive (AR) coefficients representative of the speech within the frame and
a measure of the quality of the input speech. In this embodiment, the quality measure
is the variance of the AR coefficients.
The quality measure is output to a speech quality assessor
93 and the
AR coefficients are output to a speech recognition unit
97. The speech recognition
unit
25 compares the AR coefficients for successive frames of speech with
a set of stored speech models (not shown), which may be template based or Hidden
Markov model based, to generate a recognition result. In this embodiment, the speech
recognition unit
97 outputs words and phonemes corresponding to the spoken
annotation input by the user. As shown in FIG. 2, the output words and phonemes
are input to a data file annotation unit
99 which also receives an assessment
of the speech quality output by the speech quality assessor
93. In this
embodiment, the speech quality assessor
93 determines whether or not the
input speech is of a high quality (i.e. not disturbed by high levels of background
noise) based on the variance data received from the statistical analysis unit
21.
In particular, the variance of the AR coefficients should be smaller when the speech
input is of a high quality than when there are high levels of noise. The data file
annotation unit
99 then generates an annotation for the data file
91
from the words and phonemes output by the speech recognition unit
97 and
the speech quality assessment output by the speech quality assessor
93.
The data file
91 is then stored in the data file database
101 and
the corresponding annotation data is stored in the annotation database
103.
As those skilled in the art will appreciate, the speech quality assessment which
is stored with the annotation data is useful for subsequent retrieval operations.
In particular, when the user wishes to retrieve a data file
91 from the
database
101 (using a voice query), it is useful to know the quality of
the speech that was used to annotate the data file and/or the quality of the voice
query used to retrieve the data file, since this will affect the retrieval performance.
More specifically, if the voice annotation is of a high quality and the user's
voice query is also of a high quality, then a stringent search of the annotation
database
103 should be performed, in order to reduce the amount of false
identifications. In contrast, if the original voice annotation is of a low quality
or if the user's voice query is of a low quality, then a less stringent search
of the annotation database
103 should be performed so that there is a greater
chance of retrieving the correct data file
91. The way in which this search
is carried out will be described in more detail below.
In this embodiment, the phoneme and word annotation data for a data file is stored
in the annotation database
103 as a phoneme and word lattice. FIG. 3 schematically
illustrates the form of the word and phoneme lattice generated for the spoken annotation
"picture of the Taj Mahal". As shown, the word and phoneme lattice identifies a
number of different phoneme and word strings which correspond to this spoken utterance.
The phoneme and word lattice is an acyclic directed graph with a single entry point
and a single exit point. It represents different parses of the spoken annotation.
It is not simply a sequence of words with alternatives since each word does not
have to be replaced by a single alternative, one word can be substituted for two
or more words or phonemes and the whole structure can form a substitution for one
or more words or phonemes. As those skilled in the art of speech recognition will
realise, the use of phoneme data in addition to word data is more robust, because
phonemes are dictionary independent and allow the system to cope with out of vocabulary
words, such as names, places, foreign words etc. The use of phoneme data is also
capable of making the system future proof, since it allows data files which are
placed into the database to be retrieved even when the words are not understood
by the original automatic speech recognition system.
In this embodiment, the annotation data stored in the annotation database
103
has the following general form:
- Header
- time of start
- flag if word if phoneme if mixed
- time index associating the location of blocks of annotation data
within memory to a given time point
- word set used (i.e. the dictionary)
- phoneme set used
- the language to which the language pertains
- speech quality assessment
- block(i) i=0, 1, 2, . . .
- node Nj j=0, 1, 2, . . .
- time offset of node from start of block
- phoneme links(k) k=0, 1, 2, . . .
- offset to node Nj=Nk-Nj
(Nk is node to which link k extends) or if Nk is in block(i+1)
offset to node Nj=Nk+Nb-Nj (where Nb
is the number of nodes in block(i))
- phoneme associated with link(k)
- word links(l) l=0, 1, 2 . . .
- offset to node Nj=Ni-Nj
(Nj is node to which link l extends) or if Nk is in block(i+1)
offset to node Nj=Nk+Nb-Nj (where Nb
is the number of nodes in block(i))
- word associated with link(l)
The time of start data in the header can identify the time and date of transmission
of the data. For example the time of start may include the exact time of the spoken
annotation and the date on which it was spoken.
The flag identifying if the annotation data is word annotation data, phoneme
annotation data or if it is mixed is provided since not all of the annotation data
in the annotation database
103 will include the combined phoneme and word
lattice annotation data discussed above, and in this case, a different search strategy
may be used to search this annotation data.
In this embodiment, the annotation data is divided into blocks in order to allow
the search to jump into the middle of the annotation for a given audio data stream.
The header therefore includes a time index which associates the location of the
blocks of annotation data within the memory to a given time offset between the
time of start and the time corresponding to the beginning of the block.
The header also includes data defining the word set used (i.e. the dictionary),
the phoneme set used and the language to which the vocabulary pertains. The header
may also include details of the automatic speech recognition system used to generate
the annotation data and the appropriate settings thereof which are used during
the generation of the annotation. Finally, as discussed above, the header also
includes the speech quality assessment which identifies whether or not the spoken
annotation is of a high quality.
The blocks of annotation data then follow the header and identify, for each node
in the block, the time offset of the node from the start of the block, the phoneme
links which connect that node to other nodes by phonemes and word links which connect
that node to other nodes by words. Each phoneme link and word link identifies the
phoneme or word which is associated with the link and the offset to the current
node. For example, if node N
50 is linked to node N
55 by a
phoneme link, then the offset to node N
50 for that link is 5. As those
skilled in the art will appreciate, using an offset indication like this allows
the division of the continuous annotation data into separate blocks.
Data File Retrieval
FIG. 4 is a block diagram illustrating the form of a data file retrieval system
which can be used to retrieve the annotation data files from the database
101.
This system may be, for example, a personal computer, a hand held device or the
like. As shown, in this embodiment, the retrieval system is similar to the speech
annotation systems shown in FIG. 2 except that the data file annotation unit
99
is replaced with a data file retrieval unit
102, and a display
105
is provided for displaying the search results. In operation, an input voice query
is processed in the same way as the spoken annotation described above. The phoneme
and word data corresponding to the user's input query is output from the speech
recognition unit
97 to the data file retrieval unit
102. The data
file retrieval unit
102 then searches the annotation database
103
using the generated phoneme and word data and a speech quality assessment output
by the speech quality assessor
93 for the input query. The results of the
search are then output to the user on the display
105.
FIGS. 5
a and
5b are flow charts illustrating the flow
control of the retrieval system shown in FIG. 4. As shown, initially in step s
101,
the system awaits an input query by the user. Upon receipt of the query, the system
generates in step s
103, phoneme and word data and a quality assessment for
the input query. Processing then proceeds to step s
105 where the data file
retrieval unit
102 performs a word search in the annotation database
103
using the words in the query. The processing then proceeds to step s
107
where the data file retrieval unit
102 determines whether or not a match
has been found. If it has, then the data file retrieval unit
102 displays
the results to the user on the display
105.
In this embodiment, the system then allows the user to consider the search results
and awaits the user's confirmation as to whether or not the results correspond
to the data file the user wishes to retrieve. If it is, then the processing proceeds
from step sill to the end of the processing and the system returns to its idle
state and awaits the next input query. If, however, the user indicates (by, for
example, inputting an appropriate voice command) that the search results do not
correspond to the desired data file, then the processing proceeds from step sill
to step s
112, where the data file retrieval unit
102 determines whether
or not the user's input query is of a high quality. If it is not, then the processing
proceeds to step s
113 where the data file retrieval unit
102 uses
the results of the word search to select a number of annotations and then performs
a "relaxed" phoneme search of the selected annotations. The phoneme search is "relaxed"
in the sense that the data file retrieval unit
102 does not discard annotations
unless the phonemes of the annotation are very different to the phonemes for the
input query.
If, on the other hand, the system determines at step s
112 that the input
query is of a high quality, then the processing proceeds to step s
114 where
the data file retrieval unit
102 again uses the results of the word search
to select annotations and then uses a relaxed phoneme search for the selected annotations
having a low quality assessment and a "stringent" phoneme search for annotations
having a high quality assessment. The phoneme search is "stringent" in the sense
that the data file retrieval unit
102 discards annotations quickly in the
searching operation if there are significant differences between the annotation
phonemes and the query phonemes.
After the phoneme searches have been performed, the processing proceeds to
step s
115 where the data file annotation unit
102 determines whether
or not a match has been found. If a match has been found then the processing proceeds
to step s
117 where the results are displayed to the user on the display
105. If the search results are correct, then processing proceeds from step
s
119 to the end of the processing and the system returns to its idle state
and awaits the next input query. If, on the other hand, the user indicates that
the search results still do not correspond to the desired data file, then the processing
passes to step s
121 where the data file retrieval unit
102 queries
the user, via the display
105, whether or not a phoneme search should be
performed of the whole annotation database
103. If in response to this query,
the user indicates that such a search should be performed, then the processing
proceeds to step s<