Senior Fitness - Exercise and Nutrition for Aging Men and Women
FREE Article Feed for your website.
Home Ownership Magazine
Party Planning Information
Article Marketing Resources
Bio-Medical Research Article Database
Informative Articles on Life, Love and Happiness
Tutorials on Business to Writing
Famous Quotes from Famous People
Song Lyric Information
New US Patent Information
Comprehensive List of Content by Category
Online Auctions and Shopping Related Articles
Article Search
Most Recent Articles
 

Weight Loss Tips Healthy Breakfast Recipes
Category:
Health / Fitness  

What are mutual funds
Category:
Finance / Investment  

Dining Out 101
Category:
Sports  

Nokia powers Vodafones HSDPA service in Australia
Category:
Marketing  

Privacy And Your Russian Wife
Category:
Travel  

Eating Out and Loosing Weight
Category:
Health / Fitness  

Is Adsense for every affiliate marketer
Category:
Marketing  

Bad Debt Loans Sub prime Debt Consolidation Loans
Category:
Finance / Investment  

For Managers—Best Practices
Category:
Business  

10 More Steps to Internet Success
Category:
Marketing  

It All Starts With Good Nutrition
Category:
Health / Fitness  

Multiple orgasms
Category:
Health / Fitness  

21 Reasons for exporting a used car from Japan
Category:
Cars And Trucks  

FOREX or Futures Where to Trade
Category:
Finance / Investment  

Breakfast for good healthy
Category:
Health / Fitness  

Caribbean Cruises Paradise Awaits Part Two
Category:
Travel  

Your Golf Score is determined by Feel
Category:
Sports  

Dish Network DVR s and What You Should Know
Category:
Computers  

Students Better Learning Ability Can Be Just a Breath Away
Category:
Education  

How autoresponder can be benefited from this course
Category:
Marketing  

Who Took Your Million Dollar Job
Category:
Education  

Diagnosis and Treatments for Irritable Bowel Syndrome
Category:
Health / Fitness  

Who Else Is Looking to Attract and Retain Hispanic Customers and...
Category:
Business  

Has The Time come For A Work From Home Career
Category:
Business  

Vegetable Gardening
Category:
Hobbies / Pastimes  

The History of London Bridge
Category:
Education  

Why Take Green Tea Concentrate
Category:
Health / Fitness  

Top Electrician Jobs
Category:
Home And Family  

What Was Albert Einstein Thinking
Category:
Self Help  

The Connection Between Best Acne Treatments and Technology
Category:
Health / Fitness  

The Importance of a Trading Plan
Category:
Finance / Investment  

China Auto Industry Running Fast
Category:
Cars And Trucks  

Hi Make Easy Money
Category:
Business  

Learning on the Net Online College Classes
Category:
Education  

Jazz Wedding Music Perfect for Every Wedding Day Event
Category:
Home And Family  

Click fraud
Category:
Marketing  

Pigeon Forge Hotels
Category:
Travel  

Barry Michaels Radio is My Life
Category:
Entertainment / Television  

Liquor Control System The Wireless World of Liquor
Category:
Marketing  

Organize Your Closets
Category:
Business  

Employ Bridging Loans for short term financial gaps
Category:
Finance / Investment  

A quick guide to remortgage
Category:
Finance / Investment  

Work from Home Careers
Category:
Business  

Remove Unwanted Hair
Category:
Health / Fitness  

High Blood Pressure Information
Category:
Health / Fitness  

Credit Card Suggestions For Bad Credit
Category:
Finance / Investment  

Night in Satun Adventures in Southern Thailand
Category:
Travel  

Tenant Loans Loan option when you are not a homeowner
Category:
Finance / Investment  

How to Make Money Online With Only Writing Articles
Category:
Marketing  

How is an online MBA program beneficial
Category:
Education  

Affiliate Marketing Does it Really Pay
Category:
Marketing  

Computer Desks think before you buy
Category:
Business  

Digital Camera Printer Making the Printing Process Easier
Category:
Computers  

The Importance of a Mentor
Category:
Business  

The steps involved in Search engine optimization SEO
Category:
Computers  

Teen Parenting Tips That Strengthen Your Bond
Category:
Home And Family  

How I Cured The Temptation To Binge
Category:
Health / Fitness  

Seeing the Light Laser Eyelid Surgery
Category:
Health / Fitness  

Tango Dance Of Love
Category:
Entertainment / Television  

A House Is for Protection a Home Should Be Protected
Category:
Finance / Investment  

This Powerful Quote Was Earth Shattering
Category:
Real Estate  

Buying Water Filters Let me Tell You Something
Category:
Health / Fitness  

Taking a Cruise Things to Do
Category:
Travel  

HOW YOU CAN ENJOY A CRUISE OF A LIFETIME YOU DESERVE IT
Category:
Travel  

Priceless advice on how to enjoy a bit of luxury without the hig...
Category:
Travel  

Naturally Sweet and Healthy The Wonders of Stevia
Category:
Sports  

Goals Be Open To Possibility
Category:
Self Help  

Euro Travel
Category:
Travel  

7 Ways to Get Traffic to Your Blog
Category:
Marketing  

Breast Enlargement Procedures Breast Enlargement Hypnosis
Category:
Health / Fitness  

Get Ahead with Bad Credit Cash Advance Loans
Category:
Finance / Investment  

Top 5 Questions On Creating Ebooks Answered
Category:
Marketing  

Is Worry Causing your Tossing and Turning
Category:
Health / Fitness  

Diet Comparison
Category:
Health / Fitness  

Weight Loss Tips
Category:
Health / Fitness

Noise-robust feature extraction using multi-layer principal component analysis Number:7,082,394 from the United States Patent and Trademark Office (PTO) owispatent

Home    Author Login    Submit Article    Article Search    Add Your Link    Edit Your Link    Contact Us    Advertising    Disclaimer

   

 
Web LinkGrinder.com

Top Breaking News
     Greek, Cypriot Leaders Resume Unification Talks in Nicosia by Nathan Morley
     Indonesia Tobacco Sales Grow, Raising Health Fears
     South Korea Allows Top Defector to Travel Overseas by VOA News

Title: Noise-robust feature extraction using multi-layer principal component analysis

Abstract: Extracting features from signals for use in classification, retrieval, or identification of data represented by those signals uses a "Distortion Discriminant Analysis" (DDA) of a set of training signals to define parameters of a signal feature extractor. The signal feature extractor takes signals having one or more dimensions with a temporal or spatial structure, applies an oriented principal component analysis (OPCA) to limited regions of the signal, aggregates the output of multiple OPCAs that are spatially or temporally adjacent, and applies OPCA to the aggregate. The steps of aggregating adjacent OPCA outputs and applying OPCA to the aggregated values are performed one or more times for extracting low-dimensional noise-robust features from signals, including audio signals, images, video data, or any other time or frequency domain signal. Such extracted features are useful for many tasks, including automatic authentication or identification of particular signals, or particular elements within such signals.

Patent Number: 7,082,394 Issued on 07/25/2006 to Burges,   et al.


Inventors: Burges; Chris (Bellevue, WA); Platt; John (Bellevue, WA)
Assignee: Microsoft Corporation (Redmond, WA)
Appl. No.: 10/180,271
Filed: June 25, 2002


Current U.S. Class: 704/243 ; 382/190; 704/205; 704/228; 704/235
Current International Class: G10L 15/02 (20060101); G06K 9/46 (20060101)
Field of Search: 704/205,228,235,243 382/190


References Cited [Referenced By]

U.S. Patent Documents
4069393 January 1978 Martin et al.
4843562 June 1989 Kenyon et al.
5210820 May 1993 Kenyon
5377305 December 1994 Russo
5754681 May 1998 Watanabe et al.
6061680 May 2000 Scherf et al.
6154773 November 2000 Roberts et al.
6230207 May 2001 Roberts et al.
6751354 June 2004 Foote et al.
6947892 September 2005 Bauer et al.
2003/0028796 February 2003 Roberts et al.
2003/0086341 May 2003 Wells et al.
2003/0095660 May 2003 Lee et al.
2003/0236661 December 2003 Burges et al.
2004/0260727 December 2004 Goldstein et al.

Other References

Balachander et al. Oriented Soft Localized Subspace Classification, IEEE ICASSP 1999. cited by examiner .
Jonathan Foote. "Content-based retrieval of music and audio." In Multimedia Storage and Archiving Systems II, Proceedings of SPIE, pp. 138-147, 1997. cited by other .
Lie Lu, Hao Jiang, and HongJiang Zhang. "A robust audio classification and segmentation method." Technical report, Microsoft Research, 2001. cited by other .
H.S. Malvar. "A modulated complex lapped transform and its applications to audio processing." In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Pheonix, 1999. cited by other .
Tong Zhang and C.-C. Jay Kuo. "Hierarchical classification of audio data for archiving and retrieving." In IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, 1999. cited by other .
Narendranath Malayath, Hynek Hermansky, and Alex Kain. "Towards Decomposing the Sources of Variability in Speech." Proceedings of Eurospeech'97, Rhodes, Greece 1997. cited by other .
H. Hermansky and N. Malaynath, "Speaker verification using speaker-specific mapping." Speaker Recognition and its Commercial and Forensic Applications, France, May 1998. cited by other.

Primary Examiner: {hacek over (S)}mits; Talivaldis Ivars
Assistant Examiner: Saint-Cyr; Leonard
Attorney, Agent or Firm: Lyon & Harr, LLP Watson; Mark A.

Claims



What is claimed is:

1. A system for training a feature extractor for extracting features from an input signal comprising: receiving at least one training signal; receiving at least one distorted copy of the at least one training signal; transforming each training signal and each distorted copy of the at least one training signal into a suitable representation for taking projections; performing a multi-layer oriented principal component analysis (OPCA) of the at least one transformed training signal and the at least one transformed distorted copy of the at least one training signal to compute a set of training projections for each layer; and constructing a signal feature extractor from two or more layers of said projections.

2. The system of claim 1 wherein performing a multi-layer OPCA of the at least one transformed training signal and the at least one transformed distorted copy of the training signal to compute the set of training projections for each layer comprises; computing a first OPCA layer directly from the at least one transformed training signal and the at least one transformed distorted copy of the at least one training signal; and computing at least one subsequent OPCA layer from an aggregate of the projections from an immediately preceding OPCA layer, beginning with an aggregate of the training projections from the first OPCA layer.

3. The system of claim 1 further comprising pre-processing the at least one training signal, and the at least one distorted copy of the at least one training signal, to remove known distortions from the at least one training signal and the at least one distorted copy of the training signal.

4. The system of claim 1 further comprising normalizing the training projections output by each OPCA layer.

5. The system of claim 1 wherein the set of training projections computed for each layer is populated by a predetermined number of highest generalized eigenvalue OPCA projections computed for each layer.

6. The system of claim 1 further comprising applying a suitable normalization to each projection at each layer.

7. The system of claim 1 further comprising transforming each input signal into a representation suitable for projection.

8. The system of claim 1 wherein the at least one training signal and each distorted copy of the at least one training signal comprise audio signals and wherein transforming each training signal and each distorted copy of the at least one training signal into a suitable representation for taking projections comprises transforming the audio signals into a time-frequency representation.

9. The system of claim 8 wherein transforming the audio signals into a time-frequency representation comprises applying Fourier transforms to windowed subsets of the audio signals.

10. The system of claim 7 wherein said at least one input signal comprises an audio signal and said transforming comprises transforming the audio signal into a time-frequency representation.

11. The system of claim 7 further comprising extracting at least one feature from the at least one input signal by passing at least one transformed input signal through each layer of the feature extractor in the order in which the layers were originally computed.

12. The system of claim 2 further comprising: receiving at least one input signal and transforming each input signal into a representation suitable for projection; and passing at least one transformed input signal through each layer of the feature extractor in the order in which the layers were originally computed.

13. The system of claim 12 wherein passing the least one transformed input signal through each layer of the feature extractor comprises; computing a first set of output projections by applying the training projections of the first OPCA layer to the at least one transformed input signal; computing at least one subsequent set of output projections by applying the training projections of each layer of the feature extractor to previous aggregate layers of output projections, wherein each aggregate layer of output projections is generated by collating output projections from adjacent positions in a layer.

14. The system of claim 13 wherein a final set of output projections produced by a last layer of the feature extractor represents features extracted from the input signal.

15. The system of claim 14 wherein at least one of the input signals represent a known data signal.

16. The system of claim 15 wherein at least one of the input signals represents an unknown data signal.

17. The system of claim 16 further comprising comparing the features extracted from the known data signal to the features extracted from the unknown data signal, and wherein one or more portions of the unknown data signal are identified by the comparison of the extracted features.

18. The system of claim 1 wherein transforming each training signal and each distorted copy of the training signal into a representation suitable for projection is performed on sequential frames of the training signal, and wherein performing a multi-layer oriented principal component analysis (OPCA) of the transformed training signal and the at least one transformed distorted copy of the at least one training signal to compute a set of training projections for each layer is performed on each sequential frame of the at least one training signal.

19. The system of claim 7 wherein transforming each input signal into a representation suitable for projection is performed on sequential frames of the input signal, and wherein extracting at least one feature from the at least one input signal by passing at least one transformed input signal through each layer of the feature extractor in the order in which the layers were originally computed is performed on each sequential frame of the input signal.

20. The system of claim 1 wherein the at least one training signal and the input signal are of the same signal type, and wherein the signal type represents any of audio signals, images, and video data.

21. The system of claim 1 further comprising normalizing the training projections for each layer by computing scores on a validation signal such that a mean distance between each training projection and projections computed for the validation signal is one.

22. A method for training a feature extractor for extracting features from an input signal comprising using a computing device to: divide at least one training signal into a set of adjacent frames, each frame having a same size; apply a first oriented principal component analysis (OPCA) to the adjacent frames to produce a first set of generalized eigenvectors for each frame; choose a number N of highest value eigenvectors for each frame; project each frame along the eigenvectors computed for each frame to produce a first set of N projections for each frame; aggregate the projections for adjacent frames to produce at least one aggregate; apply a second OPCA to each aggregate, with the second OPCA producing a second set of generalized eigenvectors for each aggregate frame; choose N highest value elgenvectors produced by the second OPCA for each aggregate frame; project each aggregate frame along the eigenvectors computed for the each aggregate frame to produce a second set of N projections for each aggregate frame; and train a feature extractor by assigning the first set of N projections to a first feature extractor layer, and assigning the second set of N projections to a second feature extractor layer.

23. The method of claim 22 wherein the at least one training signal is transformed prior to performing the OPCA.

24. The method of claim 22 further comprising normalizing the projections.

25. The method of claim 24 wherein normalizing the projections comprises normalizing the projections for the last layer by computing scores on a validation signal such that a mean distance between each projection computed from the at least one training signal and projections computed for the validation signal is one.

26. The method of claim 22 further comprising: computing at least one subsequent layer of projections by aggregating a number of adjacent projections of an immediately preceding layer, beginning with the second set of projections to produce a subsequent aggregate frame; applying a subsequent OPCA to this aggregate, with the OPCA outputting a new set of generalized eigenvectors; choosing N highest value elgenvectors produced by the subsequent OPCA for each subsequent aggregate frame; project each subsequent aggregate frame along the elgenvectors computed for the each subsequent aggregate frame to produce a subsequent set of N projections for each subsequent aggregate frame; and further training the feature extractor by assigning each new subsequent set of N projections to a subsequent feature extractor layer.

27. A computer-readable medium having computer executable instructions for extracting features from an input signal, said computer executable instructions comprising: applying a multi-layer oriented principal component analysis (OPCA) to a set of at least one training signals for producing a set of training projections for each OPCA layer, wherein each subsequent layer of the OPCA is performed on an aggregate of outputs from an immediately preceding OPCA layer; training a feature extractor by assigning the set of training projections for each OPCA layer to a corresponding layer of the feature extractor; and extracting features from at least one input signal by passing each input signal through each layer of the feature extractor in the order in which the layers were originally computed.

28. The computer-readable medium of claim 27 wherein applying a multi-layer OPCA to the set of training signals for producing a set of training projections for each OPCA layer comprises: computing a first OPCA layer by: transforming each training signal; computing generalized elgenvectors over the transformed training signals, projecting each training signal over a number of highest value eigenvectors to produce a number of projections from the training signal; and computing a second OPCA layer by: collating a number of adjacent projections from the first OPCA layer into an aggregate of projections, computing generalized eigenvectors over the aggregate of projections, and projecting the aggregate of projections over a number of highest value eigenvectors computed from the projections to produce a number of projections from the aggregate of projections.

29. The computer-readable medium of claim 28 further comprising computing at least one additional OPCA layer by applying an OPCA to an aggregate of the projections from an immediately preceding OPCA layer, beginning with the second OPCA layer.

30. A computer-implemented process for training an audio signal feature extractor, comprising using a computing device to: receive an audio input comprising representative audio data; transform the audio input into a time-frequency representation; compute generalized eigenvalues over the transformed audio data; compute at least one eigenvector corresponding to at least one highest value elgenvalue and assign those elgenvectors to a first layer of an audio signal feature extractor; collate a number of adjacent eigenvectors into an aggregate; compute generalized eigenvalues over the aggregate; compute at least one eigenvector corresponding to at least one highest value eigenvalue of the aggregate and assign those eigenvectors to a second layer of the audio feature extractor.

31. The computer-implemented process of claim 30 further comprising extracting features from at least one first audio signal by passing a time-frequency transformation of the first audio signal through each layer of the audio feature extractor.

32. The computer-implemented process of claim 30 wherein the audio input is distorted prior to transforming the audio input into a time-frequency representation.

33. The computer-implemented process of claim 30 wherein at least one copy of the audio input is distorted prior to transforming the audio data.

34. The computer-implemented process of claim 30 wherein at least one copy of the audio input is pre-processed prior to transforming the audio input by combining any multi-channel audio information into a single audio channel.

35. The computer-implemented process of claim 30 wherein the audio input is pre-processed prior to transforming the audio input by downsampling the audio input.

36. The computer-implemented process of claim 30 wherein the audio input is pre-processed prior to transforming the audio input by using a human psychoacoustic masking model for removing audio frequency components from the audio input which can not be heard by a typical human listener.

37. The computer-implemented process of claim 30 wherein the audio input is randomly shifted forward and backwards in time, Up to a predefined maximum time offset, to provide at least one temporally misaligned copy of the audio input, and wherein the feature extractor trained using the time-shifted audio data is robust against temporal misalignment.

38. The computer-implemented process of claim 30 wherein the audio input is transformed using a complex modulated lapped transform to produce the transformed audio data.

39. The computer-implemented process of claim 30 wherein the audio input is transformed using a windowed FFT to produce the transformed audio data.

40. The computer-implemented process of claim 31 wherein the first audio signal represents a known audio signal, and wherein each extracted audio feature is stored in an exemplary feature database.

41. The computer-implemented process of claim 40 further comprising extracting at least one second audio feature from at least one second audio signal.

42. The computer-implemented process of claim 41 further comprising comparing the audio features extracted from the first audio signal to the audio features extracted from the second audio signal.
Description



BACKGROUND

1. Technical Field

The invention is related to a signal feature extractor, and in particular, to a system and method for using a "distortion discriminant analysis" of a set of training signals to define parameters of a feature extractor for extracting distortion-robust features from signals having one or more dimensions, such as audio signals, images, or video data.

2. Related Art

There are many existing schemes for extracting features from signals having one or more dimensions, such as audio signals, images, or video data. For example, with respect to a one-dimensional signal such as an audio signal or audio file, audio feature extraction has been used as a necessary step for classification, retrieval, and identification tasks involving the audio signal. For identification, the extracted features are compared to a portion of an audio signal for identifying either elements within the audio signal, or the entire audio signal. Such identification schemes are conventionally known as "audio fingerprinting."

Conventional schemes for producing features for pattern matching in signals having one or more dimensions typically approach the problem of feature design by handcrafting features that it is hoped will be well-suited for a particular identification task. For example, current audio classification, segmentation and retrieval methods use heuristic features such as the mel cepstra, the zero crossing rate, energy measures, spectral component measures, and derivatives of these quantities. Clearly, other signal types make use of other heuristic features that are specific to the particular type of signal being analyzed.

For example, one conventional audio classification scheme provides a hierarchical scheme for audio classification and retrieval based on audio content analysis. The scheme consists of three stages. The first stage is called a coarse-level audio segmentation and classification, where audio recordings are segmented and classified into speech, music, several types of environmental sounds, and silence, based on morphological and statistical analysis of temporal curves of short-time features of audio signals. In the second stage, environmental sounds are further classified into finer classes such as applause, rain, birds' sound, etc. This fine-level classification is based on time-frequency analysis of audio signals and use of the hidden Markov model (HMM) for classification. In the third stage, the query-by-example audio retrieval is implemented where similar sounds can be found according to an input sample audio.

Another conventional scheme approaches audio content analysis in the context of video structure parsing. This scheme involves a two-stage audio segmentation and classification scheme that segments and classifies an audio stream into speech, music, environmental sounds, and silence. These basic classes are the basic data set for video structure extraction. A two-stage algorithm is then used to identify and extract audio features. In particular, the first stage of the classification is to separate speech from non-speech, based on simple features such as high zero-crossing rate ratio, low short-time energy ratio, spectrum flux and Linear Spectral Pairs (LSP) distance. The second stage of the classification further segments non-speech class into music, environmental sounds and silence with a rule based classification scheme.

Still another conventional scheme provides an audio search engine that can retrieve sound files from a large corpus based on similarity to a query sound. With this scheme, sounds are characterized by "templates" derived from a tree-based vector quantizer trained to maximize mutual information (MMI). Audio similarity is measured by simply comparing templates. The basic operation of the retrieval system involves first accumulating and parameterizing a suitable corpus of audio examples into feature vectors. The corpus must contain examples of the kinds (classes) of audio to be discriminated between, e.g., speech and music, or male and female talkers. Next, a tree-based quantizer is constructed using a manually "supervised" operation which requires the training data to be labeled, i.e., each training example must be associated with a class. The tree automatically partitions the feature space into regions ("cells") which have maximally different class populations. To generate an audio template for subsequent retrieval, parameterized data is quantized using the tree. To retrieve audio by similarity, a template is constructed for the query audio. Comparing the query template with corpus templates yields a similarity measure for each audio file in the corpus. These similarity measures can then be sorted by similarity and the results presented as a ranked list.

Another approach to feature extraction has been applied in the area of speech recognition and speech processing. For example, one conventional scheme provides a method for decomposing a conventional LPC-cepstrum feature space into subspaces which carry information about linguistic and speaker variability. In particular, this scheme uses oriented principal component analysis (OPCA) to estimate a subspace which is relatively speaker independent.

A related OPCA technique builds on the previous scheme by using OPCA for generating speaker identification or verification models using speaker information carried in the speech signal. This scheme is based on a three step modeling approach. In particular, this scheme first extracts a number of speaker-independent feature vectors which include linguistic information from a target speaker. Next, a set of speaker-dependent feature vectors which include both linguistic and speaker information are extracted from the target speaker. Finally, a functional mapping between the speaker-independent and the speaker-dependent features is computed for transforming the speaker-independent features into speaker-dependent features to be used for speaker identification.

However, while the aforementioned schemes are useful, they do have limitations. For example, a feature extractor system designed with heuristic features such as those discussed above is not typically optimal across multiple types of distortion or noise in a signal. In fact, different features than those selected or extracted often give better performance, or are more robust to particular types of noise or distortion. Further, with respect to the OPCA based schemes, these schemes do not effectively address noise or distortions in the signal being analyzed over wide temporal or spatial windows.

Therefore, what is needed is a system and method for extracting features from a set of representative training data such that the features extracted will be robust to both distortion and noise when used for feature classification, retrieval, or identification tasks involving an input signal.

SUMMARY

A system and method for extracting features from signals having one or more dimensions for use in classification, retrieval, or identification of the data represented by those signals uses a "Distortion Discriminant Analysis" (DDA) of a set of training signals to define parameters of a signal feature extractor. Note that in the context of this description, a "signal" is defined to be any set of data that has a low-dimensional index set. In general, the signal feature extractor is capable of extracting features from any time, space, or frequency domain signal of one or more dimensions. For example, such signals include an audio signal which is considered to be a one-dimensional signal; an image is which considered to be a two-dimensional signal; and video data which is considered to be a three-dimensional signal. Thus, the term signal, as used throughout this description will be understood to mean a signal of any dimensionality, except where particular signal types are explicitly referred to.

The signal feature extractor described herein takes any signal with a temporal or spatial structure, applies an oriented principal component analysis (OPCA) to limited regions of the signal, aggregates the output of multiple OPCAs that are spatially or temporally adjacent, and then applies OPCA to the aggregate. The steps of aggregating adjacent OPCA outputs and applying OPCA to the aggregated values can be performed one or more times. Consequently, the use of two or more OPCA layers allows for the extraction of low-dimensional noise-robust features from a signal, such as, for example, audio signals, images, video data, or any other time, space, or frequency domain signal. Such extracted features are useful for many tasks, including, for example, automatic authentication or identification of particular signals, or particular elements within such signals. For example, with respect to an audio signal, the DDA system described herein is capable of identifying particular songs or audio clips, either individually, or as a part of a continuous or semi-continuous audio stream. Other examples using audio data include, for example, speaker identification or differentiation, speech recognition, etc.

"Distortion Discriminant Analysis," (DDA), is a novel concept which addresses several primary concerns, as detailed below. In general, DDA can be viewed as a multi-layer linear convolutional neural network, where the weights are trained using a modified Oriented Principal Components Analysis (OPCA) rather than by other well-known techniques such as back-propagation. Each DDA layer applies OPCA to maximize a signal-to-noise ratio of its output, with a corresponding dimensional reduction of the input. Two or more DDA layers are aggregated in order to enforce shift invariance, to reduce computation time, and to build in robustness to noise and distortion at different temporal or spatial scales. Note that in an alternate embodiment, the DDA system and method described herein operates to construct a non-linear convolutional neural network rather than a linear convolutional neural network.

Further, while the DDA system and method is described herein with respect to extraction of features from audio signals, the general concepts described with respect to extraction of audio features are applicable to any signal having one or more dimensions, as noted above. Thus, a simple working example of the DDA system and method described herein is implemented in an audio signal feature extractor which provides distortion-robust audio features vectors for classification, retrieval or identification tasks while addressing several primary concerns.

First, computational speed and efficiency of the signal feature extractor is enhanced by using multiple layers of OPCA. Second, the features resulting from the signal feature extractor are robust to likely distortions of the input, thereby reducing potential errors in classification, retrieval, or identification tasks using those features. In particular, the feature vectors produced as a result of the DDA are robust to likely distortions of the input, including, in many cases, distortions for which the system has not been explicitly trained. For example, with respect to a broadcast audio signal, most radio stations introduce nonlinear distortions and time compression into the audio signal before broadcasting. Other audio signal type distortions include noise from any of a number of sources, such as, for example, interference or microphone noise.

It should be noted, that as described in detail below, the DDA-based convolutional neural network can be trained on any desired distortion or noise, or any combination of distortions or noise and distortions. Third, the features are informative for the task at hand, i.e., they work well for classification, retrieval, or identification tasks with respect to a given audio input. For example, in the case of audio identification, different audio clips should map to features that are distant, in some suitable metric, so as to reduce potential false positive identifications. Again, it should be noted that the general approach, as described with respect to the extraction of features from an audio signal are fully applicable to other signal types.

Finally, in one embodiment, the feature extraction operation is designed to be computationally efficient. For example, in one embodiment, the feature extraction operation is designed such that it uses only a small fraction of the computational resources available on a typical PC.

To begin the DDA, in one embodiment, prior knowledge of distortions and noise in the signal are used to design a pre-processor to DDA. This pre-processor then uses any of a number of conventional techniques to remove those distortions or noise that can be removed using conventional algorithms. For example, in an audio signal, where equalization is a known distortion of the signal, then de-equalization is performed by the pre-processor.

The DDA then sets the parameters of the feature extractor using layered OPCA. In particular, as noted above, a system and method for noise-robust feature extraction for use in classification, retrieval, or identification of data uses a Distortion Discriminant Analysis (DDA) of a set of training signals and one or more distorted versions of that training set to define parameters of a feature extractor. The distortions applied to the training signals can be any desired distortion, or combination of distortions or noise, either natural or artificial. Note that using distorted sample input signals is less stringent and more general than requiring that a real noise model is known. Further, it should be noted that DDA does not assume that the distortion is additive: non-linear distortions are also handled. In addition, as noted above, DDA can generalize beyond the given set of distorted training signals to be robust against distortions that are not in the training set.

The feature extractor described herein then uses two or more OPCA layers for extracting low-dimensional noise-robust features from audio data. As noted above, DDA can be viewed as a multi-layer linear convolutional neural network, where the weights are trained using a modified Oriented Principal Components Analysis (OPCA) to reduce the dimensionality of the audio input and maximize a signal-to-noise ratio of its output. Two or more DDA layers are aggregated in order to enforce shift invariance, to reduce computation time, and to build in robustness to noise and distortion at different time or space scales. Feature extractors learned with DDA address each of the concerns listed above. Namely, the learned feature extractor reduces the dimensionality of the input signal; the resulting features are robust to likely distortions of the input; the features are informative for the task at hand; and finally, the feature extraction operation is computationally efficient.

Finally, in a tested embodiment of the present invention, the robustness of the DDA feature extractor is demonstrated by applying extracted features to identify known audio segments in an audio stream. Such identification is called "stream audio fingerprinting." In stream audio fingerprinting, a fixed-length segment of the incoming audio stream is converted into a low-dimensional trace (a vector). This input trace is then compared against a large set of stored, pre-computed traces, i.e., the extracted audio features, where each stored trace has previously been extracted from a particular audio segment (for example, a song). In addition, the input traces are computed at repeated intervals and compared with the database. The stored pre-computed traces are called "fingerprints," because they are used to uniquely identify particular audio segments.

Note that in one embodiment, the audio fingerprinting system described herein uses only a single fingerprint per audio clip for identification. However, in an alternate embodiment, two fingerprints are used: the initial one, and a `confirmatory` fingerprint, right after the initial one. The use of the second fingerprint is useful for several reasons. First, the use of a second fingerprint allows a threshold for acceptance to be lowered. For example, given a lower threshold for comparison between traces, more traces are accepted for the first fingerprint, while the use of a second fingerprint provides for a more robust identification while also reducing the number of patterns which are incorrectly rejected by having a comparison threshold which is set too high with respect to the first fingerprint. In other words, the use of two fingerprints serves to reduce a false negative rate. Clearly, this embodiment is extensible to the use of even further numbers of fingerprints for trace identification, thereby further reducing identification error rates.

In addition to the just described benefits, other advantages of the signal feature extractor will become apparent from the detailed description which follows hereinafter when taken in conjunction with the accompanying drawing figures.

DESCRIPTION OF THE DRAWINGS

The specific features, aspects, and advantages of the signal feature extractor will become better understood with regard to the following description, appended claims, and accompanying drawings where:

FIG. 1 is a general system diagram depicting a general-purpose computing device constituting an exemplary system for implementing a signal feature extractor.

FIG. 2A illustrates an exemplary architectural diagram showing exemplary program modules for training a feature extractor for extracting features from signals having one or more dimensions.

FIG. 2B illustrates an exemplary architectural diagram showing exemplary program modules for using the feature extractor of FIG. 2A for identification of signals, including creation of a feature or "fingerprint" database and comparison of fingerprints.

FIG. 3 illustrates an exemplary flow diagram for training a signal feature extractor to extract noise and distortion robust signal feature vectors.

FIG. 4 illustrates an exemplary flow diagram for using extracted noise and distortion robust signal feature vectors for evaluating a signal input.

FIG. 5 is a diagram which illustrates the architecture of the DDA system, showing use of layered OPCA projections in a tested embodiment of an audio identification system employing the signal feature extractor.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In the following description of the preferred embodiments of the present invention, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention.

1.0 Exemplary Operating Environment

FIG. 1 illustrates an example of a suitable computing system environment 100 on which the invention may be implemented. The computing system environment 100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 100.

The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held, laptop or mobile computer or communications devices such as cell phones and PDA's, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices. With reference to FIG. 1, an exemplary system for implementing the invention includes a general-purpose computing device in the form of a computer 110.

Components of computer 110 may include, but are not limited to, a processing unit 120, a system memory 130, and a system bus 121 that couples various system components including the system memory to the processing unit 120. The system bus 121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.

Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110.

Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.

The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation, FIG. 1 illustrates operating system 134, application programs 135, other program modules 136, and program data 137.

The computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, FIG. 1 illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152, and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 such as a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 141 is typically connected to the system bus 121 through a non-removable memory interface such as interface 140, and magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a removable memory interface, such as interface 150.

The drives and their associated computer storage media discussed above and illustrated in FIG. 1, provide storage of computer readable instructions, data structures, program modules and other data for the computer 110. In FIG. 1, for example, hard disk drive 141 is illustrated as storing operating system 144, application programs 145, other program modules 146, and program data 147. Note that these components can either be the same as or different from operating system 134, application programs 135, other program modules 136, and program data 137. Operating system 144, application programs 145, other program modules 146, and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies.

A user may enter commands and information into the computer 110 through input devices such as a keyboard 162 and pointing device 161, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus 121, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.

The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110, although only a memory storage device 181 has been illustrated in FIG. 1. The logical connections depicted in FIG. 1 include a local area network (LAN) 171 and a wide area network (WAN) 173, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.

When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, FIG. 1 illustrates remote application programs 185 as residing on memory device 181. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.

The exemplary operating environment having now been discussed, the remaining part of this description will be devoted to a discussion of the program modules and processes embodying a signal feature extractor for providing feature vectors for use in classification, retrieval, or identification of data in signals having one or more dimensions.

2.0 Introduction

Feature extraction is a necessary step for classification, retrieval, and identification tasks with respect to portions of an input signal. A system and method for extracting features from signals of one or more dimensions, such as, for example, audio signals, images, video data, or any other time or frequency domain signal, uses a "Distortion Discriminant Analysis" (DDA) of a set of training signals to define parameters of a signal feature extractor. Note that in the context of this description, a "signal" is defined to be any set of data that has a low-dimensional index set. In general, the signal feature extractor is capable of extracting features from any time, space, or frequency domain signal of one or more dimensions. For example, such signals include an audio signal which is considered to be a one-dimensional signal; an image is which considered to be a two-dimensional signal; and video data which is considered to be a three-dimensional signal. Thus, the term signal, as used throughout this description will be understood to mean a signal of any dimensionality, except where particular signal types are explicitly referred to.

In particular, the signal feature extractor described herein takes any signal with a temporal or spatial structure, applies an oriented principal component analysis (OPCA) to limited regions of the signal, aggregates the output of multiple OPCAs that are spatially or temporally adjacent, and then applies OPCA to the aggregate. The steps of aggregating adjacent OPCA outputs and applying OPCA to the aggregated values can be performed one or more times.

The use of two or more OPCA layers allows for the extraction of low-dimensional noise-robust features from a signal. Such extracted features are useful for many tasks, including, for example, automatic authentication or identification of particular signals, or particular elements within such signals. For example, with respect to an audio signal, a tested embodiment of the DDA system described herein is capable of identifying particular songs or audio clips, either individually, or as a part of a continuous or semi-continuous audio stream. Other examples regarding DDA analysis of audio data include, for example, speaker identification or differentiation, speech recognition, etc. As noted above, DDA analysis can also be performed on multi-dimensional signals, such as images or video, or any other time, space, or frequency domain signal.

In general, the signal feature extractor described herein uses DDA to provide distortion-robust feature vectors. DDA, as described below, constructs a multi-layer linear, convolutional neural network, with each layer performing an Oriented Principal Components Analysis (OPCA) for dimensional reduction of the input while also maximizing a signal-to-noise ratio of its output. In particular, two or more DDA layers are aggregated in order to enforce shift invariance, to reduce computation time, and to build in robustness to noise and distortion at different temporal or spatial scales. Note that in an alternate embodiment, as described in further detail below, the DDA system and method described herein operates to construct a non-linear convolutional neural network rather than a linear convolutional neural network.

Further, while the DDA system and method is described herein with respect to extraction of features from audio signals, the general concepts described with respect to extraction of audio features is applicable to any signal, as noted above. Thus, a simple tested embodiment of the DDA system and method described herein is implemented in an audio signal feature extractor which provides distortion-robust audio features vectors for classification, retrieval or identification tasks while addressing several primary concerns.

First, computational speed and efficiency of the signal feature extractor is enhanced by using multiple layers of Oriented Principal Component Analysis (OPCA), as described in detail below. The use of multiple layers allows for a significant reduction in a dimensionality of the input signal. For example, in the case of audio fingerprinting for an audio stream, a working example of the signal feature extractor was used to reduce the input dimensionality of the audio signal by a factor of 8000. Such a reduction using a single step of OPCA would be computationally prohibitive, both for training and for real-time feature extraction.

Second, the features resulting from the signal feature extractor are robust to likely distortions of the input, thereby reducing potential errors in classification, retrieval, or identification tasks using those features. In particular, the feature vectors produced as a result of the DDA are robust to likely distortions of the input, including, in many cases, distortions for which the system has not been explicitly trained. For example, using an audio signal for illustrative purposes, distortions can affect the audio signal for many reasons, including the fact that most radio stations introduce nonlinear distortions and time compression into the audio signal before broadcasting. Other audio-type distortions include noise from any of a number of sources, such as, for example, interference or microphone noise. It should be noted, that as described in detail below, the DDA-based convolutional neural network can be trained on any desired distortion or noise, or any combination of distortions or noise and distortions. Further, different distortions, or different combinations of distortions or noise, can be trained at each layer of the DDA-based convolutional neural network. Again, it should be noted that the general approach, as described with respect to the extraction of features from an audio signal are fully applicable to other signal types.

Third, the features are informative for the task at hand, i.e., they work well for classification, retrieval, or identification tasks with respect to a given audio input. For example, in the case of audio identification, different audio clips should map to features that are distant, in some suitable metric, so as to reduce potential false positive identifications. The use of OPCA in the layers of the DDA serves to maximize the signal variance, thereby driving the features to be as informative as possible.

Finally, in one embodiment, the feature extraction operation is designed to be computationally efficient. For example, in one embodiment, the feature extraction operation is designed such that it uses only a small fraction of the computational resources available on a typical PC.

2.1 System Overview

In general, a system and method for signal feature extraction for use in classification, retrieval, or identification of elements or segments of a data signal uses a Distortion Discriminant Analysis (DDA) of a set of training signals to defire parameters of a signal feature extractor. The signal feature extractor described herein then uses two or more OPCA layers for extracting low-dimensional noise-robust features from the data. As noted above, DDA can be viewed as a linear convolutional neural network, where the weights are trained using oriented Principal Components Analysis (OPCA) to reduce the dimensionality of the signal input. Further, each DDA layer applies OPCA to maximize a signal-to-noise ratio of its output. Two or more OPCA layers are used in order to enforce shift invariance, to reduce computation time, and to build in robustness to noise and distortion at different time scales.

To begin, in one embodiment, prior knowledge of the distortions and noise in the signal are used to design a pre-processor to DDA. This preprocessor serves to remove those distortions or noise from the signal by using any of a number of well-known conventional signal processing techniques. For example, given an audio signal, if equalization is a known distortion of the signal, then de-equalization is performed by this embodiment. Similarly, given an image input, if contrast and brightness variations are known distortions of the signal, then histogram equalization is performed by this embodiment.

Distortion Discriminant Analysis (DDA) then sets the parameters of the feature extractor using layered OPCA as described in further detail below. Feature extractors learned with DDA address each of the concerns noted above. Namely, the learned feature extractor reduces the dimensionality of the input signal; the resulting features are robust to likely distortions of the input; the features are informative for the task at hand; and finally, the feature extraction operation is computationally efficient.

DDA is trained using a set of representative training signals and one or more distorted versions of those training signals. The set of representative training signals is simply a set of data which is chosen because it is typical or generally representative of the type of data which is to be analyzed. Note that the data used for training does not have to be the same as the data that is to be analyzed. For example, there is no need to train the feature extractor using segments of the same songs which are to be passed to the feature extractor for extracting features. Furthermore, the type of training data does not have to even match the type of data expected in test phase; for example, a system trained using pop music can be used to extract features from classical music.

The distortions applied to the training signals can be any desired distortion, or combination of distortions or noise. Using distorted samples of the input signals is less stringent and more general than requiring that a real noise model is known. Further, it should be noted that DDA does not assume that the distortion is additive: non-linear distortions are also handled. As discussed below in Section 3, DDA can generalize beyond the given set of distorted training signals to be robust against distortions that are not in the training set.

Finally, in a tested embodiment the robustness of the DDA feature extractor was examined by applying extracted features to identify known audio segments in an audio stream. Audio identification enabled by this audio feature extractor is termed "stream audio fingerprinting." In stream audio fingerprinting, a fixed-length segment of the incoming audio stream is converted into a low-dimensional trace (a vector). This input trace is then compared against a large set of stored, pre-computed traces, i.e., the extracted audio features, where each stored trace has previously been extracted from a particular audio segment, such as a song, after initial training of the feature extractor using a set of training signals representative of the audio to be examined. In addition, the input traces are computed at repeated intervals and compared with the database. The pre-computed traces are called "fingerprints," because they are used to uniquely identify particular audio segments. Note that while the audio fingerprinting system described herein uses only a single fingerprint per audio clip for identification, identification error rates are further reduced in alternate embodiments by using several fingerprints per audio clip for identification.

2.2 System Architecture

The process summarized above is illustrated by the general system diagrams of FIG. 2A and FIG. 2B. In particular, the system diagram of FIG. 2A illustrates the interrelationships between program modules for implementing a DDA-based feature extractor. Further, FIG. 2B illustrates alternate embodiments of the feature extractor as used in a feature analysis system. It should be noted that the boxes and interconnections between boxes that are represented by broken or dashed lines in FIG. 2A and FIG. 2B represent alternate embodiments of the invention, and that any or all of these alternate embodiments, as described below, may be used in combination with other alternate embodiments that are described throughout this document.

In particular, as illustrated by FIG. 2A, a system and method for DDA-based feature extraction begins, in one embodiment, by providing one or more training signal inputs 200 from a computer file or input device to a pre-processor module 205 for removing known distortions or noise from the training signal input 200 by using any of a number of well-known conventional signal processing techniques. For e


Free Web Sudoku Puzzles.
Solve with your browser.
4 5 3     6      
            2   1
          9   5  
    1   2   8    
    2 3   7 4    
    7   9   5    
  9   7          
1   8            
      9     1 4 7
What is it?



Add Your Site · Terms Of Service · Privacy Policy


DISCLAIMER
Linkgrinder is a free service that searches the Internet and indexes all files found so that you may search quickly and easily for shared files. These files are created and made available individually by users whose identity we are not aware of and who we have no control over. In essence we function like a search engine tool; these files ARE NOT STORED OR SERVED BY OUR NETWORK. We are not responsible for any materials obtained by using our service. We do not monitor any of the contents of these files. These files may contain viruses, illegal materials, materials inappropriate for minors, offensive files and the like. BY USING OUR SERVICE, YOU ASSUME FULL RESPONSIBILITY FOR DOWNLOADING THESE MATERIALS AND WILL INDEMNIFY US FOR ANY DAMAGES THAT MAY BE INCURRED.

For More Specific Information VIEW OUR TERMS OF SERVICE.

Thank you and Enjoy!