Dialog Managers

Flat
Is it useful to submit the same phrase several times?
User: ralfherzog
Date: 3/3/2008 9:57 am
Views: 303
Rating: 33
Hello Johan Lingen,

2.b.)  "is it useful if one submits the same phrases multiple times?"

In my opinion, the answer to this question can be found in the real world (speech in the family, speech in the business, speech in TV series).  How often do you say exactly the same phrase multiple times in the real world?  There might be some sentences that occur very often.  Here are some examples:

- "You're welcome".
- "Thank you".
- "I will see you later".

Those are short sentences that occur in everyday situations.  Those sentences could be submitted several times by the same person.  But if it comes to more complex sentences, it might be the case that they occur only once in the real world.  So those sentences shouldn't be submitted several times.

My approach is to look at the real world, and try to transform the real world of speech into the acoustic/language model.

In my opinion, it would be useful to have let's say one million different sentences for every language (English, Dutch, German) in the database of the VoxForge speech submission application.  This is a high number of sentences.  And most of those sentences should occur only once.  Some of them could occur more than once.  Because they occur in the real world more than once.

We should try to cover the nodes of the real world.  And the real world does have much more than just 1200 English prompts or 200 Dutch prompts. We should try to create a much higher number of prompts for each language.  And a set of ten prompts could be selected randomly by the VoxForge speech submission application. So one person could submit several times, using the speech submission application having a high probability of getting different prompts. The real world would be boring, with just 1200 prompts.

Greetings, Ralf

--- (Edited on 2008-03-03 9:57 am [GMT-0600] by ralfherzog) ---

Re: Is it useful to submit the same phrase several times?
User: JohanLingen
Date: 3/3/2008 10:18 am
Views: 414
Rating: 32

Hi Ralph,

 thank you for your answer. I am not technically up-to-date with speech recognition, though I do find it interesting. The only thing I can mean for this project is to make many people donate. I'll leave it to you folks to make sure they donate with the right variety. Smile

But since I am interested: It doesn't make sence to me why the phrase is the building block? I can imagine words pronounced differently in different phrases, but isn't it overdone to model every possible phrase? That's impossible! (and would require a hard disk with the size of a mammoth tanker to store it)

Johan

--- (Edited on 3/3/2008 10:18 am [GMT-0600] by JohanLingen) ---

Phoneme/dialect/words/sentence coverage prompts
User: ralfherzog
Date: 3/3/2008 12:00 pm
Views: 444
Rating: 31
Hello Johan,

"isn't it overdone to model every possible phrase?"


I can't prove that my opinion is true.  Maybe I am wrong.  I am just trying to push the VoxForge project into the direction which I am convinced is the right one.  I have the feeling that some participants of the VoxForge project may have a different opinion.  For example, users like Ken or "nsh" do have specific knowledge of how to use the different toolkits (HTK, Julius, CMU Sphinx).  And because of their superior knowledge they may have a different point of view.  Maybe I would have a different opinion, if I would be more familiar with those speech development toolkits.  They know more than I do.  We may have different opinions, but we do have the same goal.  And I don't want to waste any precious time.  I just want to make sure that we reach our goal as fast as possible.  But what is the shortest way to develop a good acoustic/language model?  That's the question.  Who knows the answer? I am just trying to propose one possible answer.

The phoneme coverage prompts are an excellent starting point.  Maybe it would be good to have phoneme coverage prompts for the Dutch and German language, too?

The development of an acoustic/language model can be thought of as the concept of abstraction layers (you can choose a different order if you like).

Layer one: phoneme coverage prompts (e.g. Dutch phonology).
Layer two: dialect coverage prompts (e.g. Dutch dialects).
Layer three: word coverage prompts (e.g. Dutch vocabulary).
Layer four: sentence coverage prompts.

One million sentences (layer four) could cover the pronunciation of 400.000 words (layer three). The sentence-layer (layer four) could be used to build word level n-grams.  And those 400.000 words could cover the phoneme/dialect prompts (layer one/two).  

Take a look at Moore's Law.  The hard disk storage cost per unit of information might be dropping exponentially.  So there should be enough storage space available to establish layer four (sentence coverage prompts) as a target.

Greetings, Ralf

--- (Edited on 2008-03-03 12:00 pm [GMT-0600] by ralfherzog) ---

Re: Is it useful to submit the same phrase several times?
User: kmaclean
Date: 3/4/2008 1:02 pm
Views: 360
Rating: 35

Hi Johan/Ralph,

I think we need to look at how a *dictation* system works to get to the bottom of this discussion. 

From Step 1 of the Tutorial:

All Speech Recognition Engines ("SRE"s) are made up of the following components:

  • Language Model or Grammar - Language Models contain a very large list of words and their probability of occurrence in a given sequence.  They are used in dictation applications.  Grammars are a much smaller file containing sets of predefined combinations of words.  Grammars are used in IVR or desktop Command and Control applications.   Each word in a Language Model or Grammar has an associated list of phonemes (which correspond to the distinct sounds that make up a word).
  • Acoustic Model - Contains a statistical representation of the distinct sounds that make up each word in the Language Model or Grammar.  Each distinct sound corresponds to a phoneme.
  • Decoder - Software program that takes the sounds spoken by a user and searches the Acoustic Model for the equivalent sounds.  When a match is made, the Decoder determines the phoneme corresponding to the sound.  It keeps track of the matching phonemes until it reaches a pause in the users speech.  It then searches the Language Model or Grammar file for the equivalent series of phonemes.  If a match is made it returns the text of the corresponding word or phrase to the calling program. 

So the Language Model and Acoustic model work together.  I've been always under the assumption that from an acoustic model creation perspective, we need as many samples of phones in many different contexts as possible.  Sphinx and HTK/Julius use triphones to provide this context (i.e. one phone and its associated left and right phone).  Nsh, in a previous post (where we had a similar discussion to this one, and which would be good for you to review too) corrected me and essentially said that the key is to get recordings of words that contain the most common triphones, and use "tied-state triphone" models to cover the rare triphones.  Tied-state triphones are a shortcut to group similar triphones together, reducing the need to have samples for every possible triphone in a language.

So from the analog recognition of the elemental sounds that make up a word (i.e. phones) using an acoustic model, good coverage of the most common triphones is what we should be shooting for.

However, from a language model perspective, this is where we should focusing on the probabilities of occurence of 400,000 different words and  millions of different sentence combinations - which is what Ralph's was referring to.

Ken

P.S. Ralf - thanks for really challenging my thinking on this ... it will help make our objective clearer in the future.

--- (Edited on 3/4/2008 2:02 pm [GMT-0500] by kmaclean) ---

Are one million sentences necessary to get a good language model?
User: ralfherzog
Date: 3/4/2008 5:35 pm
Views: 406
Rating: 34
Hello Ken,

Maybe it would be sufficient just to create one million sentences as text files (and *not* record them with Audacity as wav speech files) to get a good language model.  So it could be sufficient just to create one million German prompts (or one million Dutch prompts) as text files, without the creation of the corresponding wav files.  So if we would have a good coverage of the most common triphones, the computer could find the correct transcription with the help of a good language model.

Well, I have to think about it.  It takes time to find an answer.

Greetings, Ralf

--- (Edited on 2008-03-04 5:35 pm [GMT-0600] by ralfherzog) ---

Re: Are one million sentences necessary to get a good language model?
User: JohanLingen
Date: 3/5/2008 5:24 pm
Views: 4870
Rating: 29

Well, I don't know if the phrases contain the right triphones, but I do know they contain words I have never used in my life yet and hopefully will not use ever again. And I am almost sure that most phrases are impossible for children to pronounce.

If the words are too difficult, there will be a great diversity in the pronounciation! 

--- (Edited on 3/5/2008 5:24 pm [GMT-0600] by JohanLingen) ---

PreviousNext