Audio and Prompts Discussions

Nested
Re: Automatic Segmentation of LibriVox Audio
User: kmaclean
Date: 3/28/2007 11:41 am
Views: 514
Rating: 30

Hi Brian,

Tony is correct.  The problems I was having with respect to forced alignment in my previous post were the result of a "user error".  I was trying to force align 44.1kHz-32bitfloat speech audio using the VoxForge Acoustic Models which were trained using audio sampled at 16kHz with a 16 bit sampling rate.  Once I downsampled my speech audio to 16kHz-16bits, the Forced Alignment worked *much* better.

I hope to have a how-to posted shortly on the process.

If the VoxForge Acoustic Models don't provide satisfactory results for you, you might look at Keith Vertanen's  HTK Wall Street Journal Acoustic Models.

All the best,

Ken 

--- (Edited on 3/28/2007 12:41 pm [GMT-0400] by kmaclean) ---

Re: Automatic Segmentation of LibriVox Audio
User: Visitor
Date: 3/30/2007 11:36 am
Views: 285
Rating: 23

Thanks Tony and Ken for your advice. I'll email Tony directly about the nuts and bolts...

 Brian

--- (Edited on 3/30/2007 11:36 am [GMT-0500] by Visitor) ---

Re: Automatic Segmentation of LibriVox Audio
User: kmaclean
Date: 4/6/2007 1:01 pm
Views: 331
Rating: 26

Hi,

I've got a draft of an Automated Audio Segmentation How-to.  The process still needs some streamlining, but it works ... the jimmowatt Librivox submission (historyofengland01ch04_01_macaulay.wav 04-Mar-2007 14:32 165M ) was segmented using this process:

[   ] jimmowatt-20070308-hoe.tgz   06-Apr-2007 11:00  67.4M  

Please let me know if you have any comments,

thanks,

Ken 

 

--- (Edited on 4/ 6/2007 2:01 pm [GMT-0400] by kmaclean) ---

Re: Automatic Segmentation of LibriVox Audio
User: mofei
Date: 4/20/2007 1:02 pm
Views: 294
Rating: 24

No comments, except to say that it worked very well, and I learned a lot about alignment in the process. I followed your instructions up to point 6, and it performed well. Tony Robinson (thanks again) had earlier sent me an alignment he did, and the results were largely identical, though Tony's alignment did seem to identify the onset of words more accurately, by about 20 to 30 ms. Not sure what the reason might be.

I was thinking that it might be fun to try automating the process a little further by writing a script that helps you compose the phonetic entry for new words. I was thinking something like string-edit distance might find good candidates of similar words to prompt the user with (using something like the CPAN packages String-Approx or Text:Brew).

Incidentally what is the source of the VoiceForge dictionary - I guess from looking at it, it's US English - and what is the standard used to represent phones? I've a training in linguistics, so a table of conversions to IPA symbols would be really useful. 

Thanks, Brian

--- (Edited on 4/20/2007 1:02 pm [GMT-0500] by mofei) ---

Re: Automatic Segmentation of LibriVox Audio
User: kmaclean
Date: 4/20/2007 1:57 pm
Views: 646
Rating: 19

Hi Brian,

Thanks for the test of the Forced Alignment How-to

My answers to your questions follow: 

>though Tony's alignment did seem to identify the onset of words more accurately, by about 20 to 30 ms. Not sure what the reason might be

I am assuming you used the VoxForge Acoustic Model ... Tony likely has a more accurate AM trained with much more speech data than ours.
This not a big issue for VoxForge since we really only looking for pauses in the speech, and segmenting the audio on that basis - so we don't need that much accuracy for word onset.

> I was thinking that it might be fun to try automating the process a little further by writing a script that helps you compose the phonetic entry for new words. I was thinking something like string-edit distance might find good candidates of similar words to prompt the user with (using something like the CPAN packages String-Approx or Text:Brew).

I was thinking the same thing (more out of necessity than fun ...), though I was not sure how to approach it.  As a possible approach, I was thinking that I might be able to break up a word into segments that might correspond to a word (or a segment of a word) in the dictionary, look them up in the dictionary, and then offer a list of possibilities for final approval.  If the algorithm worked well enough, have it combine the word segment pronunciations into an actual word.
My understanding of the documentation on CPAN is that String::Approx lets you match and substitute strings approximately, using regex type logic. Whereas Text::Brew looks for the 'edit distance' between two words - i.e. the degree of proximity between two strings (this distance is the number of substitutions, deletions or insertions ("edits") needed to transform one string into the other one). 
So rather than segmenting a word and looking for matching words in the dictionary (as in my initial approach), a script using String::Approx and Text::Brew could take the whole word, and look for similar words in the pronunciation dictionary, and use these similar words for determining pronunciations ... very interesting, thanks!
Do you know if anyone has that created program that can help  compose phonetic entries for new words - this seems like it should be something the Speech Recognition community would have already addressed.

>Incidentally what is the source of the VoiceForge dictionary

The VoxForge dictionary originates from the unstressed The CMU Pronouncing Dictionary.
The phoneset is as follows:
The current phoneme set has 39 phonemes, not counting varia for lexical stress.
        Phoneme Example Translation
------- ------- -----------
AA odd AA D
AE at AE T
AH hut HH AH T
AO ought AO T
AW cow K AW
AY hide HH AY D
B be B IY
CH cheese CH IY Z
D dee D IY
DH thee DH IY
EH Ed EH D
ER hurt HH ER T
EY ate EY T
F fee F IY
G green G R IY N
HH he HH IY
IH it IH T
IY eat IY T
JH gee JH IY
K key K IY
L lee L IY
M me M IY
N knee N IY
NG ping P IH NG
OW oat OW T
OY toy T OY
P pee P IY
R read R IY D
S sea S IY
SH she SH IY
T tea T IY
TH theta TH EY T AH
UH hood HH UH D
UW two T UW
V vee V IY
W we W IY
Y yield Y IY L D
Z zee Z IY
ZH seizure S IY ZH ER

Ken 

 

 



--- (Edited on 4/20/2007 2:57 pm [GMT-0400] by kmaclean) ---

Re: Automatic Segmentation of LibriVox Audio
User: Tony Robinson
Date: 4/20/2007 4:08 pm
Views: 402
Rating: 36

I am assuming you used the VoxForge Acoustic Model ... Tony likely has a more accurate AM trained with much more speech data than ours.

Yes indeed.

This not a big issue for VoxForge since we really only looking for pauses in the speech

Agreed, the pauses are relatively easy to find.

Do you know if anyone has that created program that can help  compose phonetic entries for new words - this seems like it should be something the Speech Recognition community would have already addressed.

The alignments I did for Brian were completely automatic - that included generation of pronunciations.   As with all pattern processing, you can do this by rule or by statistical modelling, I have a very big statistical model which works very well (and produces multiple pronunciaitons per word when needed).    There are very many other publications on this, mostly done for use in speech synthesis systems, grapheme to phoneme conversion are the magic words to google for.   A good recent publication on this is by Paul Taylor,  "Grapheme-to-Phoneme conversion using Hidden Markov models" In Proc. Interspeech 2005 and available here: http://mi.eng.cam.ac.uk/~pat40/eurospeech05_form_04.pdf

 

Tony

P.S. I used to run the comp.speech FTP site (the main collection of free speech software in the 80's and early 90's) and on it is a free implementation of the Naval Research Lab's rule based English text to phoneme system.   It is a rule based, not statistical model, so the phoneme codes are hard wired, but it does do some text normalisation and (from a quick scan) the phoneme symbols look very close to CMUdict.   If you want a reference implementation, this might be a quick way to get going - see    ftp://svr-ftp.eng.cam.ac.uk/comp.speech/synthesis/english2phoneme.tar.gz

Another thought is festival.

-- 

Dr Tony Robinson, CEO Cantab Research Ltd
Phone:  +44 845 009 7530, Fax: +44 845 009 7532

--- (Edited on 20-April-2007 10:08 pm [GMT+0100] by Tony Robinson) ---

--- (Edited on 21-April-2007 2:38 am [GMT+0100] by Tony Robinson) ---

Re: Automatic Segmentation of LibriVox Audio
User: kmaclean
Date: 4/21/2007 12:56 pm
Views: 250
Rating: 18

Hi Tony,

Thanks for the link to the script and the article.

Also thanks for the 'magic word' - much easier to find things!

I was able to find a Perl script by Kevin Lenzo called:

t2p: Text-to-Phoneme Converter Builder

(the site says he's with Carnegie Mellon University, but his personal site says he's the CTO with Cepstral LLC - the commercial version of the Festival Text-to-Speech System) 

It uses the CMU Pronouncing Dictionary, so I think it might be a better starting point for VoxForge.

Ken  

--- (Edited on 4/21/2007 1:56 pm [GMT-0400] by kmaclean) ---

Re: Automatic Segmentation of LibriVox Audio
User: Tony Robinson
Date: 4/23/2007 12:09 pm
Views: 3067
Rating: 27

I was able to find a Perl script by Kevin Lenzo called:

t2p: Text-to-Phoneme Converter Builder

Well found, that option looks best to work with at this stage.

 

Tony 

-- 

Dr Tony Robinson, CEO Cantab Research Ltd
Phone:  +44 845 009 7530, Fax: +44 845 009 7532


--- (Edited on 23-April-2007 6:09 pm [GMT+0100] by Tony Robinson) ---

Re: Automatic Segmentation of LibriVox Audio
User: kmaclean
Date: 6/26/2007 10:22 am
Views: 231
Rating: 25

Just an update ...

I did not have much luck with t2p.  I was able to create a tree with a depth of 3 (which generated a 587k file) and a depth 5 (8.1m), but these did not seem to provide good results.  I created one for a depth of 7 (11.9m), but it came back with errors when I tried to use it with the t2p_dt.pl script.  It might be that the VoxForge Dictionary is too large at 130K.

I've turned my attention to Festival/FestVox using the process described here:

Building letter-to-sound rules automatically,

and with some help from section "6.2.4 LTS Rules" of Nigel Rochford's Developing a New Voice for Hiberno-English in The Festival Speech Synthesis System project.

Ken 

 

--- (Edited on 6/26/2007 11:22 am [GMT-0400] by kmaclean) ---

Re: Automatic Segmentation of LibriVox Audio
User: nsh
Date: 6/26/2007 10:42 am
Views: 275
Rating: 32
What's the problem with using festival directly btw? Why do you need to train your own English rules?

--- (Edited on 6/26/2007 10:42 am [GMT-0500] by nsh) ---

PreviousNext