Acoustic Model Discussions

Flat
New English Zamia-Speech Models released
User: guenter
Date: 6/23/2018 5:38 pm
Views: 8535
Rating: 1

The latest 20180611 builds of the english models were trained on over 800 hours of training material (containing material with noise and phone codec effects added).

You can find download links to all our models and dicts here:

https://github.com/gooofy/zamia-speech#download

WER results for these models are not comparable to previous releases as we are measuring WERs for speakers not in the training set from now on and also tried to make the language model more neutral (i.e. not over-represent prompts in the training material) so the WER results should give a more realistic assessment of what performance one can expect from our models without adaptation.

WER for the large kaldi model is 7.02% for the large model and 7.84% for the embedded model.

WER for the continuous CMU Sphinx model is 25.4%.

We have also been quite busy cleaning up our scripts and documentation so it should become easier to understand what we are doing here. The models come complete with example scripts and pre-compiled binary packages for various platforms, more information on that can be found in our getting started guide here:

https://github.com/gooofy/zamia-speech#get-started-with-our-pre-trained-models

Please note that we have changed the tarball format of our models significantly so you will have to use the latest 0.3.1 py-kaldi-asr wrappers with these models. The new tarball format allows for model adaptation

https://github.com/gooofy/zamia-speech#model-adaptation

as well as automatic segmentation and transcript alignment of long audio recordings (e.g. librivox audiobooks):

https://github.com/gooofy/zamia-speech#audiobook-segmentation-and-transcription-kaldi

comments, suggestions and contributions are very welcome. For more information about the zamia-speech project, please visit http://zamia-speech.org/

--- (Edited on 6/23/2018 5:38 pm [GMT-0500] by guenter) ---

Re: New English Zamia-Speech Models released
User: guenter
Date: 7/2/2018 11:33 am
Views: 391
Rating: 1

I have just released an updated version of the Kaldi Models which comes with improved noise resistance and tokenizer bugfixes resulting in slightly better WERs:

https://github.com/gooofy/zamia-speech#download

%WER 6.97 [ 53104 / 761856, 3598 ins, 14296 del, 35210 sub ] exp/nnet3_chain/tdnn_sp/decode_test/wer_9_0.0
%WER 7.78 [ 59271 / 761856, 4323 ins, 14974 del, 39974 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0

--- (Edited on 7/2/2018 11:34 am [GMT-0500] by guenter) ---

Re: New English Zamia-Speech Models released
User: guenter
Date: 8/17/2018 7:34 am
Views: 227
Rating: 0

The latest 20180815 Kaldi Models are trained on 1200 hours of recordings now that we have added the mozilla common voice v1 corpus material. Available for download in the usual places:

https://github.com/gooofy/zamia-speech#download

WERs are still good: 

%WER 8.03 [ 65993 / 821583, 4460 ins, 18032 del, 43501 sub ] exp/nnet3_chain/tdnn_sp/decode_test/wer_9_0.0
%WER 9.03 [ 74192 / 821583, 5394 ins, 19016 del, 49782 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0

A slight increase was to be expected as the new training material has more diverse speakers and more noisy content which should contribute to real world unknown-speaker performance as well as noise resistance.

 

--- (Edited on 8/17/2018 7:34 am [GMT-0500] by guenter) ---

Re: New English Zamia-Speech Models released
User: guenter
Date: 9/1/2018 10:16 am
Views: 574
Rating: 2

A new Zamia-Speech Kaldi nnet3-chain model based on factorized TDNN is available for download now here:

https://github.com/gooofy/zamia-speech#download

the new model is trained on the same dataset as the models from the 20180815 release but offers slightly better performance:

%WER 8.03 [ 65993 / 821583, 4460 ins, 18032 del, 43501 sub ] exp/nnet3_chain/tdnn_sp/decode_test/wer_9_0.0
%WER 9.03 [ 74192 / 821583, 5394 ins, 19016 del, 49782 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0
%WER 7.54 [ 61946 / 821583, 3834 ins, 17569 del, 40543 sub ] exp/nnet3_chain/tdnn_f/decode_test/wer_8_0.0

--- (Edited on 9/1/2018 10:16 am [GMT-0500] by guenter) ---

Re: New English Zamia-Speech Models released
User: guenter
Date: 3/3/2019 4:45 am
Views: 365
Rating: 1

The latest r20190227 release of the English Zamia-Speech models for Kaldi have been trained on two additional corpora:

     zamia_en       0:05:38
     voxforge_en   72:16:35
     cv_corpus_v1 247:32:39
     librispeech  427:13:56
NEW: ljspeech      20:34:33
NEW: m_ailabs_en   43:31:54
additionally 400 hours of noise-augmented audio derived from the above corpora were used (background noise and phone codecs).
Stats:
%WER 7.98 [ 65664 / 822605, 5716 ins, 13364 del, 46584 sub ] exp/nnet3_chain/tdnn_sp/decode_test/wer_8_0.0
%WER 9.06 [ 74542 / 822605, 6206 ins, 15879 del, 52457 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0
%WER 7.55 [ 62115 / 822605, 4768 ins, 13736 del, 43611 sub ] exp/nnet3_chain/tdnn_f/decode_test/wer_8_0.0
Downloads:
https://github.com/gooofy/zamia-speech#asr-models

--- (Edited on 3/3/2019 4:45 am [GMT-0600] by guenter) ---

Re: New English Zamia-Speech Models released
User: guenter
Date: 6/20/2019 3:36 am
Views: 2289
Rating: 1

With the addition of the TED-LIUM 3 corpus and positive results from the auto-review process the r20190609 release of the English Zamia-Speech models for Kaldi has been trained on the largest amount of audio material yet (over 1100 hours): 

         zamia_en            0:05:38
         voxforge_en       102:07:05
         cv_corpus_v1      252:31:11
         librispeech       450:49:09
         ljspeech           23:13:54
         m_ailabs_en       106:28:20
         tedlium3          210:13:30

additionally 400 hours of noise-augmented audio derived from the above corpora were used (background noise and phone codecs):

        voxforge_en_noisy   22:01:40
        librispeech_noisy  119:03:26
        cv_corpus_v1_noisy  78:57:16
        cv_corpus_v1_phone  61:38:33
        zamia_en_noisy       0:02:08
        voxforge_en_phone   18:02:35
        librispeech_phone  106:35:33
        zamia_en_phone       0:01:11

so in total this release has been trained on over 1500 hours of audio material (training took over 6 weeks on a GeForce GTX 1080 Ti GPU).

Stats:

    %WER 10.64 exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0
    %WER  8.84 exp/nnet3_chain/tdnn_f/decode_test/wer_8_0.0
    %WER  5.80 exp/nnet3_chain/tdnn_fl/decode_test/wer_9_0.0

The tdnn_250 model is the smallest one meant for use in embedded applications (i.e. RPi-3 class hardware), tdnn_f is our regular model, tdnn_fl is the tdnn_f model adapted to a larger language model (results illustrate the importance of language model domain adaptation btw.).

Downloads:

https://github.com/gooofy/zamia-speech#asr-models

 

--- (Edited on 6/20/2019 3:37 am [GMT-0500] by guenter) ---

PreviousNext