General Discussion

Nested
How to get the time of each Phoneme/Viseme in audio file for lipsync?
User: yaoyansi
Date: 9/1/2009 7:19 am
Views: 12192
Rating: 1

hi,all
I want to do the lipsync for my cg animation. I think I have to get the time of each Phoneme/Viseme in the audio file. My questions are:

1.How to get the time of each Phoneme/Viseme from audio file (e.g.wav/ogg/...)?for example, i need the information like this:

time(ms)                Phoneme or Viseme

0.005~0.008         W

0.008~0.1            A

0.1~0.15            ZH

........

2. can i use Voxforge/Julius or anyother else to get these information?


3.I know Microsoft Agent 2.0 Tool:Linguistic Information Sound Editing Tool can generate *.lwv file from *.wav, and *.lwv contains the information of Phoneme/Viseme.
But, it seems that the fomat of lwv file is not open. So, Is there any way to get the time of Phoneme/Viseme in lwv file?

4.Or, Is there any free software available which can i use to get <time,  Phoneme/Viseme> pair from audio file?

Sorry for my poor english. And Thank you in advance.

--- (Edited on 9/1/2009 7:19 am [GMT-0500] by yaoyansi) ---

Re: How to get the time of each Phoneme/Viseme in audio file for lipsync?
User: kmaclean
Date: 9/1/2009 8:21 am
Views: 54
Rating: 1

Hi yaoyansi,

>I want to do the lipsync for my cg animation.

The Galatea Project does this using Julius.  From their site:

Providing an open-source toolkit for anthropomorphic spoken dialogue agent with which one can develope a life-like animated agent that talks with the user and can be easily customized with the face, voice, and dialog grammar.

I don't think it has been active for a while...

Ken

--- (Edited on 9/1/2009 9:21 am [GMT-0400] by kmaclean) ---

Re: How to get the time of each Phoneme/Viseme in audio file for lipsync?
User: Visitor
Date: 9/1/2009 10:54 pm
Views: 88
Rating: 2

Thanks for your replay, Ken.

I download Galatea, and read the 00readme-julius.txt.
I think the option -palign is exact what i need,
but i don't know how to setup a simple test.


For example, I think I need a text-based recongition test.
The tst.wav and tst.txt is input, and the sequence of time stamp and phoneme pair should be generated.

For example , i need this test to generate the information like this:

<time(ms)> <Phoneme>
0.00 ~0.03          EH
0.03~0.09              n
0.09~0.12              t
0.12~0.21             ER
.......

Or,  generate the information with constant time interval like this:
(using the option -proginterval,right?, let's say -proginterval 30)

<interval(30ms)> <Phoneme>

0.00 ~0.03          EH
0.03~0.06            n
0.06~0.09            n
0.09~0.12            t
0.12~0.15             ER
0.15~0.18             ER
0.18~0.21             ER
.......


My question is how to set the config file and other options?
Can you show me a step-by-step guid?

Thank you.

--- (Edited on 9/1/2009 10:54 pm [GMT-0500] by Visitor) ---

Re: How to get the time of each Phoneme/Viseme in audio file for lipsync?
User: paradocs
Date: 9/2/2009 4:07 am
Views: 120
Rating: 1

Hi yaoyansi

The above  is just what you need.

However it might be useful to test text to speech

and later adjust this to match recorded real speech.

festival -i

festival> (set! utt1 (SayText "Hello word"))

festival> (utt.save utt1 "-")

#partial output

9 id _17 ; name pau ; dur_factor 0 ; end 0.22 ; source_end 0.081826 ;
10 id _7 ; name hh ; dur_factor -0.296956 ; end 0.277954 ; source_end 0.188655 ;
11 id _8 ; name ax ; dur_factor -0.317324 ; end 0.320176 ; source_end 0.289519 ;
12 id _9 ; name l ; dur_factor 0.240634 ; end 0.399659 ; source_end 0.378457 ;
The "-" can be changed to an output file name and

a script could transform this into a form you desire.

This gives a better output:

festival> (utt.save.segs utt1 "-")

Best Wishes

paradocs

  

--- (Edited on 9/2/2009 4:21 am [GMT-0500] by paradocs) ---

Re: How to get the time of each Phoneme/Viseme in audio file for lipsync?
User: kmaclean
Date: 9/6/2009 2:26 pm
Views: 77
Rating: 2

Hi yaoyansi,

I need to make a correction... I may have misled when I said that "Galatea uses Julius" in answer to your oginal question: "I want to do the lipsync for my cg animation."

Galatea is a spoken dialog agent (using a picture of an actual person as the agent) and uses Julius for speech recognition to take a question, and Text-to-Speech (Japanese TTS engine) to reply to the question. 

The dialog agent does *not* use speech recognition to determine phoneme timings of a speech audio recording.  It uses a process similar to the one described by Paradocs to synch up TTS utterances to lip movement.

Ken

--- (Edited on 9/6/2009 3:26 pm [GMT-0400] by kmaclean) ---

Re: How to get the time of each Phoneme/Viseme in audio file for lipsync?
User: kmaclean
Date: 9/6/2009 2:31 pm
Views: 167
Rating: 2

Hi yaoyansi,

>My question is how to set the config file and other options?

Take a look at the acoustic model testing section of the VoxForge Tutorial.  Look at the section entitled Test Acoustic Model Using HTK... the file recout.mlfrecout.mlf contains the information that I think you are loking for.

Ken

--- (Edited on 9/6/2009 3:31 pm [GMT-0400] by kmaclean) ---

Re: How to get the time of each Phoneme/Viseme in audio file for lipsync?
User: yaoyansi
Date: 9/7/2009 9:38 pm
Views: 82
Rating: 1

Thanks for your replay again,Ken.

Yes, I have to make the lip movment driven by audio recording in my cg animation. But Galatea seems to make the lip movment driven by text, right?

So, Galatea maybe not suitable for my lipsync, right? What about Julius, Sphinx and Voxforge?

And, Is there any other freeware or opensource project can do the lipsync for the cg animation?

 

--- (Edited on 9/7/2009 9:38 pm [GMT-0500] by yaoyansi) ---

Re: How to get the time of each Phoneme/Viseme in audio file for lipsync?
User: kmaclean
Date: 9/9/2009 8:57 pm
Views: 3094
Rating: 2

Hi yaoyansi,

>but Galatea seems to make the lip movment driven by text, right?

correct

>Galatea maybe not suitable for my lipsync

correct

>What about Julius, Sphinx and Voxforge?

Julius and Sphinx are speech recognition engines, and theoretically you should be use them to give you monophone timestamps...

But for any real accuracy, you would likely have to train you acoustic models with the recorded speech that you would use in your CG app. 

If you are going to go to all the trouble to get this to work, it might be just as easy to use the Festival TTS approach rather than trying to get time stamps from recorded speech using speech recognition.

If a particular voice is important to you, a hybrid approach might be to train a TTS engine with your recorded speech (if you have enough), and then use it to generate speech and timestamps for your CG app.

VoxForge is just a speech corpus repository, not a speech recognition engine.

>Is there any other freeware or opensource project can do the

>lipsync for the cg animation

Not that I know of...

Ken

--- (Edited on 9/9/2009 9:57 pm [GMT-0400] by kmaclean) ---

PreviousNext