Speech-graphics

Hi guys,

have you seen the latest result regarding voice recognition/speech analysis software?

Today I read some news about this little company in Scotland:

http://www.speech-graphics.com/

The results are promising. But I’m really interested in how many blendshapes are used to have this kind of result and if the same result can be achieved with bones as well.

Furthermore it’d be interesting seeing some ingame footage using this approach.

This is the most convincing one I’ve seen to date. Look like the game market is where they want to be as well
http://www.speech-graphics.com/leading-edge-lip-sync-service-set-to-transform-in-game-speech-development-4/

The problem with this, as is always the case, is there’s not enough info about the assets used in the videos.

I want to know, (if this is aimed at runtime) if it was done with bones, how many were used, if it was morphs then how many etc etc. Some basic metrics.

I could get similar results to that with FaceFX currently, but what I want to see, is if I could get these results with (a) less bones etc than I currently have to and (b) if its easier to use.

@MattRennie:
I totally agree. I really want to know, what the source assets are for getting these results.
It’s kind of strange that you actually work for rockstar and don’t have the right answer(link!):

http://www.rockstarspy.com/apps/blog/show/12266628-gta-v-to-use-new-revolutionary-speech-graphics-

Maybe you know even more? :wink:

@mattanimation:
Everything I’ve seen here is much better than what I’ve seen from FaceFX so far. The tongue movement seems very good. The pronoucniation is solid and even the flow of it.

I’m native German and reviewing the German example clip makes it feel a bit over-acted but still believable.

What do you guys think about the other language examples?

The other thing about these demos is that there are lip sync demos, and whilst getting good lip sync is great, it’s the upper half of the face - the brows, the eyes and the cheeks that carry all the emotional content.

If you watch people speaking in real life the lip movement is quite subdued and it’s the rest of the face that carries the meaning of the words.

There was a nice Bioware paper from a fwe years back that showed how they inserted emotional triggers into the text controls for FaceFX, and that looked great.

I speak Korean (lived there for 2 years) which is the last example and it was pretty off on the vowel shapes for the sounds, that’s a hard one to do as it is, part of the issue I saw is they are translating vowels sounds to english but korean and english don’t share all the same vowels and consonants, in fact reading a korean word in english sounds like poop. Anyway yeah i’m curious about the setup behind the scenes as well.

[QUOTE=Rick Stirling;14346]
There was a nice Bioware paper from a fwe years back that showed how they inserted emotional triggers into the text controls for FaceFX, and that looked great.[/QUOTE]
was this using text tagging or something else? if you would happen to have a link, i’d be interested in seeing the paper. in my experiments text tagging technically works but it is such a linear interpolation (even with their weighted triplets system) that it is almost not even worth it.

Cloward gave a GDC talk in GDC 2010 IIRC about it. We had those triggers like you describe- I think the result was nice but wasn’t great. There was nothing magical going on- it was really just what you’d expect, inserting symbols to indicate emotion into the text. When you have 250k lines of dialogue, though, good enough is good enough. Quantity over quality, I always say!

I can’t see how this has anything to do with joint-counts or if it’s blend-shape driven… Almost any type of rig can handle the quality of output from voice analysis software I’ve seen earlier. It’s been a temporal issue, not a spatial/pose issue.
So, voice analysis should be on a totally different abstraction level than bone-counts; otherwise it’s just weird.

Anyway… I think that from a voice-analysis perspective it’s the best I’ve seen. But as stated by many others; a good performance is not in the mouth.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.