[FoRK] Google Talk

Reza B'Far reza
Wed Aug 31 08:15:14 PDT 2005

Hi Sebastian:

I probably should have added to my original post that I'm glad one of the
big guys is semi-successfully putting money into this (IBM).  There have
been and are other big players who have put a lot of money into this...but
have mostly seen miserable failures (I don't want to name the company names,
but they're all billion dollar companies).  The problem is mostly in that
everyone but IBM has been unable/afraid/whatever to do BASIC SCIENTIFIC

The problem of independent speech recognition is unsolvable by brute-force
with the current state of the technology.  Basically, the current state is
that the recognition engines try to map some audio segment, using Wavelets,
linear predictive algorithms, etc., to words.  This is just dumb.  Beyond
that, people at IBM have tried to create more sophisticated models based on
phrases.  But the problem is that no one, so far as I know, has come up with
a good DSP/Math model for conversation context.  Take these two phrases:

Wheel Go Round
We'll Go Around

Sure there is a little bit of difference in pronouncing the two, but if you
had the context previous to this phrase (understood what the conversation is
about), you could do a lot better guessing what the user is talking about.
It gets worse when you're talking about words like  "Whole" or "Hole" where
the pronounciation is (or at least for most people) identical for two words.

One funny thing is that most recognition engines actually work MUCH better
on speech that has been spoken by a non-native speaker precisely because of
this: non-native speakers don't link context to pronounciation too well
(hence one of the aspects of "accent").  Also, this problem gets MUCH worse
in some languages (I've heard Hindi and Chinese are two of these although I
know nothing about those languages) where context is more important that it
is in English.  So, technology that doesn't work for 2/3 of the world
population probably has a basic scientific flaw.

Most companies today are trying to basically fine-tune what is already
there... make it better using brute-force.  My personal opinion is that this
is just moronic (the old square peg in a round hole)... short term thinking
in the part of the people who run them (and stock holders... looking into
profits today not 10 years from now...)... and that's another thread :)

Of course, there are other problems too... like the fact that language is
changing all the time so it's pretty hard to keep up with it... people make
mistakes when they speak (without knowing it)... any wireless connection has
at least some reliability issues...

I think for the near future, the thing to do is to deliver voice (part of
what we do at Voice Genesis) and multi-modality (allow people to respond
with Voice to text, vice-versa, etc.) as opposed to trying to force one
mode: text.  It's actually much easier to search for an audio clip through a
bunch of audio files than it is to turn the audio clip into text and then
try to search for that text in the audio.  There is nothing set in stone
that says text is better than voice in all use-cases.  For most use-cases,
IMHO, the best way to solve this problem is not to solve it at all! Simply
come up with a clever way to get around the problem.  But, admittedly, there
are some cases where text is much preferred to voice... and for those, we'll
have to await some basic research to be done... may be with our tax
dollars... if we have any left ;-)

Finally, don't get me started on VoiceXML.  Idea is great: let's make a
technolgy more usable for more developers by providing a standard
interface... but the fact is that, when the technology that you're trying to
make usable for developers sucks, then you can bet your standard is not
going to improve it.  Sort of like HTML without Http (or at least some other
hypothetical protocol that was not well designed as in the case of Http).

I'm sure there are other folks on Fork who have done this type of work...
would like to hear opinions :-)


-----Original Message-----
From: Sebastian Hassinger [mailto:shassinger at gmail.com]
Sent: Tuesday, August 30, 2005 3:10 PM
To: reza at voicegenesis.com
Cc: Strata R. Chalup; fork-noarchive at xent.com
Subject: Re: [FoRK] Google Talk

Thanks for the reply, Reza - I'm glad to hear someone's working on
this - as I've said I've often wondered why such a service didn't
BTW, I work at IBM and have in the past had dealings with the speech
reco teams (marketing, development and research) around IVR and call
center offerings, as well as the IBM team currently working on the
Voice XML browser. I have asked this question (i.e. converting
voicemails & conf calls to text transcripts) of many people I've run
into within these groups. You are the first person to admit the
technology is not up to the task.

The thing I've never understood - if your source is not "real time" -
e.g. a voicemail - can't you devote orders of magnitude more cycles to
transcribing it to text than in "real time" applications (e.g.
dictation)? Wouldn't that allow for more elaborate fitness testing of
recognition based on linguistic and grammatical rules, and therefore
drive the accuracy rate up?

On 30 Aug 2005 14:42:48 -0700, Reza B'Far <reza at voicegenesis.com> wrote:
> Well, being that we've looked at this problem at Voice Genesis for 3-4
> now, I can tell you one thing:
> User-independent Speech Recognition technology that exceeds 70% accuracy
> with large vocabulary just doesn't exist.  IBM has put more money into
> than any other company... and, well, so far, there is nothing that can do
> this...
> Now, even after you train the engine (Sphinx is the top of the line
> open-source engine...CMU stuff... as good in quality as any commercial
> package)... there is still (as of mid 2005) speech recognition with large
> vocabulary ( > 1000 words with all possible phrase combinations)
> nothing out there that provides a technology that users will accept...
> We're working on getting something similar to what is being discussed
> here..., but not exactly... since there are limitations in the base
> theory/technology.
> Please include comments in reply-to fork-noarchive only.
> Reza B'Far, CTO
> Voice Genesis, Inc.
> -----Original Message-----
> From: fork-bounces at xent.com [mailto:fork-bounces at xent.com]On Behalf Of
> Sebastian Hassinger
> Sent: Tuesday, August 30, 2005 5:32 AM
> To: Strata R. Chalup
> Cc: forkit!Now
> Subject: Re: [FoRK] Google Talk
> If it's lower than about 80% it'd be useless, I'd bet. Only one way to
> find out - I'll set up a little test. Any good open source speech reco
> engines out there? If not - to the torrents!
> On 8/29/05, Strata R. Chalup <strata at virtual.net> wrote:
> >
> > This sounds like a really good idea.  My W.A.G. would be that the real
> accuracy
> > of the speech recognition is more like 25 - 40%, which wouldn't be very
> useful.
> >    In controlled conditions, eg at a cubicle desk, the folks leaving
> messages
> > might be as much as 80 - 90% intelligible.  But in my experience, even
> when I
> > call my cellphone voicemail on a landline for the clearest listening, I
> have
> > trouble making out calls that were sent from cars, planes, and other
> important
> > but noisy places.  Which unfortunately seems to be the majority of my
> voicemail
> > traffic!
> >
> > SRC
> >
> >
> > Sebastian Hassinger wrote:
> > > Why on earth hasn't someone strapped an industrial-grade speech
> > > recognizing transcription service onto recorded voice message storage
> > > and allowed you to read/search the indexed transcript? Even with the
> > > seeming ceiling of 98% accuracy for speech reco it'd produce something
> > > usuable, surely. Plus the text transcript could be cross-linked to the
> > > recording, so that if something was unintelligible in the transcript,
> > > click, listen to it yourself. Duh.
> > >
> > >
> > >>
> > >
> > >
> >
> > --
> > ========================================================================
> > Strata Rose Chalup [KF6NBZ]                      strata "@" virtual.net
> > VirtualNet Consulting                            http://www.virtual.net/
> >               ** Strategic IT for the Growing Enterprise **
> >
> > _______________________________________________
> > FoRK mailing list
> > http://xent.com/mailman/listinfo/fork
> >
> --
> Sebastian Hassinger
> shassinger at gmail.com
> +1.845.893.1377
> _______________________________________________
> FoRK mailing list
> http://xent.com/mailman/listinfo/fork

Sebastian Hassinger
shassinger at gmail.com

More information about the FoRK mailing list