Share & Follow us on facebook and be updated!
http://www.facebook.com/shortcuttingblog
The Pentagon’s blue-sky researchers are
funding a project that uses crowdsourcing to improve how machines
analyze our speech. Even more radical: Darpa wants to make systems so
accurate, you’ll be able to easily record, transcribe and recall all the
conversations you ever have.
Analyzing speech and improving
speech-to-text machines has been a hobby horse for Darpa in recent
years. But this takes it a step further, in exploring the ways
crowdsourcing can make it possible for our speech to be recorded and
stored forever. But it’s not just about better recordings of what you
say. It’ll lead to more recorded conversations, quickly
transcribed and then stored in perpetuity — like a Twitter feed or
e-mail archive for everyday speech. Imagine living in a world where
every errant utterance you make is preserved forever.
University of Texas computer scientist
Matt Lease has studied crowdsourcing for years, including for an earlier
Darpa project called Effective Affordable Reusable Speech-to-text, or EARS,
which sought to boost the accuracy of automated transcription machines.
His work has also attracted enough attention for Darpa to award him a
$300,000 award over two years to study the new project,
called “Blending Crowdsourcing with Automation for Fast, Cheap, and
Accurate Analysis of Spontaneous Speech.” The project envisions a world
that is both radically transparent and a little freaky.
The idea is that business meetings or
even conversations with your friends and family could be stored in
archives and easily searched. The stored recordings could be held in
servers, owned either by individuals or their employers. Lease is still
playing with the idea — one with huge implications for how we interact.
“In their call, what [Darpa] really
talked about were different areas of science where they would like to
see advancements in certain problems that they see,” Lease told Danger
Room at his Austin office. “So I responded talking about what I saw as
this very big both need and opportunity to really make conversational
speech more accessible, more part of our permanent record instead of
being so ephemeral, and really trying to imagine what this world would
look like if we really could capture all these conversations and make
use of them effectively going forward.”
How? The answer, Lease says, is in
widespread use of recording technologies like smartphones, cameras and
audio recorders — a kind of “democratizing force of everyday people
recording and sharing their daily lives and experiences through their
conversations.” But the trick to making the concept functional and
searchable, says Lease, is blending automated voice analysis machines
with large numbers of human analysts through crowdsourcing. That could
be through involving people “strategically,” to clean up transcripts
where machines made a mistake. Darpa’s older EARS project relied
entirely on automation, which has its drawbacks.
“Like other AI, it can only go so far,
which is based on what the state-of-the-art methodology can do,” Lease
says. “So what was exciting to me is thinking about going back to some
of that work and now taking advantage of crowdsourcing and applying that
into the mix.”
Crowdsourcing is all about harnessing
distributed networks of people — crowds — to do tasks better and more
efficiently than individuals or machines. Recently, that’s meant
harnessing large numbers of people to build digital maps, raising funds for a film project at Kickstarter, or doing odd-jobs at Amazon Mechanical Turk — one system being studied as part of the project. Darpa has also taken an interest in crowdsourcing as a way to analyze vast volumes of intelligence data, and Darpa’s sibling in the intelligence community, Iarpa, has researched crowdsourcing as a way to find the best intelligence predictions.
But a few problems have to be overcome
before crowdsourcing can be used to analyze speech. According to Lease,
both crowdsourcing and automated systems for analyzing and transcribing
speech are — by themselves — pretty weak. Audio transcripts written by
humans are very accurate, but they take time to produce, and the labor
is too expensive when applied on a large enough scale.
Meanwhile, automated systems are not very
accurate, and require humans to copy-edit the result: adding missing
punctuation marks, capitalization, and correcting for verbal
disfluencies — those little noises we make when filling gaps in our
speech, like “um” or “ah.” We don’t always finish our words when we
talk. (But our brains are really good at not noticing it.) We change
phrases mid-thought, or mistakenly begin a sentence by saying one word,
only to quickly correct ourselves by switching to another word.
Background noise — which has plagued voice recognition machines — can
also interfere with the quality. All in all, this kind of
conversational, casual speech plays havoc with our automated machines,
the result being a sort of unintelligible word salad.
“There’s a linguistic sense in that
conversational speech is quite different than text,” Lease says. “So we
really need to think about how we make this form of our language, which
is so natural to us in speech, something that is accessible to us when
it’s written down, in a way that it may not naturally be.”
It also raises some thorny legal and
social questions about privacy. For one, there is an issue with
“respecting the privacy rights of multiple people involved,” Lease says.
One solution, for a business conference that’s storing and transcribing
everything said by the participants, could be a mutual agreement
between all parties. He adds that technical issues when it comes to
archiving recorded speech are still open questions, but people could
potentially hold their cell phone conversations on remote servers; or on
individual, privately-held servers.
The other problem is figuring how out how
to search massive amounts of transcribed speech, like how search
engines such as Google use complex algorithms to match and optimize
search queries with results that are likely to be relevant. Fast and
cheap web analytics — judging what people type and matching it up to
what they click — is one way to do it. Studying focus groups are more
precise, but expensive. A third way, Lease suggests, is using more
crowdsourcing as a sort of a “middle-ground” between the two methods.
But it’s unknown how the research will be
applied to the military. Lease wouldn’t speculate, and it’s still very
much a basic research project. Though if it’s similar to EARS at all,
then it may not be too difficult to figure out. A 2003 memorandum from
the Congressional Research Service described EARS as focusing on speech
picked up from broadcasts and telephone conversations, “as well as
extract clues about the identity of speakers” for “the military, intelligence and law enforcement communities.”
Though Lease didn’t mention automatically recognizing voices. But the
research may not have to go that far — if we’re going to be recording
ourselves.
Source:
www.wired.com
http://worldtruth.tv
No comments:
Post a Comment