KIT Department of Informatics

AI outperforms humans in speech recognition

AI outperforms humans in speech recognition

KIT researchers have developed the world's first speech recognition system that works better than humans and is faster than other AIs.

To follow and accurately reproduce an everyday conversation is one of the greatest challenges in research on artificial intelligence (AI). Researchers at the Karlsruhe Institute of Technology (KIT) have now succeeded for the first time with a computer system to exceed the accuracy of human recognition when recognizing such spontaneously spoken language - and this with only minimal delay to speaking. They report on this on the Internet platform ArXiv.org.

Der „Lecture Translator“ des KIT liefert dank überlegenem Spracherkennungssystem zukünftig bessere Ergebnisse mit minimaler Verzögerung. (Foto: KIT)
"When people talk to each other, there are interruptions, stutterers, hesitations like 'uh' or 'hm', laughs and coughs," says Alex Waibel, Professor of Computer Science at KIT. "Moreover, words are often pronounced unclearly." So it is already difficult for people to make an accurate transcript of an informal dialogue. "This was even more difficult for an AI," says the speech recognition expert. A team of KIT scientists and employees of KITES, a KIT spin-off company, has now programmed a computer system for the first time worldwide that performs this task better than humans and faster than other systems.

Waibel has already developed an automatic live translator that translates university lectures from German or English into the languages of foreign students in step with the lecture. The "Lecture Translator" has been in use in KIT lecture halls since 2012. "The recognition of spontaneous speech is the most important component in this system," explains Waibel, "since errors and delays in recognition make the translation incomprehensible. The human error rate here is around 5.5 percent. Our system is now at 5.0 percent." However, it is not only the accuracy that matters, but also how quickly the system outputs the result so that students can follow the lecture live. For the first time, the researchers were able to reduce this delay to one second. This is the lowest value in the so-called latency ever achieved by a speech recognition system of this quality, Waibel emphasises.

Error rate and delay are measured with the standardized and scientifically internationally recognized "Switchboard Benchmark" test. This is considered to be the unprecedented benchmark in the competition of the international AI research community to build a machine that approaches or exceeds the human ability to recognize spontaneous speech.

However, a recognition system alone cannot understand contents or contexts, says Waibel. "This is exclusively about acoustic recognition under scientifically comparable conditions." Dialogue, translation and other AI modules can now enable linguistic interaction faster and with greater accuracy.

Details on the KIT Center Information - Systems - Technologies (in English):http://www.kcist.kit.edu

Further materials: Link to the paper: https://arxiv.org/abs/2010.03449