After a year of significant developments, research finds Artificial Intelligence (AI) speech recognition tools are honing in on differentiation, but human-in-the-loop workflows remain critical for ASR captioning and transcription use cases.

After a year of profound improvement in accuracy, ASR providers are doubling down on improving the accuracy of their solutions and focusing on their differentiation, according to the latest State of ASR report by 3Play Media, the leading media accessibility provider in North America, released today.

“The ASR market continues to evolve and is fiercely competitive. It is clearly reaching a maturation stage in its evolution,” Josh Miller, co-CEO and co-Founder, 3Play Media, said. “After a year of revolutionary changes in the accuracy of the technology, the 2024 report finds vendors working on their differentiation based on specific use cases and fine-tuning their technologies accordingly.

“This year, it has become clear that not all errors are equal, challenging the standalone metric of accuracy rate. Ultimately, ASR alone is still insufficient for the captioning use case, especially regarding formatting and hallucinations. Human-in-the-loop captioning and transcription workflows remain critical for accuracy, quality, and accessibility.”

The annual study analyzes the general state of speech-to-text technology as it applies to the task of captioning and transcription. In addition to a surge in new advancements, 2023 brought several new players, such as Assembly and Whisper, whose ASR engines rivaled top competitors such as Speechmatics.

The new report investigates errors like hallucinations, where the engine generates incorrect words not present in the input. Whisper, a fast gainer in last year’s study, continues to be a competitive engine, but its hallucinations remain a cause for concern. These hallucinations appear more common than initially believed, and the consequences for accessibility – and ultimately a brand – are profound.

This year’s State of ASR report additionally highlights the need for a more nuanced evaluation framework that considers factors like Word Error Rate (WER), Formatted Error Rate (FER), and the Canadian NER Model. The top engines were found to have different strengths and weaknesses, and each prioritizes differing types of content or styles of transcription.

To obtain a free copy of The 2024 State of ASR report, please visit: https://go.3playmedia.com/rs-2024-asr.

About 3Play Media

3Play Media is an integrated media accessibility platform with patented solutions for closed captioning, transcription, live captioning, audio description, and subtitling. 3Play Media combines machine learning (ML), artificial intelligence (AI), and automatic speech recognition (ASR) with human review to provide innovative, highly accurate services. Customers span multiple industries, including media & entertainment, corporate, e-commerce, fitness, higher education, government, and eLearning.

Media Contact Phil LeClare phil.leclare@3playmedia.com 617-209-9406 www.3playmedia.com @3playmedia