How to pick the best speech recognition service for your medical app

AscendHIT – a client of ours – needed to establish which speech recognition solution to use in order to replace the old Dragon speech recognition software, which can run only on desktop computers. The goal of the research project we undertook for them was to find a robust web-based SR tool which could – among other things – support medical vocabulary and which could be speaker independent (i.e., the system does not need to learn to recognize specific individuals).

Classics from major market players

We started the research by exploring the most popular tools which could be used for the purpose. Unsurprisingly, we first turned to check the ML-based Google Cloud Speech API, i.e. the Google speech to text conversion tool. Together with the client we decided to postpone further investigation when it turned out that the tool does not support medical vocabulary and it is impossible to teach the system new words (as of July 2017).

Continue major player who created and markets a customizable speech to text converter is IBM with its IBM Watson. We chose to test the latter hoping that the plethora of features it comes with would cover our requirements; the free trial period offered would also help to limit the cost of the research project for the client. Having deployed an instance on IBM Watson server, we were impressed by the speed of speech processing and by the fact that the solution was speaker independent.

It turned out though that Watson does not support medical vocabulary; we did not give up immediately but tried to resolve the problem somehow. Watson instances are very easy to teach – you do not need to generate the audio recordings of new words to teach the system. You just need to prepare and deliver text documents and – since Watson “knows” the language phonetics – it will learn the new words by processing just text files without the need to provide the accompanying audio recordings. We actually imported some medical reports containing specialized vocabulary and launched the Watson teaching process. The results generated improved but we faced another issue: medical jargon is rich in a number of words which can only be understood in specific contexts and we had an impression we were not in a position to feed the system with sufficient input to teach Watson the medical vocabulary well enough. We would need a much richer body of medical texts with the vocabulary involved to implement this scenario. Having weighed the pros and cons, we decided to drop Watson and check some other solutions.

The next solution we decided to check was the Microsoft Bing Speech Recognition Interface. The latter also comes with a free trial period and the speed of speech processing is very impressive too. Unfortunately, the tool does not support medical vocabulary and – more importantly – it is very hard to teach as you need to generate both sound files and corresponding text files in order to do so. The associated obstacles of teaching the system medical vocabulary would be much harder to overcome.

The best is not always the fastest

In the next phase of the research project we moved on to scrutinize the Nvoq Sayit speech processing tool. It is rather hard to google find on the net. Interestingly enough, the review of their documentation revealed that the solution had actually been designed specifically for use in medical contexts. It recognises a large number of medical vocabulary items and – on top of that – it is also very easy to teach – just like IBM Watson. What is more, the provider supports smartphones to function as external microphones.

Though we were not able to find a free trial version, when we contacted their support team, they willingly provided us with credentials to access their servers for testing purposes. With some help from the Nvoq support team, we were able to create two prototypes based on their API. The first solution was a web application created in Ruby on Rails which streams audio data from the computer microphone to the Nvoq server and then returns the proper dictation text to the user. The other prototype was a web application integrated with the Nvoq Mobile Microphone App which fetches mobile phone audio data, processes it and returns the output to the web app to be displayed as text to the user.

On the whole, the prototypes helped us achieve our project objectives; the only issue remaining is with the processing speed but the latter was not of vital importance for this project. There is hope for some positive change in the near future though – the Nvoq support team are going to support the Web Sockets technology (as of July 2017) so we should be able to get the output much faster when they succeed in doing so. All in all, the client decided to use Nvoq Sayit in particular because of the specialized support for medical vocabulary.

All is relative

The key lesson we learnt from the research project is that the best way to discover and choose a suitable solution is to find, explore and compare many different alternatives assessing them with a very specific context in mind. Initially, the solutions often look very similar and it is only after a more detailed analysis that the differences start to emerge along a better understanding of the tool’s suitability for a specific application. The other solutions you discard in the research process may not necessarily be worse, but they just lend themselves better for other contexts. If – for example – you need to support multiple speakers (without training the system to recognize/understand their individual speech patterns) but you do not need to support custom words – the Google solution should probably be selected as the best in the set of tools we researched. While the IBM Watson tool may in turn be the best choice when you need to support custom words and you need to process multiple texts from learning materials which include these words in many different contexts. To sum up, identify a number of alternative solutions, analyze and compare them keeping the peculiarities of your specific application context in mind.

 

RESEARCH SUMMARY GRID

Requirements / Features Google Cloud Speech API IBM Watson Nvoq SayIt Microsoft Bing Speech Recognition
medical vocabulary support no no yes no
ease of teaching new vocabulary not supported easy easy very hard
speech processing speed very fast very fast slow fast
free trial period yes yes available only for big companies yes
pricing $0.024 per audio minutes transmitted $0.015 per audio minutes transmitted calculated individually $ 4.00 per 1000 transactions (1 transaction = any request sent to the server which contains audio recording)

* tool parameters and characteristics described as of July 2017

rsz_lg

Łukasz Gaweł  an experienced back-end developer with strong Ruby on Rails capabilities. Venturing boldly into the territories of emerging technologies, Łukasz demonstrates a solid understanding of the newest web technologies and never shuns an opportunity to share his insights with others.