Community Wishlist Survey 2022/Reading/IPA audio renderer/TTS investigation

Tracked in Phabricator:
Task T307624

Community Tech need to select a text-to-speech engine to drive the IPA audio renderer wish — there are a few good options available to us.

Contents

Overview edit

TTS Engine Type Licence Languages Costs (USD/character) SSML Voices
phoneme-synthesis + meSpeak.js Library GPLv3 (open source) 24 N/A  Y 29
larynx CLI/API MIT (open source) 9 N/A  Y 50
espeak-ng CLI/API GPLv3 (open source) 127 N/A  Y 127[nb 1]
Google Cloud API Closed source 40 0.000004  Y 100
IBM Cloud API Closed source 13 0.00002  Y 26
Microsoft Azure API Closed source 129 0.000016  Y 270
Amazon AWS API Closed source 22 0.000004  Y 66

Requirements edit

The TTS engine we pick should:

Audio samples edit

phoneme-synthesis + meSpeak.js edit

Notes edit

meSpeak.js (modulary enhanced speak.js) is a client-side JavaScript text-to-speech library based on the speak.js project[2], and could possibly be included directly in an extension?

Licence edit

Languages & voices edit

24 languages (29 voices) are supported, with varying completeness[3]

  • Catalan
  • Czech
  • German
  • Greek
  • English
  • Esperanto
  • Spanish
  • Finnish
  • French
  • Hungarian
  • Italian
  • Kannada
  • Latin
  • Latvian
  • Dutch
  • Polish
  • Portuguese
  • Romanian
  • Slovak
  • Swedish
  • Turkish
  • Mandarin Chinese
  • Cantonese Chinese

Quality edit

Better than larynx out of the box, but could be better with some tweaking.

Costs edit

N/A

SSML edit

SSML support can be enabled via a flag.[2]

Notes edit

Has some issues with (ə)

Links edit

larynx edit

Notes edit

larynx would need to be run as an API on the production cluster, with an extension packaging IPA -> SSML

Licence edit

Languages & voices edit

9 languages (50 voices) are supported[4], and are primarily based off of Glow-TTS, a Monotonic Alignment Search trained voice model[5]

  • English
  • German
  • French
  • Spanish
  • Dutch
  • Italian
  • Swedish
  • Swahili
  • Russian

Quality edit

Tested, fairly poor with default settings, will require a lot of tweaking.

Costs edit

N/A

SSML edit

Only a subset of SSML is supported, however the primarily useful elements (i.e. phonemes) exist[6]

Notes edit

Links edit

espeak-ng edit

Notes edit

meSpeak.js mentioned above is based off of eSpeak, and eSpeak NG is an eSpeak backwards compatible CLI application[7]. We would also need to run this as an API.

Licence edit

Languages & voices edit

127[nb 1] languages[8]

Quality edit

Untested

Costs edit

N/A

SSML edit

Similar to meSpeak.js, a subset of SSML is supported.

Notes edit

Links edit

Google Cloud edit

Notes edit

API

Licence edit

  • Proprietary

Languages & voices edit

40 languages (100+ voices)

Quality edit

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs edit

All costs exclude "WaveNet" (DeepMind GAN ML model[9]) voices, and are based on publicly available pricing.

Free quota edit

  • 4 million characters per month

Then edit

  • $0.000004 USD per character

SSML edit

Fully supported

Notes edit

Links edit

IBM Cloud edit

Notes edit

API

Licence edit

  • Proprietary

Languages & voices edit

13 languages (26 voices) are supported[10]

  • Arabic
  • Chinese
  • Czech
  • Dutch
  • English
  • French
  • German
  • Italian
  • Japanese
  • Korean
  • Portuguese
  • Spanish
  • Swedish

Quality edit

Untested

Costs edit

All costs are based on publicly available pricing.

Free quota edit

  • 10,000 characters per month

Then edit

  • $0.00002 USD per character

SSML edit

Fully supported

Notes edit

Links edit

Microsoft Azure edit

Notes edit

API

Licence edit

  • Proprietary

Languages & voices edit

129 languages (270 voices) are supported

Quality edit

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs edit

All costs exclude "Custom Neural" voices, and are based on publicly available pricing.

Free quota edit

  • 0.5 million characters per month

Then edit

  • $0.000016 USD per character

SSML edit

Fully supported

Notes edit

Links edit

Amazon AWS edit

Notes edit

API

Licence edit

  • Proprietary

Languages & voices edit

22 languages (66 voices) are supported

Quality edit

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs edit

All costs are based on publicly available pricing.

Free quota edit

  • 5 million characters per month (for 12 months)

Then edit

  • $0.000004 USD per character

SSML edit

Fully supported

Notes edit

Links edit

See also edit

Footnotes edit

  1. a b voice count unsure, likely 1 per language at least?

References edit

  1. "Speech Synthesis Markup Language (SSML) Version 1.1". www.w3.org. Archived from the original on 2022-03-16. Retrieved 2022-05-18. 
  2. a b "meSpeak.js: Text-to-Speech on the Web". www.masswerk.at. Archived from the original on 2022-04-27. Retrieved 2022-05-18. 
  3. "meSpeak – Voices & Languages". www.masswerk.at. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  4. "Larynx: End to end text to speech system using gruut and onnx". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  5. Kim, Jaehyeon; Kim, Sungwon; Kong, Jungil; Yoon, Sungroh (2020-10-22). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". Archived from the original on 2022-05-16. Retrieved 2022-05-18. 
  6. "Larynx: SSML". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  7. "eSpeak NG Text-to-Speech". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  8. "eSpeak NG languages". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18. 
  9. "Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology". Google Cloud Blog. Archived from the original on 2022-05-16. Retrieved 2022-05-18. 
  10. "IBM Cloud Text-to-speech languages". cloud.ibm.com. Archived from the original on 2021-04-15. Retrieved 2022-05-18.