Community Wishlist Survey 2022/Reading/IPA audio renderer/TTS investigation
Community Tech need to select a text-to-speech engine to drive the IPA audio renderer wish — there are a few good options available to us.
Overview edit
TTS Engine | Type | Licence | Languages | Costs (USD/character) | SSML | Voices |
---|---|---|---|---|---|---|
phoneme-synthesis + meSpeak.js | Library | GPLv3 (open source) | 24 | N/A | 29 | |
larynx | CLI/API | MIT (open source) | 9 | N/A | 50 | |
espeak-ng | CLI/API | GPLv3 (open source) | 127 | N/A | 127[nb 1] | |
Google Cloud | API | Closed source | 40 | 0.000004 | 100 | |
IBM Cloud | API | Closed source | 13 | 0.00002 | 26 | |
Microsoft Azure | API | Closed source | 129 | 0.000016 | 270 | |
Amazon AWS | API | Closed source | 22 | 0.000004 | 66 |
Requirements edit
The TTS engine we pick should:
- accept SSML (speech synthesis markup language), as an emerging W3C standard[1]
- produce acceptable quality speech synthesis
- support as wide a range of languages as possible
Audio samples edit
- https://tnt-dev.toolforge.org/projects/tts (work in progress)
phoneme-synthesis + meSpeak.js edit
Notes edit
meSpeak.js (modulary enhanced speak.js) is a client-side JavaScript text-to-speech library based on the speak.js project[2], and could possibly be included directly in an extension?
Licence edit
Languages & voices edit
24 languages (29 voices) are supported, with varying completeness[3]
- Catalan
- Czech
- German
- Greek
- English
- Esperanto
- Spanish
- Finnish
- French
- Hungarian
- Italian
- Kannada
- Latin
- Latvian
- Dutch
- Polish
- Portuguese
- Romanian
- Slovak
- Swedish
- Turkish
- Mandarin Chinese
- Cantonese Chinese
Quality edit
Better than larynx out of the box, but could be better with some tweaking.
Costs edit
N/A
SSML edit
SSML support can be enabled via a flag.[2]
Notes edit
Has some issues with (ə)
Links edit
larynx edit
Notes edit
larynx would need to be run as an API on the production cluster, with an extension packaging IPA -> SSML
Licence edit
Languages & voices edit
9 languages (50 voices) are supported[4], and are primarily based off of Glow-TTS, a Monotonic Alignment Search trained voice model[5]
- English
- German
- French
- Spanish
- Dutch
- Italian
- Swedish
- Swahili
- Russian
Quality edit
Tested, fairly poor with default settings, will require a lot of tweaking.
Costs edit
N/A
SSML edit
Only a subset of SSML is supported, however the primarily useful elements (i.e. phonemes) exist[6]
Notes edit
Links edit
- GitHub
- TheresNoTime's fork
- Languages/Voices
- SSML support
- CommTech's test installation: https://larynx-tts.wmcloud.org/openapi/
espeak-ng edit
Notes edit
meSpeak.js mentioned above is based off of eSpeak, and eSpeak NG is an eSpeak backwards compatible CLI application[7]. We would also need to run this as an API.
Licence edit
Languages & voices edit
Quality edit
Untested
Costs edit
N/A
SSML edit
Similar to meSpeak.js, a subset of SSML is supported.
Notes edit
Links edit
Google Cloud edit
Notes edit
API
Licence edit
- Proprietary
Languages & voices edit
40 languages (100+ voices)
Quality edit
As expected from a commercial service, very good with default settings. No tweaking necessary.
Costs edit
All costs exclude "WaveNet" (DeepMind GAN ML model[9]) voices, and are based on publicly available pricing.
Free quota edit
- 4 million characters per month
Then edit
- $0.000004 USD per character
SSML edit
Fully supported
Notes edit
Links edit
IBM Cloud edit
Notes edit
API
Licence edit
- Proprietary
Languages & voices edit
13 languages (26 voices) are supported[10]
- Arabic
- Chinese
- Czech
- Dutch
- English
- French
- German
- Italian
- Japanese
- Korean
- Portuguese
- Spanish
- Swedish
Quality edit
Untested
Costs edit
All costs are based on publicly available pricing.
Free quota edit
- 10,000 characters per month
Then edit
- $0.00002 USD per character
SSML edit
Fully supported
Notes edit
Links edit
Microsoft Azure edit
Notes edit
API
Licence edit
- Proprietary
Languages & voices edit
129 languages (270 voices) are supported
Quality edit
As expected from a commercial service, very good with default settings. No tweaking necessary.
Costs edit
All costs exclude "Custom Neural" voices, and are based on publicly available pricing.
Free quota edit
- 0.5 million characters per month
Then edit
- $0.000016 USD per character
SSML edit
Fully supported
Notes edit
Links edit
Amazon AWS edit
Notes edit
API
Licence edit
- Proprietary
Languages & voices edit
22 languages (66 voices) are supported
Quality edit
As expected from a commercial service, very good with default settings. No tweaking necessary.
Costs edit
All costs are based on publicly available pricing.
Free quota edit
- 5 million characters per month (for 12 months)
Then edit
- $0.000004 USD per character
SSML edit
Fully supported
Notes edit
Links edit
See also edit
Footnotes edit
References edit
- ↑ "Speech Synthesis Markup Language (SSML) Version 1.1". www.w3.org. Archived from the original on 2022-03-16. Retrieved 2022-05-18.
- ↑ a b "meSpeak.js: Text-to-Speech on the Web". www.masswerk.at. Archived from the original on 2022-04-27. Retrieved 2022-05-18.
- ↑ "meSpeak – Voices & Languages". www.masswerk.at. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "Larynx: End to end text to speech system using gruut and onnx". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ Kim, Jaehyeon; Kim, Sungwon; Kong, Jungil; Yoon, Sungroh (2020-10-22). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". Archived from the original on 2022-05-16. Retrieved 2022-05-18.
- ↑ "Larynx: SSML". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "eSpeak NG Text-to-Speech". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "eSpeak NG languages". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
- ↑ "Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology". Google Cloud Blog. Archived from the original on 2022-05-16. Retrieved 2022-05-18.
- ↑ "IBM Cloud Text-to-speech languages". cloud.ibm.com. Archived from the original on 2021-04-15. Retrieved 2022-05-18.