Community Wishlist Survey 2022/Reading/IPA audio renderer/TTS investigation

Tracked in Phabricator:
Task T307624

Community Tech need to select a text-to-speech engine to drive the IPA audio renderer wish — there are a few good options available to us.

Contents

Overview

TTS Engine	Type	Licence	Languages	Costs (USD/character)	SSML	Voices
phoneme-synthesis + meSpeak.js	Library	GPLv3 (open source)	24	N/A	Y	29
larynx	CLI/API	MIT (open source)	9	N/A	Y	50
espeak-ng	CLI/API	GPLv3 (open source)	127	N/A	Y	127^{[nb 1]}
Google Cloud	API	Closed source	40	0.000004	Y	100
IBM Cloud	API	Closed source	13	0.00002	Y	26
Microsoft Azure	API	Closed source	129	0.000016	Y	270
Amazon AWS	API	Closed source	22	0.000004	Y	66

Requirements

The TTS engine we pick should:

accept SSML (speech synthesis markup language), as an emerging W3C standard^[1]
produce acceptable quality speech synthesis
support as wide a range of languages as possible

Audio samples

https://tnt-dev.toolforge.org/projects/tts (work in progress)

phoneme-synthesis + meSpeak.js

Notes

meSpeak.js (modulary enhanced speak.js) is a client-side JavaScript text-to-speech library based on the speak.js project^[2], and could possibly be included directly in an extension?

Licence

GPL v3

Languages & voices

24 languages (29 voices) are supported, with varying completeness^[3]

Catalan
Czech
German
Greek
English
Esperanto
Spanish
Finnish
French
Hungarian
Italian
Kannada
Latin
Latvian
Dutch
Polish
Portuguese
Romanian
Slovak
Swedish
Turkish
Mandarin Chinese
Cantonese Chinese

Quality

Better than larynx out of the box, but could be better with some tweaking.

Costs

N/A

SSML

SSML support can be enabled via a flag.^[2]

Notes

Has some issues with (ə)

Links

larynx

Notes

larynx would need to be run as an API on the production cluster, with an extension packaging IPA -> SSML

Licence

MIT

Languages & voices

9 languages (50 voices) are supported^[4], and are primarily based off of Glow-TTS, a Monotonic Alignment Search trained voice model^[5]

English
German
French
Spanish
Dutch
Italian
Swedish
Swahili
Russian

Quality

Tested, fairly poor with default settings, will require a lot of tweaking.

Costs

N/A

SSML

Only a subset of SSML is supported, however the primarily useful elements (i.e. phonemes) exist^[6]

Notes

Links

GitHub
TheresNoTime's fork
Languages/Voices
SSML support
CommTech's test installation: https://larynx-tts.wmcloud.org/openapi/

espeak-ng

Notes

meSpeak.js mentioned above is based off of eSpeak, and eSpeak NG is an eSpeak backwards compatible CLI application^[7]. We would also need to run this as an API.

Licence

GPL v3

Languages & voices

127^{[nb 1]} languages^[8]

See list

Quality

Untested

Costs

N/A

SSML

Similar to meSpeak.js, a subset of SSML is supported.

Notes

Links

Google Cloud

Notes

API

Licence

Proprietary

Languages & voices

40 languages (100+ voices)

See list

Quality

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs

All costs exclude "WaveNet" (DeepMind GAN ML model^[9]) voices, and are based on publicly available pricing.

Free quota

4 million characters per month

Then

$0.000004 USD per character

SSML

Fully supported

Notes

Links

IBM Cloud

Notes

API

Licence

Proprietary

Languages & voices

13 languages (26 voices) are supported^[10]

Arabic
Chinese
Czech
Dutch
English
French
German
Italian
Japanese
Korean
Portuguese
Spanish
Swedish

Quality

Untested

Costs

All costs are based on publicly available pricing.

Free quota

10,000 characters per month

Then

$0.00002 USD per character

SSML

Fully supported

Notes

Links

Microsoft Azure

Notes

API

Licence

Proprietary

Languages & voices

129 languages (270 voices) are supported

See list

Quality

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs

All costs exclude "Custom Neural" voices, and are based on publicly available pricing.

Free quota

0.5 million characters per month

Then

$0.000016 USD per character

SSML

Fully supported

Notes

Links

Amazon AWS

Notes

API

Licence

Proprietary

Languages & voices

22 languages (66 voices) are supported

See list

Quality

As expected from a commercial service, very good with default settings. No tweaking necessary.

Costs

All costs are based on publicly available pricing.

Free quota

5 million characters per month (for 12 months)

Then

$0.000004 USD per character

SSML

Fully supported

Notes

Links

Footnotes

↑ ^a ^b voice count unsure, likely 1 per language at least?

References

↑ "Speech Synthesis Markup Language (SSML) Version 1.1". www.w3.org. Archived from the original on 2022-03-16. Retrieved 2022-05-18.
↑ ^a ^b "meSpeak.js: Text-to-Speech on the Web". www.masswerk.at. Archived from the original on 2022-04-27. Retrieved 2022-05-18.
↑ "meSpeak – Voices & Languages". www.masswerk.at. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
↑ "Larynx: End to end text to speech system using gruut and onnx". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
↑ Kim, Jaehyeon; Kim, Sungwon; Kong, Jungil; Yoon, Sungroh (2020-10-22). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". Archived from the original on 2022-05-16. Retrieved 2022-05-18.
↑ "Larynx: SSML". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
↑ "eSpeak NG Text-to-Speech". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
↑ "eSpeak NG languages". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.
↑ "Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology". Google Cloud Blog. Archived from the original on 2022-05-16. Retrieved 2022-05-18.
↑ "IBM Cloud Text-to-speech languages". cloud.ibm.com. Archived from the original on 2021-04-15. Retrieved 2022-05-18.

[espeak-ng-1] voice count unsure, likely 1 per language at least?

[2] "Speech Synthesis Markup Language (SSML) Version 1.1". www.w3.org. Archived from the original on 2022-03-16. Retrieved 2022-05-18.

[masswerk.at/mespeak-3] "meSpeak.js: Text-to-Speech on the Web". www.masswerk.at. Archived from the original on 2022-04-27. Retrieved 2022-05-18.

[4] "meSpeak – Voices & Languages". www.masswerk.at. Archived from the original on 2022-05-18. Retrieved 2022-05-18.

[5] "Larynx: End to end text to speech system using gruut and onnx". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.

[6] Kim, Jaehyeon; Kim, Sungwon; Kong, Jungil; Yoon, Sungroh (2020-10-22). "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search". Archived from the original on 2022-05-16. Retrieved 2022-05-18.

[7] "Larynx: SSML". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.

[8] "eSpeak NG Text-to-Speech". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.

[9] "eSpeak NG languages". github.com. 2022-05-18. Archived from the original on 2022-05-18. Retrieved 2022-05-18.

[10] "Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology". Google Cloud Blog. Archived from the original on 2022-05-16. Retrieved 2022-05-18.

[11] "IBM Cloud Text-to-speech languages". cloud.ibm.com. Archived from the original on 2021-04-15. Retrieved 2022-05-18.

[nb 1]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Community Wishlist Survey 2022/Reading/IPA audio renderer/TTS investigation

Overview

Requirements

Audio samples

phoneme-synthesis + meSpeak.js

Notes

Licence

Languages & voices

Quality

Costs

SSML

Notes

Links

larynx

Notes

Licence

Languages & voices

Quality

Costs

SSML

Notes

Links

espeak-ng

Notes

Licence

Languages & voices

Quality

Costs

SSML

Notes

Links

Google Cloud

Notes

Licence

Languages & voices

Quality

Costs

Free quota

Then

SSML

Notes

Links

IBM Cloud

Notes

Licence

Languages & voices

Quality

Costs

Free quota

Then

SSML

Notes

Links

Microsoft Azure

Notes

Licence

Languages & voices

Quality

Costs

Free quota

Then

SSML

Notes

Links

Amazon AWS

Notes

Licence

Languages & voices

Quality

Costs

Free quota

Then

SSML

Notes

Links

See also

Footnotes

References