Group: AutomaticTranscription
m (Fix link for Universal Subtitles / Amara and clean up old page) |
|||
(3 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
If you are interested, please join this group (start by leaving your name here). We should start by surveying any existing free software that is in or close to this area, and making a list of features that are needed. | If you are interested, please join this group (start by leaving your name here). We should start by surveying any existing free software that is in or close to this area, and making a list of features that are needed. | ||
− | + | Automatic Transcription of recorded media is mostly solved by Vosk's test_srt.py script. We still need to figure out how to provide live transcription services using free software. This should be possible with Vosk as well. | |
− | + | == Vosk == | |
− | |||
− | + | [https://alphacephei.com/vosk/ Website with documentation] - [https://github.com/alphacep/vosk-space Website GitHub] | |
− | [ | + | [https://github.com/alphacep/vosk-api Vosk API GitHub] - [https://directory.fsf.org/wiki/Vosk-API FSD] |
− | == | + | [https://directory.fsf.org/wiki/Vosk_Server Vosk Server GitHub] - [https://directory.fsf.org/wiki/Vosk_Server FSD] |
+ | |||
+ | [https://forum.members.fsf.org/t/speech-to-text-stt-for-live-transcription/2934 FSF Member Forum discussion] | ||
+ | |||
+ | Vosk is an offline Speech-to-Text (STT) framework that can transcribe audio into text much faster than real-time even on an X200. | ||
+ | |||
+ | === Vosk Setup === | ||
+ | |||
+ | These setup steps need to be done once unless you are updating. | ||
+ | |||
+ | The easiest way to install Vosk is with pip. Install pip first. | ||
+ | |||
+ | <code>pip3 install vosk</code> | ||
+ | |||
+ | Download the scripts from the project. | ||
+ | |||
+ | <code>git clone https://github.com/alphacep/vosk-api</code> | ||
+ | |||
+ | Change to the python example directory. | ||
+ | |||
+ | <code>cd vosk-api/python/example</code> | ||
+ | |||
+ | Download a model. | ||
+ | |||
+ | The full en US model should be used if you have the RAM to support it. As a rule of thumb, you need 2x the size of the model of free RAM available. | ||
+ | |||
+ | <nowiki>wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip | ||
+ | unzip vosk-model-en-us-0.22.zip | ||
+ | mv vosk-model-en-us-0.22 model</nowiki> | ||
+ | |||
+ | Alternatively, if you have a low amount of free RAM use the less accurate small model. | ||
+ | |||
+ | <nowiki>wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.15.zip | ||
+ | unzip vosk-model-small-en-us-0.15.zip | ||
+ | mv vosk-model-small-en-us-0.15 model</nowiki> | ||
+ | |||
+ | You are now ready to use Vosk. | ||
+ | |||
+ | === Vosk Subtitle generation === | ||
+ | |||
+ | This command will produce a subtitle file in the `srt` format. | ||
+ | |||
+ | Complete the setup steps above and navigate to the `vosk-api/python/example` directory. | ||
+ | |||
+ | <code>python3 test_srt.py test.webm > test.srt</code> | ||
+ | |||
+ | The output is not perfect, but readable for the most part. Human verification is then necessary for cleaning up the result. | ||
+ | |||
+ | Note: [https://github.com/KDE/kdenlive/search?q=vosk Kdenlive has Vosk built in now for this purpose.] | ||
+ | |||
+ | == CMU Sphinx == | ||
+ | |||
+ | [http://cmusphinx.sourceforge.net CMU Sphinx] might be usable for transcription. There was some effort to look into video transcription for PyCon using CMU Sphinx. | ||
+ | |||
+ | == Simon == | ||
+ | |||
+ | [http://simon.kde.org Simon] could be a good, entirely FOSS starting point that provides an end-user interface for speech recognition systems like Julius and CMU SPHINX. Every speech recognition system is only as good as it's model so I'd also like to point to [http://voxforge.org Voxforge] - a project trying to build GPL acoustic models. | ||
+ | |||
+ | == Phonetic symbols Discussion == | ||
What we needed is Automatic Voice Recognition.<br> | What we needed is Automatic Voice Recognition.<br> | ||
− | + | We need word to phonetic symbols converter. | |
− | + | ||
+ | In Japanese, for example, the correspondence between the writing and the speech sounds is definitely not one-to-one, even when only dealing with the syllabary. For example, ん can represent [n], [m], [ŋ], [ɴ] or nasalization of the preceding vowel depending on context. I believe Greek also has relatively deep orthography due to retention of Ancient Greek spellings, even after recent reforms. This will be the case for every language to some extent. The rules for these things are systematic, but immensely complex. Furthermore, variation within one language is going to a huge issue. If you have such a converter based on General American or on Received Pronunciation, a Glaswegian may as well be speaking Faroese. That's not to say that it's impossible, but this is a bigger, more complicated problem than it seems. | ||
With [http://www.eca.cx/ecasound/ Ecasound] and LADSPA filters we do preprocessing of the sound with scripts, we use the filters: noise gate, compressor, filters and reverb.<br> | With [http://www.eca.cx/ecasound/ Ecasound] and LADSPA filters we do preprocessing of the sound with scripts, we use the filters: noise gate, compressor, filters and reverb.<br> | ||
With [http://www.aegisub.org/ Aegisub] we can create karaoke of speeches with scripts, so we can use karaoke for the training of Automatic Voice Recognition. | With [http://www.aegisub.org/ Aegisub] we can create karaoke of speeches with scripts, so we can use karaoke for the training of Automatic Voice Recognition. | ||
− | Sphinx | + | Sphinx needs a GUI for easy configuration and train. Plus we need to split lexicon and grammar data from the voice recognition and make them interface. Every language works different.<br> |
Alternative is [http://www.fon.hum.uva.nl/praat/ PRAAT] where we can do voice analysis with small scripts, even with noise. But we need a pattern match algorithm for that, probably with [http://www.r-project.org/ R] or [http://www.scipy.org/ Python-scipy].<br> | Alternative is [http://www.fon.hum.uva.nl/praat/ PRAAT] where we can do voice analysis with small scripts, even with noise. But we need a pattern match algorithm for that, probably with [http://www.r-project.org/ R] or [http://www.scipy.org/ Python-scipy].<br> | ||
The problem with Sphinx is the pitch recognition, with PRAAT we can do something like sing recognition(voice to musical notes) or recognize the same "A" from a kid or someone who mimics voices.<br> | The problem with Sphinx is the pitch recognition, with PRAAT we can do something like sing recognition(voice to musical notes) or recognize the same "A" from a kid or someone who mimics voices.<br> | ||
− | Other problems are the vowels, "A" from a | + | Other problems are the vowels, "A" from a Chinese and "A" in English aren't the same. If the say half minute "A" they sound the same, but when they used in words they are different. Other letters like "P","T" are the same. |
General automatic transcription works little different for each language. | General automatic transcription works little different for each language. | ||
− | == "Assisted" vs "Automatic" Transcription | + | == "Assisted" vs "Automatic" Transcription == |
− | Machine Learning can handle | + | Machine Learning can handle 95% of the job, but a real person is always needed for the final proof. |
− | |||
− | |||
− | |||
− | |||
− | |||
We can make a tool to make transcription '''easier''' for content generators - turning it into mostly a GUI problem. I know it's splitting hairs, but should this project be "transcription assistance" instead of "automatic transcription"? | We can make a tool to make transcription '''easier''' for content generators - turning it into mostly a GUI problem. I know it's splitting hairs, but should this project be "transcription assistance" instead of "automatic transcription"? | ||
− | : If so, why don't you just use [https:// | + | : If so, why don't you just use [https://github.com/appsembler/unisubs Universal Subtitles or Amara]? It's already AGPL, the GUI problem is solved. The automatic transcription problem is not. |
− | :: Or for that matter, [http://www.aegisub.org/ Aegisub] which has an excellent GUI... which is still not good enough or "smart" enough to help out the transcriber as much as would be possible with some AI. So, I wouldn't call the GUI problem "solved", far from it. Instead of (or at least in addition to) focusing our efforts on the much harder problem of automatic transcription, it would be more productive to try and apply machine learning to solve this relatively easier problem of helping out the human transcriber who can do the job much more reliably than software, and giving him/her the right tools to do it easily. | + | :: Or for that matter, [http://www.aegisub.org/ Aegisub] which has an excellent GUI... which is still not good enough or "smart" enough to help out the transcriber as much as would be possible with some AI. So, I wouldn't call the GUI problem "solved", far from it. Instead of (or at least in addition to) focusing our efforts on the much harder problem of automatic transcription, it would be more productive to try and apply machine learning to solve this relatively easier problem of helping out the human transcriber who can do the job much more reliably than software, and giving him/her the right tools to do it easily. |
− | |||
− | |||
[[is entity::group| ]] | [[is entity::group| ]] |
Latest revision as of 15:38, 2 May 2022
Automatic transcription is an FSF Priority Project. We need free software that is capable of transcribing recordings. YouTube is starting to offer this service, but this is a kind of computing we should be doing on our systems with free software.
If you are interested, please join this group (start by leaving your name here). We should start by surveying any existing free software that is in or close to this area, and making a list of features that are needed.
Automatic Transcription of recorded media is mostly solved by Vosk's test_srt.py script. We still need to figure out how to provide live transcription services using free software. This should be possible with Vosk as well.
Contents
Vosk
Website with documentation - Website GitHub
Vosk is an offline Speech-to-Text (STT) framework that can transcribe audio into text much faster than real-time even on an X200.
Vosk Setup
These setup steps need to be done once unless you are updating.
The easiest way to install Vosk is with pip. Install pip first.
pip3 install vosk
Download the scripts from the project.
git clone https://github.com/alphacep/vosk-api
Change to the python example directory.
cd vosk-api/python/example
Download a model.
The full en US model should be used if you have the RAM to support it. As a rule of thumb, you need 2x the size of the model of free RAM available.
wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip unzip vosk-model-en-us-0.22.zip mv vosk-model-en-us-0.22 model
Alternatively, if you have a low amount of free RAM use the less accurate small model.
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.15.zip unzip vosk-model-small-en-us-0.15.zip mv vosk-model-small-en-us-0.15 model
You are now ready to use Vosk.
Vosk Subtitle generation
This command will produce a subtitle file in the `srt` format.
Complete the setup steps above and navigate to the `vosk-api/python/example` directory.
python3 test_srt.py test.webm > test.srt
The output is not perfect, but readable for the most part. Human verification is then necessary for cleaning up the result.
Note: Kdenlive has Vosk built in now for this purpose.
CMU Sphinx
CMU Sphinx might be usable for transcription. There was some effort to look into video transcription for PyCon using CMU Sphinx.
Simon
Simon could be a good, entirely FOSS starting point that provides an end-user interface for speech recognition systems like Julius and CMU SPHINX. Every speech recognition system is only as good as it's model so I'd also like to point to Voxforge - a project trying to build GPL acoustic models.
Phonetic symbols Discussion
What we needed is Automatic Voice Recognition.
We need word to phonetic symbols converter.
In Japanese, for example, the correspondence between the writing and the speech sounds is definitely not one-to-one, even when only dealing with the syllabary. For example, ん can represent [n], [m], [ŋ], [ɴ] or nasalization of the preceding vowel depending on context. I believe Greek also has relatively deep orthography due to retention of Ancient Greek spellings, even after recent reforms. This will be the case for every language to some extent. The rules for these things are systematic, but immensely complex. Furthermore, variation within one language is going to a huge issue. If you have such a converter based on General American or on Received Pronunciation, a Glaswegian may as well be speaking Faroese. That's not to say that it's impossible, but this is a bigger, more complicated problem than it seems.
With Ecasound and LADSPA filters we do preprocessing of the sound with scripts, we use the filters: noise gate, compressor, filters and reverb.
With Aegisub we can create karaoke of speeches with scripts, so we can use karaoke for the training of Automatic Voice Recognition.
Sphinx needs a GUI for easy configuration and train. Plus we need to split lexicon and grammar data from the voice recognition and make them interface. Every language works different.
Alternative is PRAAT where we can do voice analysis with small scripts, even with noise. But we need a pattern match algorithm for that, probably with R or Python-scipy.
The problem with Sphinx is the pitch recognition, with PRAAT we can do something like sing recognition(voice to musical notes) or recognize the same "A" from a kid or someone who mimics voices.
Other problems are the vowels, "A" from a Chinese and "A" in English aren't the same. If the say half minute "A" they sound the same, but when they used in words they are different. Other letters like "P","T" are the same.
General automatic transcription works little different for each language.
"Assisted" vs "Automatic" Transcription
Machine Learning can handle 95% of the job, but a real person is always needed for the final proof.
We can make a tool to make transcription easier for content generators - turning it into mostly a GUI problem. I know it's splitting hairs, but should this project be "transcription assistance" instead of "automatic transcription"?
- If so, why don't you just use Universal Subtitles or Amara? It's already AGPL, the GUI problem is solved. The automatic transcription problem is not.
- Or for that matter, Aegisub which has an excellent GUI... which is still not good enough or "smart" enough to help out the transcriber as much as would be possible with some AI. So, I wouldn't call the GUI problem "solved", far from it. Instead of (or at least in addition to) focusing our efforts on the much harder problem of automatic transcription, it would be more productive to try and apply machine learning to solve this relatively easier problem of helping out the human transcriber who can do the job much more reliably than software, and giving him/her the right tools to do it easily.
--Hey! My name is Adam. I'm excited to help work on this transcription project. I would be best helping with programming (this is my main practice), but would love to help in anyway possible (such as: researching existing free/non-free transcription projects). Please send any additional details regarding the status of this project to my email (abbergie@gmail.com).
--Hi I'm Ulises, and I would like to contribute to this project, pleas contact me at ulises3.14@gmail.com
--Hi. This is Rajul. I wish to work on this project. Please contact me at rajul.iitkgp@gmail.com --Also interested in collaborating in this project, email me at enricgarcia@uoc.edu. --Matthieu Vergne (matthieu.vergne@gmail.com), also interested to participate and program (strong interest in automatic translation, transcription being part of it, and already learning it).
-Hi, my name is Addison and I am interested in working on this project. I have no direct experience with automatic transcription, but I'm interested in learning about natural language processing and I'm familiar with basic machine learning techniques. Please contact me at: addison.mink@gmail.com
-Hi, this is beyondpy. I am quite interested in this projects. And I am familiar with common machine learning tools, Bayesian statistics, and so on. But I'm lack of the experience on NLP. Please contact me at zusongpeng@gmail.com. Thanks a lot.
- Hello, I am Raman Gupta. I am interested in contributing to this project. Please contact me at ramanatnsit@yahoo.com