Automatic transcription is an FSF Priority Project. We need free software that is capable of transcribing recordings. YouTube is starting to offer this service, but this is a kind of computing we should be doing on our systems with free software.
If you are interested, please join this group (start by leaving your name here). We should start by surveying any existing free software that is in or close to this area, and making a list of features that are needed.
[UniversalSubtitles.org] is not automated, but provides great user interface for creating transcriptions & subtitles for online video (an audio). Ideal for public content. It is a collaborative platform (one could call it a "wiki with an ui dedicated to subtitling").
CMU Sphinx might be usable for transcription. If I (cwebber) remember correctly, there was some effort to look into video transcription for PyCon using CMU Sphinx, but it didn't become production ready.
- seems like mostly a "GUI problem"; shouldn't it be possible to clone universalsubtitles and run sphinx on the audio track of whatever video is uploaded to create a first-pass transcription? After that, the main problem is creating LM's for new languages, but that's an upstream issue …
(Disclaimer: I am a developer of Simon) I think Simon can be a good, entirely FOSS starting point that provides an end-user interface for speech recognition systems like Julius and CMU SPHINX. However, every speech recognition system is only as good as it's model so I'd also like to point to Voxforge - a project trying to build GPL acoustic models.
long comment from dimid
What we needed is Automatic Voice Recognition.
For english we need word to phonetic symbols converter. Other languages like japanese, german or greek don't need that.
- why do you say other languages don't need that? In Japanese, for example, the correspondence between the writing and the speech sounds is definitely not one-to-one, even when only dealing with the syllabary. For example, ん can represent [n], [m], [ŋ], [ɴ] or nasalization of the preceding vowel depending on context. I believe Greek also has relatively deep orthography due to retention of Ancient Greek spellings, even after recent reforms. This will be the case for every language to some extent. The rules for these things are systematic, but immensely complex. Furthermore, variation within one language is going to a huge issue. If you have such a converter based on General American or on Recieved Pronunciation, a Glaswegian may as well be speaking Faroese. That's not to say that it's impossible, but this is a bigger, more complicated problem than it seems.
With Ecasound and LADSPA filters we do preprocessing of the sound with scripts, we use the filters: noise gate, compressor, filters and reverb.
With Aegisub we can create karaoke of speeches with scripts, so we can use karaoke for the training of Automatic Voice Recognition.
Sphinx need a GUI for easy configuration and train. Plus we need to split lexicon and grammar data from the voice recognition and make them interface. Every language works different.
Alternative is PRAAT where we can do voice analysis with small scripts, even with noise. But we need a pattern match algorithm for that, probably with R or Python-scipy.
The problem with Sphinx is the pitch recognition, with PRAAT we can do something like sing recognition(voice to musical notes) or recognize the same "A" from a kid or someone who mimics voices.
Other problems are the vowels, "A" from a chinese and "A" in english aren't the same. If the say half minute "A" they sound the same, but when they used in words they are different. Other letters like "P","T" are the same.
General automatic transcription works little different for each language.
"Assisted" vs "Automatic" Transcription from YarDYar
Machine Learning can handle 80-90% of the job, but a real person is always needed for the final proof. If youtube is offering this service, they are almost certainly using AWS Mechanical Turk, or some other source of slave labour.
- http://googleblog.blogspot.com/2009/11/automatic-captions-in-youtube.html doesn't exactly indicate slave labour. Considering the quality of the automatic CC's, I doubt they're using Turk (if so, they're not getting their penny's worth). (Note: a lot of the CC's _are_ post-edited, but that's done by the uploader herself).
Nuance's Dragon Naturally Speaking works pretty well after a training sample for each speaker, but the user is responsible for cleaning up leftover errors. Nuance marketeers might disagree, but I would not call their program "automatic."
- Not FOSS.
We can make a tool to make transcription easier for content generators - turning it into mostly a GUI problem. I know it's splitting hairs, but should this project be "transcription assistance" instead of "automatic transcription"?
- If so, why don't you just use Universal Subtitles? It's already AGPL, the GUI problem is solved. The automatic transcription problem is not.
- Or for that matter, Aegisub which has an excellent GUI... which is still not good enough or "smart" enough to help out the transcriber as much as would be possible with some AI. So, I wouldn't call the GUI problem "solved", far from it. Instead of (or at least in addition to) focusing our efforts on the much harder problem of automatic transcription, it would be more productive to try and apply machine learning to solve this relatively easier problem of helping out the human transcriber who can do the job much more reliably than software, and giving him/her the right tools to do it easily.
- If the GUI problem is 'solved', then isn't it a matter of adding it to a speech recognition engine that needs one, as suggested above?
--Hi. This is Rajul. I wish to work on this project. Please contact me at firstname.lastname@example.org --Also interested in collaborating in this project, email me at email@example.com. --Matthieu Vergne (firstname.lastname@example.org), also interested to participate and program (strong interest in automatic translation, transcription being part of it, and already learning it).