Speech recognition specialists have long lacked a large open corpus of oral Russian, so only large companies could afford to engage in this task, but they were not in a hurry to share their experiences.
We are in a hurry to correct this lasting misunderstanding for years.
So, we bring to your attention a data set of 4000 hours of annotated speech, collected from various online sources.
Details under the cat.
Here is the data for current version 0.3:
And here you will get a link to our corps site
Will we develop the project further?
Our work on this is not finished, we want to get at least 10 thousand hours of annotated speech.
And then we are going to make open and commercial models for speech recognition using this dataset. And we offer you to join: help us improve data, use it in your tasks.
Why is our goal 10 thousand hours?
There are various studies of the generalization of neural networks in speech recognition, but it is known that good generalization does not work on datasets for less than 1000 hours. The figure of the order of 10 thousand hours is already considered acceptable in most cases, and then it depends on the specific task.
What else can be done to improve the quality of recognition if there is still not enough data?
Often, you can adapt the neural network to your speakers through a recipe for text speakers.
You can also adjust the neural network to a dictionary from your subject area (language model).
How did we make this datasset?
- Found channels with high-quality subtitles on YouTube, downloaded audio and subtitles
- Gave audio for recognition to other speech recognition systems
- Read addresses with robo-voices
- We found audiobooks and texts of books on the Internet, then we divided them into pauses and paused them and compared one another (the so-called “alignment” task)
- Added small Russian datasets available on the Internet.
- After that, the files were converted to a single format (16-bit wav, 16 kHz, mono, hierarchical location of files on the disk).
- Metadata was saved in a separate file manifest.csv.
How to use it:
The location of files is determined by their hashes, like this:
target_format = 'wav'
wavb = wav.tobytes ()
f_hash = hashlib.sha1 (wavb) .hexdigest ()
store_path = Path (root_folder,
f_hash [1: 3],
f_hash [3:15] + '.' + target_format)
from utils.open_stt_utils import read_manifest
from scipy.io import wavfile
from pathlib import path
manifest_df = read_manifest ('path/to/manifest.csv')
for info in manifest_df.itertuples ():
sample_rate, sound = wavfile.read (info.wav_path)
text = Path (info.text_path) .read_text ()
duration = info.duration
The manifest files contain triples: the name of the audio file, the name of the file with the text description, and the phrase duration in seconds.
Filter only files of a certain length
from utils.open_stt_utils import (plain_merge_manifests,
train_manifests = [
train_manifest = plain_merge_manifests (train_manifests,
MIN_DURATION = 0.1,
MAX_DURATION = 100)
What to read or look at in Russian, to get better acquainted with the task of speech recognition?
Recently, as part of the Deep Learning course on fingers
, we recorded a lecture on speech recognition (and a little synthesis). Perhaps it will be useful to you!
- We lay out datasets under a dual license: for non-commercial purposes, we offer a license cc-by-nc 4.0 , for commercial purposes - use after agreement with us.
- As usual in such cases, all rights to use the data included in the data remain with their owners. Our rights apply to dataset itself. For scientific and educational purposes, separate rules apply, see the legislation of your country.
Once again project site for those who did not see the link above