Huge open dataset Russian speech

Huge open dataset Russian speech


Speech recognition specialists have long lacked a large open corpus of oral Russian, so only large companies could afford to engage in this task, but they were not in a hurry to share their experiences.

We are in a hurry to correct this lasting misunderstanding for years.

So, we bring to your attention a data set of 4000 hours of annotated speech, collected from various online sources.

Details under the cat.

Here is the data for current version 0.3:
Data Type Annotation Quality Phrases Hours GB
Books alignment 95%/clean 1.1M 1,511 166
Calls ASR 70%/noisy 837K 812 89
Generated (Russian Addresses) TTS 100%/4 votes 1,7M 754 81
Speech from YouTube videos subtitles 95%/noisy 786K 724 78
Books ASR 70%/noisy 124K 116 13
Other datasets recitation and alignment 99%/clean 17K 43 5

And here you will get a link to our corps site .

Will we develop the project further?

Our work on this is not finished, we want to get at least 10 thousand hours of annotated speech.

And then we are going to make open and commercial models for speech recognition using this dataset. And we offer you to join: help us improve data, use it in your tasks.

Why is our goal 10 thousand hours?

There are various studies of the generalization of neural networks in speech recognition, but it is known that good generalization does not work on datasets for less than 1000 hours. The figure of the order of 10 thousand hours is already considered acceptable in most cases, and then it depends on the specific task.

What else can be done to improve the quality of recognition if there is still not enough data?

Often, you can adapt the neural network to your speakers through a recipe for text speakers.
You can also adjust the neural network to a dictionary from your subject area (language model).

How did we make this datasset?

  • Found channels with high-quality subtitles on YouTube, downloaded audio and subtitles
  • Gave audio for recognition to other speech recognition systems
  • Read addresses with robo-voices
  • We found audiobooks and texts of books on the Internet, then we divided them into pauses and paused them and compared one another (the so-called “alignment” task)
  • Added small Russian datasets available on the Internet.
  • After that, the files were converted to a single format (16-bit wav, 16 kHz, mono, hierarchical location of files on the disk).
  • Metadata was saved in a separate file manifest.csv.

How to use it:

File DB

The location of files is determined by their hashes, like this:

  target_format = 'wav'
 wavb = wav.tobytes ()
 f_hash = hashlib.sha1 (wavb) .hexdigest ()
 store_path = Path (root_folder,
  f_hash [0],
  f_hash [1: 3],
  f_hash [3:15] + '.' + target_format)

Read Files

  from utils.open_stt_utils import read_manifest
 from import wavfile
 from pathlib import path
 manifest_df = read_manifest ('path/to/manifest.csv')

 for info in manifest_df.itertuples ():
  sample_rate, sound = (info.wav_path)
  text = Path (info.text_path) .read_text ()
  duration = info.duration

The manifest files contain triples: the name of the audio file, the name of the file with the text description, and the phrase duration in seconds.

Filter only files of a certain length

  from utils.open_stt_utils import (plain_merge_manifests,
 train_manifests = [
 train_manifest = plain_merge_manifests (train_manifests,
 check_files (train_manifest)
 save_manifest (train_manifest,

What to read or look at in Russian, to get better acquainted with the task of speech recognition?

Recently, as part of the Deep Learning course on fingers , we recorded a lecture on speech recognition (and a little synthesis). Perhaps it will be useful to you!

License Issues

  • We lay out datasets under a dual license: for non-commercial purposes, we offer a license cc-by-nc 4.0 , for commercial purposes - use after agreement with us.
  • As usual in such cases, all rights to use the data included in the data remain with their owners. Our rights apply to dataset itself. For scientific and educational purposes, separate rules apply, see the legislation of your country.

Once again project site for those who did not see the link above .

Source text: Huge open dataset Russian speech