Jeff A. Bilmes's Publications

Sorted by DateClassified by Publication TypeClassified by Research CategorySorted by First Author Last NameClassified by Author Last Name

SVitchboard-II and FiSVer-I: Crafting high quality and low complexity conversational english speech corpora using submodular function optimization

Yuzong Liu, Rishabh Iyer, Katrin Kirchhoff, and Jeff Bilmes. SVitchboard-II and FiSVer-I: Crafting high quality and low complexity conversational english speech corpora using submodular function optimization. Computer Speech & Language, 42:122–142, 2017.

Download

[PDF]846.7kB  [gzipped postscript] [postscript] [HTML] 

Abstract

We introduce a set of benchmark corpora of conversational English speech derived from the Switchboard-I and Fisher datasets. Traditional automatic speech recognition (ASR) research requires considerable computational resources and has slow experimental turnaround times. Our goal is to introduce these new datasets to researchers in the ASR and machine learning communities in order to facilitate the development of novel speech recognition techniques on smaller but still acoustically rich, diverse, and hence interesting corpora. We select these corpora to maximize an acoustic quality criterion while limiting the vocabulary size (from 10 words up to 10,000 words), where both “acoustic quality” and vocabulary size are adeptly measured via various submodular functions. We also survey numerous submodular functions that could be useful to measure both “acoustic quality” and “corpus complexity” and offer guidelines on when and why a scientist may wish use to one vs. another. The corpora selection process itself is naturally performed using various state-of-the-art submodular function optimization procedures, including submodular level-set constrained submodular optimization (SCSC/SCSK), difference-of-submodular (DS) optimization, and unconstrained submodular minimization (SFM), all of which are fully defined herein. While the focus of this paper is on the resultant speech corpora, and the survey of possible objectives, a consequence of the paper is a thorough empirical comparison of the relative merits of these modern submodular optimization procedures. We provide baseline word recognition results on all of the resultant speech corpora for both Gaussian mixture model (GMM) and deep neural network (DNN)-based systems, and we have released all of the corpora definitions and Kaldi training recipes for free in the public domain.

BibTeX

@article{liu-svitchboard-fisver-submodular-2017,
title = {SVitchboard-II and FiSVer-I: Crafting high quality and low complexity conversational english speech corpora using submodular function optimization},
journal = {Computer Speech & Language},
volume = {42},
pages = {122-142},
year = {2017},
issn = {0885-2308},
doi = {https://doi.org/10.1016/j.csl.2016.10.002},
url = {https://www.sciencedirect.com/science/article/pii/S0885230816301942},
author = {Yuzong Liu and Rishabh Iyer and Katrin Kirchhoff and Jeff Bilmes},
keywords = {Submodular function optimization, Automatic speech recognition, Speech corpus},
abstract = {We introduce a set of benchmark corpora of conversational English speech derived from the Switchboard-I and Fisher datasets. Traditional automatic speech recognition (ASR) research requires considerable computational resources and has slow experimental turnaround times. Our goal is to introduce these new datasets to researchers in the ASR and machine learning communities in order to facilitate the development of novel speech recognition techniques on smaller but still acoustically rich, diverse, and hence interesting corpora. We select these corpora to maximize an acoustic quality criterion while limiting the vocabulary size (from 10 words up to 10,000 words), where both “acoustic quality” and vocabulary size are adeptly measured via various submodular functions. We also survey numerous submodular functions that could be useful to measure both “acoustic quality” and “corpus complexity” and offer guidelines on when and why a scientist may wish use to one vs. another. The corpora selection process itself is naturally performed using various state-of-the-art submodular function optimization procedures, including submodular level-set constrained submodular optimization (SCSC/SCSK), difference-of-submodular (DS) optimization, and unconstrained submodular minimization (SFM), all of which are fully defined herein. While the focus of this paper is on the resultant speech corpora, and the survey of possible objectives, a consequence of the paper is a thorough empirical comparison of the relative merits of these modern submodular optimization procedures. We provide baseline word recognition results on all of the resultant speech corpora for both Gaussian mixture model (GMM) and deep neural network (DNN)-based systems, and we have released all of the corpora definitions and Kaldi training recipes for free in the public domain.},
}

Share


Generated by bib2html.pl (written by Patrick Riley ) on Mon Apr 19, 2021 10:51:16