INTERSPEECH 2014 includes the following 6 special sessions covering interdisciplinary topics and/or important new emerging areas of interest related to the main conference topics:
- Deep Neural Networks for Speech Generation and Synthesis
- Interspeech 2014 Computational Paralinguistics ChallengE (ComParE)
- Open Domain Situated Conversational Interaction
- Phase Importance in Speech Processing Applications
- Text-Dependent Speaker Verification with Short Utterances
- Speech technologies for Ambient Assisted Living
A description of each special session is given below.
For more information, feel free to contact the Special Session Chair, Dr. Tomi H. Kinnunen, at email firstname.lastname@example.org
Special Session Description
Deep Neural Networks for Speech Generation and Synthesis
This special session aims to bring together researchers who work actively on deep neural networks for speech research, particularly, in generation and synthesis, to promote and to understand the state-of-art DNN research in statistical learning and compare results with the parametric HMM-GMM model based TTS synthesis, generation, and conversion. DNN, with its neuron-like structure, can simulate human speech production system in a layered, hierarchical, nonlinear and self-organized network.
It can transform linguistic text information into intermediate semantic, phonetic and prosodic content and finally generate speech waveforms. Many possible neural network architectures or typologies exist, e.g. feed-forward NN with multiple hidden layers, stacked RBM or CRBM, Recurrent Neural Net (RNN), which have been used to speech/image recognition and other applications.
We would like to use this special session as a forum to present updated results in the research frontiers, algorithm development and application scenarios. Focused, but not limited to, papers are welcome on neural nets for:
- Parametric TTS synthesis
- Voice conversion, e.g., neural nets structure with speaker, environment or noise adaptation capabilities
- Neural nets as core modules for speech analysis-by-synthesis, compression and enhancement
- Different neural net structures such as feed-forward DNN and recurrent Neural Net (RNN) and their advantages or shortcomings in speech synthesis/generation.
Frank K. Soong, Microsoft Research, frankkps [at]microsoft.com
Interspeech 2014 Computational Paralinguistics ChallengE (ComParE)
The Interspeech 2014 Computational Paralinguistics ChallengE (ComParE) is an open Challenge dealing with speaker characteristics as manifested in their speech signal's acoustic properties. This year, it introduces new tasks by the Cognitive Load Sub-Challenge, the Physical Load Sub-Challenge, and a Multitask Sub-Challenge: For these Challenge tasks, the COGNITIVE-LOAD WITH SPEECH AND EGG database (CLSE), the MUNICH BIOVOICE CORPUS (MBC), and the ANXIETY-DEPRESSION-EMOTION-SLEEPINESS audio corpus (ADES) with high diversity of speakers and different languages covered (Australian English and German) are provided by the organizers. All corpora provide fully realistic data in challenging acoustic conditions and feature rich annotation such as speaker meta-data. They are given with distinct definitions of test, development, and training partitions, incorporating speaker independence as needed in most real-life settings. Benchmark results of the most popular approaches are provided as in the years before. Transcription of the train and development sets will be known. All Sub-Challenges allow contributors to find their own features with their own machine learning algorithm. However, a standard feature set will be provided per corpus that may be used. Participants will have to stick to the definition of training, development, and test sets. They may report on results obtained on the development set, but have only five trials to upload their results on the test sets, whose labels are unknown to them. Each participation will be accompanied by a paper presenting the results that undergoes peer-review and has to be accepted for the conference in order to participate in the Challenge. The results of the Challenge will be presented in a Special Session at INTERSPEECH 2014 in Singapore. Further, contributions using the Challenge data or related to the Challenge but not competing within the Challenge are also welcome.
More information is given also on the Challenge homepage:
Stefan Steidl, Friedrich-Alexander-University, stefan.steidl [at]fau.de
Anton Batliner, Technische Universität München / Friedrich-Alexander-University, batliner [at]cs.fau.de
Jarek Krajweski, Bergische Universität Wuppertal, krajewsk [at]uni-wuppertal.de
Julien Epps, The University of New South Wales / National ICT Australia, j.epps [at]unsw.edu.au
Open Domain Situated Conversational Interaction
Robust conversational systems have the potential to revolutionize our interactions with computers. Building on decades of academic and industrial research, we now talk to our computers, phones, and entertainment systems on a daily basis. However, current technology typically limits conversational interactions to a few narrow domains/topics (e.g., weather, traffic, restaurants). Users increasingly want the ability to converse with their devices over broad web-scale content. Finding something on your PC or the web should be as simple as having a conversation. A promising approach to address this problem is situated conversational interaction. The approach leverages the situation and/or context of the conversation to improve system accuracy and effectiveness. Sources of context include visual content being displayed to the user, geo-location, prior interactions, multi-modal interactions (e.g., gesture, eye gaze), and the conversation itself. For example, while a user is reading a news article on their tablet PC, they initiate a conversation to dig deeper on a particular topic. Or a user is reading a map and wants to learn more about the history of events at mile marker 121. Or a gamer wants to interact with a game’s characters to find the next clue in their quest. All of these interactions are situated – rich context is available to the system as a source of priors/constraints on what the user is likely to say. This special session will provide a forum to discuss research progress in open domain situated conversational interactions.
Topics of the session will include:
- Situated context in spoken dialog systems
- Visual/dialog/personal/geo situated context
- Inferred context through interpretation and reasoning
- Open domain spoken dialog systems
- Open domain spoken/natural language understanding and generation
- Open domain semantic interpretation
- Open domain dialog management (large-scale belief state/policy)
- Conversational Interactions
- Multi-modal inputs in situated open domains (speech/text + gesture, touch, eye gaze)
- Multi-human situated interactions
Dilek Hakkani-Tür, Microsoft Research, dilek [at]ieee.org
Gokhan Tur, Microsoft Research, gokhan [at]ieee.org
Steve Young, Cambridge University, sjy [at]eng.cam.ac.uk
Phase Importance in Speech Processing Applications
In the past decades, the amplitude of speech spectrum is considered to be the most important feature in different speech processing applications and phase of the speech signal has received less attention. Recently, several findings justify the phase importance in speech and audio processing communities. The importance of phase estimation along with amplitude estimation in speech enhancement, complementary phase-based features in speech and speaker recognition and phase-aware acoustic modelling of environment are the most prominent reported works scattered in different communities of speech and audio processing. These examples suggest that incorporating the phase information can push the limits of state-of-the-art phase-independent solutions employed for long in different aspects of audio and speech signal processing. This Special Session aims to explore the recent advances and methodologies to exploit the knowledge of signal phase information in different aspects of speech processing. Without a dedicated effort to bring researchers from different communities, a quick advance in investigation towards the phase usefulness in speech processing applications is difficult to achieve. Therefore, as the first step in this direction, we aim to promote the "phase-aware speech and audio signal processing" to form a community of researchers to organize the next steps. Our initiative is to unify these efforts to better understand the pros and cons of using phase and the degree of feasibility for phase estimation/enhancement in different areas of speech processing including: speech enhancement, speech separation, speech quality estimation, speech and speaker recognition, voice transformation and speech analysis and synthesis. The goal is to promote the importance of the phase-based signal processing and studying its importance and sharing interesting findings from different speech processing applications.
Rahim Saeidi, University of Eastern Finland, rahim.saeidi [at]uef.fi
Yannis Styilianou, Toshiba Labs Cambridge UK / University of Crete, yannis [at]csd.uoc.gr
Text-Dependent Speaker Verification with Short Utterances
In recent years, speaker verification engines have reached maturity and have been deployed in commercial applications. Ergonomics of such applications is especially demanding and imposes a drastic limitation in terms of speech duration during authentication. A well known tactic to address the problem of lack of data, due to short duration, is using text-dependency. However, recent breakthroughs achieved in the context of text-independent speaker verification in terms of accuracy and robustness do not benefit text-dependent applications. Indeed, large development data required by the recent approaches is not available in the text-dependent context. The purpose of this special session is to gather the research efforts from both academia and industry toward a common goal of establishing a new baseline and explore new directions for text-dependent speaker verification. The focus of the session is on robustness with respect to duration and modeling of lexical information. To support the development and evaluation of text-dependent speaker verification technologies, the Institute for Infocomm Research (I2R) has recently released the RSR2015 database, including 150 hours of data recorded from 300 speakers. The papers submitted to the special session are encouraged, but not limited, to provide results based on the RSR2015 database in order to enable comparison of algorithms and methods. For this purpose, the organizers strongly encourage the participants to report performance on the protocol delivered with the database in terms of EER and minimum cost (in the sense of NIST 2008 Speaker Recognition evaluation). To get the database, please contact the organizers.
Further details are available at:
Hagai Aronowitz, IBM Research – Haifa, hagaia [at]il.ibm.com
Kong Aik Lee, Institute for Infocomm Research, kalee [at]i2r.a-star.edu.sg
Patrick Kenny, CRIM – Montréal, patrick.kenny [at]crim.ca
Speech technologies for Ambient Assisted Living
This special session focuses on the use of speech technologies for ambient assisted living, the creation of smart spaces and intelligent companions that can preserve independence and executive function, social communication and security of people with special needs.
Currently, speech interfaces for assistive technologies remains underutilized despite its potential to replace or augment obtrusive and sometimes outright inaccessible conventional computer interfaces. Moreover in a smart home context, efficiency of speech interfaces can be supported by a number of concurrent information sources (e.g., wearable sensors, home automation sensors), enabling multimodal communication. In practice, daily hand-free usage of speech interfaces remains limited due to challenging real-world conditions, and because conventional speech interfaces can have difficulty with the atypical speech of many users. This, in turn, can be attributed to the lack of abundant speech material, and the limited adaptation to the user of these systems.
Taking up the challenges of this domain requires a multidisciplinary approach to define the user's needs, record corpora in realistic usage conditions, develop speech interfaces that are robust to both environment and user's characteristics and are able to adapt to specific users.
This special session will bring together researchers in speech and audio technologies with people from the ambient assisted living and assistive technologies communities to meet and foster awareness between members of either community, discuss problems, techniques and datasets, and perhaps initiate common projects.
Topics of the session will include:
- Assistive speech technology
- Applications of speech technology (ASR, dialogue, synthesis) for ambient assisted living
- Understanding, modelling, or recognition of aged and atypical speech
- Multimodal speech recognition (context-aware ASR)
- Multimodal emotion recognition
- Audio scene and smart space context analysis
- Assessment of speech and language processing within the context of assistive technology
- Speech synthesis and speech recognition for physical or cognitive impairments
- Symbol languages, sign languages, nonverbal communication
- Speech and NLP applied to typing interface applications
- Language modelling for Augmentative and Alternative Communication text entry and speech generating devices
- Deployment of speech and NLP tools in the clinic or in the field
- Linguistic resources; corpora and annotation schemes
- Evaluation of systems and components.
Jort F. Gemmeke, KU Leuven, jgemmeke [at] amadana.nl
François Portet, Laboratoire d'Informatique de Grenoble, francois.portet [at] imag.fr
Frank Rudzicz, University of Toronto, frank [at] cs.toronto.edu
Michel Vacher, Laboratoire d'Informatique de Grenoble, michel.vacher [at] imag.fr