Speech and language technology

Research Focus

The SALT (Speech and Language Technologies) research group is engaged in both fundamental and applied research in the field of speech and language technologies. Our main areas of interest and focus include, in particular:

Automatic Speech Recognition (ASR) – We focus on systems that convert spoken language into text, with applications such as voice-to-text for document creation, dictation software, automatic captioning of live broadcasts, or information retrieval in audiovisual archives..
Automatic Natural Language Processing (NLP) – We work on technologies that enable the automatic processing of textual data – whether created directly in written form or produced by automatic transcription using ASR. Specifically, these include techniques for automatic correction (diacritics and punctuation restoration), named entity recognition, semantic search, and many others.
Text-to-Speech Synthesis (TTS) – We develop technologies that enable the generation of natural-sounding speech from written text, with applications such as voice assistants, navigation systems, document/news readers, or voice cloning systems..
Spoken Dialogue Systems – We focus on the design of intelligent systems for effective two-way spoken communication between humans and machines, used in customer contact centers, information hotlines, and assistive technologies. In addition to speech recognition and synthesis modules, these systems include components for speech understanding and dialogue management, which are crucial for achieving smooth and meaningful interaction.
Voice Biometrics – Our research covers methods for identification and authentication of individuals based on unique voice characteristics, especially useful in the field of security (e.g., securing access to sensitive data or locations) in the context of crime and terrorism prevention.
Audiovisual Recognition and Synthesis – We concentrate on processing spontaneous speech in combination with visual information, which enables the development of realistic avatars capable of natural communication, including facial expressions and articulation.
Automatic Processing of Audiovisual Archives – We develop methods for automatic indexing and rapid information retrieval in large video archives, including the search for place names, personal names, and other entities without the need for prior manual annotation.
Assistive Technologies – We work on technologies supporting the integration of people with disabilities, such as voice cloning for individuals at risk of losing their voice, automatic readers for the visually impaired, speech captioning, and automatic translation between spoken language and sign language for the deaf, thereby facilitating communication and inclusion of these individuals in society and improving the productivity of healthcare staff.

We place particular emphasis on the robustness and efficiency of systems capable of reliable operation even in challenging linguistic environments, especially those involving Slavic languages.

Technologies Used

Our research utilizes a broad range of advanced methods and technologies, in particular:

Machine learning and deep neural networks for speech and language processing and analysis
Artificial intelligence, large language models (LLMs), large speech models (SLMs)
Language and acoustic modeling of speech
Advanced algorithms for processing spontaneous and multimodal speech
Biometric methods for identifying individuals based on voice
Multimodal integration of speech with visual data for the creation of realistic avatars
Modern web and cloud technologies for the deployment of speech and language systems