Under Resource Languages
SPECIAL SESSION ON SPEECH PROCESSING AND SPEECH TECHNOLOGY FOR UNDER RESOURCED LANGUAGES
Session chairs
About the session
Are you interested in speech processing for under-resourced languages? Do you want to contribute to the development of speech-based applications that can benefit speakers of under-resourced languages? If yes, then consider submitting a paper to the Special Session on Speech Processing and Speech Technology for Under-Resourced Languages at SPECOM2023!
This session will focus on the challenges and opportunities of developing speech and NLP-based applications for languages with either low or almost non-existential resources, such as annotated corpora, lexicons, etc. India has thousands of spoken languages used daily by millions of people; however, a few languages, such as Hindi, Tamil, Bengali, etc, have proper resources to build speech and language technology. Most of the spoken languages which are termed as under-resourced such as Konkani from Goa, Tulu from Karnataka, Chhattisgarhi from Chhattisgarh, Mising from Assam, Ao from Nagaland, etc., do not have any properly managed/open-sourced speech and linguistic resources for research and development. The lack of resources poses significant barriers in developing and evaluating speech and language technology for those speakers and getting connected globally. Keeping this in mind, the proposed session addresses the underlying challenges and possibilities in speech-language technologies.
This session aims to showcase the latest research and innovations in speech processing for under-resourced languages and foster collaboration and exchange of ideas among researchers working on this topic. Authors are open to submitting any other original works that address any aspect of speech processing for under-resourced languages, such as:
Speech and linguistic data collection and annotation methods for under-resourced languages
Speech recognition and synthesis for under-resourced languages
Technology development with under-resourced linguistic resources
Adaptation and transfer learning techniques
Applications of self-supervised learning-based pretraining for under-resourced languages
Cross-lingual training techniques
Data augmentation approaches for under-resourced languages
Low-resource speech translation and spoken language technology
Ethical and social issues in speech processing for under-resourced languages
To promote building speech technologies for under-resourced languages, we will be sharing new TTS and ASR corpora in Chattisgarhi, the official language in the state of Chattisgarh in Central India. Work done with these corpora should be submitted in this special session only.
Chhattisgarhi ASR and TTS datasets details
Chhattisgarhi ASR and TTS datasets are collected and curated in IISc Bangalore under the RESPIN project and SYSPIN project respectively.
ASR
Language: Chhattisgarhi
Train set: 238.85hr
Dev set: 4.03hr
Test set: 4.01hr
Links for downloading:
Train set: https://tinyurl.com/3shwu2dc
Dev set: https://tinyurl.com/5ye4kcdx
Test set: https://tinyurl.com/4974kkyb
Please cite reference [1] (see Reference section below) if you are using this dataset
TTS
Language: Chhattisgarhi
10 hours of TTS data from one male and 10 hours of TTS data from one female (may have audio-text mismatch on small fraction of the corpus)
Links for downloading:
Male TTS data: https://tinyurl.com/jyusbzs5
Female TTS data: https://tinyurl.com/5t4zuvae
Please cite reference [2] (see Reference section below) if you are using this dataset
Paper submission
If you have original and unpublished work on any of these topics, or related ones, we invite you to submit a paper to this special session. The paper submission deadline is the same as the regular papers. The papers should follow the SPECOM2023 format and guidelines, which can be found on the conference website: https://www.iitdh.ac.in/specom-2023/Submissions.html. The papers will be peer-reviewed by expert reviewers, and accepted papers will be published in the conference proceedings by Springer.
We look forward to receiving your submissions and seeing you at SPECOM2023!
References
Abhayjeet Singh, Arjun Singh Mehta, Ashish Khuraishi K S, Deekshitha G, Gauri Date, Jai Nanavati, Jesuraja Bandekar, Karnalius Basumatary, Karthika P, Sandhya Badiger, Sathvik Udupa, Saurabh Kumar, Prasanta Kumar Ghosh, Prashanthi V, Priyanka Pai, Raoul Nanavati, Sai Praneeth Reddy Mora, Srinivasa Raghavan, “An ASR corpus in Chhattisgarhi, an under-resourced Indian language,” submitted to SPECOM 2023
Abhayjeet Singh, Anjali Jayakumar, Deekshitha G, Hitesh Tiwari, Jesuraja Bandekar, Sandhya Badiger, Sathvik Udupa, Saurabh Kumar, Prasanta Kumar Ghosh, “An end-to-end TTS model in Chhattisgarhi, a under-resourced Indian language,” submitted to SPECOM 2023