Low Resource Languages Vs Conversational Artificial Intelligence
In natural language processing, languages are categorized by whether they are high- or low-resource. Low-resource languages lack data that can be used for machine learning or other processing, and high-resource languages are rich in available data.
In building Artificial Intelligence solutions in any language, the essential part is to acquire a dataset of that language. According to the research center for language, intelligence goes by the name Ethnologue, there are 7151 languages spoken today. The report for 2022 shows that mandarin Chinese and Spanish are the leading languages with the most native speakers while English is the third and first language with the most speakers followed by mandarin Chinese.
Only about 20 languages have a text corpus resource database of hundreds of millions of words. English is the language with the largest amount of data, followed by Chinese, and Spanish.
Speaking of African languages and the majority of Asian languages are lacking resources or data storage for training or building Artificial intelligence systems and so, are termed as low resource languages.
At least now you get to understand what is low resource languages but let me dive deeper into...
About Low Resource Language: What makes a Language Low Resource?
Low resource languages are those that have relatively fewer data available for building Artificial intelligence systems like Conversational AI systems, machine translation, etc
Languages like English, Spanish and Chinese are high resource languages they already have an immense corpus of data and ecosystems which simplify the essential requirements of training AI models to solve day-to-day challenges.
The above brief may be little for understanding the overall concept. But I'm talking about resources on demand when coming to possible the development of any Artificial intelligence solutions. Low resource languages like Swahili, Hindi, Kinyarwanda, Bengali, etc are not well covered and spoken by a large number of people but not on the internet compared to Western Europe languages, Chinese, Spanish and Japanese languages.
This making complex to build Natural language processing solutions like conversational AI systems. There are actually many languages that mostly exist in oral form, for which very few written resources exist.
More required types of resources to develop language-based solutions are:-
- Access to a large amount of raw text data from various documents such as books, emails, social media content, and scientific papers
- Task-specific resources such as parallel corpora for machine translations, various kinds of annotated text to be used in part-of-speech tags, and dictionaries for developing named entity recognition systems.
- The lack of working solutions and available data makes it hard to fine-tune models for downstream tasks. That might limit the range of possible tasks we can solve with low-resource NLP tools.
- Auxiliary data such as labeled data in different languages, lexical, syntactic, and semantic resources such as dictionaries, dependency tree corpora, and semantic databases like WordNet.
About Conversational Artificial intelligence
Conversational AI is the subset of artificial intelligence that useless natural language processing and natural language understanding to enable consumers to interact with computer applications the way they would with other humans.
Conversational AI involves three concepts:-
- Artificial intelligence,
- Human language, and
- Automation.
Conversational artificial intelligence they do differ in complexity, From simple ones known as chatbots, a good example of this is FAQ chatbots with the basic responses and responses to exact the keywords required other words are simple keyword matches and most are specific to use case implementation.
As complexity increases, with the use of Natural Language Processing to generate responses now we call something conversational, simply because it can learn over time and understand the context of the prompt, good example of this is customer assistants and they are always linear and can’t carry context from one convo to the next.
At the high level of conversational artificial intelligence, we do have virtual assistants good example of this is Siri and Alexa. They are advanced conversational assistants that serve a specific purpose with dialog management and understanding context.
Conversational artificial intelligence may be in form of text or voice.
Building conversational AI for low-resource languages
Let’s back to our topic currently, conversational artificial intelligence solutions are focused on high-level languages although there are about 3.5 billion low-resource language speakers mainly located in Asia and Africa.
Now we can see how the picture looks like a large portion of the world population is still underserved by Natural Language Processing systems because of various challenges that developers face when building conversational artificial intelligence solutions or even NLP systems in general for low-resource languages.
Building advanced conversational artificial intelligence solutions needs collections of data for a specific domain like sophisticated language-specific engineering, annotated text, corpora for machine translation, audios for translation models, dictionaries for training NER, etc
Question: Can we build conversational AI solutions for low-resource languages if these resources are missing?
Let’s explore some of the approaches involved in building conversational AI solutions for low-resource languages:-
Data Augmentation:- shortage of data is the first challenge to approach, collecting more data involves compiling text in the target language and this approach may be helpful to achieve the best result but it is expensive because requires extensive preparatory work, involvement of experts in the target language, cleaning and formatting the collected dataset.
To overcome this challenge, data augmentation approaches automatically create new data without collecting it explicitly. The novel data augmentation method for text classification tasks like language-model-based data augmentation(LAMBADA) is very useful in making low-resource conversational artificial intelligence solutions effective across multiple domains.
With LAMBADA is easier to finetune pretrained language models to generate synthetic training data for text classification tasks such as intent classification in conversational systems.
Meta-transfer learning:- building conversational artificial intelligence solutions for low-resource languages can be more improved when we can train models that work for target high-resource or use the existing ones and then retrain or fine-tune them for low-resource languages. The model will use the knowledge gained during the training on large-scale high-resource language data and transfer them to low-resource language data, which might significantly improve the model's performance.
In meta-learning allows models to learn analogies and patterns from the data and transfer this knowledge to specific tasks. The number of samples for those specific tasks in the training dataset may vary from few-shot learning to one-shot learning or even zero-shot learning.
This allows transferring knowledge to new languages and domains. Applying meta-learning to low-resource languages might solve problems with the limitations of such models. Conversational artificial intelligence can use this to apply to specific tasks like translations, intents classifications, entity recognition, etc.
Multi-lingual virtual assistant vendors:- different groups or vendors are working on different conversational artificial intelligence low-resource language domains. Partnering with different market vendors can help to improve conversational artificial intelligence solutions that can work across multiple languages and cover the large market space.
Cross-Lingual Annotation Projections:- a task-specific classifier is trained in a high-resource language. Using parallel corpora, the unlabeled low-resource data is then aligned to its equivalent in the high-resource language where labels can be obtained using the aforementioned classifier. These labels on the high-resource text can then be projected back to the text in the low-resource language based on the alignment between tokens in the parallel texts.
This technique helps in obtaining labeled data for low-resource languages. Cross-lingual projections have commonly been applied in low-resource settings for tasks, such as POS tagging and parsing.
Training models with cross-lingual transfer learning usually require linguistic knowledge and resources about the relation between the source language and the target language.
Final Thoughts
We have a long way to go to minimize the gap of resources in the NLP ecosystem for low-resource languages but kudos to communities and platforms pushing to overcome a couple of these challenges like Masakhane, Mozilla Common Voice, NeuralSpace, and more.
It is difficult to build conversational AI solutions for low-resource languages due to the different cases I mentioned earlier.