"Data is a tool for enhancing intuitions" ~ Hilary Mason data scientist and founder of Fast Forward Lab
In most cases when working on Natural Language Processing challenges such as Sentiment Analysis, Text Classification collecting enough labeled observations for each category can be challenging. The best option is to try using data augmentation techniques.
What is Data Augmentation?
Data augmentation are techniques that are used to generate additional, synthetic data using the data you have.
According to Wikipedia , Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from original data. It acts as a regularizer and helps in reducing over-fitting when training a machine learning model.
Augmentation techniques are super popular in computer vision applications but they are just as powerful for Natural Language Processing as well.
Today I will just go with the back-translation technique with Swahili text using Google API. If this is your first time hearing about Swahili, Swahili has also known as Kiswahili a Bantu language and the native language of the Swahili people. It is one of two official languages of the East African Community countries, namely Tanzania, Burundi, Democratic Republic of Congo, Kenya, Rwanda, South Sudan, and Uganda
About Back Translation
Back Translation is the Natural Language Processing data augmentation, also called reverse translation, which is the process of re-translating content from the target language back to its source language in literal terms. Let's take an example translating Swahili content to English, then to Swahili.
Back translation doesn't impact the translation memory or other resources like glossaries used by the translator. It is most helpful when the content at hand includes slogans, titles, product names, taglines, and puns simply because the implied meaning of the content in one language doesn't necessarily work for another language or region.
Commonly Back translation is used as a quality assessment and assurance tool/technique where the linguist translates the original source text into the new language, then the linguist translates the localized string back into the source language literally to convey the meaning of the translation, at the end the content owner or project manager selects the option that best represents the brand, tone, and intention of the source content.
Why Back Translation?
It helps to identify any ambiguities, errors, and confusion that may arise from the nuances of language and helps to evaluate the equivalence of meaning between the source and target text. By comparing the back translation to the original text, the quality and accuracy of the translation into another language can be confirmed.
In Natural language Processing, we use it as a method to boost the quantity of text data for model training purposes.
This method is mostly applied in Pharmaceutical companies, medical device companies, and Clinical research organizations.
Using Back Translation in Augmenting Swahili data
For this case we will use Google translate as our main translator, which is easily consumed through an API, we are going to follow the steps below:-
- Collecting original Swahili text
- Converting Swahili text into English using Google Translate
- Converting the translated text back into Swahili using Google translate
- Results assessment.
Google translate will perform the translation of Swahili to English, then back to Swahili, not only this you can try it in your native language also to see what the translation will look like.
Why Google Translate? simply because is the most popular service for this purpose, but you need to get an API key to use it and it is a paid service.
We are not going to pay here, only because we can consume Google translate API through excel spreadsheets, the work becomes more simple if you have a Google account you have access to their Google Sheets web app.
Let's collect Swahili text, just a simple file with six sentences you can just use this on a large text classification challenge.
Then, it's time to consume Google translate first we should add other two new columns for English text and Augmented text. Then use
GOOGLETRANSLATE() to perform translation of Swahili sentences to English and then back to Swahili.
GOOGLETRANSLATE() requires three arguments to be specified, the text you want to translate, the source language, and the target language.
GOOGLETRANSLATE(text, [source_language], [target_language])
This is just for understanding the concept of translation but for applying
Back translation we can use
GOOGLETRANSLATE function twice at once.
Here is the full command:
=GOOGLETRANSLATE(GOOGLETRANSLATE(A2,"sw","en"), "en", "sw" )
After writing the command just hit the return key to see the augmented Swahili text
After, the first sentence we can apply the formula to all observations, by selecting the first cell of
AUGMENTED TEXT column and drag the small square at the bottom right side below.
Back Translation worked well, then if your working with sentiment analysis challenge you can apply this in your training set of sentiments to augment the data.
Note: By applying
Back Translation some text may result into their original structure just make sure to set a filtering action for all augmented sentiments that will be the same as their original this will help to remove all duplicates.
Then you can export the csv file and use the data to train your model.
Text augmentation techniques in NLP are powerful tools for making generalized models.
Back translation offers an interesting approach when you’ve small training data but want to improve the performance of your model.
Depending on which language you working with it is not mandatory to use only
Google translate but you can opt for other powerful tools to perform such tasks like using transformer models etc
Back translation won't assess quality of expression, may be important as most translations will need to be more just accurate they will need to be well worded and read naturally in the target language.
Also Back translation won't identify typos, or grammatical and punctuation errors.
Be careful on which case you are using this technique to avoid result mismatch.