Custom Swahili named entity recognition using spacy

Businesses may now extract information from semi-structured and unstructured data using named entity recognition as a cognitive processing tool to help improve processing capability and the efficiency of their everyday operations.

Custom Swahili named entity recognition using spacy
Neurotech Africa

Named Entity Recognition as a potential "game-changer" in most businesses, has helped many business operations around the world by addressing complex challenges, since defining boilerplate textual data and extracting even standard information from a big corpus of words can be a difficult and error-prone task.

Financial professionals, business leaders, and innovators are increasingly turning to artificial intelligence (AI) technologies to help spend less time discovering data and more time acting on insights from the data to improve the future of their businesses. This is why "Named Entity Recognition" (NER) as the best tool of all the time in the Artificial intelligence era came into play.

This article speaks about NER as highly leveraged by Neurotech Africa, a leading startup in Africa focused on creating powerfully Artificial Intelligence and NLP algorithms to automate African business by providing sarufi solutions. I will explain common uses cases and demonstrate how to create a custom Swahili named entity recognition model using the spaCy library.

Meaning: Named Entity Recognition

Named Entity Recognition (NER) is the technique that automatically identifies the important and usefully named entities that have been shown, discussed, or mentioned in a certain unstructured text document and classifies them into pre-defined categories such as person names, organization, location, monetary values and so on. Consider the below image for more understanding.

Named Entity Recognition is the first help towards information retrieval tasks, it is also known as entity chunking, entity identification, or entity extraction and has been used in many fields such as Natural Language Processing(NLP) and Machine Learning.

Myth: Named Entity Recognition isn't the future or important in digital businesses.

The power of Named Entity Recognition, in my option comes in the ease with which different basic models can be customized or even built from scratch to extract specific business's information from a variety of data sources in certain companies, resulting in high commercial and business values and hence is of more important and mostly future of digital business

Examine how NER can be used to marry with business use cases.

Saurbh image

Use Cases: Relevance of Named Entity Recognition in Businesses' operations

The most successful businesses operations rely on the customers, with Artificial intelligence-powered Named Entity Recognition tools can give up Africa and the whole world possibilities for driving economic interest in most business operations through user satisfaction. Here I will showcase some of the usages of Named Entity Recognition in business operations.

Automating and Simplifying Customer Support

NER can be used to recognize useful entities in customer complaints and feedback so that they can be categorized to the proper department in charge of the recognized product. This saves time, cost, and faster customer caring and feedback handling in business, hence resulting in more business values. A typical example, Neurotech provides entity recognition APIs that can be integrated into business to automate customer handling process.

Powering Recommendation Engine Algorithms

Recommendation systems govern how we find fresh stuff and ideas in an interconnected world. Named Entity Recognition may be used to create algorithms that automatically filter relevant information we might be interested in and assist us to uncover similar and previously undiscovered relevant stuff based on our prior behavior. This increases customer engagement on products and brings more business values.

Effective and Efficient optimization of search engine algorithms

A search engine's algorithm is a collection of rules that determines how listings are ranked in response to a search query. Instead of examining the millions of articles and websites online for an entered query, a more efficient approach to design a search engine algorithm would be to run a NER model on the articles once and store the entities associated with them permanently. This speeds up a process and increases the business value.

Implementation: Creating a Custom Swahili Named Entity Recognition Model using Spacy.

Hope you now understand Named Entity Recognition, its importance, and its usage in business operations, let's dive into our topic and see how to create a simple named entity recognition model based on the Swahili language using spaCy. But wait !, I see you wondering, what Spacy is? right!

Meaning: spaCy

Simply put, spaCy is a Python-based open-source framework that does sophisticated natural language processing. It is intended for production usage and aids with the development of applications that process and "understand" massive amounts of text. check it out here, In spaCy, Named Entity Recognition is done by the pipeline component ner, it is easy to implement, shortly I can say, spaCy is like your NumPy in data science.

Now, Let's get started,

Using the pre-built-in NER spaCy model

Here we first explore the trained model called xx_ent_wiki_sm, this is a multilingual model trained to understand different languages. This is due to some languages including Swahili does not have a specific spaCy NER language model. This solution is made on spaCy version 3.2.1, as the latest version at the time of writing this article

Let's start by installing the libraries to be used, code below shows how to install spaCy and download the Multi-language model

! pip install -U spacy   #install spacy and upgrade to latest version
! python -m spacy download xx_ent_wiki_sm #download the multi language model
! python -m spacy info #checking the info about the spacy installed

Importing the necessary libraries in the project

import spacy 
import xx_ent_wiki_sm #multi language model
from tqdm import tqdm #making loop show nice progress bar 
from spacy.tokens import DocBin # effeciently used to hold serialized annotations
from spacy import displacy #highlighting the discovered named entities from text document
import warnings 
warnings.filterwarnings("ignore") #filter warnings 
model=xx_ent_wiki_sm.load() #loading the multi language model

Testing the trained NER Model loaded as shown above by giving it text data. Consider below code

text_swahili="Mimi ni Innocent Charles , mjuzi wa akili bandia na sayansi ya data kutoka kampuni ya IPFsoftwares" #text data in swahili language 
preds=model(text_swahili) #made predictions of the named entities that might be in text given 
for preds_show in preds.ents:
  print(preds_show.text,preds_show.label_) #print named entitie and respective labels
displacy.render(preds,style="ent",jupyter=True) #displaying it for proper visualization 

Magic !, just simple like that the model trained in spaCy has done well in recognizing the named entities as shown below image.

Let's explore the pre-defined named entities as recognized above by the trained spaCy NER model. Consider below code

print("PER Meaning:",spacy.explain("PER"))   #meaning of PER
print("ORG Meaning:",spacy.explain("ORG"))   #meaning of ORG
print("MISC Meaning:",spacy.explain("MISC")) #meaning of MISC
print("LOC Meaning:",spacy.explain("LOC"))   #meaning of LOC

Nice, From the below image, contains the meaning of entities now you got to know what NER is capable of. It was able to recognize names and organizations where innocent charles might work there.

From the above images and codes, it is shown that we were using the already trained NER model from spaCy without fine-tuning.

Now, let's create our own or custom NER model using spaCy based on the Swahili language

Training Custom NER Swahili Model using Spacy By Updating the existing pre-trained Multilingual Model

Preparation of custom data, here I have prepared some training data and validation data with pre-defined entities as labels, consider the code below

#training data 
Swahili_training_data=[
    ("Maafisa wa WHO wamesema kwa wiki kadhaa ufuatiliaji wa mlipuko huo umeangazia mabara ya Marekani, na idadi ya Jumapili imeonyesha ongezeko la siku moja la zaidi ya maambukizi 116,000 katika eneo Latin Amerika na Amerika ya Kaskazini.",{"entities":[[0,7,"MTU"],[11,14,"SHIRIKA"],[88,96,"MAHALI"],[110,118,"SIKU"],[175,182,"IDADI"],[195,208,"MAHALI"],[212,232,"MAHALI"]]}),
    ("Watu wawili waliojitolea walipatiwa chanjo hiyo Alhamisi mjini Oxford ambapo timu ya Chuo kikuu hicho ilitengeneza chanjo hiyo katika kipindi chini ya miezi mitatu.",{"entities":[[0,4,"MTU"],[5,11,"IDADI"],[48,56,"SIKU"],[63,69,"MAHALI"],[85,95,"SHIRIKA"]]})
]

#validation data 
Swahili_validation_data=[
    ("Canada, Russia na nchi nyingine pia wanashughulika kutengeneza chanjo, lakini wataalam wanasema hata kama itapatikana inayofaa hivi karibuni, utengenezaji wa chanjo hiyo na usambazaji wake unaweza kuchukua mwaka mmoja au zaidi.",{"entities":[[0,6,"MAHALI"],[8,14,"MAHALI"],[78,86,"MTU"],[206,217,"MUDA"]]}),
    ("Tafiti mbalimbali pia zinaonyesha dawa ya malaria hydroxychloroquine haiponyi virusi hivyo na pengine, ukweli ulivyo, inahatarisha maisha ya wagongwa wa COVID-19.",{"entities":[[42,49,"UGONJWA"],[50,68,"DAWA"],[141,149,"MTU"],[153,162,"UGONJWA"]]})
]

#loading the pre trained model for doing fine tuning 
custom_NER_model=xx_ent_wiki_sm.load()

Double-check if the model is loaded, consider the code below

if(custom_NER_model):
  print("Existing Model is Loaded",custom_NER_model)
else:
  print("Existing Model is not Loaded")

Check the pipelines and labeled entities, consider the code below

print(custom_NER_model.pipe_names)
print(custom_NER_model.pipe_labels)

Now the magic task happens here, the code below to covert the prepared data into spaCy data format with .spacy extension and add the custom entities to the model, and save the well-formatted data in the disk.

db = DocBin() #efficiently serialize the information
#training data
for text, annot in tqdm(Swahili_training_data):  #data in previous format
    doc = custom_NER_model.make_doc(text) 
    ents = []
    for start, end, label in annot["entities"]:   #create doc object
        span = doc.char_span(start, end, label=label,alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents    #label the text with the ents
    db.add(doc)
db.to_disk("Swahili_training_data.spacy") #save the docbin object

#validation data
for text, annot in tqdm(Swahili_validation_data): 
    doc = custom_NER_model.make_doc(text) 
    ents = []
    for start, end, label in annot["entities"]: 
        span = doc.char_span(start, end, label=label,alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents 
    db.add(doc)
db.to_disk("Swahili_validation_data.spacy")

Creating the config file for the training model, this file automatically come up with necessary hyperparameters based on the pipeline and language model used, this saves time instead of defining them manually in codes. There are multiple ways of creating a config file, but this seems to be simple with CLI.

! python -m spacy init config config.cfg --lang xx --pipeline ner --optimize efficiency

Finally, use the spacy train and config file to train the model on the prepared data in spacy format as shown below

! python -m spacy train config.cfg --output ./ --paths.train ./Swahili_training_data.spacy --paths.dev ./Swahili_validation_data.spacy

Load the custom NER Swahili model and test it in an unseen Swahili text document

model_test=spacy.load("../Notebook/model-best")
test_preds=model_test("Walinzi wa pwani ya Libya wamekamata wahamiaji 400 waliokuwa wakonjiani katika pwani ya Mediterranean ya nchi hiyo wakielekea Ulaya na kuwarejesha katika mji mkuu wa Tripoli masaa 24 yaliyopita, Shirika la uhamiaji la Umoja wa Mataifa UN limesema Jumapili.")
for x in test_preds.ents:
    print(x.text,x.label_)
displacy.render(test_preds,style="ent",jupyter=True) #display the recognized named entity in the text given

Nice job! we have managed to create a simple custom Swahili NER model using spaCy, in this article you have learned about NER, business use cases, and see the implementation of NER and creating a custom model using spaCy.

Bottom line

Following my recent exposure to NER, I am quite confident in stating that this is a highly helpful feature used in a wide range of business scenarios. However, many difficulties must be considered to make the most optimal use of NER.

On the other hand, the rapid advancement of deep learning algorithms as offered by Neurotech Africa and other organizations has resulted in far more powerful NLP models in recent years. You may consider contacting us now to upscale your business and make the most of it.

Author: Innocent Charles, machine Learning data scientist and NLP developer advocate based in Africa, focuses on harnessing the power of data and technology to create smart solutions that address complex challenges around Africa. I'm quite eager in hearing about your experience with data space!, let's keep in touch on Linkedin