Custom Named Entity Recognition Model For Medical Purpose — Using spaCy model.

Pramudya Dika
4 min readJan 6, 2023

Hi! This time we will use custom-named entity recognition (NER) to detect pathogens, medical conditions, and medicine in healthcare data.

Sound good? Let’s get started!

Background and Goals

NER is a way for a computer to identify and classify specific information in text, and it has many applications, including in the healthcare field. By building and deploying a custom NER model, you can automate the process of extracting this information from medical records, saving time and effort.

However, there are challenges and considerations to keep in mind when building a model for healthcare data. That’s why I’m excited to dive into this topic with you today. We’ll talk about all the ins and outs of using custom NER for healthcare, and hopefully, by the end of our discussion, you’ll have a good understanding of how it all works and how you can use it in your projects.

spaCy model

spaCy is a free, open-source library for natural language processing (NLP) in Python licensed under the MIT license. It is designed to build information extraction systems in NLP. spaCy works by converting a text into a Doc object, which then goes through a pipeline process.

The pipeline used during training data typically includes a tagger, lemmatizer, parser, and entity recognizer. Each of them then returns a processed Doc object as output. spaCy is fast and easy to use, making it a popular choice for NLP tasks in various industries.

language processing pipelines

Why Use spaCy?

  • Free and open source
  • Well-organized, structured, and easily accessible documentation on the spacy.io website
  • Capable of analyzing large volumes of structured or unstructured data
  • Popular, with many existing reference projects

How To Do It?

Importing Data Set

First, we import the spaCy and download the English pipelining tools.

Then, we import the data set. You can find this data set in this link. In this step, we also check the content of the data set.

Train the Data Set

After that, we can start to train the dataset. It creates an entity value such as a start index, end index, and label.

We can see the example of the training model below.

Create Config for the spaCy model

Next, we convert our training data set into a Doc object so the spaCy model can be used.

After that, we create a base model config based on the spacy.io website guideline.

For this time, we’re going to select the NER component.

And then create a new config file.

After creating a new base config, we create a new one based on the base config.

A new config file is created.

Train using spaCy model

Now, we can use the spaCy model to train our dataset.

Using EPOCH = 25, we then got an average score of 0.94.

The pipeline process will create a new model-best folder.

Last, we can try our model to predict a new document.

Result

Conclusion

Based on the training and prediction testing results, our model can accurately identify types of pathogens, medications, and health conditions in a medical document. For example, the model can identify E. Coli as a pathogen, Meningitis and Stomach ache as health conditions, and Azithromycin as a medication (antibiotic). Overall, the model performs well in identifying these entities in medical documents.

--

--