Name entity recognition using CoreNLP
Chatbots are a great way to interact with customers and get things done at a really quick pace. Instead of clicking buttons, filling out forms and scrolling endlessly a chatbot can act like a companion, amuse the user with trivia, while delivering value at the same time in a more conversational and human way.
One of the key tasks in working of a chatbot is the Name Entity Recognition (NER). Wiki says a named entity is similar to a real world object like a person, location, food items etc. Below you can see a few examples for named entities.
CoreNLP supports a few types of entites like person, time, date etc. Here is the list of all the named entities supported by CoreNLP. The other entities mentioned above are custom entites and they have to be extracted seperately. In order to achieve this we are going to use CoreNLP’s RegexNER.
As described in their website RegexNER is a rule based interface for doing custom entity recognition. A simple text file that contains all the details of the entites must be provided during the time of extraction and the pipeline takes care of annotating the token with the corresponding NER. A sample file might look like this.
Note that they are tabs and not spaces. The formatting of the file is very important for extraction and if improper the extraction fails and an exception is thrown. Let’s dig in.
Before setting up the extraction pipeline we will set the properties that will include our custom entity file.
The fileName is the path of the custom entity file. You can provide it from anywhere like the file system, url or classpath also. Here I will include the entities.txt file in the resources folder of the project and the library will pick it up from there. Let us build the pipeline now.
We have built an annotation pipeline, provided the necessary properties including our entity file. The dateFormat is nothing but an instance of SimpleDateFromat with a date pattern.
The annotation contains all the annotated data but we cannot see it till now. So let’s extract the output.
Now usually if you look for blogs related to corenlp everybody extracts a CoreMap instance from the annotation and iterate through each token and construct the output but I find that a little cumbersome and hence came up with a way to avoid it.
The above line of code does a few things in the hindsight. It extracts the annotated data from the annotation, constructs a JSON and writes that to the output stream that is provided to it.
For the sake of convenience I am just going to print it out on the console. You can feel free to provide any other stream such as a file or even an HTTP response stream. (Just set the content type header to application/json if using the HTTP servlet response stream).
Once you run this code you can find out a neat JSON printed on your console and the ner field for will contain the value that you specified in the file.
The complete source code can be found on github.