In information extraction frameworks, finding a matching for a word in a text is a very common issue. The matching is often done based in a given input dictionnary. This task is called Named Entity Recognition (NER). This task is useful to classify the words in the text. For instance, we could have a dictionnary with a set of city names (São Paulo, Rio de Janeiro, Paris, London, etc.) and use this dictionnary to find occurencies of these city names in a text. There are several kinds of dictionaries storing city/country names, drug names, product names, or others. The components that find such matchings in IE frameworks is called a Gazetteer. Gazetteers can use exact words to find an occurency, .i.e., to find “São Paulo” in a given text, it must be written exactly “São Paulo”. This is an issue if the text has mispelling/typing errors, such as “So Paulo”, “Sao Paulo”, “S.Paulo”, etc.
The approaches that find occurencies of words with errors often use Approximate String Matching techniques (ASM), commonly using the Edit Distance (ED) measure. However, this measure dos not capture all the aspects of text errors, such as phonetic dissimilarities.
We have implemented a Gazeteer for the GATE framework (General Architecture for Text Engineering) that integrates Approximate String Matching and Phonetic information. The goal was to find alternative representations that could be more efficient that existing methods. We have coded the input text and the dictionaries using a phonetic converstion function called Metaphone.
Correct word | Metaphone representation |
medroxalol | MTRKSLL |
amoxicillin | AMKSSLN |
New York | NYRK |
Avondale Estates | AFNTLSTTS |
This means we have coded the input text and the dictionary using the representation from the second column of the table above. The main advantage is to have a reduced size of the input dictionary (about 70%), and to still get good matching results. It can be an alternative implementation when the size of the dictionary is important.
Junior Ferri, with the help of Hegler Tissot have implemented the plug-in for GATE and it is available for download in this link: https://gitlab.c3sl.ufpr.br/faes/asm/tree/master. One strong point of the implementation is the possibility to set several parameters, such as the max edit distance used, the phonetic conversion class, if it should be case sensitive, and others. The full doc is available at (https://gitlab.c3sl.ufpr.br/faes/asm/wikis/home). We have also publish the details of the approach at ADBIS 2018, in Budapest. The paper is available here.