Text analytics as a service

Natural Language Processing

Tag: mhenrikson

Named Entity Recognition Service

We’ve just (re-)launched our named entity recognition service. This builds on the NICTA Named-Entity-Recogniser work of Scott Sanner, Kishor Gawande, William Han, Paul Rivera and Kin Hon Chan.

The new version has been improved by Mats Henrikson, and is open sourced under the GPLv3.0. The public web page demo is hosted on Amazon at http://ner.t3as.org

Sno-code update

Now with configuration options!

We’ve just launched the latest update for our SnoMed-CT text coder We now allow users to select options to choose the semantic types.

SnoMed coder screenshot

SnoMed coder with options

At present we are using a heuristic to filter the semantic types that MetaMap returns to restrict the hits to clinical terms. We will improve the heuristic over time.

The new feature allows the user to select which semantic types are used by MetaMap. Our defaults (the heuristic defaults) are listed, but perhaps you really want to include a search for codes that match fish. So then, press the “configure button, select the semantic type “fish” and re-analyse the text.

As this is a demo system, we don’t store cookies: so if you open new browser window, you’ll have to re-configure


A Snomed CT Text Analysis suite

We’ve released our work on Snomed-CT text analysis, our first step is a public web service that can analyse English text and report any SNOMED CT concepts that are detected. The service is available at http://snomedct.t3as.org/

The service uses the National Library of Medicine’s software product MetaMap and the National Library of Medicine’s Unified Medical Language System (UMLS). The screenshot below is generated from the web page, using the input text of Figure 3, of Sager et al. Automatic encoding of Snomed III

Snomed Coder Screenshot

Snomed Coder Screenshot

If you would like to try out some other clinical texts, you might try the Medical History texts from Monash University.

Please note: This demonstration service runs on a publicly accessible server that is not geographically constrained. All text entered in the web page is sent in clear text to the Amazon service. Please do not upload private clinical documents.

The page is a simple front-end for the public web service that can analyse English text and report SNOMED CT concepts that are detected. At present we are using a heuristic to filter the semantic types that MetaMap returns to restrict the hits to clinical terms. We will improve the heuristic over time.

The next feature we add will be to also display the semantic type of each returned concept to feed into a system that allows the user to view/search-over specific types with some defaults set.

For developers, have a look at the service page to find out how to use the service as an API. We will shortly be releasing the source code as well.


SNOMED CT or SNOMED Clinical Terms is a systematically organized computer processable collection of medical terms providing codes, terms, synonyms and definitions used in clinical documentation and reporting. Wikipedia

More information on the Australian terminology SnoMed CT-AU can be found at the NEHTA site.


MetaMap is the most common tool in clinical text analysis applications to codify and relate medical concepts. We found only 1 hit on PubMed for the similar systems of Mgrap, and 12 hits for KnowledgeMap. There were 74 hits for MetaMap. MetaMap has resulted in substantial performance gains. For example,

With thanks to Mats Henrikson, Hanna Suominen and Neil Bacon

Text analytics for: Patent classification

In this project, we developed a searchable index of patent classification codes that allows search by text and by code. We also extended this to allow users to explore the classification hierarchy. This blog entry page describes the demonstration web page.

Pat Class Web

Patent Classification pat-clas.t3as.org

For those unfamiliar with reading patents, we refer to The Lens and the tutorial, how to read a patent. Within patents, classification codes provide significant benefit in understanding and searching patents: patents with similar codes are likely to refer to similar content. From the British Library emphasis added

“The usefulness of patent classification as a means of searching for patents information is a by-product of its primary purpose as a tool for patent examiners. Using patent classification as part of a search to identify patents in a particular field can help the non-expert searcher to focus and refine his search and produce a useful set of references… However, it is a massive and complex tool designed for an expert user group and when it is used by anyone outside that user group it should be applied with care.”

For the non-expert, classification codes are difficult to use. For example, a patent for locomotive on the Lens, has two IPC classification codes associated with it: B61C17/04 and B61D27/00. What do these codes mean?

Entry point: web page

The text analytics service for this project is hosted at pat-clas.t3as.org and has a public Github repository for all code. The web page provides an open html file, that accesses the service API’s and presents a simple interface for text- or code- based search of CPC, IPC, and USPTO classification codes.


Free text search

The first field of the web page allows users to enter free text, and return classification codes. Following our example, let’s choose IPC, and enter “locomotive”. The search returns all IPC codes that contain “locomotive” in their text. The list is sorted based on relevance (rank), with all relevant search terms highlighted in the codes. The search returns at most the top 50 items.

The next field allows the user to find the context of a given code. Let’s try B61C17/04:

Code context

Code context

The context is build up of the the parent codes in the classification hierarchy, with their associated text stubs. Explore can also be used to view the hierarchy of the classification system – for example to find siblings of the code B61C17/04.

Classification hierarchy for Locomotive

Classification hierarchy for Locomotive

The screen flow below outlines how to use the web page.

Under the hood: How it works

All the code is available on Public GitHub. If you are interested in developing applications that use this code, you should read the README on GitHub.

  1. CPC/IPC/USPTO codes are converted to list of string descriptions
    • one for the code itself and
    • one for each ancestor in the hierarchy.

    This is a very simple database app with XML processing to populate the database.

  2. Given a text query, find CPC/IPC/USPTO codes that have descriptions matching the query. A very simple Lucene search app.

We’ll post more details soon.

The fine print

  • The service is hosted on Amazon Web Services, with uptime on a best-effort basis and no redundancy.
  • If requested, we may upgrade the hosting to production grade with hardware redundancy.
  • The web page/user interface is designed as a demonstration of the underlying web services, and is not intended to be a user interface designed for any particular use case

This combines work from Neil Bacon, Gabriela Ferraro and Mats Henrikson