Text analytics as a service

Natural Language Processing

Named Entity Recognition Service

We’ve just (re-)launched our named entity recognition service. This builds on the NICTA Named-Entity-Recogniser work of Scott Sanner, Kishor Gawande, William Han, Paul Rivera and Kin Hon Chan.

The new version has been improved by Mats Henrikson, and is open sourced under the GPLv3.0. The public web page demo is hosted on Amazon at http://ner.t3as.org

Patent classification service update

We’ve just made a new version of the patent classification service live.

Changes are:

  • Radio buttons to say what you want to happen when you click on a symbol link (populate Context or Explore).
  • Support for query terms like Symbol:A23* and a Symbol Prefix field (two ways to do the same thing, the former is more flexible, the later is more user friendly).
  • Options for unstemmed search
  • Term auto-complete using exact and fuzzy matches, exact matches are shown first, both are sorted with terms matching more classifications first (other sorting would be possible)
  • Added a link to the query syntax documentation. The fields available for querying are documented here
  • fixed bugs with && and IPC formatting. There may still be an issue in that IPC “B65H75/00” will always be shown as “B65H75” whereas CPC can show “B65H75” with a child “B65H75/00”. This is because the CPC data uses this human friendly format, whereas the IPC data uses A99AZMMMGGGGGZ which we have to reformat for display.

Patent classification update: web browser testing

We’ve now done some extensive browser testing for the web-demo of the patent classification service. The web-demo works on

  • Chrome
  • Firefox
  • Safari
  • Android phones and
  • iDevices (iPad, iPhone).
  • Internet explorer is a bit trickier, but the following have been tested
    • IE 11 on Win 8.1
    • IE 10 on Win 7
    • IE 9 on Win 7

The service itself is agnostic to browsers.

As IE8 doesn’t support manipulation of XML elements embedded in HTML (which are used extensively by the demo) IE8 users will need to wait for an update – either from us, or by downloading a recent version of IE.

Sno-code update

Now with configuration options!

We’ve just launched the latest update for our SnoMed-CT text coder We now allow users to select options to choose the semantic types.

SnoMed coder screenshot

SnoMed coder with options

At present we are using a heuristic to filter the semantic types that MetaMap returns to restrict the hits to clinical terms. We will improve the heuristic over time.

The new feature allows the user to select which semantic types are used by MetaMap. Our defaults (the heuristic defaults) are listed, but perhaps you really want to include a search for codes that match fish. So then, press the “configure button, select the semantic type “fish” and re-analyse the text.

As this is a demo system, we don’t store cookies: so if you open new browser window, you’ll have to re-configure

Sno-Code

A Snomed CT Text Analysis suite

We’ve released our work on Snomed-CT text analysis, our first step is a public web service that can analyse English text and report any SNOMED CT concepts that are detected. The service is available at http://snomedct.t3as.org/

The service uses the National Library of Medicine’s software product MetaMap and the National Library of Medicine’s Unified Medical Language System (UMLS). The screenshot below is generated from the web page, using the input text of Figure 3, of Sager et al. Automatic encoding of Snomed III

Snomed Coder Screenshot

Snomed Coder Screenshot

If you would like to try out some other clinical texts, you might try the Medical History texts from Monash University.

Please note: This demonstration service runs on a publicly accessible server that is not geographically constrained. All text entered in the web page is sent in clear text to the Amazon service. Please do not upload private clinical documents.

The page is a simple front-end for the public web service that can analyse English text and report SNOMED CT concepts that are detected. At present we are using a heuristic to filter the semantic types that MetaMap returns to restrict the hits to clinical terms. We will improve the heuristic over time.

The next feature we add will be to also display the semantic type of each returned concept to feed into a system that allows the user to view/search-over specific types with some defaults set.

For developers, have a look at the service page to find out how to use the service as an API. We will shortly be releasing the source code as well.

SnoMed

SNOMED CT or SNOMED Clinical Terms is a systematically organized computer processable collection of medical terms providing codes, terms, synonyms and definitions used in clinical documentation and reporting. Wikipedia

More information on the Australian terminology SnoMed CT-AU can be found at the NEHTA site.

MetaMap

MetaMap is the most common tool in clinical text analysis applications to codify and relate medical concepts. We found only 1 hit on PubMed for the similar systems of Mgrap, and 12 hits for KnowledgeMap. There were 74 hits for MetaMap. MetaMap has resulted in substantial performance gains. For example,

With thanks to Mats Henrikson, Hanna Suominen and Neil Bacon

Patent classification update

The USPTO, CPC and IPC have each recently updated their classification systems. We have updated our underlying classification data:

  • CPC release Dec 2013
  • IPC release Jan 2014
  • USPC release Jan 2014

This brings the patent classification API up to date with current international classification documentation.

Text analytics for: Patent classification

In this project, we developed a searchable index of patent classification codes that allows search by text and by code. We also extended this to allow users to explore the classification hierarchy. This blog entry page describes the demonstration web page.

Pat Class Web

Patent Classification pat-clas.t3as.org

For those unfamiliar with reading patents, we refer to The Lens and the tutorial, how to read a patent. Within patents, classification codes provide significant benefit in understanding and searching patents: patents with similar codes are likely to refer to similar content. From the British Library emphasis added

“The usefulness of patent classification as a means of searching for patents information is a by-product of its primary purpose as a tool for patent examiners. Using patent classification as part of a search to identify patents in a particular field can help the non-expert searcher to focus and refine his search and produce a useful set of references… However, it is a massive and complex tool designed for an expert user group and when it is used by anyone outside that user group it should be applied with care.”

For the non-expert, classification codes are difficult to use. For example, a patent for locomotive on the Lens, has two IPC classification codes associated with it: B61C17/04 and B61D27/00. What do these codes mean?

Entry point: web page

The text analytics service for this project is hosted at pat-clas.t3as.org and has a public Github repository for all code. The web page provides an open html file, that accesses the service API’s and presents a simple interface for text- or code- based search of CPC, IPC, and USPTO classification codes.

pat-clas-locomotive-freetext

Free text search

The first field of the web page allows users to enter free text, and return classification codes. Following our example, let’s choose IPC, and enter “locomotive”. The search returns all IPC codes that contain “locomotive” in their text. The list is sorted based on relevance (rank), with all relevant search terms highlighted in the codes. The search returns at most the top 50 items.

The next field allows the user to find the context of a given code. Let’s try B61C17/04:

Code context

Code context

The context is build up of the the parent codes in the classification hierarchy, with their associated text stubs. Explore can also be used to view the hierarchy of the classification system – for example to find siblings of the code B61C17/04.

Classification hierarchy for Locomotive

Classification hierarchy for Locomotive

The screen flow below outlines how to use the web page.

Under the hood: How it works

All the code is available on Public GitHub. If you are interested in developing applications that use this code, you should read the README on GitHub.

  1. CPC/IPC/USPTO codes are converted to list of string descriptions
    • one for the code itself and
    • one for each ancestor in the hierarchy.

    This is a very simple database app with XML processing to populate the database.

  2. Given a text query, find CPC/IPC/USPTO codes that have descriptions matching the query. A very simple Lucene search app.

We’ll post more details soon.

The fine print

  • The service is hosted on Amazon Web Services, with uptime on a best-effort basis and no redundancy.
  • If requested, we may upgrade the hosting to production grade with hardware redundancy.
  • The web page/user interface is designed as a demonstration of the underlying web services, and is not intended to be a user interface designed for any particular use case

This combines work from Neil Bacon, Gabriela Ferraro and Mats Henrikson

Our manifesto

Welcome to the Text Analytics as a Service (t3as) blog.

We want to deliver an open ability in Text Analysis to end users via open source, web-based services. Text Analytics is a significant bottleneck for data analysis: analytics of unstructured text is needed as text data remains largely unused outside research projects.

Text Analytics is an artisanal (or cottage) industry that does not yet lend itself to engineered processes. Text analytics lacks the standardisation required to deploy technology solutions composed of “off the shelf” components.

We want to build a framework that delivers the benefits of open-source text analytics, whilst overcoming the barriers of open data analysis.

How does it work?

Text Analysis as a Service will support multiple sub-projects. The services are hosted on Amazon Web Services, with domain www.t3as.org Projects will deliver:

  • Open Source files:

    We maintain a private GitHub repository. Projects that reach an appropriate level of maturity, including NICTA open source requirements, will be transitioned to Open Source, under GNU General Public License v3 (GPLv3) and public GitHub

  • Application Program Interface hosted on a web service (Amazon Web Services)

    These API’s will be useable without commercial license. The hosted API’s are designed to deliver trusted code, but not necessarily high bandwidth. The API’s are versioned, and will support remote interrogation by other services: this allows composition of our web services. If/when load becomes an issue we will develop a free API key access that supports monitoring and potentially limits usage per user.

  • Web page as a wrapper

    We will host web pages that wrap the API’s to show the services. We expect the web page to give a tutorial in the API calls, as well as allowing developers to duplicate the web page, and experiment with API calls, before developing their own software that calls the API

  • Published source code, allowing remote and local deployment

    The services work in both remote (as a server-side application) and local (service deployed in-process within the user’s application) modes.