E-txt2DB

E-txt2DB is a framework for specifying and executing Entity Recognition (ER) programs. These programs accept as input a text containing potentially interesting entities to be extracted and produce the input text annotated with the recognized entities.

The E-txt2DB functioning mode involves two distinct phases. First, the training phase consists in creating a model based on a given ER technique and one or more resources that guide the creation of the classification model. Examples of these resources are dictionaries for rule-based ER techniques or training data for statistical learning techniques (e.g., Conditional Random Fields). Second, in the execution phase, a classification model previously created receives as input plain text and produces annotations corresponding to the recognized entities.

The E-txt2DB framework consists of a software layer, built on top of Minorthird and Lingpipe, offering a command-like specification language. Existing Machine Learning Java APIs (such as Minorthird and Lingpipe) provide implementations of Entity Recognition techniques. Some developers of ER applications do not want to get involved in the implementation details of the techniques used. Instead, they are willing to focus on:

  (i) the choice of the technique to be used;

  (ii) the resources used in the process (e.g., dictionaries);

  (iii) a good set of features that help the ER program to take adequate decisions.

The objective of the E-txt2DB specification language is to turn the development and tuning of ER programs easier for developers that are mainly concerned with these topics.