# Labs

# Introduction

In Labs the users can create various type of projects in machine learning science area, such as Named-Entity Recognition, Entity Relationship Extraction, Generic Classifier/Tagger, Sentiment Analysis, Doc2vec, etc.

For a Named-Entity Recognition project type, for e.g., the user can define a list of entity types, load text documents, annotate them manually recognizing in the text proper entities, build a model to annotate further documents automatically and finally test the model goodness.

In this part of documentation will be described how to work with Named-Entity Recognition project type. The Labs NLP Engine must be installed and running in the system, so please refer to last paragraph to get help.

# Types of Users

There are tree types of users that can use Labs:

  • Labs Manager

Who can manage (create, update, delete) projects in the Labs area, assigning the management of each one to another user (or to himself). The related (global) system role is ROLE_LABS_MANAGER.

  • Project Manager

This user has the right on a specific project, and he/she can manage the work inside of it. For e.g., in a Named-Entity Recognition project this user can load new documents, manage the Type System and assigning work tasks. The related ( local) system role is ROLE_LABS_PROJECT_ERENER_MANAGE.

  • Document Annotator

In a Named-Entity Recognition project, this role is assigned to a user that should manually annotate documents, recognizing in the text proper entities. The related (local) system role is ROLE_LABS_PROJECT_ERENER_DOCUMENT_ANNOTATE.

# Architecture

As we will see hereafter, to effectively work with Labs, you need to involve some Natural Language Processing (NLP) capabilities of Hume NLP engine (also known as Annotation Service). More specifically, Hume NLP Service can perform tasks such as:

  • training of NLP models, accomplished by Hume Training Skills
  • annotation of text (Entity Extraction/Named Entity Recognition, tokenization, sentence splitting, etc.): usually these capabilities are used in the phase of preannotation of a text and are performed by Entity Extraction skills

Training a model and annotating a text require that both Core API and Annotation Service have to be able to share files each other; to do so, an intermediate file repository is used for this goal.

Labs - Architecture of the training ecosystem

MinIO ( https://min.io (opens new window) ) is its implementation, it's the file storage solution chosen for achieving this goal; later in this section we'll describe how to configure it and which kind of data flows among all the players of this scenario.

# Working with Projects

A Labs Manager can create a project, by giving a name to it and choosing the proper project type. Labs - Working with Projects - Image 01

The Labs Manager can assign the Project Manager permission to another user of the system, with the well-known Access Permissions dialog. Labs - Working with Projects - Image 02

The Project Manager can insert (or upload) documents to be annotated in the project. Labs - Working with Projects - Image 03 Labs - Working with Projects - Image 04

# Type System and Dictionary

A Project Manager should edit a proper Type System for the project, to include a reasonable set of entity type to be recognized in the documents text. Labs - Type System and Dictionary - Image 01

For each entity type can be defined a list of terms, to compose a domain specific dictionary. Please notice that regex expressions can be used as well. Labs - Type System and Dictionary - Image 02

The user would specify a list of Relationship types, that will link two different entity in the documents; in the following image we can see an example. Labs - Type System and Dictionary - Image 03

The panel Dictionary Upload in the image below allows users to add items to the Entity dictionary by uploading a CSV file, where the columns are the term, and the entity label, respectively. Labs - Type System and Dictionary - Image 04

# Building an Annotation Skill from a dictionary

Once the dictionary contains a certain number of terms, pressing the "Create Skill" button, the project manager can create a skill with the capability to pre-annotate the documents, basing on the Type System and the terms into the Dictionary just created. Labs - Type System and Dictionary - Image 05

The new created skill can be found in the ecosystem panel, and it will be available in the whole system. Labs - Type System and Dictionary - Image 06

# Task Assignment

With this feature a project manager can assign the manual annotation of documents to the annotators. He/she enters a name for the task, choosing an existing skill that will be used by the annotator to pre-annotate the document (it will be explained forward in the documentation) and a rule to select the documents to be included in task. Labs - Task Assignment - Image 01

After task creation, the system prompts to the user to assign a task to an annotator, through the permission management panel. Labs - Task Assignment - Image 02

# Manual annotation

The document annotators should now annotate the documents inside his assigned tasks. Please notice that the first thing to do before starting annotate is to process the document by Preannotate button. The result of this phase could be like the second image, where you can find some recognised entities.
Labs - Task Assignment - Image 03 Labs - Task Assignment - Image 04

At this point the user can manually recognise entities and annotate them, choosing the correct entity type. It is sufficient to select a part of text and select the entity type from the menu at top right of the page.
Labs - Task Assignment - Image 05 Labs - Task Assignment - Image 06

When the annotation process is done, the user must click the button "Mark Ready".

# Document Review

When the annotation process has been finished for a document, for every task in which it is present, a project manager should review that document. Labs - Document Review - Image 01

In this phase he/she should resolve annotation conflicts like in the image above, and accept or reject other annotations. Labs - Document Review - Image 02

The reviewer can also use the following tool (Batch Review button) to accept all non-conflicting annotations. Labs - Document Review - Image 03

At the end of the review process the project manager must click on "Complete Review" button.

# Project Versions

Once a reasonable number of documents have been annotated and reviewed, the project manager can decide to freeze the project in a Version, simply clicking the Create button (no name is needed) Labs - Project Versions - Image 01

Just after the creation of a version, we will see, a set of training data is available; these data are useful to train a model. So, the project manager can download training data, but he/she can also take advantage of the Hume ecosystem and train a model directly. Labs - Project Versions - Image 02 Labs - Project Versions - Image 03

# Model training

Labs - Creation of a model

As it's already been described, once the annotation and review process of a given set of documents has finished, it's possible to (1) create a version of the project.

Each version can be thought as a snapshot of the status of the project; it will be materialised in the shape of a TSV file (Tab-Separated Values) and stored into the MinIO bucket specified by the setting HUME_BLOBSTORE_CONTAINER (see the Configuration section); this file is the input data of the Training Skill.

As a matter of fact, the (2) Annotation Service will perform the training and (3) will return a model definition.

Then, (4) the model definition will be saved in MinIO in the same bucket specified above, together with its matching training data.

# How-To

For this activity, there's the need of a skill in the ecosystem capable to build a model starting by a set of training data, like the one in the following example. Labs - Project Versions - Image 04

When the project manager clicks on the proper icon to train a new model, the system prompts him to choose a name for the new model, and the skill that will do the job of building the model. So the system submits the training data to the training skill, and the user have to wait the needed processing time. Labs - Project Versions - Image 05 Labs - Project Versions - Image 06

# Build an Annotation Skill from a model

Labs - Creation of an annotation skill from a model

Once the model definition has been saved in Labs MinIO folder, it's possible to (1) create one or more Annotation Skills from it. Behind the scenes, Hume Core API will (2) make a copy of the model definition into a new bucket having the same name of the new skill's name; at the same time, it will also save the pipeline configuration file of the brand-new Skill.

# How-To

When the model is ready to be used, the user can create a new annotation skill from it, simply by clicking on the lamp icon.

So, the system will ask the user for a name to give to the new skill. Labs - Project Versions - Image 07 Labs - Project Versions - Image 08

# Annotate a document via Skill

Labs - Creation of a model

After its creation, the new Annotation Skill can be used for the annotation of other documents, belonging to the same business domain of the labs project we created it from.

As an example, the new skill is useful for preannotation purpose on further tasks, in order to make the model more and more accurate.

As depicted in the diagram above, (1) Core API will ask the Annotation Service, via the Skill just created, to annotate a document.

The Annotation Service will (2) pick the model definition from the skill folder and will use it for creating a pipeline; then, (3) it will annotate the text of the document according to the configuration of the pipeline and, finally, (4) it will return the result to Core API.

WARNING

Please note that the step (2) is executed only once, that is, when the pipeline of the skill has been invoked for the first time after the Annotation Service startup.

# How-To

The project manager, finally, can test the new skill, for e.g. using it in the preannotation phase of a document belonging to a task, and then looking to the results. Labs - Project Versions - Image 09 Labs - Project Versions - Image 10

# Installing Labs NLP Engine

MinIO is a file repository based on Amazon S3 protocol; below you can find a snippet of the docker-compose file for each service in order to make them connect to this storage.

# Configuration

WARNING

Please note that a configuration like this has already been provided by the installation download.

All the following configuration items must match in every service: in other words, host, identity and password of a MinIO instance must be correct and must be the same everywhere.

# Core API

  api:
    environment:
      ...
      - HUME_BLOBSTORE_ADAPTER=minio
      - HUME_BLOBSTORE_ADAPTER_S3_HOST=http://blobstore:9000
      - HUME_BLOBSTORE_ADAPTER_S3_IDENTITY=miniokey
      - HUME_BLOBSTORE_ADAPTER_S3_CREDENTIAL=miniopassword
      - HUME_BLOBSTORE_CONTAINER=humelabs
      ...

where:

  • HUME_BLOBSTORE_ADAPTER: it is the method of file sharing. In most installations its value is minio, but it can also be set with s3 for storage on AWS.
  • HUME_BLOBSTORE_ADAPTER_S3_HOST: URL of the MinIO host.
  • HUME_BLOBSTORE_ADAPTER_S3_IDENTITY: MinIO username.
  • HUME_BLOBSTORE_ADAPTER_S3_CREDENTIAL: MinIO password.
  • HUME_BLOBSTORE_CONTAINER: it's the name of the MinIO/S3 bucket where training models will be stored.

# Entity Extraction Service

  annotation-service:
    image: docker.graphaware.com/public/hume-annotation-service:${HUME_VERSION}
    environment:
      - SERVER_PORT=8085
      - server.tomcat.max-connections=2
      - "JAVA_OPTS=-Xmx4g"
      - HUME_CONFIG_STORE_DRIVER=blobstore
      - HUME_CONFIG_STORE_HOST=http://blobstore:9000
      - HUME_CONFIG_STORE_ACCESS_KEY=miniokey
      - HUME_CONFIG_STORE_ACCESS_SECRET=miniopassword
    ports:
      - 8085:8085

where the relevant settings are:

  • HUME_CONFIG_STORE_DRIVER: the value blobstore is the matching method for HUME_BLOBSTORE_ADAPTER=minio on Core API side
  • HUME_CONFIG_STORE_HOST: equivalent of HUME_BLOBSTORE_ADAPTER_S3_HOST on Core API side (must match)
  • HUME_CONFIG_STORE_ACCESS_KEY: equivalent of HUME_BLOBSTORE_ADAPTER_S3_IDENTITY on Core API side (must match)
  • HUME_CONFIG_STORE_ACCESS_SECRET: equivalent of HUME_BLOBSTORE_ADAPTER_S3_CREDENTIAL on Core API side (must match)

# MinIO

  blobstore:
    image: docker.graphaware.com/public/hume-blobstorage:${HUME_VERSION}
    restart: always
    environment:
      - MINIO_ACCESS_KEY=miniokey
      - MINIO_SECRET_KEY=miniopassword
    volumes:
      - "hume_blobstore_data:/data"
    command: "server /data"
    ports:
      - 9000:9000

here we can choose both MinIO username and password, by setting MINIO_ACCESS_KEY and MINIO_SECRET_KEY. If we changed them here, we would have to change them on Core API and annotation-service, as well.