# Skills

# What are Skills ?

# Capabilities, Implementations and Skills

The Hume Ecosystem offers a multitude of capabilities which, when adjusted or customised for a particular domain, enable enterprises to intelligently reveal knowledge hidden in their data, being structured or unstructured.

For example, the ability to recognize mentions of names representing things being of high importance in a certain domain is called the Entity Extraction capability.

While this capability can actually be fulfilled by humans highlighting those mentions with a marker or entering those references manually in an Excel sheet, the ecosystem offers one or more automated implementations, written as standalone software packages, able to serve a particular capability. Based on the example above, the Hume ecosystem provides the capability of doing Entity Extraction by integrating the Stanford Natural Language Processing library in the ecosystem.

# Integration in Hume

Every skill in Hume runs in its own microservice and is reachable via an http (but not only) protocol.

In order to integrate a skill, you will need to do two things :

  • Deploy the Skill microservice in the ecosystem
  • Registering the Skill in Hume

# Entity Extraction

Hume provides one implementation for Entity Extraction that is based on StanfordNLP.

Entity Extraction will find mentions of names that represent things in the real world. For example, in the following sentence :

Barack Obama was born in Hawaii. He was the President of the United States.

The following entities will be extracted from the text :

Barack Obama  PERSON
Hawaii        LOCATION
United States LOCATION

# Deploy the microservice

To deploy the microservice, add the following to your core Hume docker-compose file (provided by the installation download).

version: '3.7'
services:
  annotation-service:
    image: docker.graphaware.com/public/hume-annotation-service:${HUME_VERSION:-2.2.1}
    environment:
      - SERVER_PORT=8085
      - "JAVA_OPTS=-Xmx4g"
    ports:
      - "8085:8085"

Save the changes to your docker-compose.yml file, and run docker-compose up -d to run the new services.

# Configuration

  • name (required): The user defined name for this skill
  • url (required): The URL at which this skill is available, including the annotation endpoint (/annotate for standard annotation output, /v2/annotate for streamlined output where multi-token entities are not merged)
  • pipeline (required): The name of the NLP pipeline to use ( use default for using the default pipeline)
  • version (optional): Set annotation endpoint version. Leave blank if you use standard /annotate, set it to v2 if you use /v2/annotate (entities not merged)

# Inputs and outputs

# Inputs
  • text (required): The workflow message field that contains the text to be analysed.
# Outputs
  • annotation (default: _annotation): The grammatical representation of the text. Contains a NLP output that can be useful for downstream components.
  • entities (default: _entities): The entities recognised in the text, grouped by type. In case multiple subsequent tokens are tagged with the same entity type, they are merged (Jane Austen instead of Jane and Austen), unless v2 annotation endpoint is used.
{
  "_entities": {
    "money": [
      "$756 million",
      "$575 million"
    ],
    "organization": [
      "Amazon"
    ],
    "location": [
      "UK"
    ]
  }
}
  • merged entities (default: _merged_entities): The entities recognised in the text, grouped by type. Multi-token entities are merged even when using v2 annotation endpoint.
{
  "_merged_entities": {
    "PERSON": [
      {"sentenceIndex": 0,
       "beginCharacter": 15,
       "endCharacter": 25,
       "label": "PERSON",
       "value": "Jane Austen"},
      {"sentenceIndex": 2,
       "beginCharacter": 4,
       "endCharacter": 7,
       "label": "PERSON",
       "value": "John"}
    ],
    "TITLE": [
      {"sentenceIndex": 0,
       "beginCharacter": 27,
       "endCharacter": 32,
       "label": "TITLE",
       "value": "writer"}
    ]
  }
}

# Creating a dictionary based pipeline

The Entity Extraction service allows you to easily create a dictionary based model from a dictionary of terms you have.

The format of the dictionary is TSV with two columns, where the first column is the name and the second column is the type.

The first thing you will have to do is mounting a directory on your system to the /data folder of the container, so the entity extraction service can access it inside the container :

version: '3.7'
services:
  annotation-service:
    image: docker.graphaware.com/public/hume-annotation-service:${HUME_VERSION:-2.2.1}
    environment:
      - SERVER_PORT=8085
      - "JAVA_OPTS=-Xmx4g -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap"
    ports:
      - "8085:8085"
    volumes:
      - "./public/models:/data"

And restart the container with docker-compose up -d (it will intelligently restart containers with changes).

Say, for example, that you want to recognize mentions of technologies. You would start with a TSV file like the following :

computer vision TECHNOLOGY
automated job-recruitment TECHNOLOGY

Place this file in the mounted directory, in this example ./public/models/techs.tsv.

You can now create a pipeline that will use this file as a model :

POST http://localhost:8085/pipeline/new

{
    "name": "tech",
    "language": "en",
    "textProcessor": "com.hume.nlp.processor.stanford.ee.processor.EnterpriseStanfordTextProcessor",
    "annotators": ["tokenize", "ner","dependency","pos"],
    "nerRegexFiles": ["techs.tsv"]
}

You can now create a new skill in the ecosystem using that pipeline, where the _entities output would contain a technology key if any is recognized in the given texts.

The following text

Vochi, a startup operating out of Belarus that's created a "computer vision-based" video editing and effects app for mobile phones, has raised $1.5 million in seed funding. Leading the round is Ukraine-based Genesis Investments (backer of BetterMe and Jiji). It follows pre-seed funding in April 2019 from Bulba Ventures, where Vochi founder and CEO Ilya.

will yield the following entities :

{
  "_entities":
    {
      "date": ["April 2019"],
      "money": ["$1.5 million"],
      "person": ["Vochi", "Ilya"],
      "organization": ["Genesis Investments", "Jiji"],
      "location": ["Belarus"],
      "technology": ["computer vision"]
    }
}

# Keyword Extraction

Hume provides also the ability to extract the most relevant keywords from a given text and, also, it allows to identify a ranking for each of them. The Hume implementation of Keyword Extraction capability is named TextRank. It accepts as input a text which must have been tokenized beforehand; in other words, entities must already have been extracted from the Entity Extraction skill above mentioned. For this reason, a Keyword Extraction skill cannot be used all alone, but it has to work in a cascade with an Entity Extraction one.

# Deploy the microservice

To deploy the microservice, add the following to your core Hume docker-compose file (provided by the installation download).

version: "3.7"
services:
  keyword-service:
    image: docker.graphaware.com/public/hume-textrank-service:${HUME_VERSION:-2.5.0}
    environment:
      - SERVER_PORT=8086
      - server.tomcat.max-connections=100
      - "JAVA_OPTS=-Xmx2g"
    ports:
      - "8086:8086"

# Configuration

  • name (required): The user defined name for this skill
  • url (required): The URL at which this skill is available, including the extract keywords endpoint (/extractKeywords)
  • parameters (optional): the following list of additional parameters,:
  • annotatedText
    • motherNode
    • iterations
    • damp
    • threshold
    • stopwords
    • removeStopwords
    • respectDirections
    • respectSentences
    • useDependencies
    • dependenciesGraph
    • topXTags
    • keywordLabel
    • cleanKeywords
    • admittedPOSs
    • forbiddenPOSs
    • forbiddenNEs
    • allowedDepencies

# Inputs and outputs

# Inputs
  • Annotated text (required, default: null): The grammatical representation of the text. it's the NLP output field.
# Outputs
  • Keyword results (required, default: _keywords): the list of keywords with their relevance:

    "_keywords": [
        {
          "value": "spain",
          "relevance": 0.038518948637175365
        },
        {
          "value": "england",
          "relevance": 0.030516152155735225
        },
        {
          "value": "everybody",
          "relevance": 0.034293276187748675
        },
        {
          "value": "business",
          "relevance": 0.030516152155735225
        },
        {
          "value": "part",
          "relevance": 0.026955924871608004
        },
        {
          "value": "british",
          "relevance": 0.04507287184259783
        },
        {
          "value": "feeling",
          "relevance": 0.027648381912901754
        },
        {
          "value": "summer",
          "relevance": 0.04044201830604336
        },
        {
          "value": "plaza",
          "relevance": 0.030516152155735225
        },
        {
          "value": "act",
          "relevance": 0.030516152155735225
        },
        {
          "value": "spot",
          "relevance": 0.027321034489396632
        },
        {
          "value": "first",
          "relevance": 0.028144534069453997
        },
        {
          "value": "casey shaddock",
          "relevance": 0.030516152155735225
        },
        {
          "value": "president",
          "relevance": 0.030516152155735225
        }
      ]
    

    Which is the output of the following original text:

    The coronavirus pandemic which had kept the tourists away for the first part of the summer seemed to be easing, and the - mainly British - visitors were returning to this little spot on Spain's Costa Blanca. But the feeling of relief was far too short lived: on Saturday, the British government announced it was imposing a two-week quarantine on those returning from Spain. The calls asking to cancel began to pour in almost immediately - after all, many holidaymakers cannot afford to take another two weeks off, especially with the potential of them being unpaid. For the businesses around the plaza though, the reality was the rule change was not just ruining their summer holidays. It has the potential to cripple their livelihoods too. Everybody here is just panicking - we were just getting back on our feet, Casey Shaddock, president of the Villamartin Plaza, told the BBC. Normally, this square would be buzzing - we hold 1,400. On summer nights we do live music, we bring a lot of acts across from England. Now, it is just the birds chirping.


# WebData Extraction

Hume provides one implementation for WebData Extraction that is based on newspaper3k Python library.

This Skill can be used to fetch publicly available web pages and infering structured informations.

# Deploy the microservice

To deploy the microservice, add the following to your core Hume docker-compose file (provided by the installation download).

version: "3.7"
services:
  webdata-parser:
    image: docker.graphaware.com/public/hume-webdata-parser:${HUME_VERSION:-2.6.0-RC4}    
    ports:
      - "8087:8000"

# Configuration

  • name (required): The user defined name for this skill
  • url (required): The URL at which this skill is available, including the extract keywords endpoint (/extractKeywords)

# Inputs and outputs

# Inputs
  • URL to extract (required) incoming message’s field containing the URL to fetch & parse
# Outputs
  • Parsed Data: new field in the outgoing message where parsed data are stored in:
{
  "data": {
    "top_image": "https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg",
    "movies": [],
    "text": "By Leigh Ann Caldwell\n\nWASHINGTON (CNN) — Not everyone subscribes to a New [...]",
    "keywords": [
      "states",
      "minimum",
      "family",
      "drones",
      "laws",
      "guns",
      "law",
      "national",
      "wage",
      "leave",
      "obamacare",
      "state",
      "latest",
      "pot"
    ],
    "authors": ["Leigh Ann Caldwell"],
    "publish_date": "2013-12-30T00:00:00"
  }
}

which is the output for the url: http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/

# PDF Parser

The PDF Parsing Capability allows users to extract the text from PDF files.

# Microservice configuration

version: 3.7
services:
  tikapdf:
    image: docker.graphaware.com/internal/hume-tika-pdf:2.7.0
    ports:
      - "8089:8089"

# Configuration

  • name (required): The user defined name for this skill
  • url (required): The URL at which this skill is available, for eg http://tikapdf:8089

# Entity Relations Extraction (ERE)

# Rule-based Entity Relations Extraction

Entity Relations Extraction (ERE) is a tool (rule-based approach) or a Machine Learning model which aims to discover links between two named entities based on context in which they appear. For example, in a sentence "Jane Austen, a Victorian era writer, works nowadays for Google.", PERSON entity Jane Austen is clearly linked to ORGANIZATION Google through WORKS_FOR relationship.

Further details can be found in our blogposts: the rule-based approach is described in Knowledge Graphs with Entity Relations: Is Jane Austen employed by Google? (opens new window) while the ML approach is in Hume in Space: Monitoring Satellite Technology Markets with a ML-powered Knowledge Graph (opens new window). In this section we focus on the rule-based approach.

A Hume demo with rule-based ERE is described here (opens new window).

In this section we focus on the rule-based approach. Its main features are:

  • Ready to deploy out-of-the-box: two Orchestra components can be immediately used, after defining domain-specific rules (see below)
  • Very fast (no specific hardware required)
  • Useful for quick PoCs, pre-annotating data in Hume Labs to simplify the work of human annotators and complementing ERE models in production (run them side-by-side)

Limitation: works only on within-sentence relations (both entities have to be present in the same sentence)

The rule-based ERE relies on combination of NLP (Natural Language Processing, specifically on universal depedencies) and its graph representation to allow users to define generic rules which are able to find patterns (i.e. connect entities) by running Cypher queries on the NLP graph. There are two component which handle the whole process:

  • PERSISTENCE / Token Occurrence Neo4j Writer: input is the v2 output of Entity Extraction component, i.e. the NLP annotation of the input text; this text is stored in a graph (a Neo4j Writer resource configured in Hume Ecosystem resources) where each node represents a token while relationships between nodes represent the grammatical relations (universal dependencies); advice: don't pollute the main KG Neo4j DB with this metadata, create a separate "helper" DB for storing these graphs
  • DATASCIENCE TOOLKIT / Rule Based Relation Extraction: input is the node_id (output of the previous component which is document node ID in the NLP graph) and is configured to use the same Neo4j Writer resource as the previous component; user-defined rules for extracting entity relations are defined in the RULES pane in the JSON format, which are internally translated into Cypher queries

Details about how to write the JSON ERE rules can be found in technical part of this documentation (opens new window).

# SSO for Neo4j

GraphAware Hume provides a Neo4j Security plugin that allows integration with Keycloak for Single Sign-On. In such setup, the advantage is that users visualising data in Hume will be performing queries to Neo4j with their SSO identity, and not with a service account for example.

# Installation

The installation has 3 required steps :

  1. Create a specific client in Keycloak
  2. Download, install and configure the Neo4j Security plugin
  3. Configure the Neo4j SSO Resource in Hume

# Create a Neo4j client in Keycloak

Log in to Keycloak as an administrator and select the realm for Hume.

Create a new client with the openid protocol and the confidential access type. Disable the standard flow and enable the authorization. The settings should look like this

Keycloak Client

Click then on the save button.

On the same page, in the right side of the top menu, click on Installation which brings you to a page where you can select in which format you want to download the installation configuration for this client, choose Keycloak OIDC JSON, download the file to your server ( you will need to copy it in the ${NEO4J_HOME/conf} directory).

The json file should have this format :

{
  "realm": "hume",
  "auth-server-url": "https://my-keycloak-server.com/auth/",
  "ssl-required": "external",
  "resource": "neo4j-demo-sso",
  "verify-token-audience": true,
  "credentials": {
    "secret": "987721ce-a6d3-de7acb970a2c"
  },
  "use-resource-role-mappings": true,
  "confidential-port": 0,
  "policy-enforcer": {}
}

In the same page, click on the Roles tab in the menu. This is where you will need to configure the roles users can have for this particular client.

The convention is that the role must start with the prefix NEO4J_, that prefix will be removed when the user is connected to the Neo4j server. The remaining part will also be lowercased by the plugin which means that, if a user or group is mapped to the role NEO4J_editor or NEO4J_Editor or NEO4J_EDITOR , when connecting to Neo4j he will have the editor role.

# Download, install and configure the plugin in Neo4j

Download the plugin ( the link should be asked to your GraphAware point of contact ) and copy the jar in the ${NEO4J_HOME}/plugins directory.

Copy the json downloaded above and copy it in the conf directory of your Neo4j Server.

Configure Neo4j ( edit conf/neo4j.conf ) with the following :

dbms.security.authentication_providers=plugin-keycloak-sso
dbms.security.authorization_providers=plugin-keycloak-sso

If you wish to keep the Neo4j native security, then it should be the following :

dbms.security.authentication_providers=plugin-keycloak-sso,native
dbms.security.authorization_providers=plugin-keycloak-sso,native

Restart your Neo4j server.

::: note If you use a Neo4j cluster, the procedure above should be done on every member of the cluster. :::

# Create a Neo4j SSO Resource in Hume

In Hume's Ecosystem, create a Resource with the Neo4j SSO type, give it a name and specify the Neo4j server location :

Neo4j SSO Resource

Use it ! You can now use the Neo4j SSO Resource in any perspective and visualisation.

WARNING

Do not use the SSO Resource with Orchestra, SSO is not a use case for machine processes.