# Architecture

# Introduction

# Purpose

This section provides a comprehensive architectural overview of the GraphAware Hume Ecosystem.

# Scope

This section is intended to provide a high level overview of the components forming the GraphAware Hume Ecosystem. It also provides recommendations for hardware sizing, components dependency matrix, and areas where security awareness by the IT operations team has to be considered.

# Definitions

Microservice - A single application providing a specific capability. Microservices communicate with each other over the HTTP/HTTPS protocol and a REST API.

Hume Ecosystem - A set of all microservices, a subset of which forms an architecture that addresses a set of use cases or business requirements. Also referred to as GraphAware Hume or just Hume.

View - A view of the Hume Ecosystem in which multiple microservices are composed together to form the minimal architecture required for Hume to serve business requirements.

Data Source - An external resource or software system which Hume can connect to in order to read/write data.

NLP - Natural Language Processing.

# View 1 - Data Analysis

# Description

This view is the minimal view. It is intended for use cases where Hume is used only for data analysis purposes. The data to be analysed is stored in a Neo4j database.

# Architectural Diagram

Data Analysis - Architectural Diagram

# Components

# UI Application

This component acts as the frontend application of Hume. It provides users with a web based interface from which they can manage the application, describe data integration and enrichment workflows visually, as well as analyse the data as graph visualisations or charts.

# Specifications
  • Written in JavaScript with the VueJS framework
  • Distribution build is a set of html, css and javascript files
  • Must be served by a web server ( NGINX, Apache Httpd, .. )
# Commercial dependencies
  • Keylines ( https://cambridge-intelligence.com/keylines/ ) - GraphAware has a commercial OEM license with Cambridge Intelligence for the distribution of Keylines with Hume. There is no commercial or legal impact of this dependency on Hume end users / customers.

# Core API

This component provides the API for the UI application. It holds the business logic of the application. It also provides a bridge between the UI application - specifically the graph visualisation - and the Neo4j graph database.

# Specifications
  • Written in Java with the Spring Boot framework
  • Embedded Tomcat web server
  • Requires Java11
  • Stores and retrieves data from a PostgreSQL database
# Commercial dependencies
  • NIL

# Neo4j

This component is not provided by GraphAware Hume. However it is a requirement to connect Hume to a Neo4j graph database if users want to analyse data as graphs.

# Specifications
  • GraphAware Hume is compatible and tested with Neo4j 3.5.12 and higher, as well as Neo4j 4.0.0 and higher
# Commercial dependencies
  • The end user / customer must have a subscription to Neo4j Enterprise Edition

# View 2 - Data Integration and Enrichment

# Description

This view is intended for use cases where users want to leverage the data integration and enrichment capabilities of GraphAware Hume. This view includes all the components from View 1.

# Architectural Diagram

Data Analysis - Architectural Diagram with Orchestration

# Components

All components from View 1 are required for View 2.

# Orchestra API

This component serves as an API for the orchestration of various data integration and enrichment workflows. The service enables the configuration of workflows and manages them. Each workflow generally consumes data from a Data Source, processes it with some business logic and writes the processed data to another Data Source (the source and target Data Source can be the same).

# Specifications
  • Written in Java with the SpringBoot framework

  • Embedded Tomcat web server

  • Requires Java11

  • Embedded Apache Camel 3.1

# Commercial dependenxies
  • NIL
# Additional Notes
  • The Orchestra API service does not have a dependency on the components from View 1.
  • The Orchestra API service itself can run on its own and be configured with a JSON file.

# View 3 - SSL

# Description

This view is intended for use cases where communication amongst all components must be done using a secure network layer protocol.

# Architectural Diagram

Data Analysis - Architectural Diagram with Orchestration and TLS

# Components

All the components from View 1 and View 2

# Specifications

  • The components support and are tested with TLS 1.2+ only
  • Versions below TLS 1.2 are possible but require additional consulting services
  • The TLS configuration for the UI application is made at the NGINX web server level
  • The TLS configuration for Spring Boot applications is made at the embedded Tomcat web server level
  • A self-signed or provided certificate can be used

# Commercial dependencies

  • NIL

# View 4 - Natural Language Processing Capabilities

# Description

This view is intended for use cases where textual data has to be analysed, and structured information extracted from it. For example, extracting mentions of people’s names, locations or domain specific entities.

# Architectural Diagram

# Components

# Service Gateway

The purpose of this component is to provide a simple, yet effective way to route to NLP microservices and handle cross cutting concerns, such as security, monitoring/metrics, and resiliency.

# Specifications
  • Written in Java with the Spring Boot Framework
  • Embedded Netty web server
  • Requires Java11
  • Supports TLS configuration such as in View 3
  • Embedded Spring Gateway ( https://spring.io/projects/spring-cloud-gateway )
# Commercial dependencies
  • NIL

# Entity Extraction Service

This component provides a Machine Learning based framework for extracting mentions of names representing things in the real world. The component receives a piece of text as input and outputs a list of mentions for each type of entity the algorithm has been trained for. Examples include Persons, Locations, as well as custom entity types such as Weapons, Car Manufacturers, etc.

# Specifications
  • Written in Java with the Spring Boot Framework
  • Embedded Netty web server
  • Requires Java11
  • Support TLS configuration such as in View 3
# Commercial dependencies
  • NIL
# Open source dependencies

This microservice embeds Stanford CoreNLP, which is licensed under GNU GPL v3. For this reason, this microservice only is licensed under GNU GPL v3. Complete and corresponding source code for this microservice can be requested from GraphAware and will be supplied electronically free of charge.

# Additional Notes

Analysing textual data can be greedy. The time and resources needed to analyse a piece of text is generally linearly proportional to the length of the text itself.

# Keyword Extraction Service

This component offers the ability to extract most relevant keywords or keyphrases for a given piece of text. This is a proprietary algorithm developed by GraphAware; more information about how it works can be found here : https://graphaware.com/neo4j/2017/10/03/efficient-unsupervised-topic-extraction-nlp-neo4j.html

# Specifications
  • Written in Java with the Spring Boot Framework
  • Embedded Netty web server
  • Requires Java11
  • Support TLS configuration such as in View 3
# Commercial dependencies
  • NIL

# View 5 - Externalised Identity And Access Management

# Description

This view is intended for use cases where Hume’s basic authentication mechanism is not sufficient for security reasons in enterprise production scenarios. To address this requirement, Hume natively integrates with Keycloak, a Single-Sign On with Identity and Access Management Software, sponsored by RedHat.

# Architectural Diagram

# Components

# Keycloak

Keycloak is an open source (Apache Licence v2) software product that allows single sign-on with Identity Management and Access Management aimed at modern applications and services. As of March 2018 this JBoss community project is under the stewardship of Red Hat who use it as the upstream project for their RH-SSO product.

# Specifications
  • Written in Java
  • Supports Java 11
  • Support TLS configuration such as in View 3
  • User Federation support with LDAP and Kerberos
  • Social Login support for Google, Github, etc
# Commercial dependencies
  • NIL

# View 6 - Alerting

# Description

This view describes the components used to run the alerting (subscriptions) capability.

# Architectural Diagram

# Components

  • Alerting Controller
  • Alerting Operator
# Specifications
  • Written in Java
  • Supports Java 11
  • Support TLS configuration such as in View 3
  • The alerting-operator component is not a web application and is thus not exposed via http
# Commercial dependencies
  • NIL

# Minimum Hardware Requirements

For every component, the recommended minimum sizing is provided as well as a level where:

  • LOW = very small footprint
  • MEDIUM = normal to high usage footprint
  • HIGH = data intensive and greedy

Note that the recommended memory and cores must be summed up if multiple components run on a single physical or virtual machine.

# UI application

LEVEL = LOW

Recommended memory : 50MB Recommended cores : 1

# Core API

LEVEL = MEDIUM

Recommended memory : 4GB Recommended cores : 2/4

# Orchestra API

LEVEL = MEDIUM

Recommended memory : 6GB Recommended cores : 4/6

# Service Gateway

LEVEL = LOW

Recommended memory : 2GB Recommended cores : 2

# Service Registry

LEVEL = LOW

Recommended memory : 2GB Recommended cores : 2

# Entity Extraction Service

LEVEL = HIGH

Recommended memory : 8GB Recommended cores : 6/8

# Keyword Extraction Service

LEVEL = MEDIUM

Recommended memory : 2GB Recommended cores : 2

Note: Every deployment and use case is different. GraphAware can provide recommendations for sizing upon request based on particular requirements.


# Components Dependency Matrix

UI app Core API Orchestra Gateway Discovery ER KE
UI app ------ x o
Core API ----- o o
Orchestra ----- o o o o
Gateway ----- x
Discovery -----
ER -----
KE -----

# Deployment strategies

GraphAware Hume is deployable using the following two strategies.

GraphAware provides Docker images for each of the microservices presented in this document. Deployments can be made as single image deployments, or in composed application deployments with docker-compose.

# Java Archive Files

Since all the java components share the same base framework, Hume can be installed using a “vanilla” strategy, where the responsible IT operations team have to :

  • Setup the web server for the frontend

  • Create systemd configurations for the lifecycle management of every java application

    GraphAware has successfully tested this method on a RHEL7 distribution with openjdk11.


# Security Awareness

GraphAware Hume provides insights to users based on the results of the integration, enrichment, and processing of data from the Data Sources it is configured to connect to.

Running workflows involves reading data from configured data sources, processing this data (e.g. extracting named entities from text), and storing the processed results in the Neo4j graph database, so that end users can visualise insights produced by this process.

A user initiating a workflow is therefore potentially involving areas of the deployment where security should be taken care of.

The following diagram demonstrates a potential flow of information containing potentially sensitive data between the components during a normal lifecycle.

Potential flow of sensitive data

The following section lists such areas and for each of them a non-exhaustive list of security risk mitigation strategies.

# Authentication

Hume by default is activated with a native basic authentication mechanism using JSON Web Tokens.

Mitigation measures

  • Run a safe default configuration such as using the Externalised Identity and Access Management View with 2FA enabled
  • Use strong password encryption algorithms ( for eg. Hume can be configured with FIPS compliant encryption algorithms )
  • Ensures TLS is enforced on the network protocol layer

# Data Source Configuration

When GraphAware Hume is configured to read from and write to a DataSource, the connection information has to be stored somewhere. GraphAware Hume stores those in the PostgreSQL database.

When workflows are initiated from the user interface, the configuration itself is communicated to the Orchestra API in the http(s) request.

Mitigation measures

  • Encrypt the PostgreSQL data at rest
  • Ensure a TLS connection between the Core API and the PostgreSQL server
  • Ensure a TLS connection between the Core API and the Orchestra API

# Sensitive Textual Documents

The content of textual documents to be analysed can be sensitive as well. Based on the diagram above, a piece of text will have to flow between various components before the results are stored in the graph database.

Mitigation measures

  • Ensures TLS connection between all the components

# Repudiation

Any event related to a security aspect is logged in an audit log file. More advanced configurations such as leveraging Neo4j’s logging for any insights delivery aspects can also be used.

# Availability

Various reasons can cause a component to be unavailable. To remediate this, any component mentioned in all the views in this document can be scaled independently to offer a highly available system. External components to the system are also chosen with availability in mind. Keycloak, for example, offers the ability to be configured in a highly available cluster.

GraphAware is dedicated to support IT operations teams in providing all the knowledge and information necessary to deploy and configure Hume in a secure environment.

# Audit and Logging

This section describes components of the ecosystem where audit logging is in place. This includes the following types of events which are depicted later:

  • Authentication events
  • Authorization events
  • Resource configuration events
  • Visualisation events
  • Orchestration workflow events

# Authentication

Authentication

Independently of the authentication provider used, every request from the frontend application ( or any other request ) to the API is first hitting the embedded web server (Tomcat).

The request is then filtered by the Spring Security layer, which depending on the configuration, will use the correct Authentication manager in order to validate the token and authenticate the request.

The authentication is stateless - this means that the authentication aspect happens for every request. Both supported providers ( Keycloak or native ) are based on JSON Web Tokens (JWT). Types of events logged :

  • Authentication success
  • Authentication failure

# Authorization

Authorization

Any request, except the login attempt in native authentication mode, is subject to authorization.

Permissions to perform an action are based on Hume’s Role Based Access Control mechanism, and the authorization logic is the following :

  1. The JWT token contains the Principal’s roles
  2. The Security layer of the API populates the Security Context with the Principal’s details ( username, uuid, … ) as well as its roles
  3. Once the request hits the Service Layer method, it performs the PermissionEvaluation check, loading the permissions of the role from the database and verifying if the Principal has any role with granted permission to execute the method.
  4. A successful response is returned or an AccessDenied exception is thrown

The following audit events are logged :

  • AccessRequest
  • AccessGranted
  • AccessDenied

All events contain additional information such as

  • Timestamp
  • principalName
  • permissionName
  • objectId ( if requesting a specific object )

# Resource Configuration

This aspect is probably the most important aspect for Hume implementers. For Hume to connect to Neo4j for visualisations or to connect to external data sources in an Orchestra workflow, the administrator will have to configure the connection to those systems.

Such a connection is called a Resource in Hume terminology and might often contain credentials information, such as database username and passwords.

Create Resource process

An administrator or a user with a role granting him the permission to create a Resource will need to provide the credentials in the user interface.

Those credentials are then sent to the API, which stores them in the Postgres database for further retrieval.

It is highly recommended to use TLS between the Frontend application and the Hume API so that values are encrypted in transit. Hume does not have an encryption mechanism today in order to store the values in an encrypted format in the Postgres database.

The implementer can rely on using TLS between the Hume API and the Postgres server as well as encrypting the database at rest depending on its security requirements.

The following audit events are logged :

  • ResourceCreate
  • ResourceUpdate
  • ResourceDelete
  • ResourceAccess
  • ResourceCredentialsAccess ( in the case a resource contains a field of type “PASSWORD” )

Only users with roles ADMINISTRATOR, SYSTEM_INTEGRATOR or RESOURCE_MANAGER have the permissions to perform operations on the Resources Configuration.

Note that this aspect is really just creating configurations which can then be used in other components of Hume. Before other users can make use of such resources, the resource manager has to grant access to such resources to roles.

When granting access, the following audit events are logged :

  • ResourceAccessPermissionGranted
  • ResourceAccessPermissionRevoked

Both events include the following information :

  • principalName : the resource manager, system integrator or administrator username
  • resourceId : the id of the resource
  • resourceType: the type of the resource
  • role : the role which this resource is granted to
  • permissionName : the permission the role will be granted to the resource timestamp

# Resource Usage

# Visualisation

Resource Usage - Visualisation

When Hume users visualise graph data from Neo4j, they need first to search for data.

To do so, they have to be granted a permission to the Neo4j resource ( via their roles ). They also have to have the necessary permissions to access the Knowledge Graph and view or create a visualisation on the specific Neo4j resource.

Hume Knowledge Graphs are based on a Schema, which means that the node labels on which the search query runs are predefined and cannot be specified by the user.

This also means that search queries are never part of the Neo4j query itself, but always a parameter of such a Neo4j query.

Let’s assume a schema with Person, Location and Company. If the user searches for “John”, Hume will generate one query per Class in the schema.

No full text search enabled :

MATCH (n:Person) WHERE n.firstName CONTAINS $query OR n.lastName CONTAINS $query RETURN n LIMIT 20

Same for Location and Company.

If full text search enabled :

CALL db.index.fulltext.queryNodes(‘Person’, $query)`

The result of such a query is appended to the “Canvas” of the visualisation data in the Postgres database and then returned to the user for visualisation.

During a search, the following audit events are logged :

  • VisualisationSearchQuery
  • VisualisationSearchQueryTimeout

Both events contain the following additional information :

  • principalName
  • knowledgeGraphId
  • resourceId
  • visualisationId

Note that the content of the query itself is not logged; instead, the implementer can take advantage of Neo4j query logging. Hume will add the principalName to the metadata of the transaction performing the query against the Neo4j server which will be reflected in Neo4j’s query logs ( see https://neo4j.com/docs/api/java-driver/current/org/neo4j/driver/TransactionConfig.Builder.html#withMetadata-java.util.Map- )

# Actions

Actions are pre-defined Cypher queries that users can trigger in a visualisation. Actions can either write or read from the database, hence it is important that actions usage are granted to the right roles.

The content of the actions themselves are stored in Postgres.

Running Actions

Resource Usage - Actions

When the user triggers an action, it creates a transaction where the Cypher statement is the content of the action and optionally additional information ( selected node or relationship ).

The logging of the Cypher queries in Neo4j’s logs follow the same principle as the Visualisation section above.

The following audit events are logged for actions :

  • KnowledgeGraphActionCreated
  • KnowledgeGraphActionUpdated
  • KnowledgeGraphActionDeleted
  • KnowledgeGraphActionTriggeredWrite
  • KnowledgeGraphActionTriggeredRead

# Workflows

Workflows are meant to do data processing, transformation, enrichment and persistence.

They will generally trigger a process that will :

  1. Read data from a data source
  2. Process the data ( for example extracting entities )
  3. Create a given Cypher query
  4. Write the transformation to Neo4j
Workflows

Some aspects are important :

  1. The Cypher queries are given by the user when creating the workflow
  2. Workflows often need Resource credentials to be able to connect to external resources
  3. Resource credentials are passed over the network to the Orchestra service
  4. Orchestra service can write to external systems

The following audit events are logged :

  • WorkflowStart
  • WorkflowStop
  • WorkflowRestart
  • WorkflowFailed

The writes to Neo4j can benefit from the same metadata capabilities than in the Visualisation section above with the difference that the information will be the workflowId and not the principalId that will be set as executors of the queries.