# Introduction and Concepts

# What is Orchestra

Orchestra is designed on making data integration easier and more accessible to humans. It is technically best described as a Workflow Engine that automates business processes representing the consumption, transformation, enrichment and persistence of data, be it structure or unstructured. It helps not only developers, but also data scientists focus on what they do best and analysts provide insights in a matter of clicks, not days.

# What can you do with Orchestra ?

There are thousands of problems you can solve with Orchestra. Really, thousands !

Here is a non-exhaustive list of common workflows :

  • Consume live news feeds, extract entities from their texts and store it as a knowledge graph in Neo4j
  • Watch for new connections between entities in a Knowledge Graph and alert third party systems or people
  • Migrate data from siloed, legacy systems, and give them a better place to live in a connected data world
  • Reading and writing data from/to a lots differents data sources like s3, azure, queue systems, relational db and so on
  • ......

# Orchestra Architecture

# Orchestra Workflow

The core concept of Orchestra is 'Workflow', you can see a workflow like a sequence of actions executed by small blocks called 'components' linked between eachother. The granularity level is so small that you'll be able to combine every single component to build workflows at different complexity level. You can learn more about components on next sections.

# Workflow status:
  • RUNNING – The workflow is up and running
  • STOPPED – The workflow is stopped, by manual user action or by the engine itself if some issues occurres during startup.
  • NOT_FOUND – The workflow if not available.
  • STARTING – The workflow is going to be started.
  • STOPPING – The workflow is going to be stopped.
  • DELETING – The workflow is going to be deleted.
  • DELETED – The workflow has been deleted.
  • RESTARTING – The workflow is restarting.
  • DRYRUN – This is a special state, usually involved for evaluation purpose.

A workflow is created by defining the following properties:

  • name (required): The workflow name
  • autoStart (optional, default: false): When a worklow has this option enabled, and it was in RUNNING state, it will be automatically started when a restart occurres in orchestra
Workflow Create

NOTE :

To take effect the autoStart feature, proper system's setting needs to be configured as well.

# Orchestra Persistence

Orchestra stores all the data, in JSON format, on the local filesystem to retrieve all the needed information in case of restart.

It uses 3 files for each workflow:

  • Workflow definition: it stores all the core workflow information like components, resources and connections;
  • Workflow status: it stores and updates the relative status of a workflow like Running, Stopped etc;
  • Scheduler definition: it stores the scheduler workflow information like start date, stop date,etc.

# Settings

To store the files described in the previous section we need to configure the following variables (value can be the same):

  • orchestra.startup.workflows.directory: absolute path where workflow and scheduler files will be stored. Default value is {system tmp dir}/workflows;
  • orchestra.datasource.filesystemRoot: absolute path where workflows status files will be stored. Default value is {system tmp dir}/orchestra.

There are two different ways to deploy the application:

  • Docker
  • Standalone

When the application is deployed with Docker you need to change your docker-compose.yml file by adding the following variables in the orchestra service section:

version: "3.7"
services:
  orchestra:
    image: ......
    ports:
      - XXXX:XXXX
    environment:
      - xxxxx=xxxxxxx
      - orchestra.datasource.filesystemRoot=/data/orchestra_db/meta
      - orchestra.startup.workflows.directory=/data/orchestra_db/workflows
    volumes:
      - xxxxxx:xxxxxx

If you decide to use a standalone one, you need to locate the folder of the jar application and edit the following entries in the application.properties:

    orchestra.startup.workflows.directory=${java.io.tmpdir}/workflows
    orchestra.datasource.filesystemRoot=${java.io.tmpdir}/orchestra

WARNING

Safety Notes Currently, all the files for the workflow definition are not encrypted therefore it will be possible to retrieve all the passwords and secrets. To grant a better layer of security is strongly recommended to: * grant all the required access policies (read/write) * use environment variables (ex: password for db access)

# Workflow Schedulers

Foreach workflow is now possible to associate a scheduler, customizing information about how and when to run ans stop a workflow. You can schedule a workflow to run, repeat at a specified time or interval.

The access to the scheduler settings is provided by clicking on Scheduling menu item

scheduler

if there is not any scheduler associated to the workflow, you'll be invited for adding a new one

scheduler

# Workflow Scheduler Configuration

Orchestra allows you to configure basically two kinds of scheduled actions:

  • RunOnce

This kind of actions is available when you select NONE from the Repetition drop-down list, and allow you to schedule a workflow START at the specified date and time as showed in the following picture.

scheduler

At the same time you can schedule a workflow STOP event.

scheduler

NOTE : You can avoid specifying a Stop Date when your workflow includes a STOP COMPONENT that will take care to stop a workflow by itself

  • Repeatable

When you need to repeat START and STOP actions over the time you can choose one of the following frequencies:

scheduler
  • EVERY HOUR select the hourly frequency settings.
  • EVERY DAY select the daily frequency settings.
  • EVERY WEEK select the weekly frequency settings.

The repetitive nature of the workflow can be stopped by specifying

  • after how many executions
  • or simply specifying and end date (on)
scheduler

If one or more date and time settings overlap, the system will show you a proper message about the wrong configuration like this

scheduler

You can define multiple schedulers for the same workflow

scheduler

NOTE : Orchestra will take care automatically about Time Zone and it will be totally transparent by the user perspective