Apache Hop: an open-source data orchestration platform

Apache Hop

Inspiring technology by Hunters

Today our Hunters discuss Apache Hop (Hop Orchestration Platform): an open-source project developed by Matt Casters, Bart Maertens and part of the Pentaho Data Integration community. Hop is an open-source data orchestration platform based on data processing pipelines or workflows to run ETL processes.

Using this tool, developers can design workflows to handle data visually, allowing them to focus on what they need to do, rather than how to do it, without writing any lines of code.

What are Apache Hop’s functionalities?

The user designs workflows from the Apache Hop interface. The Hop engine then receives the instructions and executes them via a plugin.

All Hop functionalities are integrated via plugins. It natively includes more than 400 plugins to design pipelines and/or workflows, deal with big and small amounts of data locally and run jobs in different environments (external servers, cloud, clusters...). Hop also integrates Apache Beam natively, which runs jobs on Apache Spark, Apache Flink or Google Dataflow.

Hop is made up of command line tools that offer different actions:

Hop-gui: Opens the UI to create workflows, pipelines, manage projects, configurations and explore functionalities.
Hop-server: Starts a server to run workflows or pipelines.
Hop-run: Runs jobs specifying project configuration, environment and/or properties as parameters.
Hop-conf: Manages configuration related to environment, project and execution variables, as well as third-party plugins.
Hop-search: Searches for metadata within a project directory.
Hop-import: Imports and converts third-party plugins to Apache Hop format.
Hop-encrypt: Encrypts plain text passwords for use in XML files or within different interface components.

Concepts

Hop developers can create workflows and pipelines in Hop Gui. These processes can be run natively on Hop (locally or remotely) or on external platforms (Apache Spark or Google Dataflow) thanks to Apache Beam.

These are the key concepts in Apache Hop:

Pipelines are collections of transforms, while workflows are sequences of actions.

Pipelines are used to read data from different sources (databases, text files, REST services, etc.), process data (adding, transforming, deleting rows or columns) and export data in multiple formats. Meanwhile, workflows are used to run sequential tasks.

Transforms are the building blocks that make up a pipeline handling the data: enriching, cleansing, filtering, etc. Transforms process data and generate an output that is the input for the next transform within a pipeline. Normally transforms in a pipeline are executed in parallel.

Operación de lectura y operaciones de un fichero Excel

Figure 1: Reading and Excel file operations.

In the example in Figure 1, the Excel file is read. Then, a transform (filter or switch) obtains different sections of data. Different operations are then performed on each output: export to an Excel file, POST to an HTTP server and save to MongoDB.

Actions are the operations performed in a workflow. They return a true or false exit code that can be used to drive the workflow. Normally, actions within a workflow are executed sequentially, though they can also be run in parallel.

Secuencia de tuberías, envío de correo y encapsulación de datos en un ZIP

Figure 2: Pipe sequencing, sending email and exporting data to a ZIP file.

The example in Figure 2 shows a sequence of actions: executing pipelines, sending emails or exporting data to a zip file.

Projects are sets of pipelines, workflows, metadata, variables and configurations. Within the same project, configurations are separated by environment. We have different environments by default: development, test and production environment.

What can you do with Apache Hop?

Apache Hop, in addition to running simple jobs through pipelines or workflows, allows you to:

Load and process big data leveraging cloud environments and parallel processing.
Populate databases to carry out tests with other tools.
Migrate data between applications or between relational and non-relational databases.
Clean up and profile data.

Comparison with Pentaho Data Integration and Spring Cloud Data Flow

Pentaho Data Integration

Apache Hop is similar in concept to Pentaho Data Integration. However, in Pentaho there is no concept of project, so you cannot create variables per environment or scripts to configure projects. Shared objects (metadata and configurations) in Pentaho are defined in a shared.xml file, whereas in Hop they are divided into subfolders within the project. This provides better structuring and configuration debugging.

The Apache Hop engine allows you to run jobs externally via Apache Beam, which is not possible in Pentaho.

Hop was born in 2019 when it split from Pentaho Data Integration, and these evolved into projects with different goals and priorities. Since Hop shares its origins with Pentaho, it allows you to import jobs, transforms, database configurations or properties from Pentaho Data Integration.

The main difference between the two is that only Hop allows you to manage projects and environment configurations from the interface itself, as well as to search for information within a project or configuration. Hop also allows you to configure and execute functionality both via command line and interface, which in Pentaho can only be done via interface.

In terms of code, Apache Hop is more optimised than Pentaho Data Integration, which has advantages:

Lower resource consumption.
Reduced loading and execution times.
Ease of deployment within a Docker container

Spring Cloud Data Flow

Spring Cloud Data Flow (SCDF) is a tool for building streaming or batch data processing pipelines based on microservices.

Like Apache Hop, it allows you to deploy locally, within a Docker container or Kubernetes cluster. SCDF has a web console to view and manage pipelines.

It supports two types of processing:

Stream processing: Execution of tasks within a pipeline that communicate through events. This concept is similar to Apache Hop pipelines, where data is processed via transforms.
Batch processing: ETL microservices print log output to a file for monitoring.

Watch this Apache Hop demo!

Want to know more about Hunters?

A Hunter rises to the challenge of trying out new solutions, delivering results that make a difference. Join the Hunters programme and become part of a diverse group that generates and transfers knowledge. Anticipate the digital solutions that will help us grow. Find out more about Hunters on our website.

Tahir Farooq

Software Technician at Altia

Prev post Next post