A Spring Cloud tool for Streaming and Batch data processing using scheduled tasks.
The aim is to deploy a series of microservices on a platform (such as Kubernetes) that create a data processing pipeline.
Spring Cloud Data Flow is a platform for creating data processing pipelines developed in Spring Boot using Spring Cloud. This allows, among other things:
- Local deployment for local testing of pipelines. It can be deployed as a normal or dockerized app.
- Cloud deployment in a Kubernetes cluster using Helm or as another app. It will interact with Kubernetes to manage the deployment of scheduled tasks as cron jobs, using a docker registry as the image source to deploy the jobs as containers.
- It has a web console to view and manage pipelines.
- Integration with OAuth security platforms.
- Integration with CI/CD platforms through its REST API.
- Officially supports Kubernetes and Cloud Foundry, but there are also community projects that support OpenShift, Nomad and Apache Mesos.
Spring Cloud Data Flow has two main components:
Stream processing: Used for real-time ETL job processing based on task execution within a pipeline where each stage is an independent application, communicating through events (Kafka or Rabbit).
There are three types of nodes:
- Source: Nodes that react to an event and initiate processing.
- Processor: Nodes that transform and process the events released by the source nodes.
- Sink: End nodes that receive data from the processors and generate final information.
Figure 1: Diagram of a Stream.
Batch processing: In this case, the developed microservices must produce execution results for monitoring. These microservices are developed under the Spring Cloud Task and Spring Cloud Batch platforms. There are different basic Batch entities:
- Task: A short-lived process executed in a microservice.
- Job: An entity comprised of one or more steps that encapsulates a batch process.
- Step: A domain object that encapsulates a sequential phase of a batch job. Each step has an ItemReader, an ItemProcessor and an ItemWriter.
Figure 2: Diagram of a Batch.
Deployment: Spring Cloud Data Flow deploys as a microservice but has permissions to access the platform where it is deployed, allowing it to execute deployments and schedule jobs. This is possible thanks to the Service Provider Interface module that comes with default mechanisms to connect to Kubernetes and Cloud Foundry.
CI/CD: Spring Cloud Data Flow can be integrated into Continuous Integration and Deployment flows thanks to its REST API. The integration would be as follows:
Figure 3: Diagram of CI/CD deployment.
More information and practical examples can be found here.
José Luis Antón Bueso, Solution Architect in Altia.