What is Apache Gobblin?
Apache Gobblin is a unified data ingestion framework for extracting, transforming and loading a large volume of data from a variety of data sources. It can ingest data from different data sources in the same execution framework and manages metadata of different sources in on place.Data lake architecture has capability to quickly and easily ingest multiple types of data, such as real-time streaming data and bulk data assets. Click to explore about, Data ingestion methods
What are the components of Apache Gobblin?
Gobblin provides six different component interfaces, so it's easy to scale and customize development:Source
Source is primarily responsible for integrating source data into a series of work units and indicating what the corresponding extractor is.
Extractor
Extractor specifies the data source information through the work unit, such as Kafka, indicating the starting offset of each partition in the topic, which is used for this extraction. Gobblin uses the concept of the watermark to record the starting position of each extracted data.
Converter
Which performs some filtering and conversion operations on the extracted data, such as converting byte arrays or JSON format data into a format that needs to be output. A conversion operation can also map a piece of data into zero or more pieces of data.
Quality Checker
It is a quality detector with two types of checkers: record level and task-level policies. The checked data is output to an external file or given a warning by a standard policy or an optional policy.
Writer
Writer writes the exported data, but it is not written directly to the output file, but written to a staging directory. When all the data has been written, it is written to the output path for publishing by the publisher. The path of the Sink can be in HDFS or Kafka or Amazon S3, and the format can be Avro, Parquet, or CSV format. At the same time, the Writer can output the output file to the directory named “hour” or “day” according to the timestamp.
Publisher
Publisher is based on the path written by the writer to output the data to the final path. At the same time, it provides two kinds of submitting mechanisms: full commit and partial commit; if it is a full commit, it needs to wait until the succeeds before publish. If it is a partial commit mode, when the task fails, some data in the directory has been published.
An Open source, Data Ingestion and dispersal framework and library for Apache Hadoop, build on the top of the Hadoop ecosystem. Click to explore about, Uber Marmaray Features and its Best Practises
Why Apache Gobblin?
Apache Gobblin is a generic data ingestion framework, which is easily configurable to ingest data from several different types of sources and easily extensible for new data sources. Gobblin handles the common routine task required for all data ingestion ETLs, including job/task scheduling, task partitioning, error handling, state management, data quality checking, data publishing, etc. It ingests data from different data sources in the same execution framework and manages metadata of various data sources all in one place. Gobblin features-- Auto scalability
- Fault tolerance
- Data quality assurance
- Extensibility
- Handling data model evolution
Some challenges Gobblin addresses
- Source integration - Gobblin provides out-of-the-box adaptors for all of commonly accessed data sources such as S3, Kafka, Google Analytics, MySQL and Salesforce
- Processing paradigm - It supports both standalone and scalable platforms, including Yarn and Hadoop. Yarn Gives the capability to run continuous ingestion in addition to scheduled batches.
- Extensibility - Own adaptors can be integrated with the Gobblin framework and make it leverageable for other developers in the community.
- Self-service - It's standalone support mode so data ingestion and transformation flow can be composed in a self-service manner, test locally using standalone mode and deploy the flow in production using scale-out mode without code change.
An open source for distributing and processing of data supporting data routing and transformation. Click to explore about, Data Ingestion Platform Using Apache Nifi
How Apache Gobblin Works?
Gobblin job ingests data from a data source into a sink. A job may consist of multiple tasks, or work units, each of which represents a unit of work to be done.Guide to Computation Model
Gobblin Standalone- Single process, multi-threading
- Testing, small data, sampling
- Large datasets, horizontally scalable
- Better resource utilization
- More scheduling flexibilities
- Determines how to partition work
- Partitioning algorithm can leverage source sharding
- Group partitions intelligently for performance
- Creates work-units to scheduled
- Job execution states
- Watermark
- Job state, Task state, quality checker output, error code
- Job synchronization
- Job failure handling: policy-driven
- Specifies how to pull data from the source and get the schema
- Return ResultSet iterator
- Track high watermark
- Track extraction metrics
- Filtering
- Projection
- Type conversion
- Structural change
- Ensure the quality of any data produced by Gobblin
- Can be run on a per task, per record, or per job basis
- Can define a list of quality checkers to be used
- Schema compatibility
- Audit check
- Sensitive fields
- Unique key
- Policy-driven
- FAIL - when the check fails then so does the job
- OPTIONAL - when the checks fail the job continues
- ERR_FILE - the effecting row is written to an error file
- Writing data in Avro format onto HDFS
- One writer per task
- Flexibility
- Configurable compression codec
- Configurable buffer size
- COMMIT_ON_FULL_SUCCESS
- COMMIT_ON_PARTIAL_SUCCESS
Data is everywhere, and we are generating data from Centre of Analytics - Product Discovery and Recommendation different Sources like Social Media, Sensors, API’s, Databases. Click to explore about, Real Time Big Data Integration Solutions
How to use Apache Gobblin?
- Self-serve - User can create jobs programmatically through REST APIs or via UI on any Gobblin deployment, leaving operations to focus on only deployment and upgrades.
- Optimal resource usage - User can submit jobs and leave it to Gobblin-as - a -Service to optimally choose executor instance and compile logical job or single tenant job based on resource and SLA constraints.
- Failover and upgrades - The technology executing the job behind Gaas can be transparently swapped out in case of failover or upgrades without their intervention.
- Global state - The unifying factor of Gaas across hybrid technology deployment enables operations team to easily monitor and manage the global state of data landscape and lineage in their organization.
What are the best practices of Apache Gobblin?
A child class of EmbeddedGobblin is based on a template. The constructor should call setTemplate(myTemplate), and the model should be automatically loaded on construction. All required configurations for a job need to be parsed from the constructor arguments. A user should be able to run new MyEmbeddedGobblinExtension(params...).run() and get a sensible job run. Convenience methods must be added to the most common configurations users would want to change. For example -public EmbeddedGobblinDistcp simulate() {
this.setConfiguration(CopySource.SIMULATE,
Boolean.toString(true));
return this;
}
If the job requires additional jars in the workers that are not part of the minimal Gobblin ingestion classpath, then the constructor should call distributeJar(myJar) for the additional jars.What are the benefits of Apache Gobblin?
The benefits of Apache Gobblin are listed below:
- Auto scalability
- Fault tolerance
- Data quality assurance
- Extensibility
- Handling data model evolution
Key Apache Gobblin Tools
Conclusion
Gobblin combined with other features such as auto scalability, fault tolerance, data quality assurance, extensibility and the ability of handling data model evolution. It an easy-to-use, self-serving, and efficient data ingestion framework.
- Explore about Data Serialization in Apache Hadoop
- Read more about Apache Hudi Architecture and Best Practices