XenonStack Recommends

Enterprise Data Management

Uber Marmaray Features and its Best Practises

Chandan Gaur | 10 September 2024

Uber Marmaray Features and its Best Practises

What is Marmaray?

Marmaray is an Open source, Data Ingestion and dispersal framework and library for Apache Hadoop, build on the top of the Hadoop ecosystem. Users ingest data from any source and also further, distribute it to any sink leveraging the use of Apache Spark. Marmaray is responsible for ingesting the raw data in a data lake with an appropriate source schema to obtain reliable analytical results.


A GPU-powered real-time query engine that improves uber’s existing solutions too. Click to explore about, AresDB - GPU Accelerated Real Time Big Data Analytics Engine

What are the features of Uber Marmaray?

The features of Uber Marmaray are listed below:

  • Automated Schema Management.
  • Monitoring and Alerting Systems.
  • Fully Integrated with workflow orchestration tool.
  • Extensible architecture.
  • Open Source.

Why is Uber Marmaray Important?

  • Marmaray is capable of producing quality schematized data.
  • It is capable of ingesting data from multiple data sources into the Hadoop data lake through Marmaray ingestion.
  • It is capable of processing the ingested data and also capable of storing and calculating business metrics based on data in Hive.
  • Marmaray is responsible for serving the processed data from the hive to any data store where the users can query the data and get the results via Marmaray dispersal.

Why Marmaray Ingestion?

  • Raw Data needed in the Hadoop data lake.
  • Ingested raw data to Derived Datasets.
  • Reliable and correct schematized data.
  • Maintenance of multiple data pipelines.

Why Marmaray Dispersal?

  • Derived datasets in Hive.
  • Duplicate and ad hoc dispersal pipelines.
  • Future dispersal needs.

A place to store data on the cloud when data is ready for the cloud. Click to explore about, AWS Data Lake and Analytics

How Marmaray works?

The working architecture of Marmaray is listed below:

Chain of Converters

These are responsible for conversion or transformation of ingested data according to the requirements and also have the potential to save/store it to multiple sinks. If there is any malformed data found during transformation such as any missing fields or any other issues, then it is written to error tables.

Work Unit Calculator

Work Unit Calculator is responsible for creating the batches of data for processing. It takes cares that the defined amount of data to read or defined number of messages fetched to read from Kafka. It ensures that the works units are appropriately sized and don't overwhelm source or sink systems.

Metadata Manager

Metadata Manager is only responsible for storing the relevant metadata for a running job. Metadata Manager is used to storing the metadata as checkpoint information or can say partition offsets in case of Kafka. Fork Operator and Fork Function Why is Fork Operator needed?
  • Avoid reprocessing input records.
  • Avoid re-reading input records( in the case of Spark, re-executing input transformations).

How to adopt Marmaray?

  • Marmaray can be used for Data Ingestion and Data dispersal.
  • User submits and Ingestion/Dispersal job.
  • Create source and sink specific configuration.
  • Determine Work Unit to Process.
  • Read in raw data from Source.
  • Fork data to split into valid and error records.
  • Convert data to sink schema format.
  • Persist data to sink and update metadata.
  • Report metrics.

An open-source software platform managed by Apache Software Foundation. Click to explore about, Hadoop - Delta Lake Migration

Marmaray vs. Gobblin

Gobblin is similar to Marmaray, but one of the significant difference is that Gobblin is only capable of ingesting the data from different types of data sources such as databases, FTP/SFTP servers, rest API's, etc. onto Hadoop, whereas Marmaray is responsible for ingesting the data from any source and load it to Hadoop and further, it is also capable of distributing the ingested data from Hadoop to various sinks by leveraging Apache Spark. Hadoop MapReduce framework used by Gobblin wherein order to transform the data, but on the other hand, Marmaray doesn't provide any transformation capabilities.

Both frameworks Marmaray and Gobblin are responsible for handling the job, task scheduling and metadata management. Gobblin uses Hadoop MapReduce, but on the other hand, Marmaray uses Apache Spark as a primary Data Processing Engine. And using Apache Spark as a Data Processing Engine has its advantages over MapReduce. Spark is much faster due to its In-Memory Processing Semantics. Spark also provides many transformations by default such as grouping, mapping, filtering, etc. It can perform multiple transformations on data without storing the previously transformed data to HDFS.


What are the best practices of Uber Marmaray?

Use a new type of HoodieRcordPayload and keep the previous persisted one as the output of combineAndGetUpdateValue(...). However, the commit time of the previous persisted one updated to the latest, which makes the downstream incremental ETL count this record twice. Left join the data frame with all the persisted data by key and insert the records whose persisted_data.key is null.

The concern is not sure whether bloomIndex / metadata can be taken full advantage of. Put a new flag field in the HoodieRecord reading from HoodRecordPayload metadata to indicate if a copyOldRecord is needed during writing. Pass down a flag during data frame options to enforce; this entire job will be copyOldRecord.


Java vs Kotlin
Our solutions cater to diverse industries with a focus on serving ever-changing marketing needs. Click here for our Talk with Certified Big Data Specialists

What are the best Uber Marmaray Tools?

The best Uber Marmaray tools are listed below:

Conclusion

Uber visualizes Marmaray as a pipeline for connecting the raw data from different types of data sources to Hadoop or Hive and also further connecting both derived and raw datasets from Hive to various sinks depending on SLA, latency, and other customer requirements.