Uber Marmaray Features and its Best Practises

Interested in Solving your Challenges with XenonStack Team

Get Started

Get Started with your requirements and primary focus, that will help us to make your solution

First Name *

Last Name *

Business Email ID *

Contact Number *

Company *

Industry Belongs To *

Proceed Next

Interested in Solving your Challenges with XenonStack

Personalization

Get Started with your requirements and primary focus, that will help us to make your solution

In Which Agentic Platform and Accelerator you are Interested? *

Akira AI - Agentic AI Platform Multi Agent System

Metasecure - Autonomous SOC

Nexastack – Build and Managed Compound AI Stack

Data Foundry

XAI – Vision and AI Platform – Visual AI Agents

Strategy Consulting

AI Managed Services

Others (Please Specify)

Which segment does your company belong to? *

Startup

Scale Startup

SME

Mid Enterprises

Large Enterprises

Federal Government

Non Profits

Others (Please Specify)

What is your primary focus areas? *

Platform Engineering

Data and Analytics

AI Managed Services

AI Transformation

IT Operations Management

Supply Chain Management

Managed Services

Security Operations

Finance Operations

HR Service Delivery

Customer Service

Telecom Operations

Clinical Operations

Energy Management

Others (Please Specify)

At what stage is your AI use case currently in? *

Conceptualized: Use case defined, PoC pending

POC Completed

In Production with challenges

Not yet defined

Others (Please Specify)

What are the primary challenges in adopting AI? *

Data Quality Issues

Data Privacy and Compliance

Aligning AI with business goals

Unclear ROI from POCs

Integration with existing ERP systems

Scalability Challenges

Moving POCs in Production

Infrastructure Limitation

High Implementation costs

Others (Please Specify)

What kind of infrastructure does your organization currently using? *

AWS

Microsoft Azure

GCP

IBM Cloud

Oracle Cloud

On Premises

Others (Please Specify)

Are you using any Data platform? *

Databricks

SnowFlake

Amazon Redshift

Azure Synapse Analytics

Microsoft Fabric

Teradata

Oracle Database

SAP Hana

Informatica

Google Cloud BigQuery

Others (Please Specify)

Preferred Approach for AI Transformation *

Assisted Intelligence Agents as Co-Pilot

Collaborative Intelligence Agents as AI Teammates

Autonomous Intelligence Agents – AI Agents

Agentic Actions

Agentic Process Automation

In Which Domain your Solution/Organization belongs to in-terms of Data Privacy, Trustworthy AI *

Internal Organization

Highly Regulated Industry (Healthcare, Financials etc)

Medium Regulated

Non Regulated

Review Previous

Submit

Uber Marmaray Features and its Best Practises

What is Marmaray?

Marmaray is an Open source, Data Ingestion and dispersal framework and library for Apache Hadoop, build on the top of the Hadoop ecosystem. Users ingest data from any source and also further, distribute it to any sink leveraging the use of Apache Spark. Marmaray is responsible for ingesting the raw data in a data lake with an appropriate source schema to obtain reliable analytical results.

A GPU-powered real-time query engine that improves uber’s existing solutions too. Click to explore about, AresDB - GPU Accelerated Real Time Big Data Analytics Engine

What are the features of Uber Marmaray?

The features of Uber Marmaray are listed below:

Automated Schema Management.
Monitoring and Alerting Systems.
Fully Integrated with workflow orchestration tool.
Extensible architecture.
Open Source.

Why is Uber Marmaray Important?

Marmaray is capable of producing quality schematized data.
It is capable of ingesting data from multiple data sources into the Hadoop data lake through Marmaray ingestion.
It is capable of processing the ingested data and also capable of storing and calculating business metrics based on data in Hive.
Marmaray is responsible for serving the processed data from the hive to any data store where the users can query the data and get the results via Marmaray dispersal.

Why Marmaray Ingestion?

Raw Data needed in the Hadoop data lake.
Ingested raw data to Derived Datasets.
Reliable and correct schematized data.
Maintenance of multiple data pipelines.

Why Marmaray Dispersal?

Derived datasets in Hive.
Duplicate and ad hoc dispersal pipelines.
Future dispersal needs.

A place to store data on the cloud when data is ready for the cloud. Click to explore about, AWS Data Lake and Analytics

How Marmaray works?

The working architecture of Marmaray is listed below:

Chain of Converters

These are responsible for conversion or transformation of ingested data according to the requirements and also have the potential to save/store it to multiple sinks. If there is any malformed data found during transformation such as any missing fields or any other issues, then it is written to error tables.

Work Unit Calculator

Work Unit Calculator is responsible for creating the batches of data for processing. It takes cares that the defined amount of data to read or defined number of messages fetched to read from Kafka. It ensures that the works units are appropriately sized and don't overwhelm source or sink systems.

Metadata Manager

Metadata Manager is only responsible for storing the relevant metadata for a running job. Metadata Manager is used to storing the metadata as checkpoint information or can say partition offsets in case of Kafka. Fork Operator and Fork Function Why is Fork Operator needed?

Avoid reprocessing input records.
Avoid re-reading input records( in the case of Spark, re-executing input transformations).

How to adopt Marmaray?

Marmaray can be used for Data Ingestion and Data dispersal.
User submits and Ingestion/Dispersal job.
Create source and sink specific configuration.
Determine Work Unit to Process.
Read in raw data from Source.
Fork data to split into valid and error records.
Convert data to sink schema format.
Persist data to sink and update metadata.
Report metrics.

An open-source software platform managed by Apache Software Foundation. Click to explore about, Hadoop - Delta Lake Migration

Marmaray vs. Gobblin

Gobblin is similar to Marmaray, but one of the significant difference is that Gobblin is only capable of ingesting the data from different types of data sources such as databases, FTP/SFTP servers, rest API's, etc. onto Hadoop, whereas Marmaray is responsible for ingesting the data from any source and load it to Hadoop and further, it is also capable of distributing the ingested data from Hadoop to various sinks by leveraging Apache Spark. Hadoop MapReduce framework used by Gobblin wherein order to transform the data, but on the other hand, Marmaray doesn't provide any transformation capabilities.

Both frameworks Marmaray and Gobblin are responsible for handling the job, task scheduling and metadata management. Gobblin uses Hadoop MapReduce, but on the other hand, Marmaray uses Apache Spark as a primary Data Processing Engine. And using Apache Spark as a Data Processing Engine has its advantages over MapReduce. Spark is much faster due to its In-Memory Processing Semantics. Spark also provides many transformations by default such as grouping, mapping, filtering, etc. It can perform multiple transformations on data without storing the previously transformed data to HDFS.

What are the best practices of Uber Marmaray?

Use a new type of HoodieRcordPayload and keep the previous persisted one as the output of combineAndGetUpdateValue(...). However, the commit time of the previous persisted one updated to the latest, which makes the downstream incremental ETL count this record twice. Left join the data frame with all the persisted data by key and insert the records whose persisted_data.key is null.

The concern is not sure whether bloomIndex / metadata can be taken full advantage of. Put a new flag field in the HoodieRecord reading from HoodRecordPayload metadata to indicate if a copyOldRecord is needed during writing. Pass down a flag during data frame options to enforce; this entire job will be copyOldRecord.

Our solutions cater to diverse industries with a focus on serving ever-changing marketing needs. Click here for our Talk with Certified Big Data Specialists

What are the best Uber Marmaray Tools?

The best Uber Marmaray tools are listed below:

Conclusion

Uber visualizes Marmaray as a pipeline for connecting the raw data from different types of data sources to Hadoop or Hive and also further connecting both derived and raw datasets from Hive to various sinks depending on SLA, latency, and other customer requirements.

Explore Best Practices for Hadoop Storage Format
Read here to build Data Ingestion Platform Using Apache Nifi

Interested in Solving your Challenges with XenonStack Team

Get Started

Interested in Solving your Challenges with XenonStack

Personalization

In Which Agentic Platform and Accelerator you are Interested? *