Skip to main content
Uber logo

Schedule rides in advance

Reserve a rideReserve a ride

Schedule rides in advance

Reserve a rideReserve a ride
Data / ML, Engineering

Marmaray: An Open Source Generic Data Ingestion and Dispersal Framework and Library for Apache Hadoop

September 12, 2018 / Global
Featured image for Marmaray: An Open Source Generic Data Ingestion and Dispersal Framework and Library for Apache Hadoop
Figure 1: Marmaray both ingests data into our Hadoop Data Lake and disperses data to data stores. (Apache Kafka, Cassandra, Spark, and HDFS logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.)
Figure 2: As Uber continued to expand our operations globally, raw data stored in our Hadoop data lake grew exponentially.
Figure 3: A graph of on-call alerts for our Hadoop Platform team illustrates the overhead involved with maintaining our systems.
Figure 4: Through the app, Uber Eats uses machine learning models to recommend other restaurants an eater might enjoy.
Figure 5: Using Marmaray, Uber Freight’s metrics are updated every few hours, a much more responsive pace than the daily updates available previously. Sample data shown here does not reflect actual data, and is used for illustrative purposes only.
Figure 6: This diagram displays the high level architecture of the major components in the Marmaray framework.
Figure 7: Marmaray’s Metadata Manager is used to store any relevant metadata for a running job.
Figure 8: Fork Operator and Fork Function are used to split raw data records into a stream of schema conforming and error records to ensure high quality in our data lake.
Figure 9: AvroPayload wraps a GenericRecord with useful metadata.
Figure 10: For ingestion and dispersal, Marmaray requires that data be converted into AvroPayload, a wrapper based on Avro’s GenericRecord format. (Apache Kafka, Cassandra, Spark, and HDFS logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.)
Figure 11: Marmaray runs its ingestion and dispersal jobs independent of source or sink, as illustrated by this job flow diagram.
Figure 12: Our self-service UI enables data scientists and other users to move data from any source to any sink without having to know specific data formats.
Figure 13: Marmaray also supports data deletion by leveraging the Hudi storage format.
Danny Chen

Danny Chen

Danny Chen is currently an engineering manager for the Hadoop Platform team at Uber. He is a co-creator and co-architect of Marmaray and was also previously a tech lead on Uber Maps. Previous to Uber he worked on the core storage team at Twitter building scalable distributed storage systems.

Omkar Joshi

Omkar Joshi

Omkar Joshi is a software engineer on Uber’s Hadoop Platform team, where he is also a co-creator and co-architect of Marmaray. Omkar previously led object store and NFS solutions at Hedvig Inc. and was an initial contributor to Hadoop’s YARN scheduler.

Posted by Danny Chen, Omkar Joshi