« CASE STUDIES

Rio Tinto: From Nested S3 Paths to Databricks Gold

Rio Tinto's Databricks solution paves the way forward for future data integrations with robust, repeatable processes.

Tags

Data PlatformsData and AIDevOpsAWSDatabricksResources

Date

Apr 2024

Client

Rio Tinto

The Client

Rio Tinto is a leading global mining group that focuses on finding, mining and processing the Earth's mineral resources. Rio Tinto has started a programme of work to Ingest and Provision Geo Scientific data for Rio Tinto Exploration (RTX). This includes data of many forms such as spectral and image data and comes from a myriad of sources such as devices, aircraft, third party APIs, databases, and internal apps.

The Challenge

Mechanical Rock were engaged to build a solution that takes images from a third party repository and replicates these in the central data platform of choice, in this case Databricks. The current solution architecture uses patterns defined from a centralised team. Features include ingestion, and processing within AWS plus Databricks, for storage and analytics, including AI/ML workloads using Sagemaker. The data was not required to be realtime, but that there is a faithful replication of all data in the third party repository in the data platform at least once daily.

The Solution

Running daily, the application reaches out to the third party API, while operating within the environmental constraints set by Rio Tinto and only affects changes since the previous run. AWS Lambdas were chosen as the primary compute mechanism to enable horizontal scaling and AWS Queues allow failed messages to be retried and upon three failed attempts. The message is sent to a Dead-Letter-Queue alongside a notification to a human user to resolve the failing process. AWS Eventbridge is used to schedule and initiate the ingestion process. The ingestion process constructs a hierarchy to determine how to optimally batch and download images.

Image of the Serverless data ingestion architecture diagramServerless data ingestion architecture

To make Imago data in S3 more generally available to consumers Mechanical Rock built a catalogue of Imago metadata, in Databricks.

Image of the data pipeline diagramData pipeline

The Benefits

The data from the third party repository is now available, and updated daily, in a central data platform at Rio Tinto. This can now be used as a single point of access for data products for any customers at Rio Tinto. Furthermore, this was the first use case for Databricks at RTX and has provided a template for future data ingestion that can be utilised for other data sources, underpinned with robust software engineering process to ensure quality and repeatability.



THINK WE CAN HELP YOU?

Get in Touch

Reach out to us and a member of our team will be in touch right away.

contact@mechanicalrock.io