Omni Logistics

Case Study

About

This project presents a full-scale redesign of the data ingestion pipeline at OmniLogistics Inc., a rapidly expanding global logistics and supply chain firm. The company’s daily operations rely heavily on real-time, accurate data from multiple sources such as shipment tracking systems, inventory databases, and partner reports. Timely access to this data is critical for operational visibility, decision-making, and predictive analytics.

The Challenge

The existing ingestion pipeline, built on Azure Data Factory (ADF), lacked intelligence and flexibility. It was configured to reprocess all files from the raw data zone daily, regardless of whether they had changed. This approach led to high compute costs, unnecessary resource usage, and delays in data availability - often making key metrics 24 hours outdated.

The system also suffered from poor fault tolerance. A single malformed file could cause the entire pipeline to fail, requiring manual troubleshooting. No mechanism existed to track file-level processing outcomes, resulting in poor auditability and compliance risk. Additionally, the absence of automated archival led to cluttered storage zones and increased storage management overhead.

Solution

A new, metadata-driven framework was developed to address these gaps. The revised system is built around Azure Data Factory and introduces several enhancements to improve efficiency, reliability, and maintainability:

Incremental Loading - A high-watermark method was implemented to ensure only newly added or updated files are processed. The pipeline retrieves the timestamp of the last successful run and compares it against each file’s lastModified time to determine eligibility for processing.
Atomic Processing and Archival - Each file is handled as a standalone unit. Upon successful processing, the file is moved to a date-partitioned archive location, ensuring an immutable historical record for traceability and regulatory needs.
Granular Error Handling - The system is designed to isolate failures. If one file encounters an error, it is logged separately, and the remaining files continue to process uninterrupted. This avoids pipeline-wide failures due to single-file issues.
Detailed Audit Logging - A structured audit log captures the processing status of each file, including success, failure, and archival. This supports complete transparency, improves compliance, and accelerates root-cause analysis when issues occur.

Results and Business Impact

The new ingestion system converted a costly, error-prone batch process into a highly automated, resilient, and scalable solution. Key benefits delivered include:

Lower Operational Costs - Reduced compute usage by eliminating redundant file reprocessing.
Improved Data Freshness - Faster availability of data for analytics and business reporting.
Higher Trust and Compliance - Stronger error handling and audit capabilities support data governance.
Future-Readiness - The modular, parameterized design allows rapid onboarding of new data sources without major rework.

This modernized framework now forms a foundational layer in OmniLogistics' data platform, supporting strategic goals around real-time insights and data-driven decision-making.

Containers

In this project, Azure Data Lake Storage (ADLS) Gen2 is organized into logical containers to separate and manage the data lifecycle efficiently. The raw container stores incoming unprocessed files, while the processed container holds cleansed, structured data organized by date for analytics use. An archive container retains original source files for compliance, traceability, and recovery. A control container manages metadata and state tracking, including the high-watermark JSON file. Finally, an audit container logs operations such as file deletions and failures.

Data Ingestion

The parent pipeline, dev_pl_ingest_incremental_files, serves as the orchestration layer for the entire data ingestion process. It begins by retrieving the last processed timestamp from a control file, then uses GetMetadata to list files in the raw zone. A ForEach loop checks each file’s lastModified timestamp against the watermark to identify new or updated files. Eligible files are passed to a child pipeline for processing. Failures are captured in a variable, and if any occur, the pipeline halts with a detailed error message. On successful completion, the watermark is updated, ensuring accurate, incremental processing for the next run.

Incremental Load

The child pipeline, dev_pl_process_single_file, handles the end-to-end processing of a single file passed by the parent pipeline. It first copies the raw file to a structured, date-partitioned folder in the processed zone, applying necessary schema or transformations. Simultaneously, it archives the original file to a long-term storage location for traceability and compliance. After both copies are completed successfully, a delete activity removes the source file from the raw zone to prevent reprocessing. This pipeline ensures atomic, idempotent file handling with clean separation of concerns, enabling scalable and reliable ingestion of high-volume data with minimal operational overhead.

Failure Alerts

To ensure immediate visibility into ingestion issues at OmniLogistics, I configured a failure alert in Azure Data Factory under Alerts & Metrics. The rule monitors the "Activity Failed Runs" metric and triggers when its value is greater than 0, indicating at least one failed activity in a pipeline run.

This alert is categorized as Sev1 and linked to the 'ADF-OmniLogistics' action group, which dispatches real-time email notifications to the support team. This setup enables rapid response, minimizes downtime, and supports a more resilient data ingestion process by surfacing failures as soon as they occur.

Github

Integrating the Omni Logistics Azure Data Factory with a GitHub repository establishes robust source control, providing versioning and a complete audit trail for all development artifacts. By centralizing the codebase, this integration forms the essential foundation for CI/CD automation.

Automated pipelines, triggered by Git commits and pull requests, ensure that every change is validated and packaged consistently. This process significantly enhances collaboration, improves deployment reliability, and reduces manual errors, allowing for faster and safer delivery of new analytics capabilities.