HowAutomate
    Back to Portfolio
    CloudAWS GlueAthenaS3Lambda

    AWS Data Lake Architecture

    Designed and deployed a serverless data lake on AWS processing 500GB of raw data daily with automated schema cataloguing and pay-per-query analytics.

    500GB
    Raw data processed daily
    80%
    Storage cost vs RDS
    Serverless
    Zero infrastructure to manage
    Auto
    Schema cataloguing
    AWS Data Lake Architecture

    The Challenge

    A logistics SaaS company was generating 500GB of raw operational data daily — GPS telemetry, delivery events, vehicle sensor data — but had no way to query it cost-effectively. Storing it in RDS was prohibitively expensive at that volume; querying raw files on S3 was technically painful. Analytics were months out of date.

    What We Built

    We designed a serverless data lake using AWS native services. Raw data lands in S3 via Lambda ingest functions. AWS Glue crawlers automatically catalogue new data and update the schema registry. Athena provides SQL-on-S3 querying with no infrastructure to manage. A lightweight transformation layer (Glue Jobs) builds curated summary tables for the most common query patterns. CloudWatch monitors pipeline health.

    How It Works

    The company's data team had a real problem: they were sitting on a goldmine of operational data but couldn't query it without enormous cost or complexity. RDS couldn't handle the volume economically; spinning up and managing a Redshift cluster felt like overkill for their team size.

    The architecture we designed is fully serverless. Raw data arrives via API and is ingested by Lambda functions that partition it by date, event type, and region before writing to S3 in Parquet format. This partitioning is what makes Athena queries fast — filters on partition keys skip irrelevant data entirely.

    AWS Glue Crawlers run every hour, scanning new S3 partitions and updating the Glue Data Catalog. This means new event types or schema changes are automatically catalogued — the data team never manually defines table structures.

    For the most common query patterns (daily delivery completion rates, vehicle utilisation by region, SLA breach analysis), we run nightly Glue Jobs that produce summarised tables. These pre-aggregated tables make the dashboards instant — no full table scans at query time.

    The cost model was transformative. S3 storage costs a fraction of RDS at this volume. Athena charges per query scanned — with proper partitioning and Parquet compression, the average query scans 95% less data than it would on raw files. The entire data lake runs for roughly $800/month vs an estimated $6,000+ for equivalent RDS capacity.

    More Cloud Case Studies

    Chat with us