Querying S3 Logs at Scale with Amazon Athena: SQL Without a Database

You have millions of log files sitting in S3 — application logs, ALB access logs, CloudFront logs, VPC flow logs — and you need answers fast. Loading them into a traditional database is expensive, slow, and operationally heavy. Amazon Athena lets you run standard SQL directly against those files in S3, paying only for the data scanned, with zero infrastructure to manage.

TL;DR

Concern	Athena's Answer
Infrastructure to manage	None — fully serverless
Data movement required	No — queries run directly on S3
Query language	Standard SQL (Presto/Trino engine)
Cost model	Per-query, based on data scanned (check AWS pricing page for current rates)
Performance optimization	Partitioning + columnar formats (Parquet/ORC)
Schema management	AWS Glue Data Catalog

How Athena Works: The Architecture

Athena is a serverless query engine built on Presto/Trino. It uses the AWS Glue Data Catalog as its metastore — storing table definitions, schema, and partition metadata — while the actual data remains in S3. When you fire a query, Athena's distributed engine reads only the relevant S3 objects, applies the SQL logic, and writes results to a designated S3 output location.

graph LR Client["Client (Console / SDK / JDBC)"] Catalog["AWS Glue Data Catalog (Schema & Partitions)"] S3Logs["S3 Log Bucket (Raw Log Files)"] Engine["Athena Query Engine (Presto / Trino Workers)"] S3Results["S3 Results Bucket (CSV Output)"] Client -- "1. Submit SQL Query" --> Engine Engine -- "2. Resolve Schema & Partitions" --> Catalog Catalog -- "3. Return Partition Paths" --> Engine Engine -- "4. Read Relevant S3 Objects" --> S3Logs Engine -- "5. Write Results" --> S3Results S3Results -- "6. Return Results" --> Client

Client (Console/SDK/JDBC): Submits a SQL query to Athena.
Glue Data Catalog: Athena resolves the table schema, column types, and partition locations from the catalog.
S3 Log Bucket: Athena's distributed workers read only the relevant partitions and files — not the entire bucket.
Query Engine (Presto/Trino): Processes, filters, aggregates data in-memory across distributed workers.
S3 Results Bucket: Final query results are written as CSV to your designated output location.
Client receives results via the Athena API or console.

Analogy: Think of Athena like a librarian who knows exactly which shelf and which page to open (via the Glue catalog index), rather than reading every book in the library from cover to cover. Partitioning is the Dewey Decimal System — without it, the librarian checks every single book.

Step 1: Organize Your S3 Logs with a Partition Strategy

Before writing a single line of SQL, your S3 prefix structure determines your query cost and speed. A well-partitioned layout means Athena skips irrelevant data entirely (partition pruning).

Recommended S3 prefix structure (Hive-compatible):

s3://my-log-bucket/app-logs/
  year=2024/
    month=07/
      day=15/
        hour=14/
          logfile_001.gz
          logfile_002.gz

This Hive-style partitioning (key=value) is natively understood by Glue crawlers and Athena's partition projection feature.

Step 2: Define the Table in Glue Data Catalog

You have two options: run a Glue Crawler to auto-discover schema, or define the table manually with a CREATE EXTERNAL TABLE DDL in Athena. For production log pipelines, the DDL approach gives you explicit control.

Below is a complete example for querying gzip-compressed JSON application logs:

🔽 [Click to expand] — CREATE EXTERNAL TABLE DDL (JSON logs, partitioned by date)

CREATE EXTERNAL TABLE app_logs (
  request_id   STRING,
  timestamp    STRING,
  level        STRING,
  service      STRING,
  message      STRING,
  duration_ms  INT,
  status_code  INT,
  user_id      STRING
)
PARTITIONED BY (
  year  STRING,
  month STRING,
  day   STRING,
  hour  STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'ignore.malformed.json' = 'TRUE'
)
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://my-log-bucket/app-logs/'
TBLPROPERTIES (
  'has_encrypted_data' = 'false',
  'projection.enabled'             = 'true',
  'projection.year.type'           = 'integer',
  'projection.year.range'          = '2023,2025',
  'projection.month.type'          = 'integer',
  'projection.month.range'         = '1,12',
  'projection.month.digits'        = '2',
  'projection.day.type'            = 'integer',
  'projection.day.range'           = '1,31',
  'projection.day.digits'          = '2',
  'projection.hour.type'           = 'integer',
  'projection.hour.range'          = '0,23',
  'projection.hour.digits'         = '2',
  'storage.location.template'      = 's3://my-log-bucket/app-logs/year=${year}/month=${month}/day=${day}/hour=${hour}'
);

Key decisions explained:

JsonSerDe: Handles JSON log lines. For CSV/TSV logs, use LazySimpleSerDe.
Partition Projection: Eliminates the need to run MSCK REPAIR TABLE or manually add partitions every hour. Athena computes partition paths dynamically from the defined ranges — critical for high-frequency log ingestion.
LOCATION: Points to the S3 prefix root, not a specific file.

Step 3: Run SQL Queries Against Your Logs

With the table defined, you can immediately run SQL. Always include partition filters in your WHERE clause — this is the single most impactful cost-control mechanism.

🔽 [Click to expand] — Example Queries

-- Query 1: Error rate by service for a specific day
SELECT
  service,
  COUNT(*) AS total_requests,
  SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) AS error_count,
  ROUND(
    100.0 * SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) / COUNT(*), 2
  ) AS error_rate_pct
FROM app_logs
WHERE year = '2024' AND month = '07' AND day = '15'
GROUP BY service
ORDER BY error_rate_pct DESC;

-- Query 2: P99 latency per service over the last 24 hours
SELECT
  service,
  APPROX_PERCENTILE(duration_ms, 0.99) AS p99_latency_ms,
  APPROX_PERCENTILE(duration_ms, 0.95) AS p95_latency_ms,
  AVG(duration_ms)                      AS avg_latency_ms
FROM app_logs
WHERE year = '2024' AND month = '07' AND day = '15'
GROUP BY service
ORDER BY p99_latency_ms DESC;

-- Query 3: Find all ERROR logs for a specific user in a time window
SELECT timestamp, service, message, duration_ms
FROM app_logs
WHERE year = '2024'
  AND month = '07'
  AND day = '15'
  AND hour IN ('10', '11', '12')
  AND level = 'ERROR'
  AND user_id = 'usr_abc123'
ORDER BY timestamp ASC;

Step 4: Optimize for Cost and Performance

Raw gzip JSON is convenient but not optimal for Athena. The following optimizations can reduce data scanned by an order of magnitude.

graph TD Raw["Raw JSON / CSV (gzip on S3) Full file scan per query"] Columnar["Convert to Parquet / ORC (AWS Glue ETL Job) Column pruning enabled"] Partitioned["Apply Hive Partitioning year= / month= / day= / hour= Partition pruning enabled"] Result["Minimal Data Scanned Lower Cost + Faster Queries"] Raw -- "Step 1: ETL Conversion" --> Columnar Columnar -- "Step 2: Partition by Date" --> Partitioned Partitioned -- "Step 3: Query with WHERE filters" --> Result

Raw JSON/CSV (gzip): Your starting point. Athena must scan entire files; no column pruning possible.
Convert to Parquet/ORC: Use AWS Glue ETL jobs or Amazon EMR to convert raw logs to columnar format. Athena can now read only the columns referenced in your SELECT.
Apply Partitioning: Combined with columnar format, Athena skips entire S3 prefixes not matching your WHERE partition filters.
Result: Minimal Scan — only the relevant columns from relevant partitions are read.

Format	Column Pruning	Compression	Relative Scan Size
JSON (gzip)	❌ No	✅ Yes	Baseline
CSV (gzip)	❌ No	✅ Yes	Baseline
Parquet (Snappy)	✅ Yes	✅ Yes	Significantly reduced (varies by query selectivity)
ORC (zlib)	✅ Yes	✅ Yes	Significantly reduced (varies by query selectivity)

Step 5: Automate with AWS Glue ETL (Raw → Parquet Pipeline)

For a production pipeline, use an AWS Glue job triggered on a schedule (or by S3 event via EventBridge) to continuously convert incoming raw logs to partitioned Parquet.

🔽 [Click to expand] — Glue ETL Job: JSON to Parquet conversion (PySpark)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read raw JSON logs from Glue catalog
raw_logs = glueContext.create_dynamic_frame.from_catalog(
    database="logs_db",
    table_name="app_logs_raw"
)

# Write as partitioned Parquet to the optimized S3 location
glueContext.write_dynamic_frame.from_options(
    frame=raw_logs,
    connection_type="s3",
    connection_options={
        "path": "s3://my-log-bucket/app-logs-parquet/",
        "partitionKeys": ["year", "month", "day", "hour"]
    },
    format="parquet",
    format_options={"compression": "snappy"}
)

job.commit()

IAM: Least-Privilege Permissions for Athena

Athena requires permissions to read from your source S3 bucket, write query results to the output bucket, and interact with the Glue Data Catalog. Below is a minimal IAM policy:

🔽 [Click to expand] — IAM Policy for Athena Query Execution

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AthenaQueryExecution",
      "Effect": "Allow",
      "Action": [
        "athena:StartQueryExecution",
        "athena:GetQueryExecution",
        "athena:GetQueryResults",
        "athena:StopQueryExecution",
        "athena:GetWorkGroup"
      ],
      "Resource": "arn:aws:athena:us-east-1:123456789012:workgroup/primary"
    },
    {
      "Sid": "GlueCatalogAccess",
      "Effect": "Allow",
      "Action": [
        "glue:GetDatabase",
        "glue:GetTable",
        "glue:GetPartitions"
      ],
      "Resource": [
        "arn:aws:glue:us-east-1:123456789012:catalog",
        "arn:aws:glue:us-east-1:123456789012:database/logs_db",
        "arn:aws:glue:us-east-1:123456789012:table/logs_db/app_logs"
      ]
    },
    {
      "Sid": "S3ReadSourceLogs",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-log-bucket",
        "arn:aws:s3:::my-log-bucket/*"
      ]
    },
    {
      "Sid": "S3WriteQueryResults",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::my-athena-results-bucket",
        "arn:aws:s3:::my-athena-results-bucket/*"
      ]
    }
  ]
}

Cost Control: Athena Workgroups

Use Athena Workgroups to enforce per-query data scan limits, preventing runaway queries from scanning terabytes of data. This is a critical guardrail in production environments.

# Create a workgroup with a 1 GB per-query scan limit
aws athena create-work-group \
  --name log-analysis-wg \
  --configuration 'ResultConfiguration={OutputLocation=s3://my-athena-results-bucket/},
    EnforceWorkGroupConfiguration=true,
    BytesScannedCutoffPerQuery=1073741824' \
  --region us-east-1

Querying AWS-Native Log Formats

Athena has first-class support for several AWS service log formats. AWS provides pre-built DDL templates in the official documentation for these sources:

ALB Access Logs: Use the AWS-provided SerDe and DDL from the Elastic Load Balancing documentation.
CloudFront Access Logs: Tab-delimited format with a specific SerDe configuration documented by AWS.
VPC Flow Logs: AWS provides a ready-to-use DDL in the VPC Flow Logs documentation for Athena integration.
CloudTrail Logs: Use the CloudTrail SerDe (com.amazon.emr.hive.serde.CloudTrailSerde) documented in the CloudTrail Athena integration guide.

Always use the DDL templates from the official AWS documentation for these formats to ensure correctness.

Glossary

Term	Definition
Partition Pruning	Athena's ability to skip S3 prefixes entirely based on partition key filters in the WHERE clause, reducing data scanned.
Partition Projection	An Athena feature that computes partition locations dynamically from table properties, eliminating the need to register partitions in the Glue catalog manually.
SerDe	Serializer/Deserializer — a library that tells Athena how to parse a specific file format (JSON, CSV, Regex, etc.) into rows and columns.
Columnar Format	A storage format (Parquet, ORC) that organizes data by column rather than by row, enabling Athena to read only the columns referenced in a query.
Workgroup	An Athena resource that groups queries, enforces configuration (output location, scan limits), and enables cost tracking per team or use case.

Search This Blog

SW BBANG