Querying S3 Logs at Scale with Amazon Athena: SQL Without a Database
You have millions of log files sitting in S3 — application logs, ALB access logs, CloudFront logs, VPC flow logs — and you need answers fast. Loading them into a traditional database is expensive, slow, and operationally heavy. Amazon Athena lets you run standard SQL directly against those files in S3, paying only for the data scanned, with zero infrastructure to manage.
TL;DR
| Concern | Athena's Answer |
|---|---|
| Infrastructure to manage | None — fully serverless |
| Data movement required | No — queries run directly on S3 |
| Query language | Standard SQL (Presto/Trino engine) |
| Cost model | Per-query, based on data scanned (check AWS pricing page for current rates) |
| Performance optimization | Partitioning + columnar formats (Parquet/ORC) |
| Schema management | AWS Glue Data Catalog |
How Athena Works: The Architecture
Athena is a serverless query engine built on Presto/Trino. It uses the AWS Glue Data Catalog as its metastore — storing table definitions, schema, and partition metadata — while the actual data remains in S3. When you fire a query, Athena's distributed engine reads only the relevant S3 objects, applies the SQL logic, and writes results to a designated S3 output location.
- Client (Console/SDK/JDBC): Submits a SQL query to Athena.
- Glue Data Catalog: Athena resolves the table schema, column types, and partition locations from the catalog.
- S3 Log Bucket: Athena's distributed workers read only the relevant partitions and files — not the entire bucket.
- Query Engine (Presto/Trino): Processes, filters, aggregates data in-memory across distributed workers.
- S3 Results Bucket: Final query results are written as CSV to your designated output location.
- Client receives results via the Athena API or console.
Analogy: Think of Athena like a librarian who knows exactly which shelf and which page to open (via the Glue catalog index), rather than reading every book in the library from cover to cover. Partitioning is the Dewey Decimal System — without it, the librarian checks every single book.
Step 1: Organize Your S3 Logs with a Partition Strategy
Before writing a single line of SQL, your S3 prefix structure determines your query cost and speed. A well-partitioned layout means Athena skips irrelevant data entirely (partition pruning).
Recommended S3 prefix structure (Hive-compatible):
s3://my-log-bucket/app-logs/
year=2024/
month=07/
day=15/
hour=14/
logfile_001.gz
logfile_002.gz
This Hive-style partitioning (key=value) is natively understood by Glue crawlers and Athena's partition projection feature.
Step 2: Define the Table in Glue Data Catalog
You have two options: run a Glue Crawler to auto-discover schema, or define the table manually with a CREATE EXTERNAL TABLE DDL in Athena. For production log pipelines, the DDL approach gives you explicit control.
Below is a complete example for querying gzip-compressed JSON application logs:
🔽 [Click to expand] — CREATE EXTERNAL TABLE DDL (JSON logs, partitioned by date)
CREATE EXTERNAL TABLE app_logs (
request_id STRING,
timestamp STRING,
level STRING,
service STRING,
message STRING,
duration_ms INT,
status_code INT,
user_id STRING
)
PARTITIONED BY (
year STRING,
month STRING,
day STRING,
hour STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json' = 'TRUE'
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://my-log-bucket/app-logs/'
TBLPROPERTIES (
'has_encrypted_data' = 'false',
'projection.enabled' = 'true',
'projection.year.type' = 'integer',
'projection.year.range' = '2023,2025',
'projection.month.type' = 'integer',
'projection.month.range' = '1,12',
'projection.month.digits' = '2',
'projection.day.type' = 'integer',
'projection.day.range' = '1,31',
'projection.day.digits' = '2',
'projection.hour.type' = 'integer',
'projection.hour.range' = '0,23',
'projection.hour.digits' = '2',
'storage.location.template' = 's3://my-log-bucket/app-logs/year=${year}/month=${month}/day=${day}/hour=${hour}'
);
Key decisions explained:
- JsonSerDe: Handles JSON log lines. For CSV/TSV logs, use
LazySimpleSerDe. - Partition Projection: Eliminates the need to run
MSCK REPAIR TABLEor manually add partitions every hour. Athena computes partition paths dynamically from the defined ranges — critical for high-frequency log ingestion. - LOCATION: Points to the S3 prefix root, not a specific file.
Step 3: Run SQL Queries Against Your Logs
With the table defined, you can immediately run SQL. Always include partition filters in your WHERE clause — this is the single most impactful cost-control mechanism.
🔽 [Click to expand] — Example Queries
-- Query 1: Error rate by service for a specific day
SELECT
service,
COUNT(*) AS total_requests,
SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) AS error_count,
ROUND(
100.0 * SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) / COUNT(*), 2
) AS error_rate_pct
FROM app_logs
WHERE year = '2024' AND month = '07' AND day = '15'
GROUP BY service
ORDER BY error_rate_pct DESC;
-- Query 2: P99 latency per service over the last 24 hours
SELECT
service,
APPROX_PERCENTILE(duration_ms, 0.99) AS p99_latency_ms,
APPROX_PERCENTILE(duration_ms, 0.95) AS p95_latency_ms,
AVG(duration_ms) AS avg_latency_ms
FROM app_logs
WHERE year = '2024' AND month = '07' AND day = '15'
GROUP BY service
ORDER BY p99_latency_ms DESC;
-- Query 3: Find all ERROR logs for a specific user in a time window
SELECT timestamp, service, message, duration_ms
FROM app_logs
WHERE year = '2024'
AND month = '07'
AND day = '15'
AND hour IN ('10', '11', '12')
AND level = 'ERROR'
AND user_id = 'usr_abc123'
ORDER BY timestamp ASC;
Step 4: Optimize for Cost and Performance
Raw gzip JSON is convenient but not optimal for Athena. The following optimizations can reduce data scanned by an order of magnitude.
- Raw JSON/CSV (gzip): Your starting point. Athena must scan entire files; no column pruning possible.
- Convert to Parquet/ORC: Use AWS Glue ETL jobs or Amazon EMR to convert raw logs to columnar format. Athena can now read only the columns referenced in your
SELECT. - Apply Partitioning: Combined with columnar format, Athena skips entire S3 prefixes not matching your
WHEREpartition filters. - Result: Minimal Scan — only the relevant columns from relevant partitions are read.
| Format | Column Pruning | Compression | Relative Scan Size |
|---|---|---|---|
| JSON (gzip) | ❌ No | ✅ Yes | Baseline |
| CSV (gzip) | ❌ No | ✅ Yes | Baseline |
| Parquet (Snappy) | ✅ Yes | ✅ Yes | Significantly reduced (varies by query selectivity) |
| ORC (zlib) | ✅ Yes | ✅ Yes | Significantly reduced (varies by query selectivity) |
Step 5: Automate with AWS Glue ETL (Raw → Parquet Pipeline)
For a production pipeline, use an AWS Glue job triggered on a schedule (or by S3 event via EventBridge) to continuously convert incoming raw logs to partitioned Parquet.
🔽 [Click to expand] — Glue ETL Job: JSON to Parquet conversion (PySpark)
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read raw JSON logs from Glue catalog
raw_logs = glueContext.create_dynamic_frame.from_catalog(
database="logs_db",
table_name="app_logs_raw"
)
# Write as partitioned Parquet to the optimized S3 location
glueContext.write_dynamic_frame.from_options(
frame=raw_logs,
connection_type="s3",
connection_options={
"path": "s3://my-log-bucket/app-logs-parquet/",
"partitionKeys": ["year", "month", "day", "hour"]
},
format="parquet",
format_options={"compression": "snappy"}
)
job.commit()
IAM: Least-Privilege Permissions for Athena
Athena requires permissions to read from your source S3 bucket, write query results to the output bucket, and interact with the Glue Data Catalog. Below is a minimal IAM policy:
🔽 [Click to expand] — IAM Policy for Athena Query Execution
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AthenaQueryExecution",
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"athena:StopQueryExecution",
"athena:GetWorkGroup"
],
"Resource": "arn:aws:athena:us-east-1:123456789012:workgroup/primary"
},
{
"Sid": "GlueCatalogAccess",
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetTable",
"glue:GetPartitions"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/logs_db",
"arn:aws:glue:us-east-1:123456789012:table/logs_db/app_logs"
]
},
{
"Sid": "S3ReadSourceLogs",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-log-bucket",
"arn:aws:s3:::my-log-bucket/*"
]
},
{
"Sid": "S3WriteQueryResults",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::my-athena-results-bucket",
"arn:aws:s3:::my-athena-results-bucket/*"
]
}
]
}
Cost Control: Athena Workgroups
Use Athena Workgroups to enforce per-query data scan limits, preventing runaway queries from scanning terabytes of data. This is a critical guardrail in production environments.
# Create a workgroup with a 1 GB per-query scan limit
aws athena create-work-group \
--name log-analysis-wg \
--configuration 'ResultConfiguration={OutputLocation=s3://my-athena-results-bucket/},
EnforceWorkGroupConfiguration=true,
BytesScannedCutoffPerQuery=1073741824' \
--region us-east-1
Querying AWS-Native Log Formats
Athena has first-class support for several AWS service log formats. AWS provides pre-built DDL templates in the official documentation for these sources:
- ALB Access Logs: Use the AWS-provided SerDe and DDL from the Elastic Load Balancing documentation.
- CloudFront Access Logs: Tab-delimited format with a specific SerDe configuration documented by AWS.
- VPC Flow Logs: AWS provides a ready-to-use DDL in the VPC Flow Logs documentation for Athena integration.
- CloudTrail Logs: Use the
CloudTrailSerDe (com.amazon.emr.hive.serde.CloudTrailSerde) documented in the CloudTrail Athena integration guide.
Always use the DDL templates from the official AWS documentation for these formats to ensure correctness.
Glossary
| Term | Definition |
|---|---|
| Partition Pruning | Athena's ability to skip S3 prefixes entirely based on partition key filters in the WHERE clause, reducing data scanned. |
| Partition Projection | An Athena feature that computes partition locations dynamically from table properties, eliminating the need to register partitions in the Glue catalog manually. |
| SerDe | Serializer/Deserializer — a library that tells Athena how to parse a specific file format (JSON, CSV, Regex, etc.) into rows and columns. |
| Columnar Format | A storage format (Parquet, ORC) that organizes data by column rather than by row, enabling Athena to read only the columns referenced in a query. |
| Workgroup | An Athena resource that groups queries, enforces configuration (output location, scan limits), and enables cost tracking per team or use case. |
Comments
Post a Comment