Queries that scan billions of rows bring databases to their
knees. A single poorly-designed table can turn sub-second queries into multi-minute ordeals. Table partitioning
solves this by dividing large tables into smaller, manageable chunks that queries can target specifically,
dramatically improving performance and reducing costs.
This guide covers production partitioning strategies across
major databases—Postgres, MySQL, Snowflake, and BigQuery. We’ll transform slow, full-table scans into lightning-fast
targeted queries.
Why Partitioning Transforms Performance
The Problem with Large Tables
Without partitioning, large tables suffer from:
- Full table scans: Query touches all billions of rows even when filtering
- Slow queries: Minutes instead of seconds for common operations
- High costs: Cloud warehouses charge per data scanned
- Difficult maintenance: VACUUM, ANALYZE take hours
- Poor concurrency: Lock contention on massive tables
- Inefficient archiving: Can’t easily drop old data
Partitioning Benefits
- Partition pruning: Query only relevant partitions, skip the rest
- 10-100x faster queries: Scan GB instead of TB
- Lower costs: Pay only for data scanned
- Easy maintenance: Work on individual partitions
- Simple archiving: Drop old partitions instantly
- Better concurrency: Lock individual partitions, not entire table
Pattern 1: Range Partitioning (Time-Based)
PostgreSQL Range Partitioning
-- Create partitioned table
CREATE TABLE events (
event_id BIGINT,
user_id BIGINT,
event_type VARCHAR(50),
event_data JSONB,
created_at TIMESTAMP NOT NULL
) PARTITION BY RANGE (created_at);
-- Create monthly partitions for 2024
CREATE TABLE events_2024_01 PARTITION OF events
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE events_2024_02 PARTITION OF events
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
CREATE TABLE events_2024_03 PARTITION OF events
FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');
-- Create indexes on partitions
CREATE INDEX idx_events_2024_01_user ON events_2024_01(user_id);
CREATE INDEX idx_events_2024_02_user ON events_2024_02(user_id);
-- Query with partition pruning
EXPLAIN ANALYZE
SELECT * FROM events
WHERE created_at >= '2024-01-15'
AND created_at < '2024-01-20'
AND user_id = 12345;
-- Only scans events_2024_01, skips all other partitions
Automated Partition Management
# Python script to auto-create partitions
import psycopg2
from datetime import datetime, timedelta
def create_monthly_partition(conn, table_name, year, month):
"""Create a monthly partition"""
start_date = datetime(year, month, 1)
if month == 12:
end_date = datetime(year + 1, 1, 1)
else:
end_date = datetime(year, month + 1, 1)
partition_name = f"{table_name}_{year}_{month:02d}"
create_sql = f"""
CREATE TABLE IF NOT EXISTS {partition_name}
PARTITION OF {table_name}
FOR VALUES FROM ('{start_date}') TO ('{end_date}');
"""
with conn.cursor() as cur:
cur.execute(create_sql)
# Create indexes
cur.execute(f"""
CREATE INDEX IF NOT EXISTS idx_{partition_name}_user
ON {partition_name}(user_id);
""")
cur.execute(f"""
CREATE INDEX IF NOT EXISTS idx_{partition_name}_created
ON {partition_name}(created_at);
""")
conn.commit()
print(f"Created partition: {partition_name}")
# Create partitions for next 12 months
conn = psycopg2.connect("dbname=mydb user=postgres")
current_date = datetime.now()
for i in range(12):
future_date = current_date + timedelta(days=30*i)
create_monthly_partition(
conn,
'events',
future_date.year,
future_date.month
)
conn.close()
Pattern 2: List Partitioning (Category-Based)
MySQL List Partitioning
-- Partition by region
CREATE TABLE customer_orders (
order_id BIGINT,
customer_id BIGINT,
region VARCHAR(20),
amount DECIMAL(10,2),
order_date DATE
) PARTITION BY LIST COLUMNS(region) (
PARTITION p_north_america VALUES IN ('US', 'CA', 'MX'),
PARTITION p_europe VALUES IN ('UK', 'DE', 'FR', 'ES'),
PARTITION p_asia VALUES IN ('JP', 'CN', 'IN', 'SG'),
PARTITION p_other VALUES IN ('AU', 'BR', 'ZA')
);
-- Query specific partition
SELECT COUNT(*), SUM(amount)
FROM customer_orders
WHERE region IN ('US', 'CA') -- Only scans p_north_america
AND order_date > '2024-01-01';
-- Show partition info
SELECT
PARTITION_NAME,
TABLE_ROWS,
AVG_ROW_LENGTH,
DATA_LENGTH / 1024 / 1024 AS size_mb
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE TABLE_NAME = 'customer_orders';
Pattern 3: Hash Partitioning (Even Distribution)
PostgreSQL Hash Partitioning
-- Create hash-partitioned table
CREATE TABLE user_sessions (
session_id UUID,
user_id BIGINT NOT NULL,
session_data JSONB,
created_at TIMESTAMP
) PARTITION BY HASH (user_id);
-- Create 8 hash partitions
CREATE TABLE user_sessions_0 PARTITION OF user_sessions
FOR VALUES WITH (MODULUS 8, REMAINDER 0);
CREATE TABLE user_sessions_1 PARTITION OF user_sessions
FOR VALUES WITH (MODULUS 8, REMAINDER 1);
CREATE TABLE user_sessions_2 PARTITION OF user_sessions
FOR VALUES WITH (MODULUS 8, REMAINDER 2);
-- ... create remaining partitions 3-7
-- Queries automatically route to correct partition
SELECT * FROM user_sessions
WHERE user_id = 12345; -- Routes to one partition based on hash
-- Parallel processing benefits
-- Each partition can be scanned in parallel
SET max_parallel_workers_per_gather = 8;
SELECT user_id, COUNT(*)
FROM user_sessions
GROUP BY user_id;
Pattern 4: Composite Partitioning
Multi-Level Partitioning
-- First partition by date, then by region
CREATE TABLE sales (
sale_id BIGINT,
region VARCHAR(20),
amount DECIMAL(10,2),
sale_date DATE
) PARTITION BY RANGE (sale_date);
-- Create yearly partitions
CREATE TABLE sales_2024 PARTITION OF sales
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01')
PARTITION BY LIST (region);
-- Sub-partition 2024 by region
CREATE TABLE sales_2024_us PARTITION OF sales_2024
FOR VALUES IN ('US');
CREATE TABLE sales_2024_eu PARTITION OF sales_2024
FOR VALUES IN ('UK', 'DE', 'FR');
CREATE TABLE sales_2024_asia PARTITION OF sales_2024
FOR VALUES IN ('JP', 'CN', 'IN');
-- Query hits single sub-partition
SELECT SUM(amount)
FROM sales
WHERE sale_date BETWEEN '2024-06-01' AND '2024-06-30'
AND region = 'US';
-- Only scans sales_2024_us
Pattern 5: Snowflake Clustering
Micro-Partition Optimization
-- Snowflake automatically partitions data into micro-partitions
-- But we can optimize with clustering
-- Create table with clustering key
CREATE TABLE events (
event_id NUMBER,
user_id NUMBER,
event_type VARCHAR,
event_timestamp TIMESTAMP_NTZ
) CLUSTER BY (event_timestamp, user_id);
-- Check clustering quality
SELECT SYSTEM$CLUSTERING_INFORMATION('events', '(event_timestamp, user_id)');
-- Re-cluster if needed
ALTER TABLE events RECLUSTER;
-- Automatic clustering (recommended for production)
ALTER TABLE events RESUME RECLUSTER;
-- Query benefits from clustering
SELECT *
FROM events
WHERE event_timestamp BETWEEN '2024-01-01' AND '2024-01-31'
AND user_id = 12345;
-- Snowflake prunes micro-partitions automatically
-- Check partitions scanned
SELECT *
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE QUERY_TEXT LIKE '%events%'
ORDER BY START_TIME DESC
LIMIT 1;
Pattern 6: BigQuery Partitioning
Native Time and Integer Partitioning
-- Time-partitioned table
CREATE TABLE myproject.mydataset.events
(
event_id INT64,
user_id INT64,
event_type STRING,
event_data JSON,
event_timestamp TIMESTAMP
)
PARTITION BY DATE(event_timestamp)
OPTIONS(
partition_expiration_days=365,
require_partition_filter=true
);
-- Clustered + partitioned for optimal performance
CREATE TABLE myproject.mydataset.events_optimized
(
event_id INT64,
user_id INT64,
event_type STRING,
event_data JSON,
event_timestamp TIMESTAMP
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
OPTIONS(
partition_expiration_days=365,
require_partition_filter=true
);
-- Query with partition pruning
SELECT
event_type,
COUNT(*) as event_count
FROM myproject.mydataset.events
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31'
AND user_id = 12345
GROUP BY event_type;
-- Costs: Only scans January partitions, not entire table
BigQuery Cost Optimization
-- Check partition statistics
SELECT
partition_id,
total_rows,
total_logical_bytes / 1024 / 1024 / 1024 as size_gb
FROM myproject.mydataset.INFORMATION_SCHEMA.PARTITIONS
WHERE table_name = 'events'
ORDER BY partition_id DESC;
-- Estimate query cost before running
SELECT
SUM(size_bytes) / 1024 / 1024 / 1024 as scanned_gb,
SUM(size_bytes) / 1024 / 1024 / 1024 * 5 / 1000 as estimated_cost_usd
FROM myproject.mydataset.__TABLES__
WHERE table_id = 'events';
Pattern 7: Partition Maintenance
Automated Partition Lifecycle
# Automated partition management
import psycopg2
from datetime import datetime, timedelta
class PartitionManager:
def __init__(self, conn_string):
self.conn = psycopg2.connect(conn_string)
def create_future_partitions(self, table_name, months_ahead=3):
"""Create partitions for upcoming months"""
current_date = datetime.now()
for i in range(1, months_ahead + 1):
future_date = current_date + timedelta(days=30*i)
partition_name = f"{table_name}_{future_date.strftime('%Y_%m')}"
# Check if partition exists
with self.conn.cursor() as cur:
cur.execute(f"""
SELECT 1 FROM pg_tables
WHERE tablename = '{partition_name}'
""")
if not cur.fetchone():
self.create_partition(table_name, future_date)
def drop_old_partitions(self, table_name, retention_months=12):
"""Drop partitions older than retention period"""
cutoff_date = datetime.now() - timedelta(days=30*retention_months)
with self.conn.cursor() as cur:
# Find old partitions
cur.execute(f"""
SELECT tablename FROM pg_tables
WHERE tablename LIKE '{table_name}_%'
AND tablename < '{table_name}_{cutoff_date.strftime("%Y_%m")}'
""")
for (partition_name,) in cur.fetchall():
print(f"Dropping old partition: {partition_name}")
cur.execute(f"DROP TABLE {partition_name}")
self.conn.commit()
def analyze_partitions(self, table_name):
"""Update statistics for all partitions"""
with self.conn.cursor() as cur:
cur.execute(f"""
SELECT tablename FROM pg_tables
WHERE tablename LIKE '{table_name}_%'
""")
for (partition_name,) in cur.fetchall():
print(f"Analyzing: {partition_name}")
cur.execute(f"ANALYZE {partition_name}")
self.conn.commit()
# Usage
manager = PartitionManager("dbname=mydb user=postgres")
# Create next 3 months of partitions
manager.create_future_partitions('events', months_ahead=3)
# Drop data older than 12 months
manager.drop_old_partitions('events', retention_months=12)
# Update statistics
manager.analyze_partitions('events')
Performance Comparison
Before vs After Partitioning
| Metric | Without Partitioning | With Partitioning | Improvement |
|---|---|---|---|
| Query Time | 120 seconds | 2 seconds | 60x faster |
| Data Scanned | 5 TB | 50 GB | 100x less |
| Query Cost (BigQuery) | $25.00 | $0.25 | 100x cheaper |
| Maintenance Window | 6 hours | 15 minutes | 24x faster |
Best Practices
1. Choose the Right Partition Key
- Time-based: Use for event logs, transactions, time-series data
- Category: Use for region, status, type when evenly distributed
- Hash: Use for even distribution when no natural partitioning key
- Avoid: High-cardinality columns (user_id) unless using hash
2. Partition Size Guidelines
- PostgreSQL/MySQL: 10-50 GB per partition
- Snowflake: Auto micro-partitions (16 MB), use clustering
- BigQuery: Auto micro-partitions, partition by day/month
- Too many partitions: >1000 partitions hurts performance
- Too few partitions: Loses benefits of partitioning
3. Index Strategy
-- Create indexes on each partition
-- Bad: Index on parent table (inefficient)
CREATE INDEX idx_events_user ON events(user_id);
-- Good: Index on each partition (efficient)
CREATE INDEX idx_events_2024_01_user ON events_2024_01(user_id);
CREATE INDEX idx_events_2024_02_user ON events_2024_02(user_id);
-- Or automate with script
DO $$
DECLARE
partition_name text;
BEGIN
FOR partition_name IN
SELECT tablename FROM pg_tables
WHERE tablename LIKE 'events_2024_%'
LOOP
EXECUTE format('CREATE INDEX IF NOT EXISTS idx_%I_user ON %I(user_id)',
partition_name, partition_name);
END LOOP;
END $$;
Common Pitfalls
- Not filtering on partition key: Query still scans all partitions
- Too many partitions: >1000 partitions degrades performance
- Uneven partition sizes: One huge partition defeats the purpose
- Forgetting indexes: Create indexes on partitions, not parent
- No maintenance plan: Need automated creation/deletion of partitions
- Wrong partition granularity: Hourly partitions when daily would suffice
Key Takeaways
- Partition tables >100 GB, especially time-series or event data
- Use range partitioning (time-based) for 80% of use cases
- Always filter queries on partition key for pruning benefits
- Automate partition creation and deletion
- Aim for 10-50 GB per partition for optimal performance
- Create indexes on individual partitions, not the parent table
- Monitor partition sizes and query patterns
- In cloud warehouses (Snowflake, BigQuery), partitioning = massive cost savings
Table partitioning is one of the highest-ROI database optimizations. For large tables, it’s the difference
between queries that take minutes and queries that take seconds—and in cloud warehouses, the difference between
spending thousands vs. tens of dollars per month.
Discover more from C4: Container, Code, Cloud & Context
Subscribe to get the latest posts sent to your email.