Tips and Tricks – Partition Large Tables for Query Performance

Queries that scan billions of rows bring databases to their
knees. A single poorly-designed table can turn sub-second queries into multi-minute ordeals. Table partitioning
solves this by dividing large tables into smaller, manageable chunks that queries can target specifically,
dramatically improving performance and reducing costs.

This guide covers production partitioning strategies across
major databases—Postgres, MySQL, Snowflake, and BigQuery. We’ll transform slow, full-table scans into lightning-fast
targeted queries.

Why Partitioning Transforms Performance

The Problem with Large Tables

Without partitioning, large tables suffer from:

Full table scans: Query touches all billions of rows even when filtering
Slow queries: Minutes instead of seconds for common operations
High costs: Cloud warehouses charge per data scanned
Difficult maintenance: VACUUM, ANALYZE take hours
Poor concurrency: Lock contention on massive tables
Inefficient archiving: Can’t easily drop old data

Partitioning Benefits

Partition pruning: Query only relevant partitions, skip the rest
10-100x faster queries: Scan GB instead of TB
Lower costs: Pay only for data scanned
Easy maintenance: Work on individual partitions
Simple archiving: Drop old partitions instantly
Better concurrency: Lock individual partitions, not entire table

Pattern 1: Range Partitioning (Time-Based)

PostgreSQL Range Partitioning

-- Create partitioned table
CREATE TABLE events (
    event_id BIGINT,
    user_id BIGINT,
    event_type VARCHAR(50),
    event_data JSONB,
    created_at TIMESTAMP NOT NULL
) PARTITION BY RANGE (created_at);

-- Create monthly partitions for 2024
CREATE TABLE events_2024_01 PARTITION OF events
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE events_2024_02 PARTITION OF events
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

CREATE TABLE events_2024_03 PARTITION OF events
    FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');

-- Create indexes on partitions
CREATE INDEX idx_events_2024_01_user ON events_2024_01(user_id);
CREATE INDEX idx_events_2024_02_user ON events_2024_02(user_id);

-- Query with partition pruning
EXPLAIN ANALYZE
SELECT * FROM events
WHERE created_at >= '2024-01-15' 
  AND created_at < '2024-01-20'
  AND user_id = 12345;

-- Only scans events_2024_01, skips all other partitions

Automated Partition Management

# Python script to auto-create partitions
import psycopg2
from datetime import datetime, timedelta

def create_monthly_partition(conn, table_name, year, month):
    """Create a monthly partition"""
    start_date = datetime(year, month, 1)
    
    if month == 12:
        end_date = datetime(year + 1, 1, 1)
    else:
        end_date = datetime(year, month + 1, 1)
    
    partition_name = f"{table_name}_{year}_{month:02d}"
    
    create_sql = f"""
    CREATE TABLE IF NOT EXISTS {partition_name} 
    PARTITION OF {table_name}
    FOR VALUES FROM ('{start_date}') TO ('{end_date}');
    """
    
    with conn.cursor() as cur:
        cur.execute(create_sql)
        
        # Create indexes
        cur.execute(f"""
        CREATE INDEX IF NOT EXISTS idx_{partition_name}_user 
        ON {partition_name}(user_id);
        """)
        
        cur.execute(f"""
        CREATE INDEX IF NOT EXISTS idx_{partition_name}_created 
        ON {partition_name}(created_at);
        """)
    
    conn.commit()
    print(f"Created partition: {partition_name}")

# Create partitions for next 12 months
conn = psycopg2.connect("dbname=mydb user=postgres")

current_date = datetime.now()
for i in range(12):
    future_date = current_date + timedelta(days=30*i)
    create_monthly_partition(
        conn, 
        'events', 
        future_date.year, 
        future_date.month
    )

conn.close()

Pattern 2: List Partitioning (Category-Based)

MySQL List Partitioning

-- Partition by region
CREATE TABLE customer_orders (
    order_id BIGINT,
    customer_id BIGINT,
    region VARCHAR(20),
    amount DECIMAL(10,2),
    order_date DATE
) PARTITION BY LIST COLUMNS(region) (
    PARTITION p_north_america VALUES IN ('US', 'CA', 'MX'),
    PARTITION p_europe VALUES IN ('UK', 'DE', 'FR', 'ES'),
    PARTITION p_asia VALUES IN ('JP', 'CN', 'IN', 'SG'),
    PARTITION p_other VALUES IN ('AU', 'BR', 'ZA')
);

-- Query specific partition
SELECT COUNT(*), SUM(amount)
FROM customer_orders
WHERE region IN ('US', 'CA')  -- Only scans p_north_america
  AND order_date > '2024-01-01';

-- Show partition info
SELECT 
    PARTITION_NAME,
    TABLE_ROWS,
    AVG_ROW_LENGTH,
    DATA_LENGTH / 1024 / 1024 AS size_mb
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE TABLE_NAME = 'customer_orders';

Pattern 3: Hash Partitioning (Even Distribution)

PostgreSQL Hash Partitioning

-- Create hash-partitioned table
CREATE TABLE user_sessions (
    session_id UUID,
    user_id BIGINT NOT NULL,
    session_data JSONB,
    created_at TIMESTAMP
) PARTITION BY HASH (user_id);

-- Create 8 hash partitions
CREATE TABLE user_sessions_0 PARTITION OF user_sessions
    FOR VALUES WITH (MODULUS 8, REMAINDER 0);

CREATE TABLE user_sessions_1 PARTITION OF user_sessions
    FOR VALUES WITH (MODULUS 8, REMAINDER 1);

CREATE TABLE user_sessions_2 PARTITION OF user_sessions
    FOR VALUES WITH (MODULUS 8, REMAINDER 2);

-- ... create remaining partitions 3-7

-- Queries automatically route to correct partition
SELECT * FROM user_sessions
WHERE user_id = 12345;  -- Routes to one partition based on hash

-- Parallel processing benefits
-- Each partition can be scanned in parallel
SET max_parallel_workers_per_gather = 8;
SELECT user_id, COUNT(*)
FROM user_sessions
GROUP BY user_id;

Pattern 4: Composite Partitioning

Multi-Level Partitioning

-- First partition by date, then by region
CREATE TABLE sales (
    sale_id BIGINT,
    region VARCHAR(20),
    amount DECIMAL(10,2),
    sale_date DATE
) PARTITION BY RANGE (sale_date);

-- Create yearly partitions
CREATE TABLE sales_2024 PARTITION OF sales
    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01')
    PARTITION BY LIST (region);

-- Sub-partition 2024 by region
CREATE TABLE sales_2024_us PARTITION OF sales_2024
    FOR VALUES IN ('US');

CREATE TABLE sales_2024_eu PARTITION OF sales_2024
    FOR VALUES IN ('UK', 'DE', 'FR');

CREATE TABLE sales_2024_asia PARTITION OF sales_2024
    FOR VALUES IN ('JP', 'CN', 'IN');

-- Query hits single sub-partition
SELECT SUM(amount)
FROM sales
WHERE sale_date BETWEEN '2024-06-01' AND '2024-06-30'
  AND region = 'US';
-- Only scans sales_2024_us

Pattern 5: Snowflake Clustering

Micro-Partition Optimization

-- Snowflake automatically partitions data into micro-partitions
-- But we can optimize with clustering

-- Create table with clustering key
CREATE TABLE events (
    event_id NUMBER,
    user_id NUMBER,
    event_type VARCHAR,
    event_timestamp TIMESTAMP_NTZ
) CLUSTER BY (event_timestamp, user_id);

-- Check clustering quality
SELECT SYSTEM$CLUSTERING_INFORMATION('events', '(event_timestamp, user_id)');

-- Re-cluster if needed
ALTER TABLE events RECLUSTER;

-- Automatic clustering (recommended for production)
ALTER TABLE events RESUME RECLUSTER;

-- Query benefits from clustering
SELECT *
FROM events
WHERE event_timestamp BETWEEN '2024-01-01' AND '2024-01-31'
  AND user_id = 12345;

-- Snowflake prunes micro-partitions automatically
-- Check partitions scanned
SELECT *
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE QUERY_TEXT LIKE '%events%'
ORDER BY START_TIME DESC
LIMIT 1;

Pattern 6: BigQuery Partitioning

Native Time and Integer Partitioning

-- Time-partitioned table
CREATE TABLE myproject.mydataset.events
(
    event_id INT64,
    user_id INT64,
    event_type STRING,
    event_data JSON,
    event_timestamp TIMESTAMP
)
PARTITION BY DATE(event_timestamp)
OPTIONS(
    partition_expiration_days=365,
    require_partition_filter=true
);

-- Clustered + partitioned for optimal performance
CREATE TABLE myproject.mydataset.events_optimized
(
    event_id INT64,
    user_id INT64,
    event_type STRING,
    event_data JSON,
    event_timestamp TIMESTAMP
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
OPTIONS(
    partition_expiration_days=365,
    require_partition_filter=true
);

-- Query with partition pruning
SELECT 
    event_type,
    COUNT(*) as event_count
FROM myproject.mydataset.events
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31'
  AND user_id = 12345
GROUP BY event_type;

-- Costs: Only scans January partitions, not entire table

BigQuery Cost Optimization

-- Check partition statistics
SELECT
    partition_id,
    total_rows,
    total_logical_bytes / 1024 / 1024 / 1024 as size_gb
FROM myproject.mydataset.INFORMATION_SCHEMA.PARTITIONS
WHERE table_name = 'events'
ORDER BY partition_id DESC;

-- Estimate query cost before running
SELECT
    SUM(size_bytes) / 1024 / 1024 / 1024 as scanned_gb,
    SUM(size_bytes) / 1024 / 1024 / 1024 * 5 / 1000 as estimated_cost_usd
FROM myproject.mydataset.__TABLES__
WHERE table_id = 'events';

Pattern 7: Partition Maintenance

Automated Partition Lifecycle

# Automated partition management
import psycopg2
from datetime import datetime, timedelta

class PartitionManager:
    def __init__(self, conn_string):
        self.conn = psycopg2.connect(conn_string)
    
    def create_future_partitions(self, table_name, months_ahead=3):
        """Create partitions for upcoming months"""
        current_date = datetime.now()
        
        for i in range(1, months_ahead + 1):
            future_date = current_date + timedelta(days=30*i)
            partition_name = f"{table_name}_{future_date.strftime('%Y_%m')}"
            
            # Check if partition exists
            with self.conn.cursor() as cur:
                cur.execute(f"""
                SELECT 1 FROM pg_tables 
                WHERE tablename = '{partition_name}'
                """)
                
                if not cur.fetchone():
                    self.create_partition(table_name, future_date)
    
    def drop_old_partitions(self, table_name, retention_months=12):
        """Drop partitions older than retention period"""
        cutoff_date = datetime.now() - timedelta(days=30*retention_months)
        
        with self.conn.cursor() as cur:
            # Find old partitions
            cur.execute(f"""
            SELECT tablename FROM pg_tables
            WHERE tablename LIKE '{table_name}_%'
            AND tablename < '{table_name}_{cutoff_date.strftime("%Y_%m")}'
            """)
            
            for (partition_name,) in cur.fetchall():
                print(f"Dropping old partition: {partition_name}")
                cur.execute(f"DROP TABLE {partition_name}")
        
        self.conn.commit()
    
    def analyze_partitions(self, table_name):
        """Update statistics for all partitions"""
        with self.conn.cursor() as cur:
            cur.execute(f"""
            SELECT tablename FROM pg_tables
            WHERE tablename LIKE '{table_name}_%'
            """)
            
            for (partition_name,) in cur.fetchall():
                print(f"Analyzing: {partition_name}")
                cur.execute(f"ANALYZE {partition_name}")
        
        self.conn.commit()

# Usage
manager = PartitionManager("dbname=mydb user=postgres")

# Create next 3 months of partitions
manager.create_future_partitions('events', months_ahead=3)

# Drop data older than 12 months
manager.drop_old_partitions('events', retention_months=12)

# Update statistics
manager.analyze_partitions('events')

Performance Comparison

Before vs After Partitioning

Metric	Without Partitioning	With Partitioning	Improvement
Query Time	120 seconds	2 seconds	60x faster
Data Scanned	5 TB	50 GB	100x less
Query Cost (BigQuery)	$25.00	$0.25	100x cheaper
Maintenance Window	6 hours	15 minutes	24x faster

Best Practices

1. Choose the Right Partition Key

Time-based: Use for event logs, transactions, time-series data
Category: Use for region, status, type when evenly distributed
Hash: Use for even distribution when no natural partitioning key
Avoid: High-cardinality columns (user_id) unless using hash

2. Partition Size Guidelines

PostgreSQL/MySQL: 10-50 GB per partition
Snowflake: Auto micro-partitions (16 MB), use clustering
BigQuery: Auto micro-partitions, partition by day/month
Too many partitions: >1000 partitions hurts performance
Too few partitions: Loses benefits of partitioning

3. Index Strategy

-- Create indexes on each partition
-- Bad: Index on parent table (inefficient)
CREATE INDEX idx_events_user ON events(user_id);

-- Good: Index on each partition (efficient)
CREATE INDEX idx_events_2024_01_user ON events_2024_01(user_id);
CREATE INDEX idx_events_2024_02_user ON events_2024_02(user_id);

-- Or automate with script
DO $$
DECLARE
    partition_name text;
BEGIN
    FOR partition_name IN 
        SELECT tablename FROM pg_tables 
        WHERE tablename LIKE 'events_2024_%'
    LOOP
        EXECUTE format('CREATE INDEX IF NOT EXISTS idx_%I_user ON %I(user_id)', 
                      partition_name, partition_name);
    END LOOP;
END $$;

Common Pitfalls

Not filtering on partition key: Query still scans all partitions
Too many partitions: >1000 partitions degrades performance
Uneven partition sizes: One huge partition defeats the purpose
Forgetting indexes: Create indexes on partitions, not parent
No maintenance plan: Need automated creation/deletion of partitions
Wrong partition granularity: Hourly partitions when daily would suffice

Key Takeaways

Partition tables >100 GB, especially time-series or event data
Use range partitioning (time-based) for 80% of use cases
Always filter queries on partition key for pruning benefits
Automate partition creation and deletion
Aim for 10-50 GB per partition for optimal performance
Create indexes on individual partitions, not the parent table
Monitor partition sizes and query patterns
In cloud warehouses (Snowflake, BigQuery), partitioning = massive cost savings

Table partitioning is one of the highest-ROI database optimizations. For large tables, it’s the difference
between queries that take minutes and queries that take seconds—and in cloud warehouses, the difference between
spending thousands vs. tens of dollars per month.

Discover more from C4: Container, Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Searching in