RDB Design Series Part 2: Normalization and Denormalization — Not Theory, but Judgment Calls

RDB Design Series Part 2: Normalization and Denormalization — Not Theory, but Judgment Calls


Introduction

In the previous post, we covered the fundamentals you need to decide before creating a table (naming, data types, PK, NULL). This post goes one level deeper — “How should you split and group columns?”

When you hear “normalization,” textbooks come to mind. 1NF, 2NF, 3NF… But in practice, normalization boils down to one simple question: “If I put this data here, will it cause problems later?”

This guide explains normalization from start to finish using a single scenario: an online store. Minimal jargon, maximum intuition — you’ll feel why splitting tables makes sense.


1. Why Split Tables at All?

What if the entire store is one table?

Imagine a junior developer builds an online store with everything in one table.

CREATE TABLE orders (
    order_id BIGINT PRIMARY KEY,
    customer_name VARCHAR(50),
    customer_email VARCHAR(320),
    product_name VARCHAR(100),
    product_price DECIMAL(15, 0),
    quantity INT,
    order_date TIMESTAMP
);
| order_id | customer_name | customer_email  | product_name | product_price | quantity |
|----------|--------------|-----------------|-------------|--------------|---------|
| 1        | Alice Kim     | alice@email.com | Keyboard     | 50,000        | 1       |
| 2        | Alice Kim     | alice@email.com | Mouse        | 30,000        | 2       |
| 3        | Bob Lee       | bob@email.com   | Keyboard     | 50,000        | 1       |

Data goes in fine. Queries work. But once the service goes live, three kinds of pain emerge.

Pain 1: Updating the same data in multiple places

Alice changes her email.

UPDATE orders SET customer_email = 'new@email.com' WHERE customer_name = 'Alice Kim';
-- Must update 2 rows. If she has 10 orders, update 10. 100 orders, update 100.
-- Miss one? Alice now has two different emails in the system.

Analogy: Your friend changes their phone number, but they’re saved in your contacts 10 times. You have to update all 10. Miss one and you’ll call the old number.

Pain 2: Can’t insert data that should exist

You want to register a new product “Monitor” — but nobody’s ordered it yet.

INSERT INTO orders (product_name, product_price) VALUES ('Monitor', 500000);
-- Can't! order_id (PK) is required.
-- No order = no way to store product info.

Analogy: A restaurant can’t add a new dish to the menu until someone orders it. That makes no sense.

Pain 3: Deleting data destroys unrelated info

Bob cancels his order.

DELETE FROM orders WHERE order_id = 3;
-- Bob's order is gone... but so is all evidence that Bob was ever a customer!
-- "Keyboard costs 50,000" — that product info is also gone.

Analogy: Deleting a vacation photo from your album also deletes your friend’s contact info because they were in the picture.

Summary of the three pains

PainCauseFormal Name
Must update same data in many rowsOne fact duplicated across rowsUpdate Anomaly
Can’t insert data that should existUnrelated data forced into one tableInsert Anomaly
Deleting data destroys other infoUnrelated data forced into one tableDelete Anomaly

Normalization is the process of eliminating these three pains. The method is simple — split data that belongs together into separate tables.


2. Normalization Levels — Making the Store Progressively Cleaner

2.1 1NF — “One value per cell”

First Normal Form has one rule: One cell, one value.

-- One cell has multiple values
| order_id | products           |
|----------|--------------------|
| 1        | Keyboard, Mouse    |
| 2        | Monitor            |

“I want to cancel just the Keyboard from order 1” — you’d have to parse and split the cell. Indexes can’t help either.

-- Split into rows: problem solved
| order_id | product  |
|----------|----------|
| 1        | Keyboard |
| 1        | Mouse    |
| 2        | Monitor  |

Analogy: In a spreadsheet, putting “Seoul, Busan, Daegu” in one cell makes filtering impossible. One row per value = you can sort, search, and filter.

Common 1NF Violations in Practice

PatternExampleWhy It’s a Problem
Comma-separatedtags = "java,spring,docker"LIKE '%spring%' also matches “springframework”. Can’t index
Numbered columnsphone1, phone2, phone3Need a 4th? ALTER TABLE. Empty columns wasted
JSON arraystags = ["java", "spring"]MySQL: limited indexing. Hard to sort/aggregate

The fix is always the same — separate table.

JSON can be an exception: PostgreSQL’s JSONB + GIN indexes search well. Ask: “Do I need to search or aggregate this data?” If no, JSON is fine. If yes, separate table.

2.2 2NF — “Depends on the whole key, not part of it”

Look at this 1NF-compliant order items table:

CREATE TABLE order_items (
    order_id BIGINT,
    product_id BIGINT,
    product_name VARCHAR(100),     -- Does this belong here?
    product_price DECIMAL(15, 0),  -- And this?
    quantity INT,
    PRIMARY KEY (order_id, product_id)
);

Ask: “What do I need to know the product name?”

  • quantity → Need both order_id and product_id. “How many of product A in order 1?” ✅
  • product_name → Only need product_id. The order number is irrelevant. ❌

Product name is about the product itself, not the order. But since it’s in the order table, if the same product is ordered 100 times, the name is duplicated 100 times.

-- Product info goes in a products table
CREATE TABLE products (
    id BIGINT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    price DECIMAL(15, 0) NOT NULL
);

CREATE TABLE order_items (
    order_id BIGINT,
    product_id BIGINT,
    quantity INT NOT NULL,
    PRIMARY KEY (order_id, product_id),
    FOREIGN KEY (product_id) REFERENCES products(id)
);

Analogy: A class roster listing student names with each assignment submitted: “Alice — HW1, Alice — HW2, Alice — HW3”… Alice’s name repeats endlessly. Separate the student list from the submissions, and the name appears just once.

Practical tip: Most teams use single-column BIGINT AUTO_INCREMENT PKs, so 2NF violations are rare. Watch for them only in junction tables (N:M relationships) with composite keys.

2.3 3NF — “Only values determined directly by the PK”

With 2NF satisfied, look at an employees table:

CREATE TABLE employees (
    id BIGINT PRIMARY KEY,
    name VARCHAR(50),
    department_id INT,
    department_name VARCHAR(50)   -- Does this belong here?
);
| id | name    | department_id | department_name |
|----|---------|:---:|------------|
| 1  | Alice   | 10  | Engineering |
| 2  | Bob     | 10  | Engineering |
| 3  | Charlie | 20  | Marketing   |

Ask: “What do I need to know the department name?”

  • Employee ID → Department ID → Department name. The department name is determined by the department, not the employee.
  • But it’s in the employee table, so if Engineering has 100 people, “Engineering” is stored 100 times.

Engineering gets renamed to “Product Engineering”?

UPDATE employees SET department_name = 'Product Engineering' WHERE department_id = 10;
-- Must update 100 rows. Miss one?
-- Some department_id = 10 employees say "Engineering", others say "Product Engineering".
-- Department info goes in a departments table
CREATE TABLE departments (
    id INT PRIMARY KEY,
    name VARCHAR(50) NOT NULL
);

CREATE TABLE employees (
    id BIGINT PRIMARY KEY,
    name VARCHAR(50) NOT NULL,
    department_id INT NOT NULL,
    FOREIGN KEY (department_id) REFERENCES departments(id)
);
-- Rename department = update 1 row in departments. Done.

Analogy: Think of zip codes and addresses. If zip code “10001” gets redistricted, you update one row in the zip code table — not every customer’s address individually.

The Key 3NF Test

“If this column’s value changes, do I need to update multiple rows?” — If yes, it’s likely a 3NF violation.

ViolationWhat triggers multi-row updates?Fix
orders.customer_nameCustomer name changeSeparate customers table
products.category_nameCategory renameSeparate categories table
employees.department_nameDepartment renameSeparate departments table

2.4 BCNF (Boyce-Codd Normal Form) — A Special Case of 3NF

Rare in practice, but worth a brief mention.

Sometimes 3NF is satisfied but problems remain: a non-PK column determines another column.

Example: University course system
- Each professor teaches only one course
- Students take multiple courses

| student_id | course    | professor  |
|:---:|-----------|------------|
| 1          | DB Design | Prof. Kim  |
| 1          | Networks  | Prof. Lee  |
| 2          | DB Design | Prof. Kim  |

PK: (student_id, course)
Problem: professor -> course (knowing the professor determines the course)
         but professor is not a candidate key!

-> Need a separate professor-course table.

In practice, 3NF is sufficient. BCNF only matters in domains with complex composite keys (academic systems, reservation systems).


3. Normalization Summary — At a Glance

Normal FormOne-Line SummaryHow Often Violated
1NFOne value per cellFrequently (comma strings, JSON arrays)
2NFNo partial key dependenciesRare (auto-satisfied with single PK)
3NFNo non-PK column determining anotherMost commonly violated
BCNFEvery determinant is a candidate keyAlmost never encountered

The one normalization question: “If this column changes, do I need to update multiple rows?” → If yes, split the table.


4. Denormalization — When It’s OK to Break the Rules

Normalization optimizes for data integrity. Denormalization optimizes for read performance. They’re a trade-off.

4.1 Signals That Denormalization May Be Needed

"This query has 5 JOINs and runs on every page load?"
"The sales stats API takes 3 seconds and users are complaining"
"The same JOIN pattern is repeated in 10 different places"

Denormalization is a last resort. Try these first — only denormalize when everything else fails:

  1. Add/optimize indexes
  2. Refactor queries
  3. Caching (Redis, application cache)
  4. Read replicas
  5. Materialized Views (PostgreSQL)
  6. Still slow? → Now consider denormalization

4.2 Denormalization Patterns — Online Store Examples

Pattern 1: Pre-calculate and Store

The order list API needs to show order totals.

-- Normalized: calculate every time
SELECT o.id, SUM(oi.price * oi.quantity) AS total
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
GROUP BY o.id;
-- With 1M orders? JOIN + SUM every time is slow.

-- Denormalized: pre-store the total
ALTER TABLE orders ADD COLUMN total_amount DECIMAL(15, 0) NOT NULL DEFAULT 0;

Trade-off: Reads are fast, but you must recalculate and update the total whenever order items change.

-- MySQL: auto-sync with trigger
DELIMITER //
CREATE TRIGGER trg_order_items_after_insert
AFTER INSERT ON order_items
FOR EACH ROW
BEGIN
    UPDATE orders
    SET total_amount = (
        SELECT SUM(price * quantity) FROM order_items WHERE order_id = NEW.order_id
    )
    WHERE id = NEW.order_id;
END //
DELIMITER ;

-- PostgreSQL: trigger function
CREATE OR REPLACE FUNCTION update_order_total()
RETURNS TRIGGER AS $$
BEGIN
    UPDATE orders
    SET total_amount = (
        SELECT COALESCE(SUM(price * quantity), 0)
        FROM order_items WHERE order_id = NEW.order_id
    )
    WHERE id = NEW.order_id;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_order_items_after_insert
AFTER INSERT ON order_items
FOR EACH ROW EXECUTE FUNCTION update_order_total();

⚠️ In practice, application-level sync is preferred over triggers. Triggers are “invisible code” — hard to debug during incidents, conflict with ORM caches, and difficult to unit test. Most teams update within the same transaction in the Service layer, or use CDC/event-driven async sync (Debezium, Kafka, etc.). Rare cases where triggers are appropriate: audit logs, legacy system integration, complex integrity rules that can’t be expressed as CHECK constraints.

If you’re on PostgreSQL, consider a Materialized View first. It caches aggregated results without adding columns to your table.

CREATE MATERIALIZED VIEW order_totals AS
SELECT order_id, SUM(price * quantity) AS total_amount
FROM order_items GROUP BY order_id;

-- Refresh without blocking reads
REFRESH MATERIALIZED VIEW CONCURRENTLY order_totals;

MySQL doesn’t have this feature — you must manage summary tables manually.

Pattern 2: Summary Table

The sales dashboard needs daily revenue.

-- Calculating from all orders every time is slow
SELECT DATE(order_date) AS day, COUNT(*), SUM(total_amount)
FROM orders WHERE order_date >= '2026-01-01'
GROUP BY DATE(order_date);

-- Summary table: pre-aggregated
CREATE TABLE daily_sales_summary (
    sale_date DATE PRIMARY KEY,
    total_orders INT NOT NULL DEFAULT 0,
    total_amount DECIMAL(15, 0) NOT NULL DEFAULT 0,
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
-- MySQL: UPSERT to refresh
INSERT INTO daily_sales_summary (sale_date, total_orders, total_amount)
SELECT DATE(order_date), COUNT(*), SUM(total_amount)
FROM orders WHERE DATE(order_date) = CURRENT_DATE
GROUP BY DATE(order_date)
ON DUPLICATE KEY UPDATE
    total_orders = VALUES(total_orders),
    total_amount = VALUES(total_amount),
    updated_at = CURRENT_TIMESTAMP;

-- PostgreSQL: ON CONFLICT to refresh
INSERT INTO daily_sales_summary (sale_date, total_orders, total_amount)
SELECT DATE(order_date), COUNT(*), SUM(total_amount)
FROM orders WHERE DATE(order_date) = CURRENT_DATE
GROUP BY DATE(order_date)
ON CONFLICT (sale_date) DO UPDATE SET
    total_orders = EXCLUDED.total_orders,
    total_amount = EXCLUDED.total_amount,
    updated_at = CURRENT_TIMESTAMP;

What UPSERT does: It attempts an INSERT, and if a row with the same key (sale_date) already exists, it executes an UPDATE instead.

  • MySQLON DUPLICATE KEY UPDATE: When the PRIMARY KEY or UNIQUE KEY conflicts, the UPDATE clause runs instead. VALUES(total_orders) references the value that was about to be inserted.
  • PostgreSQLON CONFLICT (sale_date) DO UPDATE SET: When a conflict occurs on the specified column (sale_date), it runs the UPDATE. EXCLUDED is a virtual table representing the row that was about to be inserted.

This lets you handle “insert if new, update if exists” in a single atomic query — ideal for refreshing summary tables.

Analogy: Instead of counting every book in the library daily, you keep a “today’s checkout summary” board. Not real-time, but fast enough.

Pattern 3: Snapshots (This Isn’t Actually Denormalization!)

-- Preserve the product price at the time of order
CREATE TABLE order_items (
    id BIGINT PRIMARY KEY,
    order_id BIGINT NOT NULL,
    product_id BIGINT NOT NULL,
    unit_price DECIMAL(15, 0) NOT NULL, -- price at order time
    quantity INT NOT NULL
);

The product price goes from $50 to $60. If a customer who paid $50 suddenly sees $60 on their receipt? That’s a lawsuit.

TypeMeaningExample
SnapshotPreserving a point-in-time value — a business requirementOrder-time price, shipping address, terms version
DenormalizationIntentionally duplicating data for read performanceStoring customer name in orders table

Snapshots are correct design. Denormalization is a trade-off. Don’t confuse them.

Pattern 4: Table Merging

-- 1:1 relationship, two tables
CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    email VARCHAR(320) NOT NULL UNIQUE,
    password_hash VARCHAR(255) NOT NULL
);

CREATE TABLE user_profiles (
    user_id BIGINT PRIMARY KEY REFERENCES users(id),
    nickname VARCHAR(50),
    bio TEXT,
    avatar_url VARCHAR(2048)
);
-- Almost always queried together -> JOIN every time

-- Merged into one
CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    email VARCHAR(320) NOT NULL UNIQUE,
    password_hash VARCHAR(255) NOT NULL,
    nickname VARCHAR(50),
    bio TEXT,
    avatar_url VARCHAR(2048)
);
OK to MergeDon’t Merge
1:1, almost always queried togetherOne side queried alone frequently
Separated table is smallOne side is very large (row size ↑ → cache efficiency ↓)
Split reason was “felt right”Different access permissions (passwords vs profile)

4.3 Pre-Denormalization Checklist

[] Index optimization attempted?
[] Queries refactored?
[] Caching layer (Redis, etc.) considered?
[] Read replicas considered?
[] Materialized Views (PostgreSQL) considered?

-> Tried everything and still slow -> Now consider denormalization

If you denormalize, you must:

  • Automate sync with triggers or events (manual updates will be missed)
  • Write inconsistency detection queries (periodically verify original vs duplicate)
  • Comment the reason — leave comments in DDL (table/column definitions) and related queries
    -- In the column definition
    ALTER TABLE orders
        ADD COLUMN total_amount DECIMAL(15, 0) NOT NULL DEFAULT 0;
        -- Denormalized: order list API 3s -> 0.2s improvement (2026-04)
        -- Source: SELECT SUM(price * quantity) FROM order_items
    
    -- In the sync query as well
    -- Denormalization sync: orders.total_amount <- order_items aggregation
    UPDATE orders SET total_amount = (...) WHERE id = ?;

5. Normalization vs Denormalization — Decision Guide

ScenarioKeep NormalizedConsider Denormalization
Data changes frequentlyO
Integrity is critical (finance, healthcare)O
Table is small (under 100K rows)O
Reads vastly outnumber writesO
Aggregation/stats queries are frequentO
5+ JOIN queries repeat oftenO
Indexes and caching already maxed outO

OLTP vs OLAP — Why They Need Different Normalization Levels

Databases serve two fundamentally different purposes.

OLTP (Online Transaction Processing) — “The DB handling orders right now”

- Users place orders, make payments, cancel, and update shipping status
- Short transactions reading/writing 1-10 rows, hundreds to thousands per second
- Priority: "This order data must be correct" -> integrity first
- Higher normalization is better (no duplicates, always accurate)

OLAP (Online Analytical Processing) — “The DB showing yesterday’s revenue”

- Admins view dashboards, analyze monthly trends, generate reports
- Heavy queries scanning millions to billions of rows for aggregation
- Priority: "Show results fast" -> read speed first
- Denormalization is better (fewer JOINs, pre-aggregated data)
PropertyOLTP (Orders, Payments)OLAP (Dashboards, Reports)
Typical queriesINSERT, UPDATE, SELECT ... WHERE id = ?SELECT SUM/AVG/COUNT ... GROUP BY ... date range
Data per query1-10 rowsTens of thousands to billions of rows
Normalization levelHigh (3NF)Low (denormalized, star schema)
Optimized forWrite speed, integrityRead speed, aggregation
Real-world examplesService DB (MySQL, PostgreSQL)DW, BI (BigQuery, Redshift, ClickHouse)

Why You Shouldn’t Do Both in the Same DB

Scenario: Running sales reports (OLAP) directly on the service DB (OLTP)

1. SELECT SUM(total_amount) FROM orders WHERE order_date >= '2026-01-01'
   -> Full-scans 1M rows, consuming table locks and IO

2. Meanwhile, user order INSERTs start queuing up
   -> "I can't place an order!" outage begins

3. The report query also slows down due to transaction isolation
   -> Both are slow. Nobody is happy.

How to Separate Them in Practice

[OLTP DB]  ->  Sync  ->  [OLAP DB / DW]
(Service)     (CDC, ETL)    (Analytics)

- CDC (Change Data Capture): Stream DB change logs to analytics DB in real-time
  e.g., Debezium, AWS DMS
- ETL (Extract-Transform-Load): Periodically extract, transform, and load data
  e.g., Airflow, dbt
- Simple cases: Materialized View (PostgreSQL) or Read Replica

Bottom line: Keep your OLTP DB normalized. If you need analytics, separate it into a dedicated OLAP DB. “Running report queries directly on the service DB” is the classic mistake of mixing OLTP and OLAP.


6. MySQL vs PostgreSQL — Normalization/Denormalization Differences

FeatureMySQLPostgreSQLValue as Denormalization Alternative
Materialized View (MV)Not availableAvailable (REFRESH CONCURRENTLY)In PG, check MV before denormalizing
Partial indexesNot availableCREATE INDEX ... WHERE conditionSpecific queries solved by index -> no denormalization needed
Generated ColumnVIRTUAL + STOREDSTORED onlySame-table calculations work in both
JSONBJSON (limited indexing)JSONB + GIN (powerful)PG handles semi-structured data without normalization
UPSERTON DUPLICATE KEY UPDATEON CONFLICT DO UPDATESummary table refresh works in both
TriggersFOR EACH ROW onlyFOR EACH ROW + FOR EACH STATEMENTPG triggers more flexible

Terminology Explained

Materialized View (MV)

A “cached view” that physically stores query results. A regular View re-executes the query every time, but an MV stores results like a table — reads are fast.

-- Cache daily revenue as an MV (PostgreSQL)
CREATE MATERIALIZED VIEW daily_sales AS
SELECT DATE(order_date) AS day, SUM(total_amount) AS revenue
FROM orders GROUP BY DATE(order_date);

-- Query: reads as fast as a regular table
SELECT * FROM daily_sales WHERE day = '2026-04-01';

-- Must refresh manually when data changes
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_sales;

You get the same effect as manually creating a summary table with UPSERT sync — in a single DDL statement. MySQL doesn’t have this feature, so you must manage summary tables yourself.

Partial Index

An index on only the rows matching a condition, not the entire table. Smaller index size, faster targeted queries.

-- "We frequently query unpaid orders" -> full index is wasteful
-- PostgreSQL: index only unpaid orders
CREATE INDEX idx_orders_unpaid ON orders (created_at)
WHERE status = 'UNPAID';

-- This query uses the partial index
SELECT * FROM orders WHERE status = 'UNPAID' ORDER BY created_at;

MySQL doesn’t support partial indexes. In such cases, you’d need to denormalize the status column into a separate table or create a full index.

Generated Column

A column automatically computed from other columns. You can’t INSERT or UPDATE it directly — the database manages it.

-- MySQL: VIRTUAL (computed on read) / STORED (persisted to disk)
ALTER TABLE order_items
ADD COLUMN subtotal DECIMAL(15, 0)
    GENERATED ALWAYS AS (price * quantity) STORED;

-- PostgreSQL: STORED only
ALTER TABLE order_items
ADD COLUMN subtotal DECIMAL(15, 0)
    GENERATED ALWAYS AS (price * quantity) STORED;

-- Just read subtotal directly
SELECT * FROM order_items WHERE subtotal > 100000;

For same-table calculations, consider Generated Columns before denormalization. No sync concerns.

JSONB

A PostgreSQL-specific type that stores JSON in binary format with indexing support. Unlike MySQL’s JSON, it supports GIN indexes for fast internal key searches.

-- PostgreSQL: JSONB + GIN index
CREATE TABLE products (
    id BIGINT PRIMARY KEY,
    name VARCHAR(200) NOT NULL,
    attributes JSONB NOT NULL DEFAULT '{}'
);

-- Index for searching inside JSONB
CREATE INDEX idx_products_attrs ON products USING GIN (attributes);

-- Find products where color is "red" — uses the index
SELECT * FROM products WHERE attributes @> '{"color": "red"}';

Semi-structured data can be stored and searched efficiently without normalizing into separate tables. MySQL’s JSON type doesn’t support GIN indexes, so search performance is limited.

PostgreSQL often requires less denormalization than MySQL thanks to Materialized Views, partial indexes, and JSONB. What you’d denormalize for performance in MySQL may be solvable with MVs or partial indexes in PostgreSQL.


Summary

Key PointDetails
Normalization = eliminate duplication”If this value changes, do I need to update multiple rows?” -> If yes, split the table
3NF is the practical target1NF (one per cell) -> 2NF (watch composite keys) -> 3NF (separate values determined by non-PK)
Denormalization is a last resortIndexes -> query optimization -> caching -> MV -> still slow? -> denormalize
Snapshots != DenormalizationStoring order-time price/address is a business requirement, not duplication
MySQL vs PGPG needs less denormalization thanks to MVs, partial indexes, JSONB

Next up: Constraints and Data Integrity — CHECK, UNIQUE, FK trade-offs, and defensive schema design.

Shop on Amazon

As an Amazon Associate, I earn from qualifying purchases.