RDB Design Series Part 2: Normalization and Denormalization — Not Theory, but Judgment Calls
Introduction
In the previous post, we covered the fundamentals you need to decide before creating a table (naming, data types, PK, NULL). This post goes one level deeper — “How should you split and group columns?”
When you hear “normalization,” textbooks come to mind. 1NF, 2NF, 3NF… But in practice, normalization boils down to one simple question: “If I put this data here, will it cause problems later?”
This guide explains normalization from start to finish using a single scenario: an online store. Minimal jargon, maximum intuition — you’ll feel why splitting tables makes sense.
1. Why Split Tables at All?
What if the entire store is one table?
Imagine a junior developer builds an online store with everything in one table.
CREATE TABLE orders (
order_id BIGINT PRIMARY KEY,
customer_name VARCHAR(50),
customer_email VARCHAR(320),
product_name VARCHAR(100),
product_price DECIMAL(15, 0),
quantity INT,
order_date TIMESTAMP
);
| order_id | customer_name | customer_email | product_name | product_price | quantity |
|----------|--------------|-----------------|-------------|--------------|---------|
| 1 | Alice Kim | alice@email.com | Keyboard | 50,000 | 1 |
| 2 | Alice Kim | alice@email.com | Mouse | 30,000 | 2 |
| 3 | Bob Lee | bob@email.com | Keyboard | 50,000 | 1 |
Data goes in fine. Queries work. But once the service goes live, three kinds of pain emerge.
Pain 1: Updating the same data in multiple places
Alice changes her email.
UPDATE orders SET customer_email = 'new@email.com' WHERE customer_name = 'Alice Kim';
-- Must update 2 rows. If she has 10 orders, update 10. 100 orders, update 100.
-- Miss one? Alice now has two different emails in the system.
Analogy: Your friend changes their phone number, but they’re saved in your contacts 10 times. You have to update all 10. Miss one and you’ll call the old number.
Pain 2: Can’t insert data that should exist
You want to register a new product “Monitor” — but nobody’s ordered it yet.
INSERT INTO orders (product_name, product_price) VALUES ('Monitor', 500000);
-- Can't! order_id (PK) is required.
-- No order = no way to store product info.
Analogy: A restaurant can’t add a new dish to the menu until someone orders it. That makes no sense.
Pain 3: Deleting data destroys unrelated info
Bob cancels his order.
DELETE FROM orders WHERE order_id = 3;
-- Bob's order is gone... but so is all evidence that Bob was ever a customer!
-- "Keyboard costs 50,000" — that product info is also gone.
Analogy: Deleting a vacation photo from your album also deletes your friend’s contact info because they were in the picture.
Summary of the three pains
| Pain | Cause | Formal Name |
|---|---|---|
| Must update same data in many rows | One fact duplicated across rows | Update Anomaly |
| Can’t insert data that should exist | Unrelated data forced into one table | Insert Anomaly |
| Deleting data destroys other info | Unrelated data forced into one table | Delete Anomaly |
Normalization is the process of eliminating these three pains. The method is simple — split data that belongs together into separate tables.
2. Normalization Levels — Making the Store Progressively Cleaner
2.1 1NF — “One value per cell”
First Normal Form has one rule: One cell, one value.
-- One cell has multiple values
| order_id | products |
|----------|--------------------|
| 1 | Keyboard, Mouse |
| 2 | Monitor |
“I want to cancel just the Keyboard from order 1” — you’d have to parse and split the cell. Indexes can’t help either.
-- Split into rows: problem solved
| order_id | product |
|----------|----------|
| 1 | Keyboard |
| 1 | Mouse |
| 2 | Monitor |
Analogy: In a spreadsheet, putting “Seoul, Busan, Daegu” in one cell makes filtering impossible. One row per value = you can sort, search, and filter.
Common 1NF Violations in Practice
| Pattern | Example | Why It’s a Problem |
|---|---|---|
| Comma-separated | tags = "java,spring,docker" | LIKE '%spring%' also matches “springframework”. Can’t index |
| Numbered columns | phone1, phone2, phone3 | Need a 4th? ALTER TABLE. Empty columns wasted |
| JSON arrays | tags = ["java", "spring"] | MySQL: limited indexing. Hard to sort/aggregate |
The fix is always the same — separate table.
JSON can be an exception: PostgreSQL’s
JSONB+ GIN indexes search well. Ask: “Do I need to search or aggregate this data?” If no, JSON is fine. If yes, separate table.
2.2 2NF — “Depends on the whole key, not part of it”
Look at this 1NF-compliant order items table:
CREATE TABLE order_items (
order_id BIGINT,
product_id BIGINT,
product_name VARCHAR(100), -- Does this belong here?
product_price DECIMAL(15, 0), -- And this?
quantity INT,
PRIMARY KEY (order_id, product_id)
);
Ask: “What do I need to know the product name?”
quantity→ Need bothorder_idandproduct_id. “How many of product A in order 1?” ✅product_name→ Only needproduct_id. The order number is irrelevant. ❌
Product name is about the product itself, not the order. But since it’s in the order table, if the same product is ordered 100 times, the name is duplicated 100 times.
-- Product info goes in a products table
CREATE TABLE products (
id BIGINT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
price DECIMAL(15, 0) NOT NULL
);
CREATE TABLE order_items (
order_id BIGINT,
product_id BIGINT,
quantity INT NOT NULL,
PRIMARY KEY (order_id, product_id),
FOREIGN KEY (product_id) REFERENCES products(id)
);
Analogy: A class roster listing student names with each assignment submitted: “Alice — HW1, Alice — HW2, Alice — HW3”… Alice’s name repeats endlessly. Separate the student list from the submissions, and the name appears just once.
Practical tip: Most teams use single-column
BIGINT AUTO_INCREMENTPKs, so 2NF violations are rare. Watch for them only in junction tables (N:M relationships) with composite keys.
2.3 3NF — “Only values determined directly by the PK”
With 2NF satisfied, look at an employees table:
CREATE TABLE employees (
id BIGINT PRIMARY KEY,
name VARCHAR(50),
department_id INT,
department_name VARCHAR(50) -- Does this belong here?
);
| id | name | department_id | department_name |
|----|---------|:---:|------------|
| 1 | Alice | 10 | Engineering |
| 2 | Bob | 10 | Engineering |
| 3 | Charlie | 20 | Marketing |
Ask: “What do I need to know the department name?”
- Employee ID → Department ID → Department name. The department name is determined by the department, not the employee.
- But it’s in the employee table, so if Engineering has 100 people, “Engineering” is stored 100 times.
Engineering gets renamed to “Product Engineering”?
UPDATE employees SET department_name = 'Product Engineering' WHERE department_id = 10;
-- Must update 100 rows. Miss one?
-- Some department_id = 10 employees say "Engineering", others say "Product Engineering".
-- Department info goes in a departments table
CREATE TABLE departments (
id INT PRIMARY KEY,
name VARCHAR(50) NOT NULL
);
CREATE TABLE employees (
id BIGINT PRIMARY KEY,
name VARCHAR(50) NOT NULL,
department_id INT NOT NULL,
FOREIGN KEY (department_id) REFERENCES departments(id)
);
-- Rename department = update 1 row in departments. Done.
Analogy: Think of zip codes and addresses. If zip code “10001” gets redistricted, you update one row in the zip code table — not every customer’s address individually.
The Key 3NF Test
“If this column’s value changes, do I need to update multiple rows?” — If yes, it’s likely a 3NF violation.
| Violation | What triggers multi-row updates? | Fix |
|---|---|---|
orders.customer_name | Customer name change | Separate customers table |
products.category_name | Category rename | Separate categories table |
employees.department_name | Department rename | Separate departments table |
2.4 BCNF (Boyce-Codd Normal Form) — A Special Case of 3NF
Rare in practice, but worth a brief mention.
Sometimes 3NF is satisfied but problems remain: a non-PK column determines another column.
Example: University course system
- Each professor teaches only one course
- Students take multiple courses
| student_id | course | professor |
|:---:|-----------|------------|
| 1 | DB Design | Prof. Kim |
| 1 | Networks | Prof. Lee |
| 2 | DB Design | Prof. Kim |
PK: (student_id, course)
Problem: professor -> course (knowing the professor determines the course)
but professor is not a candidate key!
-> Need a separate professor-course table.
In practice, 3NF is sufficient. BCNF only matters in domains with complex composite keys (academic systems, reservation systems).
3. Normalization Summary — At a Glance
| Normal Form | One-Line Summary | How Often Violated |
|---|---|---|
| 1NF | One value per cell | Frequently (comma strings, JSON arrays) |
| 2NF | No partial key dependencies | Rare (auto-satisfied with single PK) |
| 3NF | No non-PK column determining another | Most commonly violated |
| BCNF | Every determinant is a candidate key | Almost never encountered |
The one normalization question: “If this column changes, do I need to update multiple rows?” → If yes, split the table.
4. Denormalization — When It’s OK to Break the Rules
Normalization optimizes for data integrity. Denormalization optimizes for read performance. They’re a trade-off.
4.1 Signals That Denormalization May Be Needed
"This query has 5 JOINs and runs on every page load?"
"The sales stats API takes 3 seconds and users are complaining"
"The same JOIN pattern is repeated in 10 different places"
Denormalization is a last resort. Try these first — only denormalize when everything else fails:
- Add/optimize indexes
- Refactor queries
- Caching (Redis, application cache)
- Read replicas
- Materialized Views (PostgreSQL)
- Still slow? → Now consider denormalization
4.2 Denormalization Patterns — Online Store Examples
Pattern 1: Pre-calculate and Store
The order list API needs to show order totals.
-- Normalized: calculate every time
SELECT o.id, SUM(oi.price * oi.quantity) AS total
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
GROUP BY o.id;
-- With 1M orders? JOIN + SUM every time is slow.
-- Denormalized: pre-store the total
ALTER TABLE orders ADD COLUMN total_amount DECIMAL(15, 0) NOT NULL DEFAULT 0;
Trade-off: Reads are fast, but you must recalculate and update the total whenever order items change.
-- MySQL: auto-sync with trigger
DELIMITER //
CREATE TRIGGER trg_order_items_after_insert
AFTER INSERT ON order_items
FOR EACH ROW
BEGIN
UPDATE orders
SET total_amount = (
SELECT SUM(price * quantity) FROM order_items WHERE order_id = NEW.order_id
)
WHERE id = NEW.order_id;
END //
DELIMITER ;
-- PostgreSQL: trigger function
CREATE OR REPLACE FUNCTION update_order_total()
RETURNS TRIGGER AS $$
BEGIN
UPDATE orders
SET total_amount = (
SELECT COALESCE(SUM(price * quantity), 0)
FROM order_items WHERE order_id = NEW.order_id
)
WHERE id = NEW.order_id;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_order_items_after_insert
AFTER INSERT ON order_items
FOR EACH ROW EXECUTE FUNCTION update_order_total();
⚠️ In practice, application-level sync is preferred over triggers. Triggers are “invisible code” — hard to debug during incidents, conflict with ORM caches, and difficult to unit test. Most teams update within the same transaction in the Service layer, or use CDC/event-driven async sync (Debezium, Kafka, etc.). Rare cases where triggers are appropriate: audit logs, legacy system integration, complex integrity rules that can’t be expressed as CHECK constraints.
If you’re on PostgreSQL, consider a Materialized View first. It caches aggregated results without adding columns to your table.
CREATE MATERIALIZED VIEW order_totals AS SELECT order_id, SUM(price * quantity) AS total_amount FROM order_items GROUP BY order_id; -- Refresh without blocking reads REFRESH MATERIALIZED VIEW CONCURRENTLY order_totals;MySQL doesn’t have this feature — you must manage summary tables manually.
Pattern 2: Summary Table
The sales dashboard needs daily revenue.
-- Calculating from all orders every time is slow
SELECT DATE(order_date) AS day, COUNT(*), SUM(total_amount)
FROM orders WHERE order_date >= '2026-01-01'
GROUP BY DATE(order_date);
-- Summary table: pre-aggregated
CREATE TABLE daily_sales_summary (
sale_date DATE PRIMARY KEY,
total_orders INT NOT NULL DEFAULT 0,
total_amount DECIMAL(15, 0) NOT NULL DEFAULT 0,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
-- MySQL: UPSERT to refresh
INSERT INTO daily_sales_summary (sale_date, total_orders, total_amount)
SELECT DATE(order_date), COUNT(*), SUM(total_amount)
FROM orders WHERE DATE(order_date) = CURRENT_DATE
GROUP BY DATE(order_date)
ON DUPLICATE KEY UPDATE
total_orders = VALUES(total_orders),
total_amount = VALUES(total_amount),
updated_at = CURRENT_TIMESTAMP;
-- PostgreSQL: ON CONFLICT to refresh
INSERT INTO daily_sales_summary (sale_date, total_orders, total_amount)
SELECT DATE(order_date), COUNT(*), SUM(total_amount)
FROM orders WHERE DATE(order_date) = CURRENT_DATE
GROUP BY DATE(order_date)
ON CONFLICT (sale_date) DO UPDATE SET
total_orders = EXCLUDED.total_orders,
total_amount = EXCLUDED.total_amount,
updated_at = CURRENT_TIMESTAMP;
What UPSERT does: It attempts an INSERT, and if a row with the same key (sale_date) already exists, it executes an UPDATE instead.
- MySQL —
ON DUPLICATE KEY UPDATE: When the PRIMARY KEY or UNIQUE KEY conflicts, the UPDATE clause runs instead.VALUES(total_orders)references the value that was about to be inserted. - PostgreSQL —
ON CONFLICT (sale_date) DO UPDATE SET: When a conflict occurs on the specified column (sale_date), it runs the UPDATE.EXCLUDEDis a virtual table representing the row that was about to be inserted.
This lets you handle “insert if new, update if exists” in a single atomic query — ideal for refreshing summary tables.
Analogy: Instead of counting every book in the library daily, you keep a “today’s checkout summary” board. Not real-time, but fast enough.
Pattern 3: Snapshots (This Isn’t Actually Denormalization!)
-- Preserve the product price at the time of order
CREATE TABLE order_items (
id BIGINT PRIMARY KEY,
order_id BIGINT NOT NULL,
product_id BIGINT NOT NULL,
unit_price DECIMAL(15, 0) NOT NULL, -- price at order time
quantity INT NOT NULL
);
The product price goes from $50 to $60. If a customer who paid $50 suddenly sees $60 on their receipt? That’s a lawsuit.
| Type | Meaning | Example |
|---|---|---|
| Snapshot | Preserving a point-in-time value — a business requirement | Order-time price, shipping address, terms version |
| Denormalization | Intentionally duplicating data for read performance | Storing customer name in orders table |
Snapshots are correct design. Denormalization is a trade-off. Don’t confuse them.
Pattern 4: Table Merging
-- 1:1 relationship, two tables
CREATE TABLE users (
id BIGINT PRIMARY KEY,
email VARCHAR(320) NOT NULL UNIQUE,
password_hash VARCHAR(255) NOT NULL
);
CREATE TABLE user_profiles (
user_id BIGINT PRIMARY KEY REFERENCES users(id),
nickname VARCHAR(50),
bio TEXT,
avatar_url VARCHAR(2048)
);
-- Almost always queried together -> JOIN every time
-- Merged into one
CREATE TABLE users (
id BIGINT PRIMARY KEY,
email VARCHAR(320) NOT NULL UNIQUE,
password_hash VARCHAR(255) NOT NULL,
nickname VARCHAR(50),
bio TEXT,
avatar_url VARCHAR(2048)
);
| OK to Merge | Don’t Merge |
|---|---|
| 1:1, almost always queried together | One side queried alone frequently |
| Separated table is small | One side is very large (row size ↑ → cache efficiency ↓) |
| Split reason was “felt right” | Different access permissions (passwords vs profile) |
4.3 Pre-Denormalization Checklist
[] Index optimization attempted?
[] Queries refactored?
[] Caching layer (Redis, etc.) considered?
[] Read replicas considered?
[] Materialized Views (PostgreSQL) considered?
-> Tried everything and still slow -> Now consider denormalization
If you denormalize, you must:
- Automate sync with triggers or events (manual updates will be missed)
- Write inconsistency detection queries (periodically verify original vs duplicate)
- Comment the reason — leave comments in DDL (table/column definitions) and related queries
-- In the column definition ALTER TABLE orders ADD COLUMN total_amount DECIMAL(15, 0) NOT NULL DEFAULT 0; -- Denormalized: order list API 3s -> 0.2s improvement (2026-04) -- Source: SELECT SUM(price * quantity) FROM order_items -- In the sync query as well -- Denormalization sync: orders.total_amount <- order_items aggregation UPDATE orders SET total_amount = (...) WHERE id = ?;
5. Normalization vs Denormalization — Decision Guide
| Scenario | Keep Normalized | Consider Denormalization |
|---|---|---|
| Data changes frequently | O | |
| Integrity is critical (finance, healthcare) | O | |
| Table is small (under 100K rows) | O | |
| Reads vastly outnumber writes | O | |
| Aggregation/stats queries are frequent | O | |
| 5+ JOIN queries repeat often | O | |
| Indexes and caching already maxed out | O |
OLTP vs OLAP — Why They Need Different Normalization Levels
Databases serve two fundamentally different purposes.
OLTP (Online Transaction Processing) — “The DB handling orders right now”
- Users place orders, make payments, cancel, and update shipping status
- Short transactions reading/writing 1-10 rows, hundreds to thousands per second
- Priority: "This order data must be correct" -> integrity first
- Higher normalization is better (no duplicates, always accurate)
OLAP (Online Analytical Processing) — “The DB showing yesterday’s revenue”
- Admins view dashboards, analyze monthly trends, generate reports
- Heavy queries scanning millions to billions of rows for aggregation
- Priority: "Show results fast" -> read speed first
- Denormalization is better (fewer JOINs, pre-aggregated data)
| Property | OLTP (Orders, Payments) | OLAP (Dashboards, Reports) |
|---|---|---|
| Typical queries | INSERT, UPDATE, SELECT ... WHERE id = ? | SELECT SUM/AVG/COUNT ... GROUP BY ... date range |
| Data per query | 1-10 rows | Tens of thousands to billions of rows |
| Normalization level | High (3NF) | Low (denormalized, star schema) |
| Optimized for | Write speed, integrity | Read speed, aggregation |
| Real-world examples | Service DB (MySQL, PostgreSQL) | DW, BI (BigQuery, Redshift, ClickHouse) |
Why You Shouldn’t Do Both in the Same DB
Scenario: Running sales reports (OLAP) directly on the service DB (OLTP)
1. SELECT SUM(total_amount) FROM orders WHERE order_date >= '2026-01-01'
-> Full-scans 1M rows, consuming table locks and IO
2. Meanwhile, user order INSERTs start queuing up
-> "I can't place an order!" outage begins
3. The report query also slows down due to transaction isolation
-> Both are slow. Nobody is happy.
How to Separate Them in Practice
[OLTP DB] -> Sync -> [OLAP DB / DW]
(Service) (CDC, ETL) (Analytics)
- CDC (Change Data Capture): Stream DB change logs to analytics DB in real-time
e.g., Debezium, AWS DMS
- ETL (Extract-Transform-Load): Periodically extract, transform, and load data
e.g., Airflow, dbt
- Simple cases: Materialized View (PostgreSQL) or Read Replica
Bottom line: Keep your OLTP DB normalized. If you need analytics, separate it into a dedicated OLAP DB. “Running report queries directly on the service DB” is the classic mistake of mixing OLTP and OLAP.
6. MySQL vs PostgreSQL — Normalization/Denormalization Differences
| Feature | MySQL | PostgreSQL | Value as Denormalization Alternative |
|---|---|---|---|
| Materialized View (MV) | Not available | Available (REFRESH CONCURRENTLY) | In PG, check MV before denormalizing |
| Partial indexes | Not available | CREATE INDEX ... WHERE condition | Specific queries solved by index -> no denormalization needed |
| Generated Column | VIRTUAL + STORED | STORED only | Same-table calculations work in both |
| JSONB | JSON (limited indexing) | JSONB + GIN (powerful) | PG handles semi-structured data without normalization |
| UPSERT | ON DUPLICATE KEY UPDATE | ON CONFLICT DO UPDATE | Summary table refresh works in both |
| Triggers | FOR EACH ROW only | FOR EACH ROW + FOR EACH STATEMENT | PG triggers more flexible |
Terminology Explained
Materialized View (MV)
A “cached view” that physically stores query results. A regular View re-executes the query every time, but an MV stores results like a table — reads are fast.
-- Cache daily revenue as an MV (PostgreSQL)
CREATE MATERIALIZED VIEW daily_sales AS
SELECT DATE(order_date) AS day, SUM(total_amount) AS revenue
FROM orders GROUP BY DATE(order_date);
-- Query: reads as fast as a regular table
SELECT * FROM daily_sales WHERE day = '2026-04-01';
-- Must refresh manually when data changes
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_sales;
You get the same effect as manually creating a summary table with UPSERT sync — in a single DDL statement. MySQL doesn’t have this feature, so you must manage summary tables yourself.
Partial Index
An index on only the rows matching a condition, not the entire table. Smaller index size, faster targeted queries.
-- "We frequently query unpaid orders" -> full index is wasteful
-- PostgreSQL: index only unpaid orders
CREATE INDEX idx_orders_unpaid ON orders (created_at)
WHERE status = 'UNPAID';
-- This query uses the partial index
SELECT * FROM orders WHERE status = 'UNPAID' ORDER BY created_at;
MySQL doesn’t support partial indexes. In such cases, you’d need to denormalize the status column into a separate table or create a full index.
Generated Column
A column automatically computed from other columns. You can’t INSERT or UPDATE it directly — the database manages it.
-- MySQL: VIRTUAL (computed on read) / STORED (persisted to disk)
ALTER TABLE order_items
ADD COLUMN subtotal DECIMAL(15, 0)
GENERATED ALWAYS AS (price * quantity) STORED;
-- PostgreSQL: STORED only
ALTER TABLE order_items
ADD COLUMN subtotal DECIMAL(15, 0)
GENERATED ALWAYS AS (price * quantity) STORED;
-- Just read subtotal directly
SELECT * FROM order_items WHERE subtotal > 100000;
For same-table calculations, consider Generated Columns before denormalization. No sync concerns.
JSONB
A PostgreSQL-specific type that stores JSON in binary format with indexing support. Unlike MySQL’s JSON, it supports GIN indexes for fast internal key searches.
-- PostgreSQL: JSONB + GIN index
CREATE TABLE products (
id BIGINT PRIMARY KEY,
name VARCHAR(200) NOT NULL,
attributes JSONB NOT NULL DEFAULT '{}'
);
-- Index for searching inside JSONB
CREATE INDEX idx_products_attrs ON products USING GIN (attributes);
-- Find products where color is "red" — uses the index
SELECT * FROM products WHERE attributes @> '{"color": "red"}';
Semi-structured data can be stored and searched efficiently without normalizing into separate tables. MySQL’s JSON type doesn’t support GIN indexes, so search performance is limited.
PostgreSQL often requires less denormalization than MySQL thanks to Materialized Views, partial indexes, and JSONB. What you’d denormalize for performance in MySQL may be solvable with MVs or partial indexes in PostgreSQL.
Summary
| Key Point | Details |
|---|---|
| Normalization = eliminate duplication | ”If this value changes, do I need to update multiple rows?” -> If yes, split the table |
| 3NF is the practical target | 1NF (one per cell) -> 2NF (watch composite keys) -> 3NF (separate values determined by non-PK) |
| Denormalization is a last resort | Indexes -> query optimization -> caching -> MV -> still slow? -> denormalize |
| Snapshots != Denormalization | Storing order-time price/address is a business requirement, not duplication |
| MySQL vs PG | PG needs less denormalization thanks to MVs, partial indexes, JSONB |
Next up: Constraints and Data Integrity — CHECK, UNIQUE, FK trade-offs, and defensive schema design.