RDB Design Series Part 1: Schema Basics — Naming, Data Types, PK Strategy, NULL

RDB Design Series Part 1: Schema Basics — Naming, Data Types, PK Strategy, NULL


Introduction

“How hard can it be to create a table?” — that’s what everyone thinks at first. Just throw some columns into CREATE TABLE and you’re done.

But as the service grows, things you decided carelessly at the start come back to haunt you. Column names are inconsistent so you have to check the ERD for every query, VARCHAR(255) everywhere bloats your indexes, and the INT primary key overflows past 2.1 billion, triggering an emergency migration at 3 AM.

This post covers 4 things you need to decide before creating a table:

  1. Naming conventions
  2. Data type selection
  3. PK strategy
  4. NULL semantics

All of these “seem unimportant now but are extremely expensive to change later.”


1. Naming Conventions

1.1 Why It Matters

Table and column names outlive your code. You can refactor the application anytime, but renaming a column means touching every query, ORM mapping, API response, index, and constraint.

-- Imagine inheriting this schema
CREATE TABLE TBL_USR_INF (
    usrSeqNo BIGINT,
    usrNm VARCHAR(100),
    usrStCd VARCHAR(2),      -- 01: active, 02: withdrawn, 03: suspended...?
    crtDtm DATETIME,
    updDtm DATETIME
);

Without abbreviation lookup tables, this schema is unreadable. It forces you to open the ERD every time you write a query.

1.2 Why snake_case?

There are clear reasons why snake_case is the default in the database world.

ReasonExplanation
Avoids case sensitivity trapsMySQL behavior varies by OS, PostgreSQL folds to lowercase without quotes. OrderItem works in some environments, fails in others. snake_case is all lowercase, so it behaves identically across every DB and OS
Reads well with SQLSQL keywords are conventionally uppercase (SELECT, FROM, WHERE). Mixed-case identifiers blur the line between keywords and names. SELECT OrderDate FROM OrderItems vs SELECT order_date FROM order_items — the latter is immediately readable
ORM auto-mappingJPA/Hibernate automatically maps camelCase entity fields to snake_case columns (ImplicitNamingStrategy). If the DB uses snake_case, everything works without @Column(name=...) annotations
CLI/terminal convenienceYou can type identifiers directly in psql, mysql without quotes or backticks. Typing SELECT * FROM "OrderItems" with quotes every time is painful
Industry standardPostgreSQL official docs, MySQL official examples, and major frameworks (Rails, Django, Laravel) all default to snake_case
-- What happens when you use camelCase
CREATE TABLE "OrderItems" ("orderId" BIGINT, "productName" VARCHAR(100));

-- 1. Every query needs quotes (PostgreSQL)
SELECT "orderId", "productName" FROM "OrderItems";  -- every single time

-- 2. Forget quotes? Error.
SELECT orderId FROM OrderItems;  -- Looks for "orderitems"."orderid"

-- 3. pg_dump and other tools may drop quotes -> restore fails

-- With snake_case?
CREATE TABLE order_items (order_id BIGINT, product_name VARCHAR(100));
SELECT order_id, product_name FROM order_items;  -- clean, no quotes needed

1.3 Table Names

RuleGoodBadWhy
snake_caseorder_itemOrderItem, orderitemAll the reasons explained above
Pluralorders, usersorder, userA table is a collection of rows. Plural is natural
No prefixesorderstbl_orders, t_ordersPrefixes carry zero information. Just noise
Avoid reserved wordsuser_accountsuser, orderuser is reserved in PostgreSQL/MySQL. Requires quoting every time

Singular vs plural debate: Honestly, both conventions are common. What matters is picking one and sticking to it. This guide recommends plural, but if your team uses singular, stay consistent.

Case Sensitivity: MySQL vs PostgreSQL

This is a subtle issue that many people overlook until it breaks in production.

-- MySQL: depends on lower_case_table_names setting
-- 0 (Linux default): case-sensitive -> OrderItems != orderitems
-- 1 (Windows/macOS default): stored as lowercase -> OrderItems = orderitems
-- 2 (macOS): compared as lowercase but original name preserved

-- PostgreSQL: always folds to lowercase without quotes
CREATE TABLE OrderItems (...);   -- Actually creates "orderitems"
SELECT * FROM OrderItems;        -- Queries orderitems
SELECT * FROM "OrderItems";      -- This preserves case (not recommended)
BehaviorMySQLPostgreSQL
Case sensitivityDepends on OS/setting (lower_case_table_names)Always folds to lowercase without quotes
Accessing OrderItemsSuccess or failure depending on settingAccessible as orderitems, "OrderItems" is separate
Conclusionsnake_case is safesnake_case is safe

Regardless of which DB you use, sticking to snake_case eliminates this problem entirely.

Reserved Word Pitfalls

-- In PostgreSQL, "user" is reserved
SELECT * FROM user;          -- Error
SELECT * FROM "user";        -- Works, but requires quotes every time
SELECT * FROM users;         -- Clean

-- In MySQL, "order" is reserved
SELECT * FROM order;         -- Error
SELECT * FROM `order`;       -- Works, but requires backticks every time
SELECT * FROM orders;        -- Clean

1.4 Column Names

RuleGoodBadWhy
snake_casecreated_atcreatedAt, CreatedAtORMs handle auto-conversion (JPA: ImplicitNamingStrategy)
No abbreviationsstatus, descriptionsts, descdesc is also a reserved word (ORDER BY ... DESC)
Boolean: is/has prefixis_active, has_couponactive, coupon_ynMakes the meaning explicit. _yn is a legacy convention
Timestamps: _at suffixcreated_at, deleted_atreg_date, crt_dtmClearly indicates a timestamp
FK: referenced_table_iduser_id, order_idusr_seq, fk_orderImmediately obvious which table’s PK is referenced

PK: id vs table_name_id — Which is Better?

iduser_id (includes table name)
ProsConcise, ORM default (JPA @Id), reads naturally as user.idColumn name alone shows origin in JOINs, better SQL readability
ConsJOINs require table prefix: users.id = orders.user_iduser.user_id feels redundant, needs ORM mapping
Preferred byRails, JPA/Hibernate, Django (ORM-centric)PostgreSQL community, DBA-led teams, SQL-heavy environments

The most common pattern in practice is id for the table’s own PK, referenced_table_id for FKs.

CREATE TABLE users (
    id BIGINT PRIMARY KEY,        -- Own table: id
    name VARCHAR(50) NOT NULL
);

CREATE TABLE orders (
    id BIGINT PRIMARY KEY,        -- Own table: id
    user_id BIGINT NOT NULL,      -- FK: referenced_table_id
    FOREIGN KEY (user_id) REFERENCES users(id)
);

Either approach works — what matters most is consistency within your team. This post uses the ORM-friendly id + referenced_table_id pattern.

1.5 Index and Constraint Names

Without explicit names, the DB auto-generates names like SYS_C007342. When this shows up in production error logs, it’s meaningless.

-- Recommended patterns
ALTER TABLE orders ADD CONSTRAINT pk_orders PRIMARY KEY (id);
ALTER TABLE orders ADD CONSTRAINT uq_orders_order_number UNIQUE (order_number);
ALTER TABLE order_items ADD CONSTRAINT fk_order_items_order_id
    FOREIGN KEY (order_id) REFERENCES orders(id);
CREATE INDEX idx_orders_user_id_created_at ON orders (user_id, created_at);
ObjectPatternExample
PKpk_{table}pk_orders
Uniqueuq_{table}_{column}uq_orders_order_number
FKfk_{table}_{column}fk_order_items_order_id
Indexidx_{table}_{columns}idx_orders_user_id_created_at

2. Data Type Selection

2.1 VARCHAR — Why Length Matters More Than You Think

-- Common mistake: 255 for everything
CREATE TABLE users (
    name VARCHAR(255),
    email VARCHAR(255),
    phone VARCHAR(255),       -- 255 for a phone number?
    zip_code VARCHAR(255)     -- 255 for a zip code?
);

“It’s variable length anyway, so why not go big?” — Wrong.

ImpactExplanation
Index sizeInnoDB’s max index key size is 3072 bytes. VARCHAR(255) + utf8mb4 = up to 1020 bytes. Three composite columns and you’re at the limit
Memory allocationMySQL’s MEMORY engine and temp tables allocate VARCHAR at max length. Ten VARCHAR(255) columns = 2,550 bytes per row
Intent documentationVARCHAR(20) says “this field holds short values.” 255 says “I didn’t think about it”

PostgreSQL is different: PostgreSQL stores VARCHAR(n) and TEXT identically under the hood. The length limit acts like a CHECK constraint — there’s no performance or storage difference. That’s why the PostgreSQL community often advises “just use TEXT if you don’t need a length limit.” However, if you might use MySQL too, or want the schema itself to serve as documentation, specifying appropriate VARCHAR lengths is still a good practice.

Practical Length Guidelines

Use CaseLengthRationale
Person nameVARCHAR(50)Covers most international names
EmailVARCHAR(320)RFC 5321 max: 320 chars (local 64 + @ + domain 255)
Phone numberVARCHAR(20)International format max 15 digits + separators
Zip/postal codeVARCHAR(10)US ZIP+4 is 10 chars
URLVARCHAR(2048)Practical browser max
Short code/statusVARCHAR(30)ACTIVE, PENDING_APPROVAL, etc.

DB-Specific Notes

  • MySQL (InnoDB): VARCHAR lengths of 255 or less use 1 byte to store the length prefix; 256 or more use 2 bytes. It’s a small difference, but worth knowing that internal storage behavior changes at the 255 boundary.
  • Oracle: Uses VARCHAR2, and you must explicitly choose between character and byte semantics — e.g., VARCHAR2(50 CHAR) vs VARCHAR2(50 BYTE). When handling multibyte characters (Korean, Japanese, etc.), always specify CHAR semantics.
  • PostgreSQL: As noted above, VARCHAR(n) and TEXT are stored identically, so the choice is less about length and more about whether the limit represents a business rule.
  • SQL Server: VARCHAR is byte-based, while NVARCHAR is character-based (UTF-16, 2 bytes per character). For non-ASCII text (Korean, Japanese, etc.), use NVARCHAR. Note the max length difference: VARCHAR(8000) vs NVARCHAR(4000).

The recommended lengths in the table above are based on RFCs and international standards — they apply regardless of which database you use.

2.2 Charset & Collation — The Hidden Minefield of Strings

Just as important as VARCHAR length — but far more often ignored — are charset (character encoding) and collation (sorting/comparison rules).

MySQL: utf8 ≠ UTF-8

MySQL’s most famous gotcha:

-- ❌ utf8 is only 3 bytes max → can't store emoji (💡)
CREATE TABLE posts (
    title VARCHAR(200)
) CHARACTER SET utf8;

-- ✅ utf8mb4 is real UTF-8 (up to 4 bytes)
CREATE TABLE posts (
    title VARCHAR(200)
) CHARACTER SET utf8mb4;

Since MySQL 8.0, the default charset is utf8mb4, but if you’re working with legacy databases, always verify.

How Collation Affects Your Queries

Collation determines how strings are compared and sorted. The same data can produce different results for WHERE and ORDER BY depending on the collation.

CollationBehaviorUse Case
utf8mb4_unicode_ciCase-insensitive, accent-insensitiveGeneral text (names, emails)
utf8mb4_binExact byte-level comparisonHashes, tokens, password hashes
utf8mb4_0900_ai_ciMySQL 8.0 default. Unicode 9.0-based, more accurate sortingNew projects on MySQL 8.0+
-- Results differ based on collation
-- utf8mb4_unicode_ci: 'cafe' = 'café' = 'CAFE' (all equal)
-- utf8mb4_bin:        'cafe' ≠ 'café' ≠ 'CAFE' (all different)

-- Pro tip: you can set collation per column
CREATE TABLE users (
    email VARCHAR(320) COLLATE utf8mb4_unicode_ci,  -- case-insensitive search
    api_key VARCHAR(64) COLLATE utf8mb4_bin          -- exact matching
);

PostgreSQL: Encoding Is Simple, Collation Is Complex

PostgreSQL sets encoding at database creation, and UTF-8 is the de facto standard. There’s no charset trap like MySQL’s.

-- Set at database level
CREATE DATABASE myapp
    ENCODING = 'UTF8'
    LC_COLLATE = 'en_US.UTF-8';

-- PostgreSQL 12+: ICU collation for finer control
CREATE COLLATION korean (provider = icu, locale = 'ko-KR');

ALTER TABLE users
    ALTER COLUMN name TYPE VARCHAR(50) COLLATE "korean";

PostgreSQL collation comes in two flavors: the traditional OS locale-dependent approach and the ICU provider. For new projects, prefer ICU — it prevents the nasty surprise of sort order changing after an OS upgrade.

SQL Server: NVARCHAR + Collation

SQL Server sets the default collation at the database level, not per table or column.

-- Database default collation
CREATE DATABASE MyApp COLLATE Korean_Wansung_CI_AS;

-- CI = Case Insensitive, AS = Accent Sensitive
-- Use Korean_Wansung family for Korean sorting

What the Collation Abbreviations Mean:

AbbreviationMeaningExample
CI (Case Insensitive)Treats upper/lowercase as equal'abc' = 'ABC'true
CS (Case Sensitive)Distinguishes upper/lowercase'abc' = 'ABC'false
AI (Accent Insensitive)Treats accented chars as equal'café' = 'cafe'true
AS (Accent Sensitive)Distinguishes accented chars'café' = 'cafe'false

Why Accent Sensitivity matters in practice: If your application handles European languages, whether é, ë, è match e directly affects search results. For Korean-only services, the AS/AI difference is rarely noticeable since Korean doesn’t use accents — but for multilingual services, this is a must-consider setting.

As mentioned in the DB-specific notes above, use NVARCHAR when handling non-ASCII characters like Korean or Japanese.

Practical Rules

DBRecommended CharsetRecommended Collation
MySQL 8.0+utf8mb4 (default)utf8mb4_0900_ai_ci (default, AI=Accent Insensitive, CI=Case Insensitive)
MySQL 5.7utf8mb4 (must specify!)utf8mb4_unicode_ci
PostgreSQLUTF8ICU provider-based (ko-KR, etc.)
SQL ServerUse NVARCHARKorean_Wansung_CI_AS (for Korean services)

Key takeaway: Charset and collation should be decided once at project inception and kept consistent. Changing them later requires a table rebuild + index recreation.

2.3 Integer Types — INT vs BIGINT

TypeBytesRange (UNSIGNED)When It Runs Out
INT40 ~ ~2.1 billion100K rows/day ~ 58 years. Looks safe, but…
BIGINT80 ~ ~9.2 quintillionEffectively never

“2.1 billion should be enough, right?” — Here’s the trap:

-- Consider these scenarios
1. Orders table: 500K/day x 365 days x 10 years = 1.825 billion -> INT almost maxed
2. Log table: 10M/day -> overflows in 7 months
3. Deletions: AUTO_INCREMENT never decreases after deletes

Practical rule: Always start PKs with BIGINT. Saving 4 bytes isn’t worth an emergency migration at 3 AM.

INT to BIGINT Migration Cost: MySQL vs PostgreSQL

DBWhat INT -> BIGINT requiresImpact
MySQL (InnoDB)Table rebuild (ALGORITHM=COPY)100M rows = tens of minutes to hours, writes may be blocked
PostgreSQLAlso table rebuild (ALTER COLUMN TYPE)ACCESS EXCLUSIVE lock on entire table -> reads AND writes blocked

Both databases make this an extremely expensive operation on large tables. Start with BIGINT from the beginning.

Note: MySQL has online DDL tools like pt-online-schema-change or gh-ost for zero-downtime changes. PostgreSQL uses a new-column + gradual-copy + column-swap strategy. Both are complex and risky.

2.4 Money — DECIMAL vs FLOAT

-- FLOAT trap
SELECT CAST(0.1 + 0.2 AS FLOAT);
-- Result: 0.30000000000000004  (not 0.30)

-- DECIMAL is exact
SELECT CAST(0.1 AS DECIMAL(10,2)) + CAST(0.2 AS DECIMAL(10,2));
-- Result: 0.30
TypePrecisionUse Case
FLOAT / DOUBLEApproximate (IEEE 754)Scientific calculations, coordinates, sensor data
DECIMAL(p, s)ExactMoney, quantities, rates — anything where even 1 cent matters

Choosing DECIMAL Precision

-- Currencies with no decimals (KRW, JPY)
price DECIMAL(15, 0)           -- Up to 999 trillion

-- Currencies with 2 decimals (USD, EUR)
price DECIMAL(15, 2)           -- Up to 9,999,999,999,999.99

-- Exchange rates, interest rates
exchange_rate DECIMAL(12, 6)   -- 1,234.567890

Rule: Never use FLOAT for money columns. No exceptions.

2.5 Date/Time — DATETIME vs TIMESTAMP

This matters more than you’d think — and the same type name behaves differently in MySQL vs PostgreSQL, so you must understand both.

MySQL Date/Time Types

PropertyDATETIMETIMESTAMP
StorageStored as-isConverted to UTC before storage
Range1000-01-01 ~ 9999-12-311970-01-01 ~ 2038-01-19
TimezoneNot affectedConverted based on time_zone setting
Size5 bytes4 bytes
-- MySQL: Timezone difference demo
SET time_zone = '+09:00';
INSERT INTO test (dt, ts) VALUES (NOW(), NOW());

SET time_zone = '+00:00';
SELECT dt, ts FROM test;
-- dt: 2026-04-05 14:00:00  (unchanged)
-- ts: 2026-04-05 05:00:00  (converted to UTC)

PostgreSQL Date/Time Types

PropertyTIMESTAMPTIMESTAMPTZ
StorageStored as-isConverted to UTC before storage
Range4713 BC ~ 294276 AD4713 BC ~ 294276 AD
TimezoneNot affectedConverted based on timezone setting
Size8 bytes8 bytes
-- PostgreSQL: Timezone difference demo
SET timezone = 'Asia/Seoul';
INSERT INTO test (ts, tstz) VALUES (NOW(), NOW());

SET timezone = 'UTC';
SELECT ts, tstz FROM test;
-- ts: 2026-04-05 14:00:00    (unchanged — stored exactly as entered)
-- tstz: 2026-04-05 05:00:00  (converted to UTC for display)

SQL Server Date/Time Types

PropertyDATETIME2DATETIMEOFFSET
StorageStored as-isStored with UTC offset
Range0001-01-01 ~ 9999-12-310001-01-01 ~ 9999-12-31
TimezoneNot affectedIncludes offset info (+09:00, etc.)
Size6-8 bytes (depends on precision)8-10 bytes
PrecisionUp to 100 nanoseconds (DATETIME2(7))Up to 100 nanoseconds
-- SQL Server: Timezone difference demo
DECLARE @dt DATETIME2 = '2026-04-05 14:00:00';
DECLARE @dto DATETIMEOFFSET = '2026-04-05 14:00:00 +09:00';

SELECT @dt;   -- 2026-04-05 14:00:00.0000000 (no offset)
SELECT @dto;  -- 2026-04-05 14:00:00.0000000 +09:00

-- Convert to UTC
SELECT SWITCHOFFSET(@dto, '+00:00');
-- 2026-04-05 05:00:00.0000000 +00:00

DATETIME vs DATETIME2: SQL Server also has a legacy DATETIME type, but its range (1753~9999) and precision (3.33ms) are limited. Always use DATETIME2 for new projects.

MySQL vs PostgreSQL vs SQL Server Type Mapping

PurposeMySQLPostgreSQLSQL ServerNote
Timezone-aware timeTIMESTAMPTIMESTAMPTZDATETIMEOFFSETDifferent names, same role
Timezone-naive timeDATETIMETIMESTAMPDATETIME2Caution: same name, different role!
Date onlyDATEDATEDATESame
Time onlyTIMETIME / TIMETZTIMEPostgreSQL has timezone variant

Confusion point: MySQL’s TIMESTAMP and PostgreSQL’s TIMESTAMP share the same name but behave differently. MySQL TIMESTAMP is timezone-aware, while PostgreSQL TIMESTAMP is timezone-naive. PostgreSQL’s timezone-aware type is TIMESTAMPTZ. SQL Server uses a distinct name DATETIMEOFFSET, which avoids this confusion.

The 2038 Problem

MySQL’s TIMESTAMP is internally stored as a 4-byte integer (Unix timestamp). It overflows on January 19, 2038. PostgreSQL uses 8 bytes and SQL Server’s DATETIME2 uses 6-8 bytes, so neither has this problem.

ScenarioMySQLPostgreSQLSQL Server
Global serviceTIMESTAMP (watch for 2038)TIMESTAMPTZDATETIMEOFFSET
Single-region serviceDATETIMETIMESTAMPTZ (still recommended)DATETIME2
Birth dateDATEDATEDATE
Event schedulingDATETIMETIMESTAMPDATETIME2
created_at, updated_atTIMESTAMP or DATETIMETIMESTAMPTZDATETIME2 or DATETIMEOFFSET

PostgreSQL tip: The official PostgreSQL docs recommend “use TIMESTAMPTZ for almost everything.” Plain TIMESTAMP (without timezone) is only for rare cases where you need an absolute time in a specific timezone, like “2 PM KST for an event.”

SQL Server tip: DATETIMEOFFSET stores the offset value (+09:00) alongside the time, preserving which timezone the data was entered in. Useful for global services. For single-region apps, DATETIME2 is sufficient.

2.6 ENUM vs Lookup Tables

-- Option 1: ENUM
CREATE TABLE orders (
    status ENUM('PENDING', 'PAID', 'SHIPPED', 'CANCELLED')
);

-- Option 2: Lookup table
CREATE TABLE order_statuses (
    id INT PRIMARY KEY,
    name VARCHAR(30) NOT NULL UNIQUE
);

CREATE TABLE orders (
    status_id INT REFERENCES order_statuses(id)
);

-- Option 3: Plain VARCHAR
CREATE TABLE orders (
    status VARCHAR(30) NOT NULL  -- Used with CHECK constraint
);
ApproachProsCons
ENUMStorage efficient (1-2 bytes), value restrictionAdding/removing values requires ALTER TABLE (MySQL: table rebuild). PostgreSQL allows ADD VALUE but can’t remove or rename — requires type recreation
Lookup tableFlexible add/remove, can hold extra attributes (description, sort order, active flag)Requires JOIN, slightly more complex
VARCHAR + CHECKSimple, more flexible than ENUMAdding values requires ALTER TABLE DROP/ADD CONSTRAINT (DDL change). Typo risk, larger storage

The “It’ll Never Change” Trap

At first, PENDING, PAID, CANCELLED — 3 values seem enough. But in production:

v1.0: PENDING, PAID, CANCELLED                          — 3 values
v1.3: + REFUNDED                                        — 4 values
v2.0: + PARTIAL_REFUND, DISPUTED, PAYMENT_FAILED        — 7 values
v2.5: CANCELLED -> CANCELLED_BY_USER, CANCELLED_BY_ADMIN — 8 values + rename

With ENUM or VARCHAR+CHECK, every change requires a DDL modification. On large tables, that can impact the live service.

ENUM Pitfalls by Database

IssueMySQLPostgreSQL
Adding valuesALTER TABLE -> table rebuild (minutes on large tables)ALTER TYPE ... ADD VALUE -> fast, but can’t run inside a transaction
Removing/renaming valuesALTER TABLE -> table rebuildNot possible — must create a new type and swap
ORM syncJava enum <-> DB ENUM mismatch = runtime error. DB must be updated before app deploymentSame
SortingSorts by internal index (declaration order)Declaration order, not alphabetical — unexpected sort results

Lookup Tables Are Better

-- Adding a value: one INSERT. No DDL change. Zero service impact.
INSERT INTO order_statuses (id, name) VALUES (5, 'REFUNDED');

-- Freely add metadata later
ALTER TABLE order_statuses ADD COLUMN display_name VARCHAR(50);
ALTER TABLE order_statuses ADD COLUMN is_terminal BOOLEAN DEFAULT FALSE;

“But what about JOIN cost?” — Lookup tables typically have tens of rows. The entire table is cached in memory, so the measured performance difference is negligible. Application-level caching eliminates the JOIN entirely.

Decision Guide

Default choice                          -> Lookup table (most flexible and safe)
Values that absolutely never change     -> ENUM or VARCHAR+CHECK (e.g., gender M/F/X, weekdays, ISO codes)
No permission to create new tables      -> VARCHAR+CHECK (when lookup tables aren't an option)

Practical tip: Don’t trust “this value will never change” at face value. When in doubt, a lookup table is the choice that causes the least pain down the road.

2.7 BOOLEAN Type

-- MySQL: Actually an alias for TINYINT(1)
is_active BOOLEAN DEFAULT TRUE    -- Internal: TINYINT(1) DEFAULT 1

-- PostgreSQL: Real BOOLEAN
is_active BOOLEAN DEFAULT TRUE    -- Stores true/false/null

Watch out:

-- MySQL BOOLEAN is TINYINT(1), so this works
INSERT INTO users (is_active) VALUES (2);   -- No error!
INSERT INTO users (is_active) VALUES (99);  -- Also fine!

-- Defend with CHECK constraint (MySQL 8.0.16+)
ALTER TABLE users ADD CONSTRAINT chk_is_active CHECK (is_active IN (0, 1));

2.8 TEXT vs VARCHAR

PropertyVARCHAR(n)TEXT
Max sizeMySQL: 65,535 bytes (per row)MySQL: 65,535 bytes, PostgreSQL: 1GB
IndexingDirectPrefix index only (MySQL)
Default valueSupportedMySQL: not supported, PostgreSQL: supported
Use caseShort strings with predictable lengthPost bodies, descriptions, JSON, etc.

MySQL vs PostgreSQL Differences

MySQL:      VARCHAR(n) != TEXT — different storage, indexing, and default value support
PostgreSQL: VARCHAR(n) ≈ TEXT — identical internal storage. VARCHAR(n) just adds a length check
DifferenceMySQLPostgreSQL
INDEX on TEXTPrefix index only (INDEX(col(255)))Regular index possible (up to ~2700 bytes)
DEFAULT on TEXTNot supportedSupported
VARCHAR vs TEXT performanceVARCHAR can be better (temp tables)No difference

Practical rule: For MySQL, use VARCHAR when length is predictable, TEXT otherwise. For PostgreSQL, use VARCHAR(n) only when length limits are a business rule — otherwise TEXT is fine for everything.


3. PK (Primary Key) Strategy

Choosing a PK isn’t just “what should the id be?” It’s an architectural decision that directly affects index structure, INSERT performance, and distributed system compatibility.

3.1 AUTO_INCREMENT / IDENTITY (Sequential Integer)

-- MySQL
CREATE TABLE orders (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    ...
);

-- PostgreSQL (recommended: IDENTITY — SQL standard)
CREATE TABLE orders (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    ...
);

-- PostgreSQL (legacy: SERIAL — sequence-based)
CREATE TABLE orders (
    id BIGSERIAL PRIMARY KEY,   -- Internally: sequence + DEFAULT combo
    ...
);

Why BIGINT + IDENTITY instead of SERIAL?

This choice addresses two independent concerns: syntax and size.

Syntax: SERIAL vs IDENTITY

  • SERIAL is PostgreSQL-specific. Under the hood, it’s just a macro that creates a sequence and sets DEFAULT nextval(...).
  • Problem 1: Sequence ownership is messy. DROP TABLE may leave orphaned sequences, and pg_dump ordering can break.
  • Problem 2: Nothing stops INSERT INTO orders(id) VALUES (999)arbitrary values bypass the sequence. When the sequence catches up to that value, you get duplicate key errors.
  • GENERATED ALWAYS AS IDENTITY is SQL:2003 standard and blocks arbitrary value insertion by default. (Requires explicit OVERRIDING SYSTEM VALUE to bypass.)

Size: INT (SERIAL) vs BIGINT (BIGSERIAL)

  • SERIAL = INTEGER (4 bytes, max ~2.1 billion), BIGSERIAL = BIGINT (8 bytes)
  • As covered in section 2.3, INT runs out faster than you’d expect. Saving 4 bytes per row isn’t worth a 3 AM emergency migration.
  • Switching INT→BIGINT means changing the PK type + all FK column types + rebuilding every index. On large tables, this can require hours of downtime.

Bottom line: Use BIGINT GENERATED ALWAYS AS IDENTITY instead of SERIAL or BIGSERIAL — it solves both the syntax and size problems at once.

ProsCons
Simple and intuitiveSecurity risk if exposed (total order count guessable)
Best fit for clustered indexes (sequential inserts = no page splits)Collisions in distributed setups (multiple DB servers)
Sort order = creation orderDB-dependent (can’t know ID before INSERT)
Small (8 bytes)ID conflicts during cross-table migrations

What Is a Clustered Index?

In InnoDB (MySQL), PK = clustered index. Data is physically stored in PK order.

PostgreSQL is different: PostgreSQL has no clustered index by default. Tables (heaps) store data in insertion order, and the PK is just a separate B+Tree index. The CLUSTER command can sort data once, but subsequent INSERTs won’t maintain order. This means random UUID inserts aren’t as devastating as in MySQL — though index size and cache efficiency concerns still apply.

[AUTO_INCREMENT — Sequential inserts]
Page 1: [1, 2, 3, 4, 5]
Page 2: [6, 7, 8, 9, 10]
Page 3: [11, 12, ...]        <- Always appends to last page. Clean.

[UUID — Random inserts]
Page 1: [3a2f..., 7b1c..., a9d4...]
Page 2: [1e8b..., 5c3a..., f2e1...]
INSERT -> 0x4d7... -> Must squeeze between pages 1 and 2 -> Page split!

Page splits cause increased disk I/O, index fragmentation, and degraded INSERT performance.

3.2 UUID v4 (Random)

-- MySQL 8.0
CREATE TABLE orders (
    id BINARY(16) PRIMARY KEY,  -- Store UUID as binary (not 36-byte string)
    ...
);

-- PostgreSQL
CREATE TABLE orders (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    ...
);
ProsCons
Globally unique — no collisions in distributed systemsRandom = poor clustered index performance (page splits)
Can be generated client-side without DBLarge (16 bytes binary, 36 bytes as string)
Can’t guess data from IDHard for humans to read (550e8400-e29b-41d4-a716-446655440000)
Larger indexes — every secondary index includes the PK

Storing UUID as VARCHAR(36) in MySQL is the worst option. Always convert to BINARY(16). MySQL 8.0’s UUID_TO_BIN(uuid, 1) function can also enable time-based ordering.

3.3 UUID v7 / ULID (Time-Sortable)

These solve UUID v4’s main weakness (random = page splits).

UUID v7 structure:
|-- 48-bit timestamp --|-- random --|
017F22E2-79B0-7CC3-98C4-DC0C0C07398F
^^^^^^^^^^^^^^^^
Time-sortable!

ULID structure:
|-- 48-bit timestamp --|-- 80-bit random --|
01ARZ3NDEKTSV4RRFFQ69G5FAV
^^^^^^^^^^
Time-sortable!
PropertyUUID v4UUID v7ULID
Time-sortableNoYesYes
Clustered index friendlyNoYesYes
Size16 bytes16 bytes16 bytes (26-char string)
StandardRFC 4122RFC 9562 (2024)Unofficial (de facto standard)
Native DB supportPostgreSQLPostgreSQL 17+No (app-generated)
// Generating UUID v7 in Java (requires library)
// com.github.f4b6a3:uuid-creator
UUID uuidV7 = UuidCreator.getTimeOrderedEpoch();

// Generating ULID
// com.github.f4b6a3:ulid-creator
Ulid ulid = UlidCreator.getMonotonicUlid();

3.4 Snowflake ID (Distributed)

Created by Twitter. Encodes time + machine ID + sequence into a 64-bit integer.

Snowflake structure (64-bit):
|1 bit (unused)|41 bits (timestamp)|10 bits (machine ID)|12 bits (sequence)|

- Timestamp: ~69 years of coverage
- Machine ID: up to 1,024 machines
- Sequence: 4,096 per millisecond
ProsCons
Fits in BIGINT (8 bytes)Needs ID generation server (or library)
Time-sortableMachine ID management required
No collisions in distributed systemsClock synchronization dependency (NTP)
Clustered index friendlyImplementation complexity

3.5 Which PK Strategy Should You Use?

ScenarioRecommendedWhy
Single DB, typical serviceAUTO_INCREMENT (BIGINT)Simple, best performance, sufficient for most cases
ID exposed externallyUUID v7 or ULIDCan’t guess order or total count
Microservices, multi-DBUUID v7 or SnowflakeGenerate IDs without DB, no collisions
Event sourcingUUID v7Event ordering + global uniqueness

Practical tip: “Start with AUTO_INCREMENT and switch when needed” is the most realistic approach. But always start with BIGINT for the PK type. INT to BIGINT requires a table rebuild, while BIGINT to UUID can be done via a new column + gradual migration.

Natural Key vs Surrogate Key

One more important decision: Should the PK be a business value, or a meaningless artificial key?

  • Natural Key: A unique value from the data itself used as PK (email, SSN, student ID, etc.)
  • Surrogate Key: A value with no business meaning used as PK (AUTO_INCREMENT id, UUID, etc.)
-- Natural Key: business value = PK
CREATE TABLE countries (
    code CHAR(2) PRIMARY KEY,  -- 'KR', 'US', 'JP'
    name VARCHAR(100)
);

-- Surrogate Key: artificial key = PK
CREATE TABLE countries (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    code CHAR(2) NOT NULL UNIQUE,
    name VARCHAR(100)
);
TypeProsCons
Natural KeyMeaningful without JOIN, built-in dedupBusiness rule changes require PK changes → cascading FK updates
Surrogate KeyPK never changes, easy FK managementNeed JOIN to see meaning, separate UNIQUE constraint needed

Why Natural Keys Are Dangerous — A Real-World Scenario

Natural Keys cause problems when the assumption “this value never changes” breaks.

Scenario: Email as PK

-- At design time: "Email is unique per user, let's make it the PK"
CREATE TABLE users (
    email VARCHAR(320) PRIMARY KEY,
    name VARCHAR(50)
);

CREATE TABLE orders (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    user_email VARCHAR(320) REFERENCES users(email),  -- FK
    amount DECIMAL(10,2)
);

CREATE TABLE reviews (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    user_email VARCHAR(320) REFERENCES users(email),  -- FK
    content TEXT
);

Six months later, users request an email change feature. Here’s what happens:

-- To change an email:
-- 1. Update the PK in users table
-- 2. Update the FK in orders table
-- 3. Update the FK in reviews table
-- 4. Every other table referencing user_email... all of them

-- CASCADE handles this automatically, but
-- on large tables → millions of rows updated → locks + downtime
UPDATE users SET email = 'new@email.com' WHERE email = 'old@email.com';

What if you’d used a Surrogate Key?

CREATE TABLE users (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    email VARCHAR(320) NOT NULL UNIQUE,  -- UNIQUE constraint, not PK
    name VARCHAR(50)
);

CREATE TABLE orders (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    user_id BIGINT REFERENCES users(id),  -- numeric FK
    amount DECIMAL(10,2)
);

-- Change email? Just 1 row in users. No FK changes needed.
UPDATE users SET email = 'new@email.com' WHERE id = 42;

Index Efficiency

Natural Keys also impact index size when used as FKs.

-- FK as BIGINT (8 bytes)
orders.user_id: 1M rows × 8 bytes = ~8MB index

-- FK as VARCHAR(320) (up to 1280 bytes in utf8mb4)
orders.user_email: 1M rows × ~30 bytes avg = ~30MB index
-- The gap widens further with composite indexes

When Is a Natural Key Actually Safe?

A Natural Key is safe when it meets all three conditions:

  1. The value never changes — ISO country codes (KR), currency codes (USD), etc.
  2. Few FK references from other tables — or the data volume is small even if referenced
  3. The value is short and fixed-lengthCHAR(2), CHAR(3) level. No index efficiency concerns.
-- ✅ Natural Key is appropriate
CREATE TABLE currencies (
    code CHAR(3) PRIMARY KEY,  -- 'USD', 'KRW', 'JPY' — ISO 4217, never changes
    name VARCHAR(50),
    symbol VARCHAR(5)
);

-- ❌ Natural Key is risky
CREATE TABLE users (
    email VARCHAR(320) PRIMARY KEY,     -- can change
    ...
);

CREATE TABLE products (
    sku VARCHAR(50) PRIMARY KEY,        -- SKU scheme can change with company policy
    ...
);

Practical rule: Almost always use a Surrogate Key as the PK, and protect Natural Keys with UNIQUE constraints. Even when you’re sure “this value will never change,” if the table is widely referenced via FKs, a Surrogate Key is the safer bet.


4. NULL Semantics

NULL doesn’t mean “no value.” It means “unknown.” This distinction is where all the confusion starts.

4.1 Three-Valued Logic

SQL uses three-valued logic with TRUE, FALSE, and UNKNOWN.

-- Comparisons with NULL always yield UNKNOWN
SELECT NULL = NULL;      -- NULL (not TRUE!)
SELECT NULL != NULL;     -- NULL (not TRUE!)
SELECT NULL > 0;         -- NULL
SELECT NULL = 0;         -- NULL

-- UNKNOWN acts like FALSE in WHERE clauses
SELECT * FROM users WHERE deleted_at = NULL;     -- 0 rows!
SELECT * FROM users WHERE deleted_at IS NULL;    -- Correct way

NULL Traps

-- Trap 1: NOT IN with NULL
SELECT * FROM orders WHERE user_id NOT IN (1, 2, NULL);
-- Result: 0 rows! (Every comparison becomes UNKNOWN, filtering everything)

-- Safe alternative
SELECT * FROM orders WHERE user_id NOT IN (1, 2);
-- Or
SELECT * FROM orders WHERE user_id NOT IN (
    SELECT id FROM blocked_users WHERE id IS NOT NULL
);

-- Trap 2: Aggregate functions and NULL
SELECT AVG(score) FROM reviews;
-- NULL rows are ignored. If 2 of 5 rows are NULL, it averages only 3

SELECT COUNT(score) FROM reviews;  -- Excludes NULL
SELECT COUNT(*) FROM reviews;      -- Includes all rows

-- Trap 3: UNIQUE constraint and NULL
-- MySQL: NULL is allowed multiple times in UNIQUE columns
INSERT INTO users (email) VALUES (NULL);  -- OK
INSERT INTO users (email) VALUES (NULL);  -- Also OK! (NULL != NULL)

-- PostgreSQL 14 and below: same as MySQL (multiple NULLs allowed)
-- PostgreSQL 15+: NULLS NOT DISTINCT option added
CREATE TABLE users (
    email VARCHAR(320),
    CONSTRAINT uq_users_email UNIQUE NULLS NOT DISTINCT (email)
);
-- Now NULL is only allowed once!

-- SQL Server: Only one NULL allowed (default behavior)

4.2 NOT NULL + DEFAULT vs Nullable

-- Option 1: Nullable (NULL until deleted)
deleted_at TIMESTAMP NULL

-- Option 2: NOT NULL + DEFAULT
status VARCHAR(20) NOT NULL DEFAULT 'ACTIVE'
retry_count INT NOT NULL DEFAULT 0

Decision Guide

NULL Is AppropriateNOT NULL Is Appropriate
”Unknown” is meaningful (deleted_at, approved_at)Value should always exist (name, email, created_at)
Optional relationship (mentor_id — might not have a mentor)Default is obvious (retry_count DEFAULT 0)
Value not yet determined (graduated_at — still enrolled)NULL checks would be needed everywhere in business logic

4.3 NULL and Indexes

-- MySQL InnoDB: NULL values ARE included in indexes
CREATE INDEX idx_users_deleted_at ON users (deleted_at);
-- IS NULL conditions CAN use the index
SELECT * FROM users WHERE deleted_at IS NULL;  -- Index usable

-- PostgreSQL: Same — NULL is included in indexes
-- But partial indexes are more efficient
CREATE INDEX idx_users_active ON users (id) WHERE deleted_at IS NULL;
-- Only indexes non-deleted users -> dramatically smaller index

4.4 Practical NULL Design Principles

1. Default to NOT NULL
   - When adding a column, always ask: "Can this value be absent?"
   - If the answer is "no", make it NOT NULL

2. When allowing NULL, make the meaning explicit
   - deleted_at IS NULL -> "not deleted" (clear)
   - phone IS NULL -> "phone not registered" (clear)
   - score IS NULL -> "not yet graded" vs 0 points (clear distinction)

3. Defend in application code
   - Java: Wrap with Optional, @Column(nullable = false)
   - SQL: Use COALESCE for default values
// JPA defense
@Entity
public class User {
    @Column(nullable = false)
    private String name;

    @Column(nullable = false)
    private String email;

    private LocalDateTime deletedAt;  // nullable — NULL until deleted
}
-- Safe NULL handling with COALESCE
SELECT
    name,
    COALESCE(phone, 'Not registered') AS phone,
    COALESCE(score, 0) AS score
FROM users;

Summary

TopicKey Principle
Namingsnake_case, plural tables, no abbreviations, avoid reserved words. Consistency is what matters most
Data typesVARCHAR lengths with justification, charset/collation decided early and consistent, DECIMAL for money, BIGINT for PKs, time types matched to service needs
PK strategyAUTO_INCREMENT + BIGINT covers most cases. UUID v7 or ULID for external exposure or distributed systems
NULLDefault to NOT NULL. When allowing NULL, you must be able to answer “what does the absence of this value mean?”

Cutting corners on these 4 decisions means coming back later for a migration. Spending 5 extra minutes when first creating a table is 100x cheaper than changing a column type on a live service.

Next up: Normalization and Denormalization — not the theory, but the practical judgment calls for when it’s right to break normalization rules.

Shop on Amazon

As an Amazon Associate, I earn from qualifying purchases.