SQL - Normalization (1NF, 2NF, 3NF, BCNF)
Before diving into normal forms, you need to understand functional dependencies. A functional dependency X → Y means that if you know the value of X, you can determine the value of Y. In a table with...
Key Insights
- Normalization eliminates data redundancy and update anomalies by systematically decomposing tables according to functional dependencies, with each normal form building upon the previous one.
- First Normal Form (1NF) requires atomic values, Second Normal Form (2NF) eliminates partial dependencies, Third Normal Form (3NF) removes transitive dependencies, and Boyce-Codd Normal Form (BCNF) addresses remaining anomalies with overlapping candidate keys.
- While normalization improves data integrity and reduces storage waste, real-world applications often require strategic denormalization for performance optimization—understanding the trade-offs is essential for effective database design.
Understanding Functional Dependencies
Before diving into normal forms, you need to understand functional dependencies. A functional dependency X → Y means that if you know the value of X, you can determine the value of Y. In a table with columns (StudentID, StudentName, CourseID, CourseName), StudentID → StudentName is a functional dependency because each student ID uniquely determines the student’s name.
-- Example table with multiple functional dependencies
CREATE TABLE student_enrollments (
student_id INT,
student_name VARCHAR(100),
student_email VARCHAR(100),
course_id INT,
course_name VARCHAR(100),
instructor_name VARCHAR(100),
grade CHAR(2)
);
-- Functional dependencies present:
-- student_id → student_name, student_email
-- course_id → course_name, instructor_name
-- (student_id, course_id) → grade
First Normal Form (1NF): Atomic Values
A table is in 1NF when every column contains only atomic (indivisible) values, and each row is unique. No repeating groups or arrays are allowed.
Violation Example:
-- NOT in 1NF: phone_numbers contains multiple values
CREATE TABLE customers_bad (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
phone_numbers VARCHAR(200) -- "555-1234, 555-5678, 555-9012"
);
1NF Solution:
-- Approach 1: Separate table for phone numbers
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE customer_phones (
customer_id INT,
phone_number VARCHAR(20),
phone_type VARCHAR(20), -- 'mobile', 'home', 'work'
PRIMARY KEY (customer_id, phone_number),
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
-- Insert example
INSERT INTO customers VALUES (1, 'John Doe');
INSERT INTO customer_phones VALUES
(1, '555-1234', 'mobile'),
(1, '555-5678', 'home');
Another common 1NF violation involves repeating columns:
-- NOT in 1NF: repeating column groups
CREATE TABLE orders_bad (
order_id INT PRIMARY KEY,
product1_id INT,
product1_qty INT,
product2_id INT,
product2_qty INT,
product3_id INT,
product3_qty INT
);
-- 1NF solution
CREATE TABLE orders (
order_id INT PRIMARY KEY,
order_date DATE
);
CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT,
PRIMARY KEY (order_id, product_id),
FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
Second Normal Form (2NF): Eliminate Partial Dependencies
A table is in 2NF if it’s in 1NF and every non-key attribute is fully functionally dependent on the entire primary key. This matters only for tables with composite primary keys.
Violation Example:
-- NOT in 2NF: partial dependencies exist
CREATE TABLE course_enrollments (
student_id INT,
course_id INT,
student_name VARCHAR(100), -- depends only on student_id
student_major VARCHAR(50), -- depends only on student_id
course_name VARCHAR(100), -- depends only on course_id
instructor VARCHAR(100), -- depends only on course_id
grade CHAR(2), -- depends on both (full dependency)
PRIMARY KEY (student_id, course_id)
);
The problem: student_name depends only on student_id, not on the full key (student_id, course_id). This creates redundancy and update anomalies.
2NF Solution:
-- Decompose into three tables
CREATE TABLE students (
student_id INT PRIMARY KEY,
student_name VARCHAR(100),
student_major VARCHAR(50)
);
CREATE TABLE courses (
course_id INT PRIMARY KEY,
course_name VARCHAR(100),
instructor VARCHAR(100)
);
CREATE TABLE enrollments (
student_id INT,
course_id INT,
grade CHAR(2),
PRIMARY KEY (student_id, course_id),
FOREIGN KEY (student_id) REFERENCES students(student_id),
FOREIGN KEY (course_id) REFERENCES courses(course_id)
);
-- Query to get full enrollment details
SELECT
s.student_name,
s.student_major,
c.course_name,
c.instructor,
e.grade
FROM enrollments e
JOIN students s ON e.student_id = s.student_id
JOIN courses c ON e.course_id = c.course_id;
Third Normal Form (3NF): Eliminate Transitive Dependencies
A table is in 3NF if it’s in 2NF and no non-key attribute depends on another non-key attribute (no transitive dependencies).
Violation Example:
-- NOT in 3NF: transitive dependency exists
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
department_id INT,
department_name VARCHAR(100), -- transitively dependent via department_id
department_location VARCHAR(100) -- transitively dependent via department_id
);
-- Problem: department_name depends on department_id, not directly on employee_id
-- employee_id → department_id → department_name (transitive)
3NF Solution:
CREATE TABLE departments (
department_id INT PRIMARY KEY,
department_name VARCHAR(100),
department_location VARCHAR(100)
);
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
department_id INT,
FOREIGN KEY (department_id) REFERENCES departments(department_id)
);
-- Query with department details
SELECT
e.employee_id,
e.employee_name,
d.department_name,
d.department_location
FROM employees e
JOIN departments d ON e.department_id = d.department_id;
Another practical example:
-- NOT in 3NF
CREATE TABLE orders_bad (
order_id INT PRIMARY KEY,
customer_id INT,
customer_name VARCHAR(100),
customer_credit_limit DECIMAL(10,2), -- depends on customer_id
order_date DATE,
order_total DECIMAL(10,2)
);
-- 3NF solution
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
customer_credit_limit DECIMAL(10,2)
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
order_total DECIMAL(10,2),
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
Boyce-Codd Normal Form (BCNF): Stricter Than 3NF
BCNF is a stronger version of 3NF. A table is in BCNF if for every functional dependency X → Y, X must be a superkey. Most tables in 3NF are also in BCNF, but violations occur with overlapping candidate keys.
Violation Example:
-- NOT in BCNF despite being in 3NF
CREATE TABLE course_schedule (
student_id INT,
course_id INT,
instructor VARCHAR(100),
PRIMARY KEY (student_id, course_id)
);
-- Constraint: Each instructor teaches only one course
-- Functional dependencies:
-- (student_id, course_id) → instructor (satisfies 3NF)
-- instructor → course_id (violates BCNF: instructor is not a superkey)
The issue: if an instructor teaches only one course, instructor → course_id, but instructor isn’t a superkey.
BCNF Solution:
CREATE TABLE instructor_courses (
instructor VARCHAR(100) PRIMARY KEY,
course_id INT
);
CREATE TABLE student_instructors (
student_id INT,
instructor VARCHAR(100),
PRIMARY KEY (student_id, instructor),
FOREIGN KEY (instructor) REFERENCES instructor_courses(instructor)
);
-- Query to get student course assignments
SELECT
si.student_id,
ic.course_id,
si.instructor
FROM student_instructors si
JOIN instructor_courses ic ON si.instructor = ic.instructor;
Practical Considerations and Denormalization
While normalization improves data integrity, it can impact query performance due to multiple joins. Strategic denormalization is often necessary:
-- Denormalized for read-heavy workloads
CREATE TABLE order_summary (
order_id INT PRIMARY KEY,
customer_id INT,
customer_name VARCHAR(100), -- denormalized
order_date DATE,
order_total DECIMAL(10,2),
item_count INT -- computed/denormalized
);
-- Maintain with triggers or application logic
CREATE TRIGGER update_order_summary
AFTER INSERT ON order_items
FOR EACH ROW
BEGIN
UPDATE order_summary
SET item_count = item_count + 1,
order_total = order_total + NEW.quantity * NEW.unit_price
WHERE order_id = NEW.order_id;
END;
Use denormalization when:
- Query performance is critical and data rarely changes
- Aggregations are expensive to compute on-the-fly
- Read operations vastly outnumber writes
Maintain normalized tables as the source of truth and create denormalized views or materialized views for performance:
CREATE MATERIALIZED VIEW order_details_mv AS
SELECT
o.order_id,
c.customer_name,
c.customer_email,
o.order_date,
SUM(oi.quantity * oi.unit_price) as order_total,
COUNT(oi.product_id) as item_count
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id, c.customer_name, c.customer_email, o.order_date;
-- Refresh periodically
REFRESH MATERIALIZED VIEW order_details_mv;
Normalization isn’t dogma—it’s a tool. Apply it to maintain data integrity, then selectively denormalize based on measured performance needs.