An Introduction to Database Normalization
By
Mike Hillyer
When users ask for advice about their database applications, one of the first things I try to help them with is the normalization of their
table structure. Normalization is the process of removing redundant data from your tables in order to improve storage efficiency, data integrity and
scalability. This improvement is balanced against an increase in complexity and potential performance losses from the joining of the normalized tables at query-time.
Mike's Bookshop
Let's say you were looking to start an online bookstore. You would need to track certain information about the books available to your site viewers, such as:
- Title
- Author
- Author
Biography
- ISBN
- Price
- Subject
|
- Number
of Pages
- Publisher
- Description
- Review
- Reviewer
Name
|
Let's start by adding a couple of books written by Luke Welling and Laura
Thomson. Because this book has two authors, we are going to need to accommodate both in our table. Lets take a look at a typical approach I encounter:
| Title |
Author1 |
Author2 |
ISBN |
Subject |
Pages |
Publisher |
| PHP and MySQL Web Development |
Luke
Welling |
Laura
Thomson |
0672317842 |
PHP, MySQL |
867 |
Sams |
| MySQL Tutorial |
Luke
Welling |
Laura
Thomson |
0672325845 |
MySQL |
300 |
Sams |
Let's take a look at some issues involved in this table design:
First, this table is not very efficient with storage. Lets imagine for a second that Luke and Laura were extremely busy writers and managed to produce
500 books for our database. The combination of their two names is 25 characters
long, and since we will repeat their two names in 500 rows we are wasting 25 × 500 = 12,500 bytes of storage space unnecessarily.
Second, this design does not protect data integrity. Lets once again imagine that Luke and Laura have written 500 books. Someone has had to type their names
into the database 500 times, and it is very likely that one of their names will be misspelled at least once (i.e.. Thompson instead of Thomson). Our
data is now corrupt, and anyone searching for book by author name will find some of the results missing. The same thing could happen with publisher name. Sams
publishes hundreds of titles and if the publisher's name were misspelled even once the list of books by publisher would be missing titles.
Third, this table does not scale well. First of all, we have limited ourselves to only two authors, yet some books are written by over a
dozen people. This kind of limitation is often exhibited in personal info tables where the typical design includes Phone1, Phone2, and Phone3 columns.
In both cases we are limiting future growth of our data. Another limitation to scalability is that fact that with all our data in one table our one
table file will grow faster and we will have more trouble with file size limitations of our underlying operating system.
First Normal Form
The normalization process involves getting our data to conform to three progressive normal forms, and a higher level of normalization cannot be
achieved until the previous levels have been achieved (there are actually five normal forms, but the last two are mainly academic and will not be discussed).
The First Normal Form (or 1NF) involves removal of redundant data from horizontal rows. We want to ensure that there is no duplication of data
in a given row, and that every column stores the least amount of information possible (making the field atomic).
In our table above we have two violations of First Normal Form:. First, we have more than one author field, and our subject field contains more
than one piece of information. With more than one value in a single field, it
would be very difficult to search for all books on a given subject. In addition, with
two author fields we have two fields to search in order to look for a book by a
specific author
We could get around these problems by modifying our table to have only one author field, with two entries for a book with two authors, as in the following example:
| Title |
Author |
ISBN |
Subject |
Pages |
Publisher |
| PHP and MySQL Web Development |
Luke
Welling |
0672317842 |
MySQL |
867 |
Sams |
| PHP and MySQL Web Development |
Laura
Thomson |
0672317842 |
PHP |
867 |
Sams |
| MySQL Tutorial |
Laura
Thomson |
0672325845 |
MySQL |
300 |
Sams |
| MySQL Tutorial |
Luke
Welling |
0672325845 |
MySQL |
300 |
Sams |
While this
approach has no redundant columns and
the subject column has only
one piece of information, we do have a problem that we now have two
rows for a
single book. Also, to ensure that we can do a search of author and
subject
(i.e. books on PHP by Luke Welling), we would need four rows to ensure
that we
had each combination of author and subject. Additionally, we would be
violating
the Second Normal Form, which will be described below.
A better
solution to our problem would be to
separate the redundant data
into separate tables, with the tables related to each other. In this
case we
will create an Author table and a Subject table to store our
information,
removing that information from the Book table:
Author Table:
| Author_ID |
Last
Name |
First
Name |
Bio |
| 1 |
Welling |
Luke |
Author
of books on PHP and MySQL |
| 2 |
Thomson |
Laura |
Another
author of books on PHP and MySQL |
Subject
Table:
| Subject_ID |
Subject |
| 1 |
PHP |
| 2 |
MySQL |
Book Table:
| ISBN |
Title |
Pages |
Publisher |
| 0672317842 |
PHP
and MySQL Web Development |
867 |
Sams |
| 0672325845 |
MySQL
Tutorial |
300 |
Sams |
While
creating the Author table, I also split
author name down into separate
first name and last name columns, in order to follow the requirement
for
storing as little information as possible in a given column. Each table
has a
primary key value which can be used for joining tables together when
querying
the data. A primary key value must be unique with in the table (no two
books
can have the same ISBN number), and a primary key is also an INDEX,
which
speeds up data retrieval based on the primary key.
Defining
Relationships
As you can
see, while our data is now split up,
relationships between the
table have not been defined. There are various types of relationships
that can
exist between two tables:
- One to
(Zero or) One
- One to
(Zero or) Many
- Many to
Many
The
relationship between the Book table and the
Author table is a
many-to-many relationship: A book can have more than one author, and an
author
can write more than one book. To represent a many-to-many relationship
in a
relational database we need a third table to serve as a link between
the
two:
Book_Author
Table:
| ISBN |
Author_ID |
| 0672317842 |
1 |
| 0672317842 |
2 |
| 0672325845 |
1 |
| 0672325845 |
2 |
Similarly,
the Subject table also has a
many-to-many relationship with the
Book table, as a book can cover multiple subjects, and a subject can be
explained by multiple books:
Book_Subject
Table:
| ISBN |
Subject_ID |
| 0672317842 |
1 |
| 0672317842 |
2 |
| 0672325845 |
2 |
The
one-to-many relationship will be covered in
the next section. As you can
see, we now have established the relationships between the Book,
Author, and
Subject tables. A book can have an unlimited number of authors, and can
refer
to an unlimited number of subjects. We can also easily search for books
by a
given author or referring to a given subject.
In the
linking tables above the values stored
refer to primary key values
from the Book, Author, and Subject tables. Columns in a table that
refer to
primary keys from another table are known as foreign keys, and serve
the
purpose of defining data relationships. In database systems which
support
referential integrity, such as the InnoDB storage engine for MySQL,
defining a
column as a foreign key will also allow to database software to enforce
the
relationships you define. For example, with foreign keys defined, the
InnoDB
storage engine will not allow you to insert a row into the Book_Subject
table
unless the book and subject in question already exist in the book and
subject
tables. Such systems will also prevent the deletion of books from the
book
table that have 'child' entries in the book_subject or book_author
tables.
Second
Normal Form
Where the
First Normal Form deals with redundancy
of data across a
horizontal row, Second Normal Form (or 2NF) deals with redundancy of
data in
vertical columns. As stated earlier, the normal forms are progressive,
so to
achieve Second Normal Form, your tables must already be in First Normal
Form.
Lets take another look at our new Book table:
Book Table:
| ISBN |
Title |
Pages |
Publisher |
| 0672317842 |
PHP
and MySQL Web Development |
867 |
Sams |
| 0672325845 |
MySQL
Tutorial |
300 |
Sams |
This table is
in First Normal Form; we do not
repeat the same data in any
two columns, and each column holds only the minimum amount of data.
However, in
the Publisher column we have the same publisher repeating in each row.
This is
a violation of Second Normal Form. As I stated above, this leaves the
chance
for spelling errors to occur. The user needs to type in the publisher
name each
time and such an approach is also inefficient in regards to storage
usage,
and the more data a table has the more IO requests are needed to scan
the
table, resulting in slower queries.
To normalize
this table to Second Normal Form, we
will once again break data
off to a separate table. In this case we will move publisher
information to
a separate table as follows:
Publisher
Table:
| Publisher_ID |
Publisher
Name |
| 1 |
Sams |
In this case
we have a one-to-many relationship
between the book table and
the publisher. A given book has only one publisher (for our purposes),
and a
publisher will publish many books. When we have a one-to-many
relationship,
we place a foreign key in the many, pointing to the primary key of the
one.
Here is the new Book table:
Book Table:
| ISBN |
Title |
Pages |
Publisher_ID |
| 0672317842 |
PHP
and MySQL Web Development |
867 |
1 |
| 0672325845 |
MySQL
Tutorial |
300 |
1 |
Since the
Book table is the 'many' portion of our
one-to-many relationship,
we have placed the primary key value of the 'one' portion in the
Publisher_ID
column as a foreign key.
The other
requirement for Second Normal Form is
that you cannot have any data
in a table with a composite key that does not relate to all portions of
the
composite key. A composite key is a primary key that incorporates more
than
one column, as with our Book_Author table:
Let's say we
wanted to add in tracking of reviews
and started with the
following table:
Review Table:
| ISBN |
Reviewer_ID |
Review_Text |
Reviewer_Name |
| 0672325845 |
1 |
What a
great book! I learned a lot! |
Mike
Hillyer |
In this table
the combination of Reviewer_ID and
ISBN form the primary key
(a composite key), ensuring that no reviewer writes more than one
review for
the same book. In this case the reviewer name is only related to the
Reviewer_ID, and not the ISBN, and is therefore in violation of Second
Normal
Form.
In this case
we solve the problem by having a
separate table for the data
that relates to only one part of the key. If the portion of the primary
key
that the field relates to is a foreign key to another table, move the
data
there (as in this situation where the Reviewer_ID will be a separate
table):
Reviewer
Table:
| Reviewer_ID |
Reviewer_Name |
Reviewer_House |
Reviewer_Street |
Reviewer_City |
Reviewer_Province |
Reviewer_PostalCode |
| 1 |
Mike
Hillyer |
123 |
Main
Street |
Calgary |
Alberta |
T1S-2N2 |
Review Table:
| ISBN |
Reviewer_ID |
Review_Text |
| 0672325845 |
1 |
What a
great book! I learned a lot! |
Third Normal
Form
I have a
confession to make; I do not often use
Third Normal Form. In Third
Normal Form we are looking for data in our tables that is not fully
dependant
on the primary key, but dependant on another value in the table. In the
reviewer table above the Reviewer_Street and Reviewer_City fields are
really
dependant on the Reviewer_PostalCode and not the Reviewer_ID. To bring
this
table into compliance with Third Normal Form, we would need a table
based on
Postal Code:
PostalCode
Table:
| Postal_Code |
Street |
City |
| T1S-2N2 |
Main
Street |
Calgary |
Now of course
this new table violates Second
Normal Form as the Street and
City will be vertically redundant, and must be broken off to separate
Street
and City tables, and Province will need to be in it's own table which
the City
table will refer to as a foreign key.
The point of
the last example is that
Normalization is a tradeoff. In fact,
there are two additional normal forms that are generally recognized,
but are
mainly academic. Additional normalization results in more complex JOIN
queries,
and this can cause a decrease in performance. In some cases performance
can
even be increased by de-normalizing table data, but de-normalization is
beyond
the scope of this article.
Conclusion
It is
generally a good idea to make sure that your
data at least conforms to
Second Normal Form. Our goal is to avoid data redundancy to prevent
corruption
and make the best possible use of storage. Make sure that the same
value is not
stored in more than one place. With data in multiple locations we have
to
perform multiple updates when data needs to be changed, and corruption
can
begin to creep into the database.
Third Normal
Form removed even more data
redundancy, but at the cost of
simplicity and performance. In our example above, do we really expect
the City
and Street names to change very regularly? In this situation Third
Normal Form
still prevents misspelling of City and Street names, but it is up to
you to
find a balance between Normalization and Speed/Simplicity. In an
upcoming
article we will look at how to form queries that join our normalized
data
together.
Article keywords
|