• (3 Pages)
  • +
  • 1
  • 2
  • 3

Relational Database Design - Normalization

#1 Atli  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 3718
  • View blog
  • Posts: 5,986
  • Joined: 08-June 10

Posted 25 June 2010 - 09:25 PM

*
POPULAR

It's common for people new to relational databases to approach the design of the tables like they would a flat, Excel-like spreadsheet: to design the tables as if they were isolated, with all the data needed for a particular functionality crammed into a single table.

This is not how relational databases are meant to be used. Tables in a relational database are linked together, forming a web of sorts, where all the data needed for a single query can be spread over a great many tables, and a change in one table can have an affect on every other table in that database.

It is therefore extremely important that each an every table in the database be properly designed, not only in regards to the data that each table is mean to store, but also in respect to every other table in the "web". If you ever find yourself in need to expand your database, or refactor it in order to fix or improve your existing applications, a proper design can save you an enormous amount of time and effort.

Key concepts
Let me start by naming a few of the most important concepts in relational database design. A basic understanding of these will be required to understand the rest of the article.

  • Primary Key (PK)
    A column with a unique value for each row. Although not all database management systems (DBMS) require you to put a PK into each table, from a design perspective a PK is a requirement. No table should be without one.

  • Foreign Key (FK)
    These define relationships between tables. When you want a row in one table to be linked to a row in another table, you place a FK column in the child table and use the value of the parent row's PK as the value of the FK field.

  • Composite Key
    This is a key that is made up of more than one column. This is typically used when you want to prevent a table from using the same combination of values twice. For example, in a table that lists item prizes for shops, you would only want each shop to have a single price for each item. So, you create a FK for the shop and a FK for the item, and then you create a composite PK out of those two columns. This would cause the DBMS to forcefully restrict entries that would create rows where the combined values of these fields are duplicated. - This type of key is commonly used in N:M relationships. (Explained below... with visual aids.)

  • One-To-One (1:1) relationship
    A relationship between two tables, where a single row in one table is linked to a single row in another table.
    +------------+     +----------------+
    | person     |     | person_contact |
    +------------+     +----------------+
    | person_id  |1---1| person_id      |
    | first_name |     | email          |
    | last_name  |     | phone          |
    +------------+     +----------------+
    

    This type of relationship is practically non-existent in normalized relational designs. They exist mostly to get around limitations in databases like Access, where the number of column was limited, thus creating the need to split tables up. They are also sometimes used to optimize the performance of the database.

  • One-To-Many (1:N) relationship
    A relationship between two tables, where multiple rows in a child table can be linked to a single row in a parent table. For example:
    +------------+     +------------+
    | person     |     | country    |
    +------------+     +------------+
    | person_id  |  |-1| country_id |
    | name       |  |  | name       |
    | country_id |*-|  +------------+
    +------------+
    
    This is in fact the only "real" type of relationship in a relational database. (See the next point for the reasoning behind that assertion.)

  • Many-To-Many (N:M) relationship
    A relationship between two tables, where multiple rows in one table can be linked to multiple rows in another table. This type is "artificial" in a a way, because this kind of relationship can not be created directly between tables. To accomplish this type of relationship you need to create a third table; an intermediary table that contains FKs to both parents, linked via a set of 1:N relationships.
    +-----------+     +--------------+     +--------------+
    | shop      |     | prices       |     | product      |
    +-----------+     +--------------+     +--------------+
    | shop_id   |1-|  | product_id   |*---1| product_id   |
    | shop_name |  |-*| shop_id      |     | product_name |
    +-----------+     | price        |     +--------------+
                      +--------------+
    


Normalization
To help us properly design our tables we have a set of guidelines which, if followed properly, will help reduce the redundancy and chance of data corruption. We call this "Normalization".

There are several steps involved in normalizing a database. The steps are referred to as "Normal Forms" (abbreviated: NF). There are at least seven NF ranging from 1NF to 6NF. Each NF requires that the NF before it has also been satisfied. The spot between 3NF and 4NF is reserved for the BCNF (Boyce-Codd normal form), which was developed later as a slightly stronger version of the 3NF, to address certain shortcomings.

Tables that have reached the 3NF are generally considered "normalized". Specifically aiming for a higher level is unusual, but a table that is designed to be in 3NF is very likely also in the 5NF.

The first three Normal Forms aren't complex at all. They can take some getting used to, but after working with them for a while you may well find yourself developing 3NF compliant tables without even thinking about it.


The First Normal Form (1NF)
The first normal form is both the simplest and the most important of the three steps. It simply requires that tables must not contain repeating groups of data. This means that if you need to store multiple, identical pieces of data for a single entry (row) then you can not serialize them into a single field, or create multiple identical columns.

Consider the following example. It's a simple list of persons, where each person is listed with his/her name and phone numbers.
+----+------+--------------------+
| id | name | phone              |
+----+------+--------------------+
|  1 | Joe  | 588-5522,789-85522 | 
|  2 | Anna | 589-4567,987-12354 |
+----+------+--------------------+

Note that both entries store multiple phone numbers in a single field, separated by a comma. There are two major problems with this approach:
  • Your database management system (DBMS) regards each field as a single value, so it can not differentiate between the individual phone numbers. From your DBMS perspective, the phone field is just a normal string containing a single value. - This complicates the use of that data, especially when you need to do things like search for or extract a single value, or filter a data-set based on a single value in the field.
  • It also means you need to manually manage the data; to write the code that separates the values when they are retrieved and constructs the string when they are inserted. This makes it FAR more likely that your data will become corrupt or incompatible between applications. The DBMS has NO control over how the data is stored. It simply considers it a single string, and stores it as such. The internal structure and integrity of the data is completely up to you.

To remedy the situation, you would need to separate the data into individual fields. Your first instinct might be to simply create multiple "phone" columns, like so:
+----+------+----------+-----------+
| id | name | phone1   | phone2    |
+----+------+----------+-----------+
|  1 | Joe  | 588-5522 | 789-85522 |
|  2 | Anna | 589-4567 | 987-12354 |
+----+------+----------+-----------+

But this is NOT an acceptable solution. It does solve both of the problems I listed above, but it creates a new problem.

Namely that now we have restricted each person to two phone numbers and ONLY two phone numbers. What if a person needs to store three numbers? Using this table, the only way would be to add a third "phone" column, which would also add a third number to ALL other persons in the table. (What if we need to store a hundred phone numbers? Or a thousand?)

What we want to do is allow each person to supply as many phones as that person needs. To allow for that, we need to extract the phone numbers from that table altogether and put it into a new table dedicated to listing phone numbers. The design for that may look something like this:
+-----------+     +--------------+
| person    |     | phone_number |
+-----------+     +--------------+
| person_id |1-|  | phone_id     |
| name      |  |-*| person_id    |
+-----------+     | number       |
                  +--------------+

There each row in the phone_number table contains a column with the ID of a person. This column identifies the person who's number this is. A relationship like this is referred to as a One-To-Many (1:N) relationship, because each row in the parent table (person) can be linked to multiple rows in the child table (phone_number), but not the other way around.

The person_id column in the phone_number table is what we call a Foreign Key. It is an indication that the value in the column is meant to reference the value of another table. In many cases the DBMS will enforce this link and reject phone_number entries that do not provide a person_id that exists in the person table.


The Second Normal Form (2NF)
This requires that no field should only be partially dependent on any candidate key in the table. This does not only include the PK, but any fields combinations that would uniquely identify a row.

Consider this design:
+---------------+
| prices        |
+---------------+
| price_id (PK) |
| product       |
| shop          |
| unit_price    |
| qty           |
| shop_address  |
| unit_weight   |
+---------------+

The price_id column is the PK there, but because the combined values of the product and shop columns could also act as a composite PK, together they are considered a "candidate key".

Lets look at a few example entries into that table.
+----------+---------+--------+---------+------+--------------+------------+
| price_id | product | shop   | u_price | qty  | shop_address |unit_weight |
+----------+---------+--------+---------+------+--------------+------------+
|        1 | Beer    | Bob's  |    9.50 | 12.0 | Main Road 5  |      15.00 |
|        2 | Pepper  | Bob's  |   19.50 |  2.5 | Side Road 10 |       2.00 |
|        3 | Beer    | Jill's |    3.50 |  6.0 | Main Steet 1 |       1.50 |
|        4 | Pepper  | Jill's |    8.50 | 30.0 | Main Road 1  |      20.00 |
|        5 | Salt    | Jill's |   27.50 | 3.14 | Main Road 10 |     250.00 |
+---------+---------+--------+---------+------+--------------+-------------+


The problem becomes apparent when we examine the values of the shop_address and unit_weight fields against the above mentioned candidate key. The shop_address values should be identical in all rows with the same shop value, and the unit_weight should be identical for the same product values. However, this design allows us to specify different shop addresses and unit weights for the same shops and products, which in reality makes no sense.

So, to turn this into a 2NF design, you move these fields out of the table into their own tables, creating FKs to link the new tables to the main sale table.
+-----------+     +--------------+     +--------------+
| shop      |     | prices       |     | product      |
+-----------+     +--------------+     +--------------+
| name (PK) |1-|  | price_id     |  |-1| name (PK)    |
| address   |  |  | product      |*-|  | weight       |
+-----------+  |-*| shop         |     +--------------+
                  | unit_price   |
                  | qty          |
                  +--------------+

This ensures that all prices entries will be linked to a proper address and weight values.

The Third Normal Form (3NF)
The last of the forms needed for a database to be considered normalized. It requires that columns should depend only upon the primary key of the table. Basically what that means is that any column that is not solely dependent on the primary key of this table, or only partially, should be moved out of the table.

Lets look at two examples:

First, if we add a "country" column to the "persons" table we created for the 1NF.
+-----------+------+----------+
| person_id | name | country  |
+-----------+------+----------+
|         1 | Joe  | France   |
|         2 | Anna | England  |
+-----------+------+----------+

This looks fine at first glance, but if you look closer, the name of the country does not belong in this table.

Consider what happens if you are asked to list ALL possible countries. (E.g. for a drop-down box on a website.) The problem with that is, because the countries are dependent on a person to exist, the design will only allow us to list countries of people that exist in the database. No country that does not have a representative in the person table can exist in the database.

The solution for this is to move the country out of the table, into it's own table, and add a Foreign Key to the person table that references the country the person belongs to:
+------------+     +------------+
| person     |     | country    |
+------------+     +------------+
| person_id  |  |-1| country_id |
| name       |  |  | name       |
| country_id |*-|  +------------+
+------------+

Now you can list all the countries in the country table, and just link the persons to their respective countries.


Second, to address another common 3NF conformance issue, if we wanted to add more than just the country name for each person. If we were to add the city and address info as well, it might looks something like:
+-----------+------+----------+--------+------------------+
| person_id | name | country  | city   | address          |
+-----------+------+----------+--------+------------------+
|         1 | Joe  | France   | Paris  | Eiffel blwd. 101 |
|         2 | Anna | England  | London | English road 302 |
+-----------+------+----------+--------+------------------+

This is basically the same problem we had with the previous example, but we have added two more columns, neither of which belong in this table either. So, lets try to apply the same solution as last time, but this time move all three location items into a location table, and link the location to the person:
+-------------+     +-------------+
| person      |     | location    |
+-------------+     +-------------+
| person_id   |  |-1| location_id |
| name        |  |  | country     |
| location_id |*-|  | city        |
+-------------+     | address     |
                    +-------------+

This is better, but can you spot the problem?

As you remember, to reach 3NF no field can depend on anything but the PK of the table. But if you look at the location table, notice that "address" depends not only on the PK, but on the "city" field as well. (If the city field changes, so must the address field). The same applies to the city field; it depends on the country field.

This means that, in order to reach 3NF, you need to move country, city and address all into their own tables, each linked to each other. The address table should be linked to the person, the city table linked to the address table, and finally the country table linked to the city table.
+-------------+     +------------+
| person      |     | address    |
+-------------+     +------------+
| person_id   |  |-1| address_id |
| location_id |*-|  | city_id    |*-|
| name        |     | name       |  |
+-------------+     +------------+  |
                                    |
+------------+     +------------+   |
| country    |     | city       |   |
+------------+     +------------+   |
| country_id |1-|  | city_id    |1--|
| name       |  |-*| country_id |
+------------+     | name       |
                   +------------+

Now this design is in 3NF, and could be considered "normalized".

Epilogue
It's should be mentioned that in some cases, putting your tables into 3NF can have a negative effect on the performance and usability of the table. Like, for example, in the latest example. Before normalizing the table, you could query all the data from the single table with a simple SELECT query. After normalizing it, fetching the exact same result requires three JOINs.

My point is this; normalization is a design goal. It may not always be the practical solution. It may in fact be impractical in certain situations to design fully normalized tables. If performance and resource conservation matters a great deal more to you than data integrity, then you may well be better of aiming for a lower NF level. I would advice you to ALWAYS make sure you tables are in 1NF, but beyond that you must judge the situation for yourself.

I can only say that in 99% of cases, you are be better of at least trying to go for the 3NF. It may spare you some otherwise wasted efforts later on.

In any case, I hope this has been helpful.

All the best,
- Atli

Edit, 2012-02-11 - Improved the section about the 2NF. The purpose of that rule should be clearer this way.
Edit, 2012-06-21 - Had a couple of the relationships in the 3NF diagrams reversed. Thanks to thava for pointing that out!

Is This A Good Question/Topic? 31
  • +

Replies To: Relational Database Design - Normalization

#2 Sethro117  Icon User is offline

  • Still the sexiest mofo.
  • member icon

Reputation: 236
  • View blog
  • Posts: 2,378
  • Joined: 14-January 09

Posted 26 June 2010 - 10:46 AM

Great tutorial.
Was This Post Helpful? 0
  • +
  • -

#3 dorknexus  Icon User is offline

  • or something bad...real bad.
  • member icon

Reputation: 1256
  • View blog
  • Posts: 4,618
  • Joined: 02-May 04

Posted 26 June 2010 - 03:48 PM

Very nice. Normalization is often overlooked and can cause a lot of headaches.
Was This Post Helpful? 0
  • +
  • -

#4 westmatrix99  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 20
  • Joined: 29-September 07

Posted 20 July 2010 - 08:30 AM

I do like this tut, but it's missing the actual SELECT statements used to get the results...
Any way of posting those?
Was This Post Helpful? 0
  • +
  • -

#5 Atli  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 3718
  • View blog
  • Posts: 5,986
  • Joined: 08-June 10

Posted 22 July 2010 - 06:54 PM

View Postwestmatrix99, on 20 July 2010 - 02:30 PM, said:

I do like this tut, but it's missing the actual SELECT statements used to get the results...
Any way of posting those?

The exact syntax of those aren't the same for all databases, and I didn't want to tie this down to a specific database. All the example layouts in the article can be queried with a few simple JOINs though.

However, I'll post a couple of ANSI compliant examples. You may have to change this slightly to fit your database, but these should work almost unaltered for the most popular choices.

For the 1NF example, this would give you a list of phone numbers for a given person's name:
SELECT
    phone_id,
    number
FROM phone_number n
JOIN person p
    on p.person_id = n.person_id
WHERE
    p.name = "John Doe";


For the 2NF example, this would select relevant data for a single shop, identified by it's ID:
SELECT
    sale.quanity,
    shop.shop_name,
    product.product_name,
    product.price
FROM sale
LEFT JOIN product
    ON sale.product_id = product.product_id
LEFT JOIN shop
    shop.shop_id = sale.shop_id
WHERE
    sale.sale_id = 1


For the last 3NF example, this would select a persons location info based on his/her name:
SELECT
    a.name AS 'adress',
    ci.name AS 'city',
    co.name AS 'country'
FROM person p
LEFT JOIN address a
    ON p.location_id = a.address_id
LEFT JOIN city ci
    ON ci.city_id = a.city_id
LEFT JOIN country co
    ON co.country_id = ci.country_id
WHERE
    p.name = 'John Doe';


Hope that's what your looking for :)
Was This Post Helpful? 2
  • +
  • -

#6 Guest_fritz the cat*


Reputation:

Posted 26 August 2010 - 05:36 AM

Very nice and clear tutorial, thanks
Was This Post Helpful? 0

#7 Guest_Joe Johnson*


Reputation:

Posted 05 November 2010 - 11:49 AM

I just wanted to say thanks for taking the time to write such a great tutorial; unfortunately, in my experience, database design is one of the most often over-looked aspects of development. No matter what language you're learning (PHP, ColdFusion, etc.) it is always assumed that the "student" already has a basic understanding of RDMS, foreign keys and primary keys.

It took me a long time (being self-taught) to really understand DB design principles; this article helped clear up a few things for me, and I'm sure it will be inspirational to others, as well.

With the author's permission, may I use your article in my video tutorials when I cover databases? Obviously, I'll give you credit. :)

Thanks again.


Sincerely,



Joe Johnson
Web Development Videos.com
Was This Post Helpful? 1

#8 Atli  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 3718
  • View blog
  • Posts: 5,986
  • Joined: 08-June 10

Posted 05 November 2010 - 03:26 PM

View PostJoe Johnson, on 05 November 2010 - 06:49 PM, said:

With the author's permission, may I use your article in my video tutorials when I cover databases? Obviously, I'll give you credit. :)

Hey Joe.

Sure, feel free. If it can be of use to you viewers, I think that's great. :)
Was This Post Helpful? 1
  • +
  • -

#9 Mr.T  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 8
  • Joined: 09-October 10

Posted 29 January 2011 - 08:47 AM

Great tutorial, its just what i was looking for, I can see fully the benefits of the 1nf example, and it fits exactly what I need for my project, how ever I cant see how to write the info into mysql, I have a web page for ppl to input their details and input it to the db (all working ok) but dont know how to get their id from that table to add it to the next table for the relay legs they will ride (can be nay number of legs from 1 to 12).
Was This Post Helpful? 0
  • +
  • -

#10 nK0de  Icon User is offline

  • Catch me As Exception
  • member icon

Reputation: 205
  • View blog
  • Posts: 823
  • Joined: 21-December 11

Posted 22 January 2012 - 09:51 AM

I have a test on RDBMS coming up. Though I'm pretty much okay with queries, I suck at designing databases. :/ I was looking for a good tutorial on Relational Database Designing all over the web but didn't know DIC had a good tutorial! great tutorial Atli :) thanks
Was This Post Helpful? 1
  • +
  • -

#11 roohafza  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 2
  • Joined: 10-February 12

Posted 10 February 2012 - 09:57 PM

Hi there
Very nice topic, but I have a question

What is your idea about this below.

+-----------+------+------------+---------+------------------+
| person_id | name | country_id | city_id | address          |
+-----------+------+------------+---------+------------------+
|         1 | Joe  | 1          | 6       | Eiffel blwd. 101 |
|         2 | Anna | 2          | 7       | English road 302 |
+-----------+------+------------+---------+------------------+

+-------------+     +------------+
| person      |     | country    |
+-------------+     +------------+
| person_id   |  |-*| country_id |
| name        |  |  | name       |
| country_id  |1-|  +------------+
| city_id     |1-|  
| address     |  |
+-------------+  |   
                 |                   
+------------+   | 
| City       |   | 
+------------+   | 
| city_id    |*--|
| name       |
+------------+

Was This Post Helpful? 0
  • +
  • -

#12 Atli  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 3718
  • View blog
  • Posts: 5,986
  • Joined: 08-June 10

Posted 11 February 2012 - 02:17 AM

Hey, roohafza.

In that design, in the person table, the address field depends on the city_id field, and the city_id field depends on the country_id field. Both of those would therefore violate the 3NF.

Also consider that the City table entries are not linked to a country entry, so the only way to find out the country each city belongs to is to search for the non-link* in the person table. This means that City entries that have no representative in the person table can not be linked to a country.

* I say "non-link" because there is actually no relationship between the city_id and country_id in that table. From the perspective of the database, they are no more related than any other two non-prime attributes.
Was This Post Helpful? 1
  • +
  • -

#13 roohafza  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 2
  • Joined: 10-February 12

Posted 11 February 2012 - 09:54 AM

Thanks
Was This Post Helpful? 0
  • +
  • -

#14 Che7tzs  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 7
  • Joined: 21-February 12

Posted 06 March 2012 - 05:17 AM

very helpful tutorials
Was This Post Helpful? 0
  • +
  • -

#15 v1sh23  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 1
  • Joined: 01-May 12

Posted 07 May 2012 - 09:27 PM

very useful. thanks :)
Was This Post Helpful? 0
  • +
  • -

  • (3 Pages)
  • +
  • 1
  • 2
  • 3