Separating lists and normalizing data

By David J. Lake
2006-09-03

Understanding database normalization is one of the more important tasks to undertake for those of us who wish to become advanced users.

One of the ways in which new users make a mistake in not normalizing their data is to think that a comma separted list of values will be beneficial in looking up records or that it will save space.

Unfortunately that is usually not the case, and as the database scales the data becomes cumbersome to search effectively.

Let's take a look at how such a list is commonly used, how it has to be searched and what can be done to rectify this. We will use this as one of many steps towards database normalization as well as becoming a useful tool in your arsenal as a developer.

We'll use a very basic real estate example with a listing of property addresses and a list of property assets that would be commonly searched on when shopping for a new home.

First we will create a table and for the sake of brevity eliminate a number of common items.

create table properties (
property_id int auto_increment primary key,
address varchar(50),
assets text);

Now let's fill that table with data, using a comma separated list of values under the assets for each property:

insert into properties values
(1, '82 Fairington', 'pool,fireplace,side-split,garage,4 bedrooms,3 bathrooms,34 yrs old'),
(2, '10 Primrose','fireplace,two storey,garage,3 bedrooms,3 bathrooms,21 yrs old'),
(3, '2 Frontenac','carport,3 bedrooms,2 bathrooms,45 yrs old,bungalow'),
(4, '786 Ossington','3 bedrooms,1 bathrooms,100 yrs old,two storey');

Retrieving all the data isn't difficult of course, a simple select statement will take care of that:

select
property_id,
address,
assets
from properties
\G

*************************** 1. row ***************************
property_id: 1
    address: 82 Fairington
     assets: pool,fireplace,side-split,garage,4 bedrooms,3 bathrooms,34 yrs old
*************************** 2. row ***************************
property_id: 2
    address: 10 Primrose
     assets: fireplace,two storey,garage,3 bedrooms,3 bathrooms,21 yrs old
*************************** 3. row ***************************
property_id: 3
    address: 2 Frontenac
     assets: carport,3 bedrooms,2 bathrooms,45 yrs old,bungalow
*************************** 4. row ***************************
property_id: 4
    address: 786 Ossington
     assets: 3 bedrooms,1 bathrooms,100 yrs old,two storey
4 rows in set (0.00 sec)

But how about searching on the data? lets return all properties with 2 bathrooms.

select
property_id,
address,
assets
from properties
where assets like '%2 bathrooms%'
\G

*************************** 1. row ***************************
property_id: 3
    address: 2 Frontenac
     assets: carport,3 bedrooms,2 bathrooms,45 yrs old,bungalow
1 row in set (0.47 sec)

The result is the one property meeting the requirement of 2 bathrooms.

Even in the simplest search we have to use a like clause and, making the situation worse, because we don't know where in the list of assets the search term might occur we have to precede our search term with %.

The difficulty with that is that now you can't use an index to do your search. That's because when a search term begins with a % the index on the column is automatically ignored.

Another difficulty of course comes up when we want additional search terms. Let's search for all properties with 3 bathrooms and that are 2 storeys.

select
property_id,
address,
assets
from properties
where assets like '%3 bathrooms%'
and assets like '%two storey%'
\G

*************************** 1. row ***************************
property_id: 2
    address: 10 Primrose
     assets: fireplace,two storey,garage,3 bedrooms,3 bathrooms,21 yrs old

1 rows in set (0.48 sec)

So we could use an OR to get properties matching any of the assets, in addition we can use AND to get exact number of assets. But what if we wanted to show properties that contained a certain number, say 3 of a list of 5 desired assets?

Each of the rows of assets is only counted once, so you can't easily get a separate count of the assets.

So what do we do about this? Simply separate the comma separated lists and normalize our data. this also involves the creation of two other tables.

create table new_properties (
property_id int auto_increment primary key,
address varchar(50));

insert into new_properties values
(1, '82 Fairington'),
(2, '10 Primrose'),
(3, '2 Frontenac'),
(4, '786 Ossington');

we now create a table that holds the various types of assets, since we will want to reference these as a lookup table we will create an id column as well.

create table assets(
asset_id int auto_increment primary key,
asset varchar(20) unique key );

now we will insert a single row for each asset without duplication

insert into assets values
(1,'pool'),
(2,'fireplace'),
(3,'side-split'),
(4,'garage'),
(5,'4 bedrooms'),
(6,'3 bathrooms'),
(7,'34 yrs old'),
(8,'two storey'),
(9,'3 bedrooms'),
(10,'21 yrs old'),
(11,'carport'),
(12,'2 bathrooms'),
(13,'45 yrs old'),
(14,'bungalow'),
(15,'1 bathrooms'),
(16,'100 yrs old');

Lastly we will create what is known as a pivot table which gives a listing of the property_id and asset for each asset owned by the house. Something like this:

create table features(
pid int,
aid int,
primary key (pid,aid)
)

Note that we create a primary key over two columns here. The reasoning is twofold. One, you don't want to list the same asset for a house more than one time, and secondly a primary key over more than one column will still allow the optimizer to use this index when you are retrieving all the assets for a given property (i.e. when you need only the left-most column of the two-column index).

insert into features values
(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (1,7),
(2,2),(2,8),(2,4),(2,9),(2,10),(2,6),
(3,11),(3,9),(3,12),(3,13),(3,14),
(4,9),(4,15),(4,16),(4,8);

you could put all of the values above on a single line, or one on each line, when you input them. I separated them here by property so it was easy to recongize I had the correct number of assets for each property.

Now getting the data from the table is a little more complicated for something as simple as displaying all the assets, but you would write the query out once and then modify it accordingly for other queries. You will see the advantages of splitting up the tables as we go forward into the more complicated queries.

Our original query above shows the property id, the address and the list of assets for each property so let's do that.

select
property_id,
address,
group_concat(asset) as assets
from new_properties
inner join features
on property_id = pid
inner join assets
on
aid = asset_id
group by
property_id,
address
\G

*************************** 1. row ***************************
property_id: 1
    address: 82 Fairington
     assets: pool,fireplace,side-split,garage,4 bedrooms,3 bathrooms,34 yrs old
*************************** 2. row ***************************
property_id: 2
    address: 10 Primrose
     assets: fireplace,garage,3 bathrooms,two storey,3 bedrooms,21 yrs old
*************************** 3. row ***************************
property_id: 3
    address: 2 Frontenac
     assets: 3 bedrooms,carport,2 bathrooms,45 yrs old,bungalow
*************************** 4. row ***************************
property_id: 4
    address: 786 Ossington
     assets: two storey,3 bedrooms,1 bathrooms,100 yrs old
4 rows in set (0.47 sec)

Now you may be thinking "that's a lot of effort to get what we had with the single table", and you may be right but there are two things to note. First, you won't be typing in the query time after time, if you have a large query save it to a text file and then run source on the text file like this:

source myquery.txt

Second, and more importantly, your data is properly normalized and you will have no trouble scaling this.

Now back to the problem of searching for multiple assets. Remember we had trouble doing that from the table with the comma separated list. There are only a couple of changes to the code above to get what we need. First we will get a list of properties with ANY number of those assets:

select
property_id,
address,
group_concat(asset) as assets
from new_properties
inner join features
on property_id = pid
inner join assets
on
aid = asset_id
and (asset = '3 bathrooms' or asset = 'two storey')
group by
property_id,
address
\G

*************************** 1. row ***************************
property_id: 1
    address: 82 Fairington
     assets: 3 bathrooms
*************************** 2. row ***************************
property_id: 2
    address: 10 Primrose
     assets: 3 bathrooms,two storey
*************************** 3. row ***************************
property_id: 4
    address: 786 Ossington
     assets: two storey
3 rows in set (0.00 sec)

and now we add a HAVING clause in order to specify that we only want those properties with both 3 bedrooms that are type of two storey.

select
property_id,
address,
group_concat(asset) as assets
from new_properties
inner join features
on property_id = pid
inner join assets
on
aid = asset_id
and (asset = '3 bathrooms' or asset = 'two storey')
group by
property_id,
address
having count(*) = 2
\G

*************************** 1. row ***************************
property_id: 2
    address: 10 Primrose
     assets: 3 bathrooms,two storey
1 row in set (0.00 sec)

For those of you worried about all the extra key strokes, you can reduce the list of assets above to a more simple IN list, which is just a short form of rewriting all those OR clauses:

select
property_id,
address,
group_concat(asset) as assets
from new_properties
inner join features
on property_id = pid
inner join assets
on
aid = asset_id
and asset IN ('3 bathrooms','two storey')
group by
property_id,
address
having count(*) = 2
\G

*************************** 1. row ***************************
property_id: 2
    address: 10 Primrose
     assets: 3 bathrooms,two storey
1 row in set (0.00 sec)

The having clause is also helpful when you want to choose, say a house that has at least 3 out of 5 preferred features, (or any amount of course). To do that we can modify the query this way:

select
property_id,
address,
group_concat(asset) as assets
from new_properties
inner join features
on property_id = pid
inner join assets
on
aid = asset_id
and asset IN ('3 bathrooms', 'two storey', 'pool', 'garage','fireplace')
group by
property_id,
address
having count(*) >= 3
\G

*************************** 1. row ***************************
property_id: 1
    address: 82 Fairington
     assets: pool,fireplace,garage,3 bathrooms
*************************** 2. row ***************************
property_id: 2
    address: 10 Primrose
     assets: fireplace,garage,3 bathrooms,two storey
2 rows in set (0.00 sec)

Finally you may also weight your query using a CASE statement. Perhaps you want to choose 3 of those 5 assets but if there was a pool you might want to have that count double. The case statement says that when asset matches pool count it as double, otherwise count the assets just once. and of course the having clause changes so that we are using the alias, otherwise using count(*) would not give us the weighted total.

select
property_id,
address,
group_concat(asset) as assets,
sum(case when asset='pool' then 2 else 1 end) as assettotal
from new_properties
inner join features
on property_id = pid
inner join assets
on
aid = asset_id
and asset IN ('1 bathrooms', 'two storey', 'pool', 'car port','fireplace')
group by
property_id,
address
having assettotal >= 3
\G

*************************** 1. row ***************************
property_id: 1
    address: 82 Fairington
     assets: pool,fireplace
 assettotal: 3
1 row in set (0.09 sec)

Now having seen the benefits of splitting your data, as well as the efficiency of searching your data and the fact that you can weight your clauses, you will have a better understanding of the importance of data normalization.

I hope the explanation above explained the pitfalls of arranging your data without lists.


I'll add one last example because I see it discussd a lot. That is a list of numbers. Perhaps a list of players and a list of golf scores.

(1,3,4,4,5,4,6,4,4,7),
(2,4,5,5,3,3,5,3,6,5),
(3,4,4,4,4,4,4,4,4,4);

Now that was just 9 holes for three golfers, so not too hard to pick out the obvious, like these guys are pretty good golfers, and golfer #3 is very consistant.

But say I told you par for those 9 holes is 35. Would it be easy to calculate the scores above/below par for each golfer? How about which of the golfers did best on the fifth hole? Perhaps it is a skins game, who won how many holes?

Again, you are better to have three columns, one for the golfer id, one for the score, and one to indicate the hole being played. You would have 27 rows for the table and such calculations would be much easier as discussed in the real estate example above.