(adapted from an article on my UNIFACE page)
Every table in a relational database requires a primary key - a unique identity for each occurrence (row). This primary key is obtained from one or more fields (columns) that exist within the table's structure, and it is the value contained within the primary key field(s) that is used to identify each individual occurrence.
For each primary key the database will build an index. This contains the primary key values and a pointer to the corresponding data record. When searching for entries via a primary key value it is quicker to scan the index than the main data table. The performance benefit from using an index is balanced by the amount of disk space required to hold it and the extra processing required to maintain it.
An important fact to note about primary keys is that in most databases once created the value cannot be changed. If a change is required the only option is to create a new occurrence with the new primary key, then delete the original occurrence with the old primary key.
A review of this article published in http://r937.com/moresurrogatekeys.html quibbles over my statement that primary keys cannot be changed. In my defence I state the following:
As an example of a primary key consider a table called COUNTRY which contains items of data such as NAME, CURRENCY and LANGUAGE. It would be possible to use NAME as the primary key, but it is usual to use a shortened code rather than a long textual description.
primary key | |||
---|---|---|---|
COUNTRY_ID | COUNTRY_NAME | CURRENCY | LANGUAGE |
UK | United Kingdom | Sterling | English |
USA | United States of America | US Dollar | English |
FR | France | French Franc | French |
For objects such as countries it is commonplace to use the internationally recognised standard country codes as the primary key.
If a primary key contains more than one column none of those columns is allowed to be null. This is enforced by not allowing any column in a primary key to have the NOT NULL constraint.
A candidate key is a means of providing an additional unique key into a table. Candidate keys are entirely optional, so a table may contain none, one or several of them.
For example, in a payroll system the EMPLOYEE table may contain columns for PAYROLL_NO and NATIONAL_INSURANCE_NO, both of which are unique to each employee. If both of these were required as lookups into the table only one could be defined as the primary key, so the other would have to be defined as a candidate key.
Like primary keys each candidate key will require additional resources for indexing purposes.
The advantage of a candidate key over a primary key is that its value may be changed, although not to one that already exists.
If a candidate key contains more than one column then some databases, not all, will allow some of those columns to be nullable, and if any column contains NULL the key will not be stored. The key will only be stored and validated against if all the columns contain NOT NULL values.
Where two database entities are involved in a one-to-many relationship a column in the 'many/child' entity is linked to the primary key of the 'one/parent' entity. This column in the 'many/child' entity is referred to as a FOREIGN KEY, as shown is the following example:
primary key | foreign key | |
---|---|---|
COMPANY_ID | COMPANY_NAME | COUNTRY_ID |
MS | Microsoft | USA |
IBM | International Business Machines | USA |
ICL | International Computers Ltd | UK |
A foreign key must exist between two tables before those tables can be the subjects of a JOIN in a SELECT statement.
You should not attempt to use a field as a foreign key where the corresponding field on the foreign table is not unique as this would violate the one-to-many principle. This could result in a child record having multiple parents. Although it is possible to link a foreign key to a candidate key this is not recommended as it is possible to change the value of a candidate key. As this does not automatically update the corresponding foreign key values in all child rows you would effectively break the relationship with those children.
Note that for a foreign key an index is optional.
This is a key for which the possible values have an obvious meaning to the user or the data. For example, a semantic primary key for a COUNTRY entity might contain the value 'USA' for the occurrence describing the United States of America.
This is a key for which the possible values have no obvious meaning to the user or the data. Technical primary keys let you enforce the uniqueness property within a table while making it easy to generate values invisibly and automatically. A typical method of generating values is to take the next number from a named sequence which is maintained by the database engine using code similar to the following:
SELECT <seq_name>.NEXTVAL FROM DUAL
In a database which does not have sequence numbers you could use something similar to the following:
SELECT max(id) FROM <tablename>
Remember that the result of this statement will get bigger and bigger as time goes on, so the field which holds the value must be large enough so as not to overflow in the foreseeable future. I have seen an interface definition of 'N10' used on more than one occasion for this purpose.
In some cases the columns within a table do not readily provide a unique value that can be used as a primary key. Take the example of a PERSON table which contains the following columns:
first_name, last_name, initials, sex, date_of_birth, address.
Which column, or combination of columns, could be guaranteed to provide a unique reference? Remember that a primary key, once assigned, cannot be changed. Once you exclude those columns that may be changed during a person's lifetime not much is left, certainly not enough to guarantee uniqueness.
In such a case as this the solution would be to create a new column (e.g. PERSON_ID) for use as either a semantic or technical key. If the value needed to be remembered for later use, such as a login id to the main computer system, then a semantic key may be more appropriate - a person is more liable to remember a short text string such as 'AJM' or 'TONY27' than a random 10 digit number.
It is possible to have a primary key that is comprised of more than 1 column. This is known as a COMPOUND or CONCATENATED key. If the number of columns required is too great, or the combined length is rather large, it is usual practice to insert a technical key. However, if you still need to create a compound candidate key to help prevent duplicate values then an additional technical key is superfluous.
Where a table has a semantic key it is not unknown for the user to want to change the value of that key. As explained previously this is just not possible with a primary key. Although the occurrence can be deleted and recreated with the new value this is no good if there are large numbers of occurrences on other tables which use the original value as a foreign key. The overhead of deleting and recreating all those occurrences could be enormous.
The solution here is to make the user-defined value into a candidate key and insert a new column for use as a technical primary key. Provided that all related occurrences are linked to the unchanging primary key then no relationship will be broken if the candidate key were to be changed.
There are some people who think that technical keys are such a good idea that they insist on putting them on every table without further thought and regardless of the consequences. Although there are situations where the use of technical keys can provide positive benefits, there are also situations where they may actually produce negligible or even negative results. The following sections identify several instances where I think that the inclusion of technical keys is ever-so-slightly questionable.
Where an entity already has a recognisable semantic key which is to be used as a foreign key in related entities there is no advantage in adding a technical key. Take the following example:
primary key | candidate key | |
---|---|---|
COUNTRY_ID | COUNTRY_CODE | COUNTRY_NAME |
1 | UK | United Kingdom |
2 | USA | United States of America |
3 | FR | France |
Where users are likely to be familiar with the semantic key values and have the ability to enter them directly they are unlikely to be as familiar with a random number. Everybody knows that a COUNTRY_CODE of 'USA' refers to the United States of America, but who can say the same for the number 2?
A relationship between two entities is defined within the application model as a one-to-many, where the number of occurrences of the 'many' entity may be zero, one or more. If a particular implementation requires that there never be more than one occurrence of the 'many' entity there is no apparent way that this restraint can be defined in the database schema. Take the following example:
|
|
This particular structure does not disallow the possibility of adding more than one occurrence of the 'many' entity for each occurrence of the 'one' entity. If software is released which does not enforce this non-obvious restriction the results can be, shall we say, interesting?
I have this (perverse?) expectation that if a database is designed in order to satisfy certain business rules, then it should be possible to reverse engineer the database design in order to recreate those rules. This is not possible with the above implementation. My preferred approach to this situation, one that I have used with complete success on more than one occasion, is to make the primary key of 'many' exactly the same as the primary key of 'one'. The primary key of 'many' can therefore serve a dual purpose as the foreign key back to 'one', as demonstrated in the following example:
|
|
As the primary key and foreign key are now one and the same there is no need for a separate index on the foreign key. This saves both disk space and processing time.
This method makes it obvious to all concerned that only one occurrence of 'many' should exist for each occurrence of 'one', and indeed makes it impossible to do otherwise. However, it does upset some people because it violates their rule that each database table has its own number sequence for generating the primary key. As this rule does not exist in any database system that I have ever encountered I consider it to be artificial, and because it produces inferior results I choose to ignore it.
In this example the 'many' entity does not have a semantic key - the primary key is purely arbitrary as each occurrence is just one of many that may be owned by an occurrence of the 'one' entity. This is typically implemented using the following structure:
'many' entity | ||
primary key | foreign key | |
---|---|---|
PKEY_MANY | PKEY_ONE | DATA |
27 | 104 | stuff for 104,27 |
34 | 104 | stuff for 104,34 |
35 | 208 | stuff for 208,35 |
56 | 335 | stuff for 335,56 |
Note the following disadvantages:
Now consider an alternative structure. Here the value for SEQ_NO is not unique for the table, but is unique for each value of PKEY_ONE. It is obtained by code similar to the following:
ONE.LAST_SEQ_NO = ONE.LAST_SEQ_NO + 1 MANY.SEQ_NO = ONE.LAST_SEQ_NO
'many' entity | ||
primary key | ||
---|---|---|
foreign key | ||
PKEY_ONE | SEQ_NO | DATA |
104 | 1 | stuff for 104,1 |
104 | 2 | stuff for 104,2 |
208 | 1 | stuff for 208,1 |
335 | 1 | stuff for 335,1 |
Note the following advantages:
An alternative way of finding out the current highest value for SEQ_NO would be to use a SELECT
statement, as in the following example:
$query = "SELECT max(seq_no) FROM 'many' WHERE (many.pkey_one = one.pkey_one)"; $count = $this->getCount($query); many.seq_no. = $count + 1
As the columns PKEY_ONE and SEQ_NO together form the primary key, this particular SELECT
statement can be satisfied by a quick look at the index without the need to trawl through large numbers of data records.
In this example each occurrence of 'many' is only valid for a finite amount of time, therefore it must include a field to hold a date or time stamp. This is a structure that I have seen somebody else implement:
'many' entity | |||
primary key | candidate key | ||
---|---|---|---|
foreign key | |||
PKEY_MANY | PKEY_ONE | START_DATE | DATA |
27 | 104 | 01-Jan-2000 | stuff for 104 |
34 | 104 | 01-Nov-2000 | stuff for 104 |
35 | 208 | 01-Jan-2000 | stuff for 208 |
56 | 335 | 01-Jan-2000 | stuff for 335 |
The candidate key is there (apparently) to ensure that an instance of PKEY_ONE does not have more than 1 occurrence for a particular date. However, the program still has to read in the occurrence with the highest date to ensure that the start date for a new occurrence is later, not earlier.
This structure allows entries to be created with dates in the future, therefore the procedure for identifying which occurrence is valid for a particular date is a little convoluted (and achieved with a database view):
This structure allows START_DATE to be amended, but the following rules have to be obeyed:
Here is an alternative structure that I have implemented on numerous occasions:
'many' entity | ||||
primary key | ||||
---|---|---|---|---|
foreign key | ||||
PKEY_ONE | SEQ_NO | DATA | START_DATE | END_DATE |
104 | 1 | stuff for 104 | 01-Jan-2000 | 31-Oct-2000 |
104 | 2 | stuff for 104 | 01-Nov-2000 | 31-Dec-9999 |
208 | 1 | stuff for 208 | 01-Jan-2000 | 31-Dec-9999 |
335 | 1 | stuff for 335 | 01-Jan-2000 | 31-Dec-9999 |
The advantages of the SEQ_NO field have been explained in the previous section.
The procedure for identifying which occurrence applies to today's date is a little simpler, and can be done without the need for a database view:
SELECT ... WHERE (start_date <= 'target_date' & end_date >= 'target_date')
Note that this will not work if undefined end dates are left as null. I always use a dummy date to simulate 'sometime in the future', as explained in Dealing with null End Dates.
This structure does require a little more effort to maintain, but I think the results are worth it. For example, when adding a new occurrence the following rules must be obeyed:
When amending an existing occurrence the following rules apply:
If you are being efficient and modern and using a 3 tier architecture then these rules can be defined in a single component in the business layer rather than being duplicated in numerous components in the presentation layer, but that's another story.
In this situation there are several groups of lookup codes, such as country codes, organisation types, address types, etc. Instead of having a separate database table for each group they are all lumped together on a single table with a group identifier to separate them. A big advantage of this structure is that new groups can be added very easily without having to amend the database structure. However, a disadvantage is that it makes the actual lookups a bit more complicated. Exactly how much more complicated depends on the inclusion of the all-pervasive technical primary key. Here is a structure that I saw somebody else design fairly recently:
It starts with a table that defines the various code groups:
'one' entity | ||
primary key | candidate key | |
---|---|---|
ID | GROUP_ID | GROUP_DESC |
1 | COUNTRY | Country Code |
2 | ORGTYPE | Organisation Type |
3 | ADDRTYP | Address Type |
The table that contains the actual codes looks like this:
'many' entity | |||
primary key | foreign key | ||
---|---|---|---|
candidate key | |||
ID | GROUP_ID | CODE_ID | CODE_DESC |
1 | COUNTRY | USA | United States of America |
2 | COUNTRY | UK | United Kingdom |
3 | ADDRTYP | UK | Unknown |
This is how we do things on the planet Earth...
|
|
This is a relationship between two entities (for example, X and Y), where more than one occurrence of Y can exist for each occurrence of X, and more than one occurrence of X can exist for each occurrence of Y. This can be represented in the following diagram:
|
|
|
As it is not possible to define a many-to-many relationship directly the solution is to create an additional link or cross reference entity which can act as the 'many' entity in a pair of one-to-many relationships, as shown in the following diagram:
|
|
|
|
|
This link entity requires as a minimum a foreign key for entity X and a foreign key for entity Y. If you insist on a technical primary key the resulting structure looks like this:
primary key | candidate key | |
---|---|---|
foreign key | foreign key | |
PKEY_LINK | PKEY_X | PKEY_Y |
1 | X1 | Y6 |
2 | X9 | Y42 |
3 | X3 | Y27 |
If there is not a unique key which combines the two foreign keys it will not be easy to check that a particular X+Y pair already exists, and without this check it will be possible to create duplicate entries. In the above structure this can only be implemented as a candidate key. If you think about it carefully you should see that all requirements could be satisfied by dropping the PKEY_LINK column and converting the candidate key into the primary key. This removes 1 redundant column and 1 redundant index, producing a simplified structure as follows:
primary key | |
---|---|
foreign key | foreign key |
PKEY_X | PKEY_Y |
X1 | Y6 |
X9 | Y42 |
X3 | Y27 |
This structure has the problems of a standard one-to-many relationship as highlighted previously, but is compounded by each additional level. Examine the following diagram:
|
|
|
|
|
Here there is a one-to-many relationship between 'top' and 'middle', and another one-to-many relationship between 'middle' and 'bottom'.
'top' entity | |
primary key | |
---|---|
PKEY_TOP | DATA |
104 | 104 data |
208 | 208 data |
335 | 335 data |
'middle' entity | ||
primary key | foreign key | |
---|---|---|
PKEY_MIDDLE | PKEY_TOP | DATA |
27 | 104 | 104,27 data |
35 | 208 | 208,35 data |
56 | 335 | 335,56 data |
'bottom' entity | ||
primary key | foreign key | |
---|---|---|
PKEY_BOTTOM | PKEY_MIDDLE | DATA |
1001 | 27 | 104,27,1001 data |
1002 | 35 | 208,35,1002 data |
1003 | 56 | 335,56,1003 data |
With structure it will not be possible to satisfy the request 'select BOTTOM where PKEY_TOP = 104' without traversing through the MIDDLE entity. If there is more than one entity in the middle then this will add further steps in the journey. Now look at an alternative structure:
'middle' entity | ||
primary key | ||
---|---|---|
foreign key | ||
PKEY_TOP | MIDDLE_SEQ | DATA |
104 | 1 | 104,1 data |
104 | 2 | 104,2 data |
208 | 1 | 208,1 data |
56 | 1 | 335,1 data |
'bottom' entity | |||
primary key | |||
---|---|---|---|
foreign key | foreign key | ||
PKEY_TOP | MIDDLE_SEQ | BOTTOM_SEQ | DATA |
104 | 1 | 1 | 104,1,1 data |
104 | 1 | 2 | 104,1,2 data |
104 | 2 | 1 | 104,2,1 data |
104 | 2 | 2 | 104,2,2 data |
208 | 1 | 1 | 208,1,1 data |
335 | 1 | 1 | 335,1,1 data |
In this structure the request 'select BOTTOM where PKEY_TOP = 104' can be satisfied without the need to navigate through any intermediate tables, therefore would be quicker to process.
I once joined a project where some bright spark had the brilliant idea of giving every technical key the name 'ID'. When asked why he explained that when building the low-level component which actually retrieved records from the database it was not necessary to change the inherited code as the name of the primary key was always 'ID', and could therefore be hard coded.
This bright spark was obviously not an experienced programmer as he was unfamiliar with the various methods by which the names of the primary key field(s) can be obtained.
A problem begins to appear in relationships - if the primary key is called 'ID' then a foreign key cannot also be called 'ID', therefore something will have to change. The solution was to change the foreign key name to 'somethingelse_ID', where 'somethingelse' was an abbreviation of the entity name. Those of you with more than two brain cells to rub together will notice that this results in a field which is named one thing on one entity but something else on another. This is a practice that gives me the heebie-jeebies.
This becomes a problem when you attempt to use standard generic code to pass context from one component to another. A typical scenario that I use all the time is to have a component that lists multiple occurrences from table 'A', and within this component the user can select one particular occurrence and jump to another component which shows related occurrences from table 'B'. Using generic code it is a simple matter to extract the primary key of the occurrence selected in the first component and pass it as a string in the format "column='value'" to the second component which simply includes it as the WHERE clause in an sql SELECT statement. If the column names are different between the two tables then you have to interrupt the generic code with some sort of translation mechanism. If a table is involved in more than one relationship this will usually involve having to know where the "column='value'" string came from in order to carry out the correct translation. For example, if the string "id=209" can be passed from tables TOM, DICK and HARRY then you will need to know which one in order to translate "id" into "tom_id", "dick_id" or "harry_id". Just think of how much simpler it would be if the passed string could be used "as-is" without the need for any translation. It makes no difference to the database what the column names are called, but if program code has to be inserted to translate column names then you are introducing an area of delay, expense and potential error. If an idea involves cost but has no tangible benefit, then what is the point?
The convention that I have used for many years is to give each primary key field the name '<entname>_ID
'. This means that when used as a foreign key the name does not have to change. Thus any field name that ends in '_ID
' is a key field, and when appearing in the body of a field list it will not only be apparent that it is a foreign key, but will also identify the name of the entity to which it is related. Simple and effective. Regarding the above problem, if the field name does not have to be converted then it is possible to maximise the use of generic code without the need for constant translations.
The worst idea that I ever heard, though fortunately never saw implemented, was the idea that each technical key, instead of containing just a number, should also have a string prefix which identified the entity to which it related. I'm sure this sounds like a very clever idea, but what on earth is the benefit? I have always used techniques that have been simple and effective, so I do not see the point in switching to a new method that is less simple and less effective just because it is fashionable.
It has recently been suggested to me that my preference for prefixing "ID" with "<entityname>_" is not good practice as it results in redundancy. The argument is that in the statement "customer.customer_id" the word "customer" is used twice. This is "clearly redundant and is therefore not good practice".
In my opinion this argument is weak and short-sighted as it focuses on only one aspect of the usage of primary keys. In my many years of experience I have had to cater for other considerations. Some of these are:
There is no rule that says a primary key must always consist of a single field, therefore compound keys are permissible. Likewise there is no rule that says a primary key must always be a technical key, therefore semantic keys are also permissible. The trick is knowing when to use one form in preference to the other.
There are two ways of using technical primary keys - intelligently and indiscriminately.
Although the use of a technical primary key may avoid potential problems in some circumstances (refer to When using a technical key is a good idea) it is incorrect to assume that there will be benefits in all circumstances. Some people may argue that a single technical key is more efficient than a compound semantic key, but even if this were true (and I have yet to see any evidence in support of this claim) I would counter this by saying that a reduction in the number of additional candidate keys and indexes without a corresponding reduction in the accessibility of the data would actually produce an overall net saving.
If you create records with a technical primary but without checking for uniqueness with a semantic secondary key then it is possible to create records with duplicate data, and what effect would that have in your application? If you are able to check for uniqueness via a semantic secondary key then is the technical key redundant? Why have two unique keys if you can get by with one?
Some people say that you should use a numeric key instead of a character key as it makes lookups more efficient. What they fail to realise is that if you have to perform a lookup on the secondary key in order to obtain the primary key before you can perform lookups on other tables then any savings you make by having a numeric key have been flushed down the toilet.
It is my personal opinion that the indiscriminate use of technical keys shows a lack of thought and understanding, and leads me to wonder what other horrors are lurking in the database design. Putting a technical primary key on EVERY table by default may be fashionable, but it is not very intelligent. Where a technical key is employed unnecessarily valuable resources are actually being wasted:
If you have a database with hundreds of tables and millions of occurrences this could add up to a significant sum. Think about it. Look at the different ways in which an objective may be achieved, then examine the pros and cons of each one. Do not jump at a so-called 'solution' just because it solved a problem that appeared with a particular set of circumstances in the past. If those same circumstances do not exist in your current situation then the 'solution' may actually turn out to be more of a 'problem' instead.
13 Mar 20215 | Updated Primary Keys and Candidate keys to indicate if compound keys can contain null columns or not. |
29 May 2005 | Added Field names which include entity names introduce redundancy. |
11 Apr 2005 | Added a comment regarding a review in another web site. |