Every table in a relational database requires a primary key - a unique identity for each occurrence (row). This primary key is obtained from one or more fields (columns) that exist within the table's structure, and it is the value contained within the primary key field(s) that is used to identify each individual occurrence.
For each primary key the database will build an index. This contains the primary key values and a pointer to the corresponding data record. When searching for entries via a primary key value it is quicker to scan the index than the main data table. The performance benefit from using an index is balanced by the amount of disk space required to hold it and the extra processing required to maintain it.
An important fact to note about primary keys is that once created the value cannot be changed. If a change is required the only option is to create a new occurrence with the new primary key, then delete the original occurrence with the old primary key.
As an example of a primary key consider a table called COUNTRY which contains items of data such as NAME, CURRENCY and LANGUAGE. It would be possible to use NAME as the primary key, but it is usual to use a shortened code rather than a long textual description.
primary key | |||
---|---|---|---|
COUNTRY_ID | COUNTRY_NAME | CURRENCY | LANGUAGE |
UK | United Kingdom | Sterling | English |
USA | United States of America | US Dollar | English |
FR | France | French Franc | French |
For objects such as countries it is commonplace to use the internationally recognised standard country codes as the primary key.
A candidate key is a means of providing an additional unique key into a table. Candidate keys are entirely optional, so a table may contain none, one or several of them.
For example, in a payroll system the EMPLOYEE table may contain columns for PAYROLL_NO and NATIONAL_INSURANCE_NO, both of which are unique to each employee. If both of these were required as lookups into the table only one could be defined as the primary key, so the other would have to be defined as a candidate key.
Like primary keys each candidate key will require additional resources for indexing purposes.
The advantage of a candidate key over a primary key is that its value may be changed, although not to one that already exists.
Where two database entities are involved in a one-to-many relationship a column in the 'many' entity is linked to the primary key of the 'one' entity. This column in the 'many' entity is referred to as a FOREIGN KEY, as shown is the following example:
primary key | foreign key | |
---|---|---|
COMPANY_ID | COMPANY_NAME | COUNTRY_ID |
MS | Microsoft | USA |
IBM | International Business Machines | USA |
ICL | International Computers Ltd | UK |
Although it is possible to link a foreign key to a candidate key this is not recommended as it is possible to change the value of a candidate key. As this does not automatically update the corresponding foreign key values in all related occurrences you would effectively break the relationship.
Note that for a foreign key an index is optional.
This is a key for which the possible values have an obvious meaning to the user or the data. For example, a semantic primary key for a COUNTRY entity might contain the value 'USA' for the occurrence describing the United States of America.
This type of key may also be known as a SURROGATE key. It is a key for which the possible values have no obvious meaning to the user or the data. Technical primary keys let you enforce the uniqueness property within a table while making it easy to generate values invisibly and automatically. A typical method of generating values is to take the next number from a named sequence which is maintained by the database engine using code similar to the following:
SQL "SELECT <seq_name>.NEXTVAL FROM DUAL","DEF"
Remember that the result of this statement will get bigger and bigger as time goes on, so the field which holds the value must be large enough so as not to overflow in the foreseeable future. I have seen an interface definition of 'N10' used on more than one occasion for this purpose.
In some cases the columns within a table do not readily provide a unique value that can be used as a primary key. Take the example of a PERSON table which contains the following columns:
first_name, last_name, initials, sex, date_of_birth, address.
Which column, or combination of columns, could be guaranteed to provide a unique reference? Remember that a primary key, once assigned, cannot be changed. Once you exclude those columns that may be changed during a person's lifetime not much is left, certainly not enough to guarantee uniqueness.
In such a case as this the solution would be to create a new column (e.g. PERSON_ID) for use as either a semantic or technical key. If the value needed to be remembered for later use, such as a login id to the main computer system, then a semantic key may be more appropriate - a person is more liable to remember a short text string such as 'AJM' or 'TONY27' than a random 10 digit number.
It is possible to have a primary key that is comprised of more than 1 column. This is known as a COMPOUND or CONCATENATED key. If the number of columns required is too great, or the combined length is rather large, it is usual practice to insert a technical key. However, if you still need to create a compound candidate key to help prevent duplicate values then an additional technical key is superfluous.
Where a table has a semantic key it is not unknown for the user to want to change the value of that key. As explained previously this is just not possible with a primary key. Although the occurrence can be deleted and recreated with the new value this is no good if there are large numbers of occurrences on other tables which use the original value as a foreign key. The overhead of deleting and recreating all those occurrences could be enormous.
The solution here is to make the user-defined value into a candidate key and insert a new column for use as a technical primary key. Provided that all related occurrences are linked to the unchanging primary key then no relationship will be broken if the candidate key were to be changed.
There are some people who think that technical keys are such a good idea that they insist on putting them on every table without further thought and regardless of the consequences. Although there are situations where the use of technical keys can provide positive benefits, there are also situations where they may actually produce negligible or even negative results. The following sections identify several instances where I think that the inclusion of technical keys is ever-so-slightly questionable.
Where an entity already has a recognisable semantic key which is to be used as a foreign key in related entities there is no advantage in adding a technical key. Take the following example:
primary key | candidate key | |
---|---|---|
COUNTRY_ID | COUNTRY_CODE | COUNTRY_NAME |
1 | UK | United Kingdom |
2 | USA | United States of America |
3 | FR | France |
Where users are likely to be familiar with the semantic key values and have the ability to enter them directly they are unlikely to be as familiar with a random number. Everybody knows that a COUNTRY_CODE of 'USA' refers to the United States of America, but who can say the same for the number 2?
A relationship between two entities is defined within the application model as a one-to-many, where the number of occurrences of the 'many' entity may be zero, one or more. If a particular implementation requires that there never be more than one occurrence of the 'many' entity there is no apparent way that this restraint can be defined in the database schema. Take the following example:
|
|
This particular structure does not disallow the possibility of adding more than one occurrence of the 'many' entity for each occurrence of the 'one' entity. If software is released which does not enforce this non-obvious restriction the results can be, shall we say, interesting?
I have this (perverse?) expectation that if a database is designed in order to satisfy certain business rules, then it should be possible to reverse engineer the database design in order to recreate those rules. This is not possible with the above implementation. My preferred approach to this situation, one that I have used with complete success on more than one occasion, is to make the primary key of 'many' exactly the same as the primary key of 'one'. The primary key of 'many' can therefore serve a dual purpose as the foreign key back to 'one', as demonstrated in the following example:
|
|
As the primary key and foreign key are now one and the same there is no need for a separate index on the foreign key. This saves both disk space and processing time.
This method makes it obvious to all concerned that only one occurrence of 'many' should exist for each occurrence of 'one', and indeed makes it impossible to do otherwise. However, it does upset some people because it violates their rule that each database table has its own number sequence for generating the primary key. As this rule does not exist in any database system that I have ever encountered I consider it to be artificial, and because it produces inferior results I choose to ignore it.
In this example the 'many' entity does not have a semantic key - the primary key is purely arbitrary as each occurrence is just one of many that may be owned by an occurrence of the 'one' entity. This is typically implemented using the following structure:
'many' entity | ||
primary key | foreign key | |
---|---|---|
PKEY_MANY | PKEY_ONE | DATA |
27 | 104 | stuff for 104,27 |
34 | 104 | stuff for 104,34 |
35 | 208 | stuff for 208,35 |
56 | 335 | stuff for 335,56 |
Note the following disadvantages:
Now consider an alternative structure. Here the value for SEQ_NO is not unique for the table, but is unique for each value of PKEY_ONE. It is obtained by code similar to the following:
LAST_SEQ_NO.ONE = LAST_SEQ_NO.ONE + 1 SEQ_NO.MANY = LAST_SEQ_NO.ONE
'many' entity | ||
primary key | ||
---|---|---|
foreign key | ||
PKEY_ONE | SEQ_NO | DATA |
104 | 1 | stuff for 104,1 |
104 | 2 | stuff for 104,2 |
208 | 1 | stuff for 208,1 |
335 | 1 | stuff for 335,1 |
Note the following advantages:
An alternative way of finding out the current highest value for SEQ_NO would be to use a selectdb
statement, as in the following example:
selectdb max(seq_no) from "many" %\ u_where (pkey_one.many = pkey_one.one) %\ to $1 seq_no.many = $1 + 1
As the columns PKEY_ONE and SEQ_NO together form the primary key, this particular selectdb
statement can be satisfied by a quick look at the index without the need to trawl through large numbers of data records.
In this example each occurrence of 'many' is only valid for a finite amount of time, therefore it must include a field to hold a date or time stamp. This is a structure that I have seen somebody else implement:
'many' entity | |||
primary key | candidate key | ||
---|---|---|---|
foreign key | |||
PKEY_MANY | PKEY_ONE | START_DATE | DATA |
27 | 104 | 01-Jan-2000 | stuff for 104 |
34 | 104 | 01-Nov-2000 | stuff for 104 |
35 | 208 | 01-Jan-2000 | stuff for 208 |
56 | 335 | 01-Jan-2000 | stuff for 335 |
The candidate key is there (apparently) to ensure that an instance of PKEY_ONE does not have more than 1 occurrence for a particular date. However, the program still has to read in the occurrence with the highest date to ensure that the start date for a new occurrence is later, not earlier.
This structure allows entries to be created with dates in the future, therefore the procedure for identifying which occurrence is valid for a particular date is a little convoluted (and achieved with a database view):
This structure allows START_DATE to be amended, but the following rules have to be obeyed:
Here is an alternative structure that I have implemented on numerous occasions:
'many' entity | ||||
primary key | ||||
---|---|---|---|---|
foreign key | ||||
PKEY_ONE | SEQ_NO | DATA | START_DATE | END_DATE |
104 | 1 | stuff for 104 | 01-Jan-2000 | 31-Oct-2000 |
104 | 2 | stuff for 104 | 01-Nov-2000 | 31-Dec-9999 |
208 | 1 | stuff for 208 | 01-Jan-2000 | 31-Dec-9999 |
335 | 1 | stuff for 335 | 01-Jan-2000 | 31-Dec-9999 |
The advantages of the SEQ_NO field have been explained in the previous section.
The procedure for identifying which occurrence applies to today's date is a little simpler, and can be done without the need for a database view:
read u_where (START_DATE <= $DATE & END_DATE >= $DATE)
Note that this will not work if undefined end dates are left as null. I always use a dummy date to simulate 'sometime in the future', as explained in Dealing with null End Dates from my Tips & Trix page.
This structure does require a little more effort to maintain, but I think the results are worth it. For example, when adding a new occurrence the following rules must be obeyed:
When amending an existing occurrence the following rules apply:
If you are being efficient and modern and using a 3 tier structure then these rules can be defined in a single service component in the business layer rather than being duplicated in numerous form components in the presentation layer, but that's another story.
In this situation there are several groups of lookup codes, such as country codes, organisation types, address types, etc. Instead of having a separate database table for each group they are all lumped together on a single table with a group identifier to separate them. A big advantage of this structure is that new groups can be added very easily without having to amend the database structure. However, a disadvantage is that it makes the actual lookups a bit more complicated. Exactly how much more complicated depends on the inclusion of the all-pervasive technical primary key. Here is a structure that I saw somebody else design fairly recently:
It starts with a table that defines the various code groups:
'one' entity | ||
primary key | candidate key | |
---|---|---|
ID | GROUP_ID | GROUP_DESC |
1 | COUNTRY | Country Code |
2 | ORGTYPE | Organisation Type |
3 | ADDRTYP | Address Type |
The table that contains the actual codes looks like this:
'many' entity | |||
primary key | foreign key | ||
---|---|---|---|
candidate key | |||
ID | GROUP_ID | CODE_ID | CODE_DESC |
1 | COUNTRY | USA | United States of America |
2 | COUNTRY | UK | United Kingdom |
3 | ADDRTYP | UK | Unknown |
This is how we do things on the planet Earth...
|
|
This is a relationship between two entities (for example, X and Y), where more than one occurrence of Y can exist for each occurrence of X, and more than one occurrence of X can exist for each occurrence of Y. This can be represented in the following diagram:
|
|
|
As it is not possible to define a many-to-many relationship directly the solution is to create an additional link or cross reference entity which can act as the 'many' entity in a pair of one-to-many relationships, as shown in the following diagram:
|
|
|
|
|
This link entity requires as a minimum a foreign key for entity X and a foreign key for entity Y. If you insist on a technical primary key the resulting structure looks like this:
primary key | candidate key | |
---|---|---|
foreign key | foreign key | |
PKEY_LINK | PKEY_X | PKEY_Y |
1 | X1 | Y6 |
2 | X9 | Y42 |
3 | X3 | Y27 |
If there is not a unique key which combines the two foreign keys it will not be easy to check that a particular X+Y pair already exists, and without this check it will be possible to create duplicate entries. In the above structure this can only be implemented as a candidate key. If you think about it carefully you should see that all requirements could be satisfied by dropping the PKEY_LINK column and converting the candidate key into the primary key. This removes 1 redundant column and 1 redundant index, producing a simplified structure as follows:
primary key | |
---|---|
foreign key | foreign key |
PKEY_X | PKEY_Y |
X1 | Y6 |
X9 | Y42 |
X3 | Y27 |
This structure has the problems of a standard one-to-many relationship as highlighted previously, but is compounded by each additional level. Examine the following diagram:
|
|
|
|
|
Here there is a one-to-many relationship between 'top' and 'middle', and another one-to-many relationship between 'middle' and 'bottom'.
'top' entity | |
primary key | |
---|---|
PKEY_TOP | DATA |
104 | 104 data |
208 | 208 data |
335 | 335 data |
'middle' entity | ||
primary key | foreign key | |
---|---|---|
PKEY_MIDDLE | PKEY_TOP | DATA |
27 | 104 | 104,27 data |
35 | 208 | 208,35 data |
56 | 335 | 335,56 data |
'bottom' entity | ||
primary key | foreign key | |
---|---|---|
PKEY_BOTTOM | PKEY_MIDDLE | DATA |
1001 | 27 | 104,27,1001 data |
1002 | 35 | 208,35,1002 data |
1003 | 56 | 335,56,1003 data |
With structure it will not be possible to satisfy the request 'select BOTTOM where PKEY_TOP = 104' without traversing through the MIDDLE entity. If there is more than one entity in the middle then this will add further steps in the journey. Now look at an alternative structure:
'middle' entity | ||
primary key | ||
---|---|---|
foreign key | ||
PKEY_TOP | MIDDLE_SEQ | DATA |
104 | 1 | 104,1 data |
104 | 2 | 104,2 data |
208 | 1 | 208,1 data |
56 | 1 | 335,1 data |
'bottom' entity | |||
primary key | |||
---|---|---|---|
foreign key | foreign key | ||
PKEY_TOP | MIDDLE_SEQ | BOTTOM_SEQ | DATA |
104 | 1 | 1 | 104,1,1 data |
104 | 1 | 2 | 104,1,2 data |
104 | 2 | 1 | 104,2,1 data |
104 | 2 | 2 | 104,2,2 data |
208 | 1 | 1 | 208,1,1 data |
335 | 1 | 1 | 335,1,1 data |
In this structure the request 'select BOTTOM where PKEY_TOP = 104' can be satisfied without the need to navigate through any intermediate tables, therefore would be quicker to process.
I once joined a project where some bright spark had the brilliant idea of giving every technical key the name 'ID'. When asked why he explained that when building the low-level service component which actually retrieved records from the database it was not necessary to change the code inherited from the component template as the name of the primary key was always 'ID', and could therefore be hard coded.
This bright spark was obviously not an experienced Uniface programmer as he was unfamiliar with the $keyfields
function. Give this function the name of an entity and it will return a list containing the names of the primary key field(s).
A problem begins to appear in relationships - if the primary key is called 'ID' then a foreign key cannot also be called 'ID', therefore something will have to change. The solution was to change the foreign key name to 'somethingelse_ID', where 'somethingelse' was an abbreviation of the entity name. Those of you with more than two brain cells to rub together will notice that this results in a field which is named one thing on one entity but something else on another.
This is not normally an issue as standard Uniface processing will automatically transport the key values from one related entity to another. It knows what field names to use as these are defined in the relationship. However, this standard processing is not available when you are dealing with a component in which no entities are painted. Confused? Let me explain. The designers of this particular system had decreed that it would be N-tier, with a separate application model for the presentation and data layers. It is important to note that they did not use any Uniface features that were designed to support these concepts, therefore everything had to be hand-cranked. In order to keep the presentation layer component as simple as possible it was decreed that it would use an intermediate component, called a decorator, to handle its data. When called with a 'GET' operation the decorator would communicate with lower-level service components, one for each entity. It would receive a separate string (not an XML stream but an associative list) from each of these service components, then merge them so that it could pass back a single string to the presentation layer. When called with a 'PUT' operation the decorator had to split its input string into smaller strings so that it could pass to each service component only that data that it was designed to handle. This decorator did not communicate with the database directly therefore it had no entities painted within its structure.
Now imagine what has to be done when retrieving data for related entities - for the 'one' entity the retrieve profile contains 'ID=value
', but for the 'many' entity the corresponding name is 'somethingelse_ID
'. The only way to deal with this translation was to code it manually. An attempt was made to use a standard component template for the decorator, but the inheritance for proc code was broken every time the developer had to perform this manual translation.
The convention that I have used for many years is to give each primary key field the name '<entname>_ID
'. This means that when used as a foreign key the name does not have to change. Thus any field name that ends in '_ID
' is a key field, and when appearing in the body of a field list it will not only be apparent that it is a foreign key, but will also identify the name of the entity to which it is related. Simple and effective. Regarding the above problem, if the field name does not have to be converted then the key transportation can be done with a pair of putlistitems/getlistitems statements.
The worst idea that I ever heard, though fortunately never saw implemented, was the idea that each technical key, instead of containing just a number, should also have a string prefix which identified the entity to which it related. I'm sure this sounds like a very clever idea, but what on earth is the benefit? I have always used techniques that have been simple and effective, so I do not see the point in switching to a new method that is less simple and less effective just because it is fashionable.
There is no rule that says a primary key must always consist of a single field, therefore compound keys are permissible. Likewise there is no rule that says a primary key must always be a technical key, therefore semantic keys are also permissible. The trick is knowing when to use one form in preference to the other.
Although the use of a technical primary key may avoid potential problems in some circumstances (refer to When using a technical key is a good idea) it is incorrect to assume that there will be benefits in all circumstances. Some people may argue that a single technical key is more efficient than a compound semantic key, but even if this were true (and I have yet to see any evidence in support of this claim) I would counter this by saying that a reduction in the number of additional candidate keys and indexes without a corresponding reduction in the accessibility of the data would actually produce an overall net saving.
It is my personal opinion that the indiscriminate use of technical keys shows a lack of thought and understanding, and leads me to wonder what other horrors are lurking in the database design. Putting a technical primary key on EVERY table by default may be fashionable, but it is not very intelligent. Where a technical key is employed unnecessarily valuable resources are actually being wasted:
If you have a database with hundreds of tables and millions of occurrences this could add up to a significant sum. Think about it. Look at the different ways in which an objective may be achieved, then examine the pros and cons of each one. Do not jump at a so-called 'solution' just because it solved a problem that appeared with a particular set of circumstances in the past. If those same circumstances do not exist in your current situation then the 'solution' may actually turn out to be a 'problem' instead.
Tony Marston
26th November 2000
mailto:tony@tonymarston.net
mailto:TonyMarston@hotmail.com
http://www.tonymarston.net
7th March 2001 | Added section 3.g In a one-to-many-to-many relationship |