Constraints and Predicates: A Brief Tutorial (Part 1)

C.J. Date Author, Read Author Bio || Read All Articles by C.J. Date

This three-part article is offered as a tutorial on the crucial notion of integrity constraints (which I'll usually abbreviate to just constraints) and the associated and equally crucial notion of predicates. As I've written elsewhere -- see, e.g., the book Foundation for Future Database Systems: The Third Manifesto, by Hugh Darwen and myself^[1] (2nd edition, Addison-Wesley, 2000), referred to hereinafter as the Third Manifesto book, or just The Third Manifesto for short -- these notions are absolutely fundamental, and yet they don't seem to be very well understood in the database community at large. Indeed, they seem to be surrounded by confusion (not least in the academic world, as I'll show in a moment). My aim in this article is to try to clear up some of that confusion, if I can.

Of course, I'm sure most readers will agree that the integrity constraint concept is easy enough to understand at an intuitive level; intuitively, a constraint is just a conditional expression -- also known as a boolean, truth-valued, or logical expression -- that's required to evaluate to true. Here are a few simple examples, all based on the well-known suppliers and parts database (see later for a definition of that database):

Every supplier status value is in the range 1 to 100 inclusive.
Every supplier in London has status 20.
If there are any parts at all, at least one of them is blue.
No two distinct suppliers have the same supplier number.
Every shipment involves an existing supplier.
No supplier with status less than 20 supplies any part in a quantity greater than 500.

And so on.

Now, these examples are all fairly straightforward, and I'm sure you didn't have any difficulty in understanding any of them. If history is anything to go by, however, pinning matters down more precisely turns out to be a little trickier than you might think. In support of this claim, I could cite among other things the numerous attempts -- most of them not very successful -- to come up with a sensible taxonomy or classification scheme for constraints. I've made several such attempts myself! -- see among other things my Relational Database Writings series, especially the 1991-1994 and 1994-1997 volumes (Addison-Wesley, 1995 and 1998, respectively), and the two editions, coauthored with Hugh Darwen, of the Third Manifesto book mentioned above (Addison-Wesley, 1998 and 2000, respectively). Other writers who have also tried to come up with classification schemes include:

Ted Codd (see his book The Relational Model for Database Management Version 2, Addison-Wesley, 1990);
Mike Stonebraker (one of the earliest attempts, so far as I'm aware -- see his paper on the subject in the 1975 ACM SIGMOD Conference Proceedings);
Jeff Ullman and Jennifer Widom (see their book A First Course in Database Systems, Prentice Hall, 1997);
Ralph Kimball (see his articles "There Are No Guarantees" and "Enforcing the Rules" in Intelligent Enterprise 3, Nos. 11 and 12, August 1st and 18th, 2000, respectively);
Ron Ross (see his book The Business Rule Book: Classifying, Defining, and Modeling Rules, 2nd edition, Business Rule Solutions LLC, 1997);

and several others. Note: In particular, I simply can't resist mentioning the book Understanding the New SQL: A Complete Guide, by Jim Melton and Alan R. Simon (Morgan Kaufmann, 1993). That book has a chapter entitled "Constraints, Assertions, and Referential Integrity." What would you think of a book on biology that had a chapter entitled "Birds, Feathered Bipeds, and Woodpeckers"? The parallel is exact.^[2]

I'd like to go on and offer some more evidence, evidence that suggests that a fair number of database professionals (including in particular authors of textbooks) don't seem to understand or appreciate the fundamental nature and importance of integrity in a database context. A quick and admittedly not very scientific survey of a whole shelfload of database textbooks -- 37 in all, including essentially all of the best-known ones^[3] -- reveals the following:

Only one book had an entire chapter devoted to the topic of integrity -- and even there I have severe reservations about the quality of the treatment. Note: At first glance it looked as if there were three others that had a whole chapter on the subject too, but closer examination revealed that one of those books was using the term to refer to normalization issues only, while the other two were taking it to refer, not to integrity in its usual sense at all, but rather to locking and concurrency control issues. Caveat lector!
Most of the books examined didn't even mention integrity in a chapter title at all, and those few that did tended to bundle it with other topics in what seems to me a very haphazard fashion ("Integrity, Views, Security, and Catalogs" is a typical example).
I couldn't find a good explanation or definition of the concept, let alone the kind of emphasis I think the concept deserves, in any of the books at all.

With all of the foregoing in mind, I offer this brief tutorial as an aid to clarification and precise thinking in this potentially confusing area.

Let me begin by observing that, logically speaking, a database is a variable -- a large and complicated variable, perhaps, but a variable nonetheless. Certainly it's something that has different values at different times, and that's more or less the definition of a variable. Thus, the operation of "updating the database" causes the current value of the variable that's the database in question to be replaced by another value. And, of course, the values I'm talking about here are both database values, and the variable is a database variable. In other words, the crucial distinction between values and variables that I've discussed in general terms in this occasional series of articles several times already -- see, e.g., "Is a Circle an Ellipse?" (www.dbdebunk.com, April 2001) -- applies to databases in particular.

Likewise, the relations in a given database are also variables: They too have different values at different times. In fact, in The Third Manifesto, Hugh Darwen and I explicitly introduced the term relation variable, distinguishing it from the term relation value (relation values are the kinds of values that relation variables are permitted to have). Historically, of course, we've typically used the same term, relation (or, in SQL contexts, table), for both concepts; but this usage has demonstrably led to a lot of confusion in the past, and so we decided in The Third Manifesto that we really did need two different terms. For simplicity, however, we also decided to allow those terms to be abbreviated whenever it seemed safe to do so: relation variable to relvar, and relation value to just relation (exactly as we conventionally allow, e.g., the term integer value to be abbreviated to just integer).

In the same kind of way, it would make good logical sense to talk explicitly of database values and database variables (probably allowing those terms to be abbreviated to databases and dbvars, respectively); a database variable would be a container for a collection of relation variables (and no other kinds of variables!), and a database value would be the value of a database variable at some particular time. Slightly against my better judgment, however, in this article I've decided to stick to the simple and familiar term database for both concepts. But I'll definitely reserve the term relation to mean a relation value specifically, and I'll use the term relvar whenever I mean a relation variable specifically.

With all of that preamble out of the way, let's get down to the matter at hand. As usual, I'll base my examples on the suppliers and parts database, so let me first give an appropriate definition for that database. Note: That definition -- which should, I hope, be more or less self-explanatory -- is expressed in a slightly simplified form of Tutorial D, which is a language Hugh Darwen and I use as a basis for examples in the Third Manifesto book and elsewhere. For simplicity, I've omitted the underlying type (or domain) definitions, showing only the three relvar definitions -- but I'm going to assume, where it makes any difference, that types INTEGER (integers) and CHAR (character strings of arbitrary length) are builtin or system-defined types, while types S#, NAME, P#, COLOR, WEIGHT, and QTY are user-defined types.

VAR S RELATION
{ S# S#,
SNAME NAME,
STATUS INTEGER,
CITY CHAR }
KEY { S# } ;

VAR P RELATION
{ P# P#,
PNAME NAME,
COLOR COLOR,
WEIGHT WEIGHT,
CITY CHAR }
KEY { P# } ;

VAR SP RELATION
{ S# S#,
P# P#,
QTY QTY }
KEY { S#, P# }
FOREIGN KEY { S# } REFERENCES S
FOREIGN KEY { P# } REFERENCES P ;

Now we can start to look at integrity constraints as such. Essentially, an integrity constraint is a constraint on the values some given variable, or some given combination of variables, is permitted to assume. Thus, the very fact that a given variable is defined to be of some given type represents an a priori constraint on the variable in question -- the values that can be assumed by that variable obviously have to be values of that type. And, as a special case of the foregoing, the very fact that each attribute of a given relvar is defined to be of some given type represents an a priori constraint on the relvar in question. For example, relvar S (suppliers) is constrained to contain values that are relations in which every S# value is a supplier number (a value of type S#), every SNAME value is a name (a value of type NAME), and so on.

However, these simple a priori constraints are certainly not the only possible ones; in fact, none of the examples given at the beginning of this article is an a priori constraint in the foregoing sense. Let's take a closer look at the first of those earlier examples:

1. Every supplier status value is in the range 1 to 100 inclusive.

Here's an alternative way of saying the same thing:

If s is a supplier, then s has a status value in the range 1 to 100 inclusive.

My reason for giving this alternative formulation should become clear in a few moments.

Now, the foregoing natural-language statements are obviously quite informal. A more formal and more precise formulation might look something like this:

Before I try to explain this expression in detail, let me say that -- for reasons that are beyond the scope of this article -- the syntax I'll be using in my formal examples is not exactly Tutorial D syntax as such. But it is close to a relational calculus version of that syntax, and in particular it makes use of the truth-valued operators (the technical term is quantifiers) FORALL and EXISTS. To elaborate briefly:

The expression FORALL x (p) evaluates to true if and only if the truth-valued expression p evaluates to true for all values of x.
The expression EXISTS x (p) evaluates to true if and only if the truth-valued expression p evaluates to true for at least one value of x.

Note: The "x" in both FORALL x (p) and EXISTS x (p) is an example of what's called a bound variable. I'll have more to say about bound variables in Parts II and III of this article.

The syntax I'm using also makes use of the truth-valued operator "e", which can be read as is a member of or belongs to, or simply in (it does resemble SQL's IN operator, somewhat).

Now, I think the various formal examples (the one shown above and all of the others to follow) are all pretty much self-explanatory; however, if you'd like further explanation of FORALL, EXISTS, "e", and related matters, then I refer you to my book An Introduction to Database Systems (7th edition, Addison-Wesley, 2000).

Back to the example. It follows from all of the above that the formal version of the constraint can be read as follows (in rather stilted English):

For all supplier numbers s# and all names sn and all integers st and all character strings sc, if a tuple with S# = s# and SNAME = sn and STATUS = st and CITY = sc appears in the suppliers relvar, then st is greater than or equal to 1 and less than or equal to 100.

Perhaps you can now see why I gave that alternative natural-language formulation of the constraint when I originally began to discuss this example. The fact is, that alternative natural-language version, the formal and precise formulation, and the stilted English counterpart all have a certain overall "shape," as it were, that looks something like this:^[4]

IF a certain tuple appears in a certain relvar, THEN that tuple satisfies a certain condition.

This "shape" is an example of what's known in logic as an implication (sometimes a logical or material implication). In general, an implication takes the form

IF p THEN q

where p and q are truth-valued expressions, usually referred to in logic as the antecedent and the consequent, respectively. Like the antecedent and the consequent, the overall implication is itself a truth-valued expression; it evaluates to false if p is true and q is false, and to true otherwise (in other words, it's logically equivalent to the expression (NOT p) OR q).

By the way, note how the foregoing shape implies the FORALL quantification -- "IF a certain tuple appears" means, implicitly, "FORALL tuples that do appear."

Suppose, then, that some given constraint does in fact have the foregoing shape. Then it's specifically the "certain condition" that appears in the consequent of the implication that most people would regard, informally, as the integrity constraint per se. In SQL, for example, the constraint under discussion -- "If s is a supplier, then s has a status value in the range 1 to 100 inclusive" -- would typically be expressed as follows:

CHECK ( STATUS BETWEEN 1 AND 100 )

This expression would appear as part of the definition of relvar S, or what SQL would call "base table" S. And it does strongly suggest that the "condition" -- i.e., the expression between parentheses -- is the whole of the constraint (after all, it does say that's what's to be checked, doesn't it?).

However, while the foregoing perception (that the "condition" is whole of the constraint) might perhaps be unobjectionable in informal contexts, and possibly even desirable for human factors reasons, it shouldn't be allowed to obscure the fact that there's really a lot more going on. That is, the SQL expression shown above must be understood as shorthand for something like the more explicit, and more complete, formal expression shown earlier.

(As an aside, I note that SQL doesn't actually include any direct support for logical implication -- i.e., for truth-valued expressions of the form IF p THEN q. This fact might go some way toward explaining why the SQL formulation of the constraint takes the form it does.)

To conclude Part I of this article, let's take a look at the rest of the introductory examples. I'll show a precise formulation in each case and offer some comments where I think they're needed, but I'll leave the "stilted English" and SQL formulations to you. Please note that I make no claim that the precise formulations shown are unique, or even that they're as simple as possible; I claim only that they're correct. Note too that each example does illustrate at least one new point.

2. Every supplier in London has status 20.

FORALL s# eS#, sn eNAME, st eINTEGER, sc eCHAR
( IF { S# s#, SNAME sn, STATUS st, CITY sc }e S
THEN ( IF sc = 'London'
THEN st = 20 ) )

In this example, the consequent of the implication takes the form of an implication itself.

3. If there are any parts at all, at least one of them is blue.

IF
EXISTS p# eP#, pn eNAME, pl eCOLOR, pw eWEIGHT, pc eCHAR
( { P# p#, PNAME pn, COLOR pl, WEIGHT pw, CITY pc }eP )
THEN
EXISTS p# eP#, pn eNAME, pl eCOLOR, pw eWEIGHT, pc eCHAR
( { P# p#, PNAME pn, COLOR pl, WEIGHT pw, CITY pc }eP
AND pl = COLOR ('Blue') )

The expression "COLOR ('Blue')" in the last line but one here is a COLOR literal. Note that we can't just say "at least one part is blue" -- we do have to worry about the case where there aren't any parts at all. I remark in passing that SQL tries to "help" the user with the formulation of constraints of this kind but (predictably enough) makes a hash of things ... To be specific, if the clause CHECK(q) is specified as part of the definition of base table BT, but BT happens to be empty, then SQL regards the constraint as satisfied no matter what form the truth-valued expression q takes -- even if it takes the form "BT must not be empty"! (or even the form "BT must have cardinality -5," or the form "1 = 0," come to that).

4. No two distinct suppliers have the same supplier number.

FORALL x# eS#, xn eNAME, xt eINTEGER, xc eCHAR,
y# eS#, yn eNAME, yt eINTEGER, yc eCHAR
( IF { S# x#, SNAME xn, STATUS xt, CITY xc }e S AND
{ S# y#, SNAME yn, STATUS yt, CITY yc }e S
THEN ( IF x# = y#
THEN xn = yn AND xt = yt AND xc = yc ) )

This expression is just a formal statement of the fact that {S#} is a candidate key for suppliers,^[5] thus, candidate key constraints are just a special case of constraints in general. The Tutorial D syntax KEY {S#} might be regarded as shorthand for the foregoing more longwinded expression. Note: In practice, of course, our list of constraints ought also to include (a) an analogous one for parts, saying that no two distinct parts have the same part number, and (b) an analogous one for shipments, saying that no two distinct shipments have the same supplier number and the same part number, but I've omitted these two constraints for simplicity.

5. Every shipment involves an existing supplier.

FORALL s# eS#, p# eP#, q eQTY
( IF { S# s#, P# p#, QTY q } e SP
THEN EXISTS sn eNAME, st eINTEGER, sc eCHAR
( { S# s#, SNAME sn, STATUS st, CITY sc }e S ) )

This expression is a formal statement of the fact that {S#} is a foreign key for shipments, matching the candidate key {S#} for suppliers; thus, foreign key constraints too are just a special case of constraints in general. Note: In practice, of course, our list of constraints ought really to include one to say that every shipment involves an existing part as well, but again I've omitted this constraint for simplicity.

This example is the first we've seen to involve two distinct relvars (SP and S, in the example) -- all the previous examples involved just one. In the Third Manifesto book and elsewhere, I've used the term relvar constraint for a constraint involving exactly one relvar and the term database constraint for a constraint involving more than one. However, while that distinction might be useful from a pragmatic point of view, it's not very important from a theoretical one, and I won't bother with it much in the remainder of this article.

6. No supplier with status less than 20 supplies any part in a quantity greater than 500.

Like the previous example, this one too involves two distinct relvars (S and SP again, of course), but it's certainly not a foreign key constraint.

...continued

Notes

*/ ?>

[1] In passing, let me acknowledge helpful reviews of earlier drafts of this article by Hugh Darwen, Ron Ross, and Fabian Pascal.

[2] In fairness, I should perhaps say that the SQL standard itself might be partly to blame here, inasmuch as it uses the keyword CONSTRAINT in connection with some constraints and the keyword ASSERTION in connection with others. While most constraints can be specified using either the CONSTRAINT style or the ASSERTION style, some have to use the CONSTRAINT style and some have to use the ASSERTION style. I have no idea why this is.

[3] It wouldn't have been appropriate to include any of my own books in the survey, and I didn't. I also didn't include Fabian Pascal's book Practical Issues in Database Management (Addison-Wesley, 2000), since I reviewed and commented on that book in draft form and also wrote a foreword for it, and might therefore be accused of bias. Let me simply note for the record that Fabian's book, unlike most of the others, does include a chapter on integrity.

[4] I don't mean to suggest that all constraints have this particular shape, just that many of them do. See Example 3, later, for an example of one that doesn't.

[5] It would be more accurate to say that it's a formal statement of the fact that {S#} is a superkey for suppliers. A superkey is a superset -- not necessarily a proper superset, of course -- of a candidate key; thus, all candidate keys are superkeys, but some superkeys are not candidate keys. But I don't want to get too far off on this particular tangent here.

# # #

Standard citation for this article:

C.J. Date, "Constraints and Predicates: A Brief Tutorial (Part 1)" Business Rules Journal, Vol. 2, No. 7, (Jul. 2001)
URL: http://www.brcommunity.com/a2001/b065a.html

About our Contributor:

C.J. Date Author,

C. J. Date is an independent author, lecturer, researcher, and consultant, specializing in relational database technology. He is best known for his book An Introduction to Database Systems (eighth edition, Addison-Wesley, 2004), which has sold some 725,000 copies and is used by several hundred colleges and universities worldwide. He is also the author of many other books on database management.

Read All Articles by C.J. Date