# Constraints and Predicates: A Brief Tutorial (Part 1)

This three-part article is offered as a tutorial on the crucial
notion of **integrity constraints** (which I'll usually abbreviate to just *constraints*)
and the associated and equally crucial notion of **predicates**. As I've
written elsewhere -- see, e.g., the book *Foundation for Future Database Systems:
The Third Manifesto,* by Hugh Darwen and myself^{[1]}
(2nd edition, Addison-Wesley, 2000), referred to hereinafter as the *Third Manifesto*
book, or just *The Third Manifesto* for short -- these notions are absolutely
fundamental, and yet they don't seem to be very well understood in the database community
at large. Indeed, they seem to be surrounded by confusion (not least in the
academic world, as I'll show in a moment). My aim in this article is to try
to clear up some of that confusion, if I can.

Of course, I'm sure most readers will agree that the integrity constraint concept
is easy enough to understand at an intuitive level; intuitively, a constraint is
just a *conditional expression* -- also known as a *boolean, truth-valued,*
or *logical* expression -- that's required to evaluate to *true*.
Here are a few simple examples, all based on the well-known suppliers and parts database
(see later for a definition of that database):

- Every supplier status value is in the range 1 to 100 inclusive.
- Every supplier in London has status 20.
- If there are any parts at all, at least one of them is blue.
- No two distinct suppliers have the same supplier number.
- Every shipment involves an existing supplier.
- No supplier with status less than 20 supplies any part in a quantity greater than 500.

And so on.

Now, these examples are all fairly straightforward, and I'm sure you didn't have
any difficulty in understanding any of them. If history is anything to go by,
however, pinning matters down more precisely turns out to be a little trickier than
you might think. In support of this claim, I could cite among other things
the numerous attempts -- most of them not very successful -- to come up with a sensible
**taxonomy** or **classification scheme** for constraints. I've made
several such attempts myself! -- see among other things my *Relational Database
Writings* series, especially the 1991-1994 and 1994-1997 volumes (Addison-Wesley,
1995 and 1998, respectively), and the two editions, coauthored with Hugh Darwen,
of the *Third Manifesto* book mentioned above (Addison-Wesley, 1998 and 2000,
respectively). Other writers who have also tried to come up with classification
schemes include:

- Ted Codd (see his book
*The Relational Model for Database Management Version 2*, Addison-Wesley, 1990); - Mike Stonebraker (one of the earliest attempts, so far as I'm aware -- see his paper on the subject in the 1975 ACM SIGMOD Conference Proceedings);
- Jeff Ullman and Jennifer Widom (see their book
*A First Course in Database Systems*, Prentice Hall, 1997); - Ralph Kimball (see his articles "There Are No Guarantees" and "Enforcing
the Rules" in
*Intelligent Enterprise*3, Nos. 11 and 12, August 1st and 18th, 2000, respectively); - Ron Ross (see his book
*The Business Rule Book: Classifying, Defining, and Modeling Rules*, 2nd edition, Business Rule Solutions LLC, 1997);

and several others. *Note: *In particular,
I simply can't resist mentioning the book *Understanding the New SQL: A Complete
Guide,* by Jim Melton and Alan R. Simon (Morgan Kaufmann, 1993). That book
has a chapter entitled "Constraints, Assertions, and Referential Integrity."
What would you think of a book on biology that had a chapter entitled "Birds,
Feathered Bipeds, and Woodpeckers"? The parallel is exact.^{[2]}

I'd like to go on and offer some more evidence, evidence
that suggests that a fair number of database professionals (including in particular
authors of textbooks) don't seem to understand or appreciate the fundamental nature
and importance of integrity in a database context. A quick and admittedly not
very scientific survey of a whole shelfload of database textbooks -- 37 in all, including
essentially all of the best-known ones^{[3]}
-- reveals the following:

- Only one book had an entire chapter devoted to the topic of integrity -- and
even there I have severe reservations about the quality of the treatment.
*Note:*At first glance it looked as if there were three others that had a whole chapter on the subject too, but closer examination revealed that one of those books was using the term to refer to normalization issues only, while the other two were taking it to refer, not to integrity in its usual sense at all, but rather to locking and concurrency control issues.*Caveat lector!* - Most of the books examined didn't even mention integrity in a chapter title at all, and those few that did tended to bundle it with other topics in what seems to me a very haphazard fashion ("Integrity, Views, Security, and Catalogs" is a typical example).
- I couldn't find a good
*explanation*or*definition*of the concept, let alone the kind of emphasis I think the concept deserves, in any of the books at all.

With all of the foregoing in mind, I offer this brief tutorial as an aid to clarification and precise thinking in this potentially confusing area.

Let me begin by observing that, logically speaking, a database is a **variable**
-- a large and complicated variable, perhaps, but a variable nonetheless. Certainly
it's something that has different values at different times, and that's more or less
the *definition* of a variable. Thus, the operation of "updating
the database" causes the current value of the variable that's the database in
question to be replaced by another value. And, of course, the values I'm talking
about here are both **database values,** and the variable is a **database variable**.
In other words, the crucial distinction between values and variables that I've discussed
in general terms in this occasional series of articles several times already -- see,
e.g., "Is a Circle an Ellipse?" (*www.dbdebunk.com**,* April 2001) -- applies to databases
in particular.

Likewise, the relations in a given database are also variables: They too
have different values at different times. In fact, in *The Third Manifesto,*
Hugh Darwen and I explicitly introduced the term **relation variable,** distinguishing
it from the term **relation value** (relation values are the kinds of values that
relation variables are permitted to have). Historically, of course, we've typically
used the same term, *relation* (or, in SQL contexts, *table*), for both
concepts; but this usage has demonstrably led to a lot of confusion in the past,
and so we decided in *The Third Manifesto* that we really did need two different
terms. For simplicity, however, we also decided to allow those terms to be
abbreviated whenever it seemed safe to do so: *relation variable* to **relvar,**
and *relation value* to just **relation** (exactly as we conventionally allow,
e.g., the term *integer value* to be abbreviated to just *integer*).

In the same kind of way, it would make good logical sense to talk explicitly of
*database values* and *database variables* (probably allowing those terms
to be abbreviated to *databases* and *dbvars,* respectively); a database
variable would be a container for a collection of relation variables (and no other
kinds of variables!), and a database value would be the value of a database variable
at some particular time. Slightly against my better judgment, however, in this
article I've decided to stick to the simple and familiar term *database* for
both concepts. But I'll definitely reserve the term *relation* to mean
a relation value specifically, and I'll use the term *relvar* whenever I mean
a relation variable specifically.

With all of that preamble out of the way, let's get down to the matter at hand.
As usual, I'll base my examples on the suppliers and parts database, so let me first
give an appropriate definition for that database. *Note: *That definition
-- which should, I hope, be more or less self-explanatory -- is expressed in a slightly
simplified form of **Tutorial D,** which is a language Hugh Darwen and I use as
a basis for examples in the *Third Manifesto* book and elsewhere. For
simplicity, I've omitted the underlying **type** (or *domain*) definitions,
showing only the three *relvar* definitions -- but I'm going to assume, where
it makes any difference, that types INTEGER (integers) and CHAR (character strings
of arbitrary length) are builtin or *system-defined* types, while types S#,
NAME, P#, COLOR, WEIGHT, and QTY are *user-defined* types.

`VAR S RELATION`

`{ S# S#,`

`SNAME NAME,`

`STATUS INTEGER,`

`CITY CHAR }`

`KEY { S# } ;`

`VAR P RELATION`

`{ P# P#,`

`PNAME NAME,`

`COLOR COLOR,`

`WEIGHT WEIGHT,`

`CITY CHAR }`

`KEY { P# } ;`

`VAR SP RELATION`

`{ S# S#,`

`P# P#,`

`QTY QTY }`

`KEY { S#, P# }`

`FOREIGN KEY { S# } REFERENCES S`

`FOREIGN KEY { P# } REFERENCES P ;`

Now we can start to look at integrity constraints as such. Essentially,
an integrity constraint is a constraint on the values some given variable, or some
given combination of variables, is permitted to assume. Thus, the very fact
that a given variable is defined to be of some given type represents an *a priori*
constraint on the variable in question -- the values that can be assumed by that
variable obviously have to be values of that type. And, as a special case of
the foregoing, the very fact that each attribute of a given relvar is defined to
be of some given type represents an *a priori* constraint on the relvar in question.
For example, relvar S (suppliers) is constrained to contain values that are relations
in which every S# value is a supplier number (a value of type S#), every SNAME value
is a name (a value of type NAME), and so on.

However, these simple *a priori* constraints are certainly not the only possible
ones; in fact, none of the examples given at the beginning of this article is an
*a priori* constraint in the foregoing sense. Let's take a closer look
at the first of those earlier examples:

1. Every supplier status value is in the range 1 to 100 inclusive.

Here's an alternative way of saying the same thing:

If

sis a supplier, thenshas a status value in the range 1 to 100 inclusive.

My reason for giving this alternative formulation should become clear in a few moments.

Now, the foregoing natural-language statements are obviously quite informal. A more formal and more precise formulation might look something like this:

Before I try to explain this expression in detail, let me say that -- for reasons
that are beyond the scope of this article -- the syntax I'll be using in my formal
examples is not exactly **Tutorial D** syntax as such. But it *is*
close to a relational calculus version of that syntax, and in particular it makes
use of the truth-valued operators (the technical term is *quantifiers*) FORALL
and EXISTS. To elaborate briefly:

- The expression FORALL
*x (p)*evaluates to*true*if and only if the truth-valued expression*p*evaluates to*true*for all values of*x*. - The expression EXISTS
*x (p)*evaluates to*true*if and only if the truth-valued expression*p*evaluates to*true*for at least one value of*x.*

*Note:* The "*x*" in both FORALL *x (p)* and EXISTS *x
(p)* is an example of what's called a *bound variable*. I'll have more
to say about bound variables in Parts II and III of this article.

The syntax I'm using also makes use of the truth-valued operator "`e`

", which can be read as *is a member of*
or *belongs to,* or simply *in* (it does resemble SQL's IN operator, somewhat).

Now, I think the various formal examples (the one shown above and all of the others
to follow) are all pretty much self-explanatory; however, if you'd like further explanation
of FORALL, EXISTS, "`e`

",
and related matters, then I refer you to my book *An Introduction to Database Systems*
(7th edition, Addison-Wesley, 2000).

Back to the example. It follows from all of the above that the formal version of the constraint can be read as follows (in rather stilted English):

For all supplier numbers *s#* and all names *sn* and all integers *st*
and all character strings *sc,* if a tuple with S# = *s#* and SNAME = *sn*
and STATUS = *st* and CITY = *sc* appears in the suppliers relvar, then
*st* is greater than or equal to 1 and less than or equal to 100.

Perhaps you can now see why I gave that alternative natural-language
formulation of the constraint when I originally began to discuss this example.
The fact is, that alternative natural-language version, the formal and precise formulation,
and the stilted English counterpart all have a certain overall "shape,"
as it were, that looks something like this:^{[4]}

IFa certain tuple appears in a certain relvar,THENthat tuple satisfies a certain condition.

This "shape" is an example of what's known in logic as an *implication*
(sometimes a *logical* or *material* implication). In general, an
implication takes the form

`IF`

pTHENq

where *p* and *q* are truth-valued expressions, usually referred to
in logic as the *antecedent* and the *consequent,* respectively.
Like the antecedent and the consequent, the overall implication is itself a truth-valued
expression; it evaluates to *false* if *p* is *true* and *q*
is *false,* and to *true* otherwise (in other words, it's logically equivalent
to the expression (NOT *p*) OR *q*).

By the way, note how the foregoing shape *implies* the FORALL quantification
-- "IF a certain tuple appears" means, implicitly, "FORALL tuples
that do appear."

Suppose, then, that some given constraint does in fact have the foregoing shape.
Then it's specifically the "certain condition" that appears in the consequent
of the implication that most people would regard, informally, as the integrity constraint
*per se*. In SQL, for example, the constraint under discussion -- "If
*s* is a supplier, then *s* has a status value in the range 1 to 100 inclusive"
-- would typically be expressed as follows:

`CHECK ( STATUS BETWEEN 1 AND 100 )`

This expression would appear as part of the definition of relvar S, or what SQL would call "base table" S. And it does strongly suggest that the "condition" -- i.e., the expression between parentheses -- is the whole of the constraint (after all, it does say that's what's to be checked, doesn't it?).

However, while the foregoing perception (that the "condition" is whole of the constraint) might perhaps be unobjectionable in informal contexts, and possibly even desirable for human factors reasons, it shouldn't be allowed to obscure the fact that there's really a lot more going on. That is, the SQL expression shown above must be understood as shorthand for something like the more explicit, and more complete, formal expression shown earlier.

(As an aside, I note that SQL doesn't actually include any direct support for
logical implication -- i.e., for truth-valued expressions of the form IF *p*
THEN *q*. This fact might go some way toward explaining why the SQL formulation
of the constraint takes the form it does.)

To conclude Part I of this article, let's take a look at the rest of the introductory
examples. I'll show a precise formulation in each case and offer some comments
where I think they're needed, but I'll leave the "stilted English" and
SQL formulations to you. Please note that I make no claim that the precise
formulations shown are unique, or even that they're as simple as possible; I claim
only that they're *correct*. Note too that each example does illustrate
at least one new point.

2. Every supplier in London has status 20.

`FORALL`

s#`e`

`S#,`

sn

`e`

`NAME,`

st

`e`

`INTEGER,`

sc`e`

`CHAR`

`( IF { S#`

s#,SNAMEsn,STATUSst,CITYsc}`e`

`S`

`THEN ( IF`

sc= 'London'

`THEN`

st= 20 ) )

In this example, the consequent of the implication takes the form of an implication itself.

3. If there are any parts at all, at least one of them is blue.

`IF`

`EXISTS`

p#`e`

`P#,`

pn

`e`

`NAME,`

pl

`e`

`COLOR,`

pw

`e`

`WEIGHT,`

pc

`e`

`CHAR`

`( { P#`

p#,PNAMEpn,COLORpl,WEIGHTpw,CITYpc}`e`

`P )`

`THEN`

`EXISTS`

p#`e`

`P#,`

pn

`e`

`NAME,`

pl

`e`

`COLOR,`

pw

`e`

`WEIGHT,`

pc

`e`

`CHAR`

`( { P#`

p#,PNAMEpn,COLORpl,WEIGHTpw,CITYpc}`e`

`P`

`AND`

pl= COLOR ('Blue') )

The expression "COLOR ('Blue')" in the last line but one here is a COLOR
literal. Note that we can't just say "at least one part is blue"
-- we do have to worry about the case where there aren't any parts at all.
I remark in passing that SQL tries to "help" the user with the formulation
of constraints of this kind but (predictably enough) makes a hash of things ...
To be specific, if the clause CHECK(*q*) is specified as part of the definition
of base table *BT,* but *BT* happens to be empty, then SQL regards the
constraint as satisfied **no matter what form the truth-valued expression q
takes** -- even if it takes the form "

*BT*must not be empty"! (or even the form "

*BT*must have cardinality -5," or the form "1 = 0," come to that).

4. No two distinct suppliers have the same supplier number.

`FORALL`

x#`e`

`S#,`

xn

`e`

`NAME,`

xt

`e`

`INTEGER,`

xc`e`

`CHAR,`

y#`e`

`S#,`

yn

`e`

`NAME,`

yt

`e`

`INTEGER,`

yc

`e`

`CHAR`

`( IF { S#`

x#,SNAMExn,STATUSxt,CITYxc}`e`

`S AND`

`{ S#`

y#,SNAMEyn,STATUSyt,CITYyc}`e`

`S`

`THEN ( IF`

x#=y#

`THEN`

xn=ynANDxt=ytANDxc=yc) )

This expression is just a formal statement of the fact that
{S#} is a candidate key for suppliers,^{[5]}
thus, candidate key constraints are just a special case of constraints in general.
The **Tutorial D** syntax KEY {S#} might be regarded as shorthand for the foregoing
more longwinded expression. *Note: *In practice, of course, our
list of constraints ought also to include (a) an analogous one for parts, saying
that no two distinct parts have the same part number, and (b) an analogous one for
shipments, saying that no two distinct shipments have the same supplier number and
the same part number, but I've omitted these two constraints for simplicity.

5. Every shipment involves an existing supplier.

`FORALL`

s#`e`

`S#,`

p#

`e`

`P#,`

q

`e`

`QTY`

`( IF { S#`

s#,P#p#,QTYq}`e`

`SP`

`THEN EXISTS`

sn`e`

`NAME,`

st

`e`

`INTEGER,`

sc

`e`

`CHAR`

`( { S#`

s#,SNAMEsn,STATUSst,CITYsc}`e`

`S ) )`

This expression is a formal statement of the fact that {S#} is a foreign key for
shipments, matching the candidate key {S#} for suppliers; thus, foreign key constraints
too are just a special case of constraints in general. *Note: *In
practice, of course, our list of constraints ought really to include one to say that
every shipment involves an existing part as well, but again I've omitted this constraint
for simplicity.

This example is the first we've seen to involve two distinct relvars (SP and S,
in the example) -- all the previous examples involved just one. In the *Third
Manifesto* book and elsewhere, I've used the term *relvar constraint* for
a constraint involving exactly one relvar and the term *database constraint*
for a constraint involving more than one. However, while that distinction might
be useful from a pragmatic point of view, it's not very important from a theoretical
one, and I won't bother with it much in the remainder of this article.

6. No supplier with status less than 20 supplies any part in a quantity greater than 500.

Like the previous example, this one too involves two distinct relvars (S and SP again, of course), but it's certainly not a foreign key constraint.

### Notes

*/ ?>[1] In passing, let me acknowledge helpful reviews of earlier drafts of this article by Hugh Darwen, Ron Ross, and Fabian Pascal.

[2] In fairness, I should perhaps say that the SQL standard
itself might be partly to blame here, inasmuch as it uses the keyword CONSTRAINT
in connection with some constraints and the keyword ASSERTION in connection with
others. While most constraints can be specified using either the CONSTRAINT
style or the ASSERTION style, some *have* to use the CONSTRAINT style and some
*have* to use the ASSERTION style. I have no idea why this is.

[3] It wouldn't have been appropriate to include any of
my own books in the survey, and I didn't. I also didn't include Fabian Pascal's
book *Practical Issues in Database Management* (Addison-Wesley, 2000), since
I reviewed and commented on that book in draft form and also wrote a foreword for
it, and might therefore be accused of bias. Let me simply note for the record
that Fabian's book, unlike most of the others, does include a chapter on integrity.

[4] I don't mean to suggest that *all* constraints
have this particular shape, just that many of them do. See Example 3, later,
for an example of one that doesn't.

[5] It would be more accurate to say that it's a formal
statement of the fact that {S#} is a *superkey* for suppliers. A superkey
is a superset -- not necessarily a proper superset, of course -- of a candidate key;
thus, all candidate keys are superkeys, but some superkeys are not candidate keys.
But I don't want to get too far off on this particular tangent here.

Copyright (C) C.J. Date 2001.

# # #

### About our Contributor:

**Spotlight**

## Online Interactive Training Series

In response to a great many requests, Business Rule Solutions now offers at-a-distance learning options. No travel, no backlogs, no hassles. Same great instructors, but with schedules, content and pricing designed to meet the special needs of busy professionals.