# Constraints and Predicates: A Brief Tutorial (Part 2)

This is the second part of a three-part tutorial article on the subject of integrity constraints. In the first part, after a number of preliminaries, I discussed a set of six examples, giving in each case a natural-language statement and a formal expression of the constraint in question. Now I want to move on to introduce some simple but fundamental concepts and terminology.

Let's return to the first example from Part 1 ("Every supplier status value is in the range 1 to 100 inclusive"). Here again is the precise formulation:

As I pointed out in Part 1, this constraint involves just a single relvar; in
fact, it involves just one variable of any kind -- namely, the suppliers relvar S.
Let me immediately qualify this remark! In fact, of course, the constraint
does also involve certain so-called "bound" variables: namely, *s#, sn,
st,* and *sc*. However, bound variables aren't variables in the usual
programming language sense -- instead, they're variables in the sense of *logic*.
You can think of a bound variable as a kind of "dummy" variable,
in a sense. For consider:

- If we replace all appearances within the formal expression shown above of, e.g.,
the symbol
*st*by any other symbol, say the symbol*xyz,***the constraint remains logically unchanged**. - By contrast, the same is certainly not true if we replace all appearances of, e.g., the symbol S -- which denotes a "nondummy" variable, of course -- by, say, the symbol SP.

From this point forward, I'll take the term *variable* to mean a variable
in the usual programming language sense specifically, not a variable in the sense
of logic (barring explicit statements to the contrary, of course).

To say it again, then, the formal expression of the constraint involves a variable,
S. Thus, although that expression is indeed truth-valued, *we can't say
what its value is* -- i.e., we can't say what truth value it yields -- *until
we substitute a value for that variable*. (Indeed, different substitutions
will yield different truth values, in general.) In other words, the expression
is a **predicate**.

What's a predicate? In general, a predicate is *a truth-valued function*
(i.e, a function that, when it's invoked, returns a truth value); it has a set of
*parameters* (as do all functions), and an argument value has to be supplied
for each such parameter when the function is invoked ("when the predicate is
instantiated," as the logicians say). In the case at hand, then, when
we want to instantiate the predicate -- which is to say, when we want to check the
constraint -- we supply as argument (the *sole* argument, as it happens) the
relation that's the current value of relvar S, and the expression can then be evaluated.

Now, when we do instantiate that predicate, and in effect replace the sole parameter by some argument, we wind up with a truth-valued expression that involves no variables at all, only values. And, of course, analogous remarks apply to constraints involving two, three, four, ..., or any number of relvars; in all cases, when we need to evaluate the expression (i.e., when we need to check the constraint), we replace each parameter by the relation that's the current value of the applicable relvar, and what we wind up with is an expression that involves no variables at all.

In logic, a truth-valued expression that involves no variables at all is said
to be a **proposition**. A proposition thus evaluates to either *true*
or *false,* unequivocally (it can be thought of as a degenerate predicate --
i.e., a predicate that involves zero variables). Here are a few simple examples:

- The sun is a star.
- The moon is a star.
- The sun is further away from us than the moon.
- George W. Bush won the US presidential election in the year 2000.

I'll leave it to you to decide which of these propositions evaluate to *true*
and which to *false*. Do please note, however, that not all propositions
are ones that evaluate to *true* (to think otherwise is a mistake all too easily
made).

*Terminology:* Propositions are also said to be **closed expressions,**
while expressions that involve at least one variable are said to be **open** expressions.
In general, therefore, propositions are closed expressions and predicates
are open ones (except for the degenerate case in which the predicate in question
is in fact a proposition).

Anyway, it should be clear from all of the above that, logically speaking, a constraint
as formally stated is a predicate -- but when the constraint is checked, argument
values are substituted for the parameters, thereby reducing the predicate to a proposition,
and that proposition is then required to evaluate to *true*.

Of course, any given relvar will be subject to many constraints, in general.
Let *R* be a relvar. Then **the relvar predicate** for *R* is
the logical AND or *conjunction* of all of the constraints that apply to (in
other words, mention) that relvar *R*. Please don't get confused here!
-- each individual constraint is a predicate in its own right, as we already know,
but **"the"** relvar predicate is the conjunction of all of those individual
predicates.

For example, if we assume for simplicity that the six constraints discussed in
Part 1 of this article are the *only* constraints that apply to the suppliers
and parts database (apart from *a priori* ones), then the relvar predicate for
suppliers is the conjunction of Numbers 1, 2, 4, 5, and 6, and the relvar predicate
for shipments is the conjunction of Numbers 5 and 6. Note that these two relvar
predicates "overlap" each other, in a sense, inasmuch as they have certain
constituent constraints in common.

*Note:* I might possibly be accused of moving
the goalposts here, slightly. In my book *An Introduction to Database Systems
(7th edition)*,^{[1]} I defined
the relvar predicate for relvar *R* to be the conjunction of all *relvar*
constraints that applied to *R* (where, as explained in Part 1, the term *relvar
constraint* means a constraint that refers to just one relvar). In *The
Third Manifesto,* by contrast, Hugh Darwen and I refined that definition; to be
specific, we now define the relvar predicate for relvar *R* to be the conjunction
of *all* constraints, not just all *relvar* constraints, that apply to
*R*. Apologies to anyone who might find this shift in terminology confusing.

Now let *R* be a relvar, and let *RP* be the relvar predicate for *R*.
Clearly, then, *R* must never be allowed to have a value that, when it's
substituted for *R* in *RP* (and when any other necessary argument-for-parameter
substitutions have also been made in *RP*), causes *RP* to evaluate to
*false*. In fact, I can now introduce **The Golden Rule** (or the
first version of that rule, at any rate):

No update operation must ever assign to any relvar a value that causes its relvar
predicate to evaluate to false. |

Now let *D* be a database, and let *D* contain relvars *R1, R2, ...Rn*
(only). Let the relvar predicates for those relvars be *RP1, RP2, ..., RPn,*
respectively. Then **the database predicate** for *D* -- *DP,*
say -- is the conjunction of all of those relvar predicates:

`DP`

`_`

RP1ANDRP2AND ... ANDRPn

As an aside, let me remind you that (as we've seen) two distinct relvar predicates
*RPi* and *RPj* (*i* =/ *j*) might have certain constituent constraints
in common. It follows that the very same constraint might appear many times
over in the database predicate *DP*. From a logical point of view, of
course, there's no harm in this state of affairs, because if *c* is a constraint,
then *c* AND *c* is logically equivalent to just *c* -- though I naturally
hope the system would be smart enough in such a situation to evaluate *c* once
and not twice.

Here then is the extended (more general, and final) version of **The Golden Rule:**

No update operation must ever assign to any database a value that causes its database
predicate to evaluate to false. |

Of course, a database predicate will evaluate to *false* if and only if at
least one of its constituent relvar predicates does so too. And a relvar predicate
will evaluate to *false* if and only if at least one of its constituent constraints
does so too.

I'll have more to say about **The Golden Rule** in a few moments. First,
however, I want to consider the question of what everything I've said so far implies
for the practical issue of actually doing constraint checking. Once again,
let's return to our very first example ("Every supplier status value is in the
range 1 to 100 inclusive"):

As I pointed out in Part 1 of this article, this constraint is effectively saying
that **IF** a certain tuple appears in relvar S, **THEN** that tuple has to
satisfy a certain condition ("status in the range 1 to 100," in the case
at hand). So, apparently, if we try to insert a new supplier tuple with status
(say) 200, the sequence of events has to be:

- Insert the new tuple.
- Check the constraint.
- Undo the update (because the check fails).

But this is absurd! Clearly, we would like to catch the error before the
INSERT is done in the first place. So what the implementation clearly has
to do is use the formal expression of the constraint to **infer** the appropriate
check(s) to be performed on tuples that are presented for insertion.

-- i.e., if the antecedent (the piece between the IF and the THEN) in some implication within the overall predicate is of the form "some tuple appears in S" -- then the consequent (the piece after the THEN) in that implication is essentially a constraint on tuples that are presented for insertion into relvar S.

I remark too that if the database is designed in accordance
with *The Principle of Orthogonal Design*^{[2]}
then any tuple presented for insertion will be a plausible INSERT candidate for at
most one relvar in the database (implying that the tuple will have to be checked
against the relvar predicate for at most one relvar in that database). But
now I'm beginning to stray into an area that deserves a sizable discussion of its
own, and I think I'd better leave it to a subsequent article.

Back to The Golden Rule:

No update operation must ever assign to any database a value that causes its database
predicate to evaluate to false. |

Now, I didn't point out the fact explicitly before, but you
might have realized for yourself that the rule as stated implies that **all checking
is immediate**. Why? Because it talks in terms of *update operations*
-- in other words, INSERT, DELETE, and UPDATE operations, loosely speaking -- and
**not** in terms of transactions (see below). In effect, therefore, **The
Golden Rule** requires integrity constraints to be satisfied *at statement boundaries*,.^{[3]} and there's no notion of "deferred"
or COMMIT-time integrity checking at all.

In order to explain the foregoing properly, I first need to say a little more about transactions. Of course, transactions are another big topic in their own right, even if we limit our attention (as I want to do here) to their integrity aspects specifically, so what follows is only the briefest of brief sketches. I plan to elaborate on the points involved in another subsequent article.

First of all, then, I do assume you're familiar with the
transaction concept: in particular, with the so-called "ACID properties"
of transactions. ACID is an acronym for *atomicity - consistency - isolation -
durability*. *Atomicity* means transactions are "all or nothing."
*Consistency* means they transform a consistent state of the database
into another consistent state, without necessarily preserving consistency at all
intermediate points. *Isolation* means that any given transaction's updates
are concealed from all other transactions, until the given transaction commits.
*Durability* means that once a transaction commits, its updates survive in the
database, even if there's a subsequent system crash. The standard reference
on transactions is the -- highly recommended -- book, *Transaction Processing:
Concepts and Techniques*.^{[4]}

But -- I hear some readers objecting -- surely some integrity checks *have*
to be deferred, don't they? As a trivial example, consider the constraint
"Supplier S1 and part P1 are in the same city." If supplier S1 moves,
say from London to Paris, then part P1 must move from London to Paris as well.
The conventional solution to this problem is to wrap the two updates up into a single
transaction, like this:

In this conventional solution, the constraint is defined to be DEFERRED, and the
checking is done at COMMIT -- and the database is inconsistent between the two UPDATE
operations. Note in particular that if the transaction performing the UPDATEs
were to ask the question "Are supplier S1 and part P1 in the same city?"
between the two UPDATE operations, it would get the answer *no*.

By contrast, *The Third Manifesto* solves this kind
of problem by introducing a **multiple assignment** operator, which lets us carry
out several assignments as a single operation (i.e., a single statement), without
any integrity checking being done until all of the assignments in question have been
executed. (INSERT, DELETE, and UPDATE are of course just shorthand for certain assignment
operations, as I pointed out in an earlier article in this occasional series, *viz.,*
"There's Only One Relational Model!"^{[5]}
For example:

Note the comma separator, which means the two UPDATEs are
both part of the same statement. (Of course, I don't mean to suggest that
the problem under consideration can be solved simply by a tiny change in syntax!
Rather, the point is that we do need to be able to perform the two UPDATEs
as part of the same statement -- in parallel, in fact -- and so we need some syntax
to specify the necessary "bundling." So we use a comma for that
purpose, and no integrity checking is done "until we get to the semicolon."
Note in particular that there's now no way for the transaction^{[6]} to see an inconsistent state of the database between
the two UPDATEs, because the notion of "between the two UPDATEs" is one
that has no meaning in this context.

Given the availability of the multiple assignment operator, there's now no need at all for deferred checking in the traditional sense (i.e., checking that's deferred to end-of-transaction). As already indicated, however, this is a topic that deserves an article of its own, and I hope to write that article soon.

*...to be continued*

#### References

[1] C.J. Date, *An Introduction to Database Systems
(7th edition)*, Addison-Wesley (2000).

[2] See the book, *An Introduction to Database Systems
(7th edition)*, mentioned a couple of times already:

[3] extent on the particular language we're dealing with.
For present purposes, suffice it to say that constraints must be satisfied at the
end of each and every statement that contains no other statement nested syntactically
inside itself. Loosely: *Constraints must be satisfied at semicolons*.

[4] Jim Gray and Andreas Reuter, *Transaction Processing:
Concepts and Techniques*, Morgan Kaufmann (1993).

[5] C.J. Date, "There's Only One Relational Model!,"
*www.dbdebunk.com*,
(February 2001).

[6] Of course, we do still need the transaction concept (in particular, we need it as the unit of recovery and the unit of concurrency), even though we reject it as a unit of integrity.

# # #

### About our Contributor:

**Spotlight**

## Online Interactive Training Series

In response to a great many requests, Business Rule Solutions now offers at-a-distance learning options. No travel, no backlogs, no hassles. Same great instructors, but with schedules, content and pricing designed to meet the special needs of busy professionals.