Constraints and Predicates: A Brief Tutorial (Part 2)
This is the second part of a three-part tutorial article on the subject of integrity constraints. In the first part, after a number of preliminaries, I discussed a set of six examples, giving in each case a natural-language statement and a formal expression of the constraint in question. Now I want to move on to introduce some simple but fundamental concepts and terminology.
Let's return to the first example from Part 1 ("Every supplier status value is in the range 1 to 100 inclusive"). Here again is the precise formulation:
As I pointed out in Part 1, this constraint involves just a single relvar; in fact, it involves just one variable of any kind -- namely, the suppliers relvar S. Let me immediately qualify this remark! In fact, of course, the constraint does also involve certain so-called "bound" variables: namely, s#, sn, st, and sc. However, bound variables aren't variables in the usual programming language sense -- instead, they're variables in the sense of logic. You can think of a bound variable as a kind of "dummy" variable, in a sense. For consider:
- If we replace all appearances within the formal expression shown above of, e.g., the symbol st by any other symbol, say the symbol xyz, the constraint remains logically unchanged.
- By contrast, the same is certainly not true if we replace all appearances of, e.g., the symbol S -- which denotes a "nondummy" variable, of course -- by, say, the symbol SP.
From this point forward, I'll take the term variable to mean a variable in the usual programming language sense specifically, not a variable in the sense of logic (barring explicit statements to the contrary, of course).
To say it again, then, the formal expression of the constraint involves a variable, S. Thus, although that expression is indeed truth-valued, we can't say what its value is -- i.e., we can't say what truth value it yields -- until we substitute a value for that variable. (Indeed, different substitutions will yield different truth values, in general.) In other words, the expression is a predicate.
What's a predicate? In general, a predicate is a truth-valued function (i.e, a function that, when it's invoked, returns a truth value); it has a set of parameters (as do all functions), and an argument value has to be supplied for each such parameter when the function is invoked ("when the predicate is instantiated," as the logicians say). In the case at hand, then, when we want to instantiate the predicate -- which is to say, when we want to check the constraint -- we supply as argument (the sole argument, as it happens) the relation that's the current value of relvar S, and the expression can then be evaluated.
Now, when we do instantiate that predicate, and in effect replace the sole parameter by some argument, we wind up with a truth-valued expression that involves no variables at all, only values. And, of course, analogous remarks apply to constraints involving two, three, four, ..., or any number of relvars; in all cases, when we need to evaluate the expression (i.e., when we need to check the constraint), we replace each parameter by the relation that's the current value of the applicable relvar, and what we wind up with is an expression that involves no variables at all.
In logic, a truth-valued expression that involves no variables at all is said to be a proposition. A proposition thus evaluates to either true or false, unequivocally (it can be thought of as a degenerate predicate -- i.e., a predicate that involves zero variables). Here are a few simple examples:
- The sun is a star.
- The moon is a star.
- The sun is further away from us than the moon.
- George W. Bush won the US presidential election in the year 2000.
I'll leave it to you to decide which of these propositions evaluate to true and which to false. Do please note, however, that not all propositions are ones that evaluate to true (to think otherwise is a mistake all too easily made).
Terminology: Propositions are also said to be closed expressions, while expressions that involve at least one variable are said to be open expressions. In general, therefore, propositions are closed expressions and predicates are open ones (except for the degenerate case in which the predicate in question is in fact a proposition).
Anyway, it should be clear from all of the above that, logically speaking, a constraint as formally stated is a predicate -- but when the constraint is checked, argument values are substituted for the parameters, thereby reducing the predicate to a proposition, and that proposition is then required to evaluate to true.
Of course, any given relvar will be subject to many constraints, in general. Let R be a relvar. Then the relvar predicate for R is the logical AND or conjunction of all of the constraints that apply to (in other words, mention) that relvar R. Please don't get confused here! -- each individual constraint is a predicate in its own right, as we already know, but "the" relvar predicate is the conjunction of all of those individual predicates.
For example, if we assume for simplicity that the six constraints discussed in Part 1 of this article are the only constraints that apply to the suppliers and parts database (apart from a priori ones), then the relvar predicate for suppliers is the conjunction of Numbers 1, 2, 4, 5, and 6, and the relvar predicate for shipments is the conjunction of Numbers 5 and 6. Note that these two relvar predicates "overlap" each other, in a sense, inasmuch as they have certain constituent constraints in common.
Note: I might possibly be accused of moving the goalposts here, slightly. In my book An Introduction to Database Systems (7th edition), I defined the relvar predicate for relvar R to be the conjunction of all relvar constraints that applied to R (where, as explained in Part 1, the term relvar constraint means a constraint that refers to just one relvar). In The Third Manifesto, by contrast, Hugh Darwen and I refined that definition; to be specific, we now define the relvar predicate for relvar R to be the conjunction of all constraints, not just all relvar constraints, that apply to R. Apologies to anyone who might find this shift in terminology confusing.
Now let R be a relvar, and let RP be the relvar predicate for R. Clearly, then, R must never be allowed to have a value that, when it's substituted for R in RP (and when any other necessary argument-for-parameter substitutions have also been made in RP), causes RP to evaluate to false. In fact, I can now introduce The Golden Rule (or the first version of that rule, at any rate):
|No update operation must ever assign to any relvar a value that causes its relvar predicate to evaluate to false.|
Now let D be a database, and let D contain relvars R1, R2, ...Rn (only). Let the relvar predicates for those relvars be RP1, RP2, ..., RPn, respectively. Then the database predicate for D -- DP, say -- is the conjunction of all of those relvar predicates:
_ RP1 AND RP2 AND ... AND RPn
As an aside, let me remind you that (as we've seen) two distinct relvar predicates RPi and RPj (i =/ j) might have certain constituent constraints in common. It follows that the very same constraint might appear many times over in the database predicate DP. From a logical point of view, of course, there's no harm in this state of affairs, because if c is a constraint, then c AND c is logically equivalent to just c -- though I naturally hope the system would be smart enough in such a situation to evaluate c once and not twice.
Here then is the extended (more general, and final) version of The Golden Rule:
|No update operation must ever assign to any database a value that causes its database predicate to evaluate to false.|
Of course, a database predicate will evaluate to false if and only if at least one of its constituent relvar predicates does so too. And a relvar predicate will evaluate to false if and only if at least one of its constituent constraints does so too.
I'll have more to say about The Golden Rule in a few moments. First, however, I want to consider the question of what everything I've said so far implies for the practical issue of actually doing constraint checking. Once again, let's return to our very first example ("Every supplier status value is in the range 1 to 100 inclusive"):
As I pointed out in Part 1 of this article, this constraint is effectively saying that IF a certain tuple appears in relvar S, THEN that tuple has to satisfy a certain condition ("status in the range 1 to 100," in the case at hand). So, apparently, if we try to insert a new supplier tuple with status (say) 200, the sequence of events has to be:
- Insert the new tuple.
- Check the constraint.
- Undo the update (because the check fails).
But this is absurd! Clearly, we would like to catch the error before the INSERT is done in the first place. So what the implementation clearly has to do is use the formal expression of the constraint to infer the appropriate check(s) to be performed on tuples that are presented for insertion.
-- i.e., if the antecedent (the piece between the IF and the THEN) in some implication within the overall predicate is of the form "some tuple appears in S" -- then the consequent (the piece after the THEN) in that implication is essentially a constraint on tuples that are presented for insertion into relvar S.
I remark too that if the database is designed in accordance with The Principle of Orthogonal Design then any tuple presented for insertion will be a plausible INSERT candidate for at most one relvar in the database (implying that the tuple will have to be checked against the relvar predicate for at most one relvar in that database). But now I'm beginning to stray into an area that deserves a sizable discussion of its own, and I think I'd better leave it to a subsequent article.
Back to The Golden Rule:
|No update operation must ever assign to any database a value that causes its database predicate to evaluate to false.|
Now, I didn't point out the fact explicitly before, but you might have realized for yourself that the rule as stated implies that all checking is immediate. Why? Because it talks in terms of update operations -- in other words, INSERT, DELETE, and UPDATE operations, loosely speaking -- and not in terms of transactions (see below). In effect, therefore, The Golden Rule requires integrity constraints to be satisfied at statement boundaries,. and there's no notion of "deferred" or COMMIT-time integrity checking at all.
In order to explain the foregoing properly, I first need to say a little more about transactions. Of course, transactions are another big topic in their own right, even if we limit our attention (as I want to do here) to their integrity aspects specifically, so what follows is only the briefest of brief sketches. I plan to elaborate on the points involved in another subsequent article.
First of all, then, I do assume you're familiar with the transaction concept: in particular, with the so-called "ACID properties" of transactions. ACID is an acronym for atomicity - consistency - isolation - durability. Atomicity means transactions are "all or nothing." Consistency means they transform a consistent state of the database into another consistent state, without necessarily preserving consistency at all intermediate points. Isolation means that any given transaction's updates are concealed from all other transactions, until the given transaction commits. Durability means that once a transaction commits, its updates survive in the database, even if there's a subsequent system crash. The standard reference on transactions is the -- highly recommended -- book, Transaction Processing: Concepts and Techniques.
But -- I hear some readers objecting -- surely some integrity checks have to be deferred, don't they? As a trivial example, consider the constraint "Supplier S1 and part P1 are in the same city." If supplier S1 moves, say from London to Paris, then part P1 must move from London to Paris as well. The conventional solution to this problem is to wrap the two updates up into a single transaction, like this:
In this conventional solution, the constraint is defined to be DEFERRED, and the checking is done at COMMIT -- and the database is inconsistent between the two UPDATE operations. Note in particular that if the transaction performing the UPDATEs were to ask the question "Are supplier S1 and part P1 in the same city?" between the two UPDATE operations, it would get the answer no.
By contrast, The Third Manifesto solves this kind of problem by introducing a multiple assignment operator, which lets us carry out several assignments as a single operation (i.e., a single statement), without any integrity checking being done until all of the assignments in question have been executed. (INSERT, DELETE, and UPDATE are of course just shorthand for certain assignment operations, as I pointed out in an earlier article in this occasional series, viz., "There's Only One Relational Model!" For example:
Note the comma separator, which means the two UPDATEs are both part of the same statement. (Of course, I don't mean to suggest that the problem under consideration can be solved simply by a tiny change in syntax! Rather, the point is that we do need to be able to perform the two UPDATEs as part of the same statement -- in parallel, in fact -- and so we need some syntax to specify the necessary "bundling." So we use a comma for that purpose, and no integrity checking is done "until we get to the semicolon." Note in particular that there's now no way for the transaction to see an inconsistent state of the database between the two UPDATEs, because the notion of "between the two UPDATEs" is one that has no meaning in this context.
Given the availability of the multiple assignment operator, there's now no need at all for deferred checking in the traditional sense (i.e., checking that's deferred to end-of-transaction). As already indicated, however, this is a topic that deserves an article of its own, and I hope to write that article soon.
...to be continued
 extent on the particular language we're dealing with. For present purposes, suffice it to say that constraints must be satisfied at the end of each and every statement that contains no other statement nested syntactically inside itself. Loosely: Constraints must be satisfied at semicolons.
 C.J. Date, "There's Only One Relational Model!," www.dbdebunk.com, (February 2001).
# # #