The Business View of Data and Data Quality: The Six Dimensions of Semantic Quality

Ronald G.  Ross
Ronald G. Ross Co-Founder & Principal, Business Rule Solutions, LLC , Executive Editor, Business Rules Journal , and Co-Chair, Building Business Capability (BBC) Read Author Bio       || Read All Articles by Ronald G. Ross

Business has a fundamental problem with data quality.  In some places it's merely painful; in others it's near catastrophic.  Why is the problem so pervasive?  Why does it never seem to get fixed?  Perhaps we've been thinking about the problem wrong.  Time for a fresh look.

The central flaw in the long-running discussion over data quality is literally its focus on 'data'.  Stored data is merely the system or database residue of things that have already happened in the business, a memory of past events.

To truly fix 'data quality' problems requires a business perspective, a shift in the focus from data design or data cleansing to what occurs in the business itself.  Our sights should be trained squarely on the business activity that results in the data.  This discussion introduces six dimensions of semantic quality to ensure our aim is true.


Consider what workers are actually doing when they create a piece of data.  In the business world, of course, they're probably just doing a bit of work.  Look more closely, however, and in some ways what they're doing is actually quite profound.  Think about it this way:

Creating data is a business communication to people in the future.

In other words, the act of creating data is the act of sending a message.

Normally we think of communication in terms of either direct conversations or (in the spirit of the times) a flurry of text messages exchanged more or less in real time with people we know.  In either case there's usually a shared context within which the meaning of the messages can be interpreted.

What's distinct about creating data as an act of business communication is that you're almost certainly not going to be face-to-face with the recipients of the message or connected live with them via an interactive network.  That fact rules out body language (e.g., raised eyebrows or emoticons) and dialog (including grunts and groans, or more emoticons) to clarify what you mean.  In that sense the communication is blind, as illustrated in Figure 1.

    Figure 1.  The Act of Creating Data as a Blind Business Communication to People in the Future.

As a consequence, the data a worker creates literally needs to speak for itself.  The emphasis needs to be on communication quality.

Communication quality focuses on whether the meaning of a message is clear.  Just formatting data correctly doesn't get you there.  If the meaning isn't clear a business communication won't be properly understood.  In other words, you need semantic quality — not just data quality.

The six dimensions of semantic quality are presented in Table 1.  Don't be put off by the word semantic.  We're simply talking about the meaning of messages.[1]

Table 1.  The Six Dimensions of Semantic Quality
Dimension of
Semantic Quality

Description

Comment

Readable

not encoded or cryptic

Meaning not obscured by choice of signifiers (terms/codes/words).

Understandable

readily intelligible

Only standard terms that have solid business definitions used.

Precise

internally consistent

Underlying meaning consistent with a structured business vocabulary — a shared concept model.

Reliable

free from flaws or defects that might impair usefulness

Compliant with all relevant business rules.

Useful

fit for purpose

An explicit shared purpose agreed by the parties.

Sufficient

adequate for satisfying a defined need

The explicit extent (scope) of the need agreed by the parties.

The six dimensions of semantic quality are discussed more fully later.  They seem largely self-evident, but as you might imagine, there is much more to them than initially meets the eye.

The Role of Data/System Architectures and What Data Quality is Really About

Because of the time delay in delivering blind communications to everyone in the future who might need them, a secure, well-organized holding area is needed.  IT professionals, hopefully guided by knowledgeable data architects, create data/system architectures for that purpose as illustrated by Figure 2.

    Figure 2.  Data/System Architecture as a Rest Stop for Business Communications (Data).

Unfortunately, most data quality measures in current use (refer to Appendix 1) focus on the health of the content of the data/system architecture rather than on the semantic quality of the original business communications.  That focus serves a purpose for data management, but misses the mark almost entirely in clarifying what practices produce good business communications in the first place.  Compared with the semantic quality dimensions presented above, typical data quality dimensions are:

  • Retroactive rather than proactive
  • Quantitative rather than qualitative
  • Systemic rather than semantic

The quality of data in a data/system architecture can never be any better than the quality of the business communications that produced it.  A systematic means to manage data at rest simply does not guarantee the vitality — the semantic health — of the business communications it supports.  Unfortunately, many IT professionals fail to understand this point.[2]

To make the point differently, it is entirely possible to assess your data quality as outstanding even though the business communications that produced the data were confusing, contradictory, unintelligible, or otherwise ineffective.  Such an assessment would of course be nonsense.

Forming High-Quality Business Communications

Rather than retroactively focusing on data already formed, business people and professionals need proactive measures to form high-quality messages in the first place.

What should the recipients of blind messages expect?  They have the right to expect:

  1. High-quality evidence about what the content means.
  2. No need for any significant assumptions, whether unconscious or deliberate, to supplement that evidence.
  3. The content representing exactly the reality the evidence suggests.

What form does evidence available to recipients take?

  • names including codes[3]
  • definitions
  • concept model
  • business rules
  • documentation of purpose
  • scope of the need

The dimensions of semantic quality arise directly from these six kinds of evidence, respectively.  They provide the context for blind communications.  The six dimensions are discussed individually below with examples.

1. Readable

A readable message is one that is not unintentionally encoded or cryptic; that is, one whose meaning is not obscured by choice of signifiers (names or codes).  If a message is encrypted (as security of course usually demands these days) the encryption should be on top of the message, not a by-product of forming the message (data) itself.

Cryptic names and codes are rampant in IT systems; they are encouraged by programming languages, software platforms, and legacy computer tradecraft.  Here are some typical examples:

  1. PID-RAD2-TYPE.  Who but programmers might know what this field name would represent?!

  2. A coding scheme for the values of a field where '0' stands for 'no' and '1' stands for 'yes'.  Why?!

  3. The abbreviation 'PT'.  Without adequate evidence, this abbreviation could stand for many things, including the following:[4]
    • PT Emp → Part-time employee
    • PTCRSR → PT Cruiser (Personal Transportation Cruiser)
    • Blk pt chassis → Black platinum chassis
    • 24pt bk → Manual published in 24-point type
    • 2 pt asbl → Two-part assembly
    • 1 pt → One pint
    • LIS PT → Lisbon, Portugal

Making names and codes readable is not always easy.  For example, concepts that are highly computed or derived are often difficult to label succinctly with names or codes of reasonable size.  Supplemental evidence is even more necessary in those cases.  For the basic or elemental concepts of a problem domain, however, there is simply no excuse for cryptic names or codes.

2. Understandable

An understandable message uses only standard terms that have solid business (not data) definitions.  Failings in this regard can arise from:

  • Naming things wrong or obscurely — a term is used for a concept that could easily or subtly be misconstrued.

  • Defining things poorly or inaccurately — a term's definition is absent, unclear, imprecise, incomplete, and/or un-business-like.[5]

Naming and defining things — that is, creating a solid business vocabulary — is the fundamental purpose of a concept model.[6]  A good concept proves adequate vocabulary to support discriminating messages, indicating precisely the right word(s) to use for a given concept.

For example:

  • Suppose someone calls something a site.  The subject matter has to do with immunology.  Does site refer to a location where a vaccination took place (e.g., a doctor's office) or to an anatomical location where a vaccination was injected.  A good concept model would provide terminology to clearly distinguish the two concepts.

  • Suppose someone calls something a loss.  This designation could mean either the event of a loss (e.g., a house burns down) or the amount of the loss (e.g., the house was a 50% loss).  A good concept model would avoid this ambiguity, perhaps by offering two distinct terms loss event and loss amount.[7]

  • Suppose someone says vaccination.  This term could either mean a whole vaccination series (if a particular vaccination requires more than one dose) or an injection of any one dose in a particular vaccination series.  A good concept model will provide distinct vocabulary to talk about both meanings, as needed.[8]

  • Suppose someone says person has vehicle.  This verbal connection could mean any of the following:  person owns vehicle, person leases vehicle, person borrows vehicle, or person has access to vehicle.  Assuming the need, a good concept model will provide distinct wordings for each of these meanings.

3. Precise

A precise[9] message is one that uses terms and wordings from a concept model correctly.

In subject matter of any complexity — which is to say virtually all business subject matter — precision in word choice can make a huge difference in the ultimate effectiveness of a communication.  There is simply no word like exactly the right word.

Sometimes the choice of word for some concept in a message is simply wrong.   Such usage can be highly misleading.  For example:

  • Using extension to mean 'an offering of a product given to a prospect when the prospect clicks on an ad', rather than how the concept model defines it, 'an additional period of time given to a prospect to accept an offer'.[10]

  • Using borrower to refer to a party that completes a loan application.  A party becomes a borrower only if their loan application is approved and funded.  Assuming a robust concept model, the error of such usage would be obvious.

A robust concept model addresses the deeper semantics from which misunderstandings and misinterpretations of terminology often spring.  Root causes of ambiguity can often be eliminated only by reflecting each concept's logical connections with other concepts.  By addressing these connections, a concept model actually represents more than just a business vocabulary; it represents a structured business vocabulary.

Examples of this kind of problem in creating messages:

  • Use of a term for a concept where a role name would be more accurate, or vice versa (e.g., 'party' vs. 'applicant' vs. 'owner' vs. 'leaser').[11]

  • Use of a term for a concept where a term for one of the concept's categories or the concept's super-category would be more accurate (e.g., 'limited liability corporation' vs. 'corporation' vs. 'party').

  • Use of a term for a concept where a term for either the whole or a part of the whole would be more accurate (e.g., 'chassis' vs. 'vehicle').

  • Use of a term for the class of a thing where a term for the thing itself should be used, or vice versa (e.g., 'tower' vs. 'Eiffel Tower').

4. Reliable

A reliable message is one that is compliant with all relevant business rules.

Much confusion arises over business rules.  Professionals who work with data/system architectures often have a technical view of them.  That's off-target.  Business rules are not data rules or system rules.  A true business rule is a criterion for shaping behavior or making decisions in actually running the business.  Business rules are about shaping business activity, not data — at least directly.

I recently read the following statement about data quality:  "Business rules capture accurate data content values."  No.  Business rules are about running the business correctly.

If the business is run correctly then, of course, its business communications will be formed correctly.  If its business communications are formed correctly, then the content of its data/system architecture will also be correct.  So yes, business rules result in correct data, but more importantly correct data arises because business activity is conducted correctly in the first place.

In other words, data quality isn't really about the quality of your data, it's about the quality of your business rules.

Problems with data quality arising from failure to consistently follow appropriate business rules in business activity[12] are often illustrated by very simple examples such as the following.  Don't be fooled!  These examples barely scratch the surface — they just happen to be relatively easy to talk about.

  1. Data in a field is invalid because it violates some definitional business rule(s) — for example, social security numbers are found in a field for the last name of a person.  Reasonable definitional rules would disallow numeric values.

  2. Data in a field is invalid because it violates some minimum or maximum threshold — for example, a number greater than 99 is found in a percentile field.

  3. Numeric data in a computed field fails to comply with some computation rule — for example, social security tax is calculated incorrectly.

  4. Alpha data in a derived field fails to comply with some derivation rule — for example, a valid-candidate-for-insurance flag is set to 'yes' although the person has been convicted of a felony involving a motor vehicle.

Each of the problems above basically addresses values of just a single field.  Often, data in one or more fields can collectively violate some business rule(s) so as to represent a conflicting or prohibited business situation.  The following examples illustrate.  Each example is first expressed as a business rule[13] then as a corresponding data constraint.  Incidentally, these examples also illustrate the difference between communicating in business terms vs. communicating in data-speak.

  1. Business rule:  A customer must have an assigned agent if the customer has placed an order.

Expressed as a corresponding data constraint:  To be correct, valid data is required in the assigned-agent field of an order record if any orders are listed for that order record.[14]

  1. Business rule:  A claim may include at most only one of an assigned adjudicator or a litigating lawyer.

Expressed as a corresponding data constraint:  To be correct, no data is permitted in the assigned-adjudicator field of a claim record if data appears in the litigating-lawyer field of that record, and vice versa.

  1. Business rule:  The payee of a claim payment for a claim must be a party who makes the claim.

Expressed as a corresponding data constraint:  To be correct, any data in the payee field of a claim-payment record must indicate a party who is one of the parties listed as having made the claim.

  1. Business rule:  A loan application for a subject property that is subject to litigation may be approved only if the income of the applicant is GT 20% of the estimated value of the property.

Expressed as a corresponding data constraint:  To be correct, a loan-application record linked to a subject-property record that is flagged as being subject to litigation must not be flagged as approved if the (numeric) data in the income field of the applicant record is not more than 20% of the (numeric) data in the estimated-value field of the property record.

5. Useful

A useful message is one that is fit for business purpose.  Assessing fitness for business purpose requires that the purpose of the message, and all others like it, is stated or described explicitly.  For example, the purpose might be "to take and fulfill orders for products of a certain kind."  If all other semantic quality dimensions are satisfied, the message will be deemed useful for that purpose.  For other purposes it is likely to prove less useful — or not useful at all.

In the simplest terms, the purpose of a message is documentation that explains the intended use of the message.  That purpose and its exact scope might not be fully evident from the content of the message itself.  The purpose is evidence that can be inspected to pin down the exact context in which the message is meaningful or relevant.  For example, given the purpose "to take and fulfill orders for products of a certain kind" a message might not be useful for products not of that certain kind.

To be useful, the parties accessing the message need to have agreed to the purpose explicitly.

  • Actively participating parties (e.g., ones inside the immediate value stream of business activity) should agree to the common purpose in advance.  Ideally, this purpose should arise from, and be aligned with, an explicit business strategy for the problem domain.

  • Passively participating parties (e.g., ones outside the immediate value stream of business activity) should explicitly acknowledge and accept the purpose when they eventually attempt to make use of the message.

6. Sufficient

A sufficient[15] message is one that can be taken with confidence to satisfy a need.  Assessing whether a message satisfies a need requires that the need, like purpose, be agreed explicitly.  Unlike purpose, however, the need does not need to be documented explicitly.  Instead, the parties must agree that messages of all types falling within scope can convey everything necessary to satisfy the purpose.

In effect, the parties are simply stipulating that the concept model is complete.  If some vocabulary needed in some message to satisfy the purpose cannot be found in the concept model, then the need cannot be fully satisfied.  There would literally be no words (standard business vocabulary) to talk about it.  Therefore, it could not be communicated (at least with any confidence).

For example, suppose the purpose is again "to take and fulfill orders for products of a certain kind."   Someone on the receiving end of a message within scope needs 'wheel diameter', but 'wheel diameter' cannot be found in the concept model.  As a consequence, the message cannot legitimately supply it; the message is therefore not sufficient.

The concept model thus plays a sovereign role in determining whether any message within scope can be deemed sufficient.[16]

Summary

The six dimensions of semantic quality get to root causes of 'data quality' problems.  Communicating about difficult subject matter is hard to begin with.  Blind communication to people you can't converse or interact with directly is the hardest of all.  It requires order-of-magnitude sophistication in the techniques used to form the messages.  Concept models and business rules provide the necessary tools.

Appendix 1.  Typical Data Quality Dimensions[17]

Dimension of
Data Quality

Description

Measure

Unit of Measure

Completeness[18]

The proportion of stored data against the potential of "100% complete".

A measure of the absence of blank (null or empty string) values or the presence of non-blank values.[19]

Percentage

Uniqueness

No thing will be recorded more than once based upon how that thing is identified.

Analysis of the number of things as assessed in the 'real world' compared to the number of records of things in the data set.

Percentage

Timeliness

The degree to which data represent reality from the required point in time.

Time difference.

Time

Validity

Data is valid if it conforms to the syntax (format, type, range) of its definition.[20]

Comparison between the data and the metadata or documentation for the data item.

Percentage of data items deemed valid to invalid

Accuracy

The degree to which data correctly describes the "real world" object or event being described.[21]

The degree to which the data mirrors the characteristics of the real world object or objects it represents.

Percentage of data entries that pass the data accuracy rules

Consistency

The absence of difference, when comparing two or more representations of a thing against a definition.[22]

Analysis of pattern and/or value frequency.

Percentage

References

[1]  I would almost prefer to call them the six dimensions of meaning quality, but unfortunately that label is a bit cryptic. (!) return to article

[2]  Many data professionals do understand the point, but do not know quite how to articulate it or feel powerless to do much about it. return to article

[3]  From a business communications perspective, a code used as a stored value in a field is actually a name for a shared concept about some thing.  That thing sometimes, but by no means always, exists in the real world.  For example, the real MA (Massachusetts) can't be put in a file or database (way too large!).  The code 'MA' is the name for our shared understanding (concept) of that U.S. state as it exists in the real world. return to article

[4]  From "Six Myths about Data Quality", by Steven Sarsfield, January 28, 2017 https://www.ewsolutions.com/six-myths-data-quality/ return to article

[5]  For examples and guidelines, refer to How to Define Business Terms in Plain English:  A Primer (free download) http://www.brsolutions.com/b_ipspeakprimers.php return to article

[6]  Refer to:  "What Is a Concept Model?" by Ronald G. Ross, Business Rules Journal, Vol. 15, No. 10 (Oct. 2014) http://www.brcommunity.com/a2014/b779.html return to article

[7]  There are many words in English that are used loosely either for some thing, or for a quantity of that thing.  They generally make poor terms without qualification.  return to article

[8]  Providing vocabulary to distinguish between a thing and a series of events related to that thing is a common challenge in concept modeling.  return to article

[9]  The word 'accurate' or 'accuracy' (often cited as a data quality dimension) is not the best choice for this dimension.  You can never really be sure that a message is true about the real world beyond what its specified semantics allow.  return to article

[10]  Perhaps even worse is being inconsistent in usage — e.g., sometimes the term means one thing, and sometimes another.  Such terms are called homonyms (one word or word phrase, but multiple meanings).  Synonyms — different words or word phrases standing for the same meaning — also present challenges for effective communication, though generally not as difficult.  return to article

[11]  Role names are always related to verb concepts — i.e., named connections or characteristics pertaining to noun concepts.  For example, applicant arises from the verb concept party applies for mortgageOwner arises from party owns propertyLeaser arises from party leases property.  return to article

[12]  or following the wrong rules or following no rules at all  return to article

[13]  Using RuleSpeak®, free on www.RuleSpeak.com  return to article

[14]  Refer to Business Rule Concepts:  Getting to the Point of Knowledge (4th ed), Ronald G. Ross, 2013, pp. 99-100.  return to article

[15]  The term 'complete' or 'completeness' (often cited as a data quality dimension) is not the best choice for this dimension.  Often there are no inherent criteria for how 'complete' some message could possibly be.  return to article

[16]  A data model won't do.  Data models are simply not robust business vocabulary models.  return to article

[17]  Taken from:  The Six Primary Dimensions For Data Quality Assessment:  Defining Data Quality Dimensions, by the DAMA UK Working Group, 2013, 16pp, https://www.whitepapers.em360tech.com/wp-content/files_mf/1407250286DAMAUKDQDimensionsWhitePaperR37.pdf  return to article

[18]  "Business rules … define what '100% complete' represents."  This sentence is the only reference to business rules in the DAMA UK Working Group document.  Business rules that dictate when something must exist can be quite complex.  Also, just because a property of a thing is known doesn't mean what's known satisfies all relevant business rules.  return to article

[19]  Business people don't naturally talk about blank values, nulls, or empty strings.  return to article

[20]  "Range" is actually semantic rather than simply syntactic.  Such constraints arise from the definition and/or business rules.  return to article

[21]  "It is common to use third party reference data from sources which are deemed trustworthy and of the same chronology."  DAMA UK Working Group document.  return to article

[22]  An assessment of multiple copies of data about the same thing.  return to article

For further information, please visit BRSolutions.com     

# # #

Standard citation for this article:


citations icon
Ronald G. Ross , "The Business View of Data and Data Quality: The Six Dimensions of Semantic Quality" Business Rules Journal Vol. 19, No. 4, (Apr. 2018)
URL: http://www.brcommunity.com/a2018/b946.html

About our Contributor:


Ronald  G. Ross
Ronald G. Ross Co-Founder & Principal, Business Rule Solutions, LLC , Executive Editor, Business Rules Journal , and Co-Chair, Building Business Capability (BBC)

Ronald G. Ross is Principal and Co-Founder of Business Rule Solutions, LLC, where he actively develops and applies the IPSpeak methodology including RuleSpeak®, DecisionSpeak and TableSpeak.

Ron is recognized internationally as the "father of business rules." He is the author of ten professional books including the groundbreaking first book on business rules The Business Rule Book in 1994. His newest are:


Ron serves as Executive Editor of BRCommunity.com and its flagship publication, Business Rules Journal. He is a sought-after speaker at conferences world-wide. More than 50,000 people have heard him speak; many more have attended his seminars and read his books.

Ron has served as Chair of the annual International Business Rules & Decisions Forum conference since 1997., now part of the Building Business Capability (BBC) conference where he serves as Co-Chair. He was a charter member of the Business Rules Group (BRG) in the 1980s, and an editor of its Business Motivation Model (BMM) standard and the Business Rules Manifesto. He is active in OMG standards development, with core involvement in SBVR.

Ron holds a BA from Rice University and an MS in information science from Illinois Institute of Technology. Find Ron's blog on http://www.brsolutions.com/category/blog/. For more information about Ron visit www.RonRoss.info. Tweets: @Ronald_G_Ross

Read All Articles by Ronald G. Ross
Subscribe to the eBRJ Newsletter
In The Spotlight
 Ronald G. Ross
 Jim  Sinur

Online Interactive Training Series

In response to a great many requests, Business Rule Solutions now offers at-a-distance learning options. No travel, no backlogs, no hassles. Same great instructors, but with schedules, content and pricing designed to meet the special needs of busy professionals.