What are reference data?

Reference data are data that structure or control other data. Often reference data are stored in what people call ‘Code tables’. A very well known example is the ISO 3166-1 country code table. This way you don’t have to manage a list of countries yourselves. And it makes data exchange easier. When you state that you identify a country with that widely known code, the chances that the receiving end will understand which country you mean.

But wait, is that all you can do with reference data? No!

Like all data, reference data is just a means to an end. So what can you do with reference data?

Read on…

With reference data you can:

Refer to externally managed entities in a controlled and standardized way
Build a taxonomy by classifying data
Categorize data for better navigation
Define the possible states of a type

Refer to externally managed entities in a controlled and standardized way

The example of Country codes is a form of reference data you can use to refer to externally managed entities. A country is actually an entity, but not one that you will likely manage internally. You won’t decide that the name of an existing country will change, nor will you decide that a country ceases the exist. You do however want to be able to refer to countries properly, especially when you are active in an international industry. The same goes for things like Currencies, Member states of the EU, Members of Trade unions, etc. For most of these topics, internationally standardized code lists are available. Often also Industry specific ones like IATA codes for airlines and airports, and SWIFT codes for banks.

Build a taxonomy by classifying data

When dealing with many instances of data (like many Customers or Products) you want to put some structure into all those data by giving them a specific type. Examples are ‘Private’ versus ‘Business’ customers for a bank. Or ‘Laptop’ versus ‘Desktop’ products for a computer store. Classification is about assigning some instance to a type and thus saying that this instance belongs to a group of instances with a distinct and observable characteristic. In classifications, one can create subtypes of types resulting in a hierarchy (also called Taxonomy). In classifications, one has to choose between mutual exclusive types, and therefore one creates a hierarchy with unique leaves. Defining the different types you want to use, and determining when data belongs to a certain type is actually not that easy. An example could be the definition of a ‘Gold customer’ for customers who have put in a certain amount of orders, and therefore get discounts or exclusive deals. But what amount of orders? In which time frame? And what if the customer puts in orders but doesn’t pay accordingly? Determining what is an instance of a concept is not easy! Ron Ross explains this very well in his book: Rules – Shaping Behavior and Knowledge.

Categorize data for better navigation

When having to navigate through a lot of data, companies want to offer their customers/consumers some kind of navigation. For example, a grocery store wants to categorize their products into: fruits, vegetables, dairy, meat, fish, etc. Categorization is the grouping of instances of the same kind of data based on predefined criteria. Categorization often make use of the available classifications of data. For example, you can group products on a combination of several characteristics like (beverages & dairy products). Because categorization can be done based on multiple criteria, the hierarchy that is created does not need to result in unique leaves. It might happen that one instance falls into multiple possible categories (e.g. milk) and others fall into one category (e.g. yoghurt and orange juice). Categorization is also a form of Taxonomy, slightly different from classification.

Define the possible states of a type

Most entities can have distinct states that mark the significant phases in their life cycle. For example, a ‘Customer request’ can be in states like: sent, received, validated, approved, rejected, in progress, handled, notified, closed. Because you don’t want to determine the possible states for each individual instance of the same kind, organizations keep state tables that control the possible options. These state tables are actually not just static tables, but often also contain constraints to disallow illegal/unwanted state changes (e.g. from ‘rejected’ to ‘in process’). So these state options should actually be properly modeled in to a state transition diagram instead of a “dumb” list.

To summarize

Reference data restricts what otherwise would be free text data values and makes sure that only allowed/controlled options can be used.

How do you recognize reference data in practice? Reference data is what users typically encounter as the allowed values in the drop down boxes or select input fields of the forms/screens they work with. Reference data is often an underestimated/undervalued form of data. This is because code tables are often hidden away in some obscure corner of the database. Or even worse, hard coded into the application source code. There is actually often a lot of valuable knowledge hidden in reference data! So please take more notice of this form of data.

Reference data as a first class citizen

In Hapsah we made reference data a first class citizen in our apps, allowing modelers and users to manage these data sets without needing any technical coding or database skills. And we use reference data as input for business rules to work with.