Original article published by Information Management, April 6, 2015.
Ever wonder how to migrate from a relational database management system (RDBMS) schema structure to a big data platform (NoSQL database implementation). This article (part 2 of an overall series; Big Data: part 1) offers some practical advice for such a migration.
In the following sections you’ll find information that will show you the ins and outs of migrating from a traditional RDBMS schema structure to a big data storage structure. While most of the examples are made to be fairly generic, the final structure will be created for use in the “Wide Column / Column Family Store” variety. There are many ways of getting to the end goal. The process I illustrate is one of many, but it does work well for what I’m trying to show.
Warning! Here’s where things will get a little deep, technically speaking. Not to worry though! Just like in American football when you know you’ll be breaking through the defenses line, buckle up your chin-strap and keep going. The reality is that even if this discussion causes you to twitch a bit (indicating that you may be on the upper end of the non-technical speaking population) you’ll probably come through it just fine.
Relational to Composite Structure
For the sake of narrowing the scope of this discussion, I’m going to limit the focus to the “De-normalize and duplicate” approach. We’ll be ignoring the “Balanced” approach because it takes too much time to explain correctly. We’ll also be ignoring the “Model it as-is” approach because that’s a poor a practice and doesn’t perform well.
With our discussion scope sufficiently narrowed, we’ll start by tackling a relatively simple relational structure. The very first thing we’ll need to do is to evaluate which entities can be de-normalized to become what I call super-classes. “Super-class” is not a standard big data term. It’s my term and I find it makes things easier for the initial discussion. I’ll explain why later. Each of these super-classes will be used to help define the new composite structure (an actual Big Data term). We’ll be using the following Entity Relationship Diagram (ERD) to lay out the steps needed to identify our super-classes.
In this example we have a relational structure that is used to manage information about Really Simple Syndication (RSS) feeds. Remember, there are no joins and no referential integrity in the No SQL database world. Thus, we’ll be adapting the relational structure represented in the ERD to one that matches more of a composite structure.
Finding and de-normalizing the entities in our diagram into our super-classes can be as easy as asking a few simple questions. These questions will help us identify the super-class candidates, and the de-normalization will help us create them correctly. Ready? Let’s go!
Does the entity contain foreign keys (references) to other entities?
If the answer to the foreign key question is no, we have to ask an additional clarifying question. Do other entities reference this entity? If yes, in most cases, we have a pure reference entity that will be absorbed into one or more super-classes. Reference entities become part of the super-classes that reference them (de-normalization).
If yes, this is an indicator that the entity could be either an intersection/junction or an actual super-class. To discover which, we’ll ask a clarifying question. Does the entity contain only foreign keys, foreign keys that all make up the primary key or foreign keys with an additional, separate primary key, dates or other superfluous information? A ‘no’ means we’ve identified a super-class. ‘Yes’ indicates that we have identified an intersection/junction entity. We’ll ignore any entities of this type (for now).
Let’s take a look at how this type of analysis works with our existing schema structure. Using the questions we just went through, it looks as though we have three reference entities: “categories”, “sources”, and “item_types.” We also have an intersection/junction entity: “feed_contents.” That leaves the two remaining entities: “feeds” and “contents” as our identified super-classes. That’s it, right? Not quite.
We haven’t touched on the ever-so-popular: (pause for effect…) common modeling jargon, lingo, you know – the terms you just have to know! Let’s get that taken care of now, shall we? The following table shows the common terms that we’ll need to understand to make that all important transition in our thinking when modeling from relational to non-relational.
See! That wasn’t so bad. The key here is that unlike in the relational world we have to being to think about structures in more finite terms. A relational column is broken down into a name-value pair. Primary keys become row keys. Tables are column families and databases are key spaces. Column Families can be related to one another but not in a traditional sense so the concepts of joins and referential integrity do not translate well at all.
Before we get into the resulting structure, we still have to address a few missing parts. Namely, we need to follow through with the de-normalization of the reference entities. Let’s start with the reference table “item_types.” Its information will need to be duplicated into both the “feeds” and the “contents” super-classes. That doesn’t mean simply adding the columns from one table to another. Instead, think of it more as nesting. I’ll use braces and brackets to illustrate table boundaries, nesting depth, and column ownership as follows:
You see, the “item_types” entity literally becomes part of the “feeds” entity. Using that same technique, let’s fast-forward and show how our “feeds” super-class looks after de-normalization.
Along with the modeling terms and the de-normalization technique we just learned about, we’ll also need to know about composite structures and how they can look in Wide Column Stores. Each has their uses, pros, and cons. While I’m going to introduce these structures and their uses, I will not get into all of the pros and cons as there are currently too many discussions for, and arguments against, one structure over another to list them accurately. That said, here we go.
The column family structure is what I would call a “class.” It’s a pretty generic structure and can be easily used to model simple entities.
The super column family structure is derived from what I’ve been calling a “super-class.” It’s the one structure that allows us to model our de-normalized entities and accommodates elements of the column family structure also.
Let’s look at an illustration of how our “feeds” super-class converts into the “feeds” super column family.
I know what you’re thinking: “what about that intersection/junction entity we’ve been ignoring?” You’re right. Now is definitely the right time to bring that up. How are we going to handle the many-to-many relationship? My preferred way is by modeling it explicitly.
While this may seem like a waste of space, this model allows us to efficiently query all the contents for a specific feed as well as all feeds that have a specific content without using the actual super column families making the whole lookup process more efficient. This form of modeling can also be called: Query Pattern Modeling.
How many lookup column families and which data columns you have in those families depends what you’re intending to use in your application data-wise and how you’d like to model your query patterns. Ultimately, modeling the data structure based on the type of data you’re after is the right way to go. I will note that some structures will require much less de-normalization than ours. In our example we opted for super column families to hold the bulk of our data and did our de-normalization for that up-front.
A note on performance: Modeling the intersection/junction explicitly caused us to create two lookup column families. It also caused some obvious, additional, de-normalization for both “feeds” and “contents.” Even with the extra duplication, this method is still very reasonable to populate lookup information for both feed and content detail pages. Both will perform well with this model. Both will cause two lookups: one to query feeds and one to query contents. This will remain constant regardless of the total number of feeds and the total number of contents (a good thing).
In this article, I’ve illustrated how to define a data model for a “Wide Column / Column Family Store” NoSQL database implementation. I’ve shown how to define the model and discussed an applicable modeling technique for our narrowly defined scope.
The information presented is relevant regardless of structure modeling technique you end up using for the “Wide Column / Column Family Store” NoSQL database implementation you choose.