Thomas CasselberryOriginal article published by Information Management, April 6, 2015.

Ever wonder how to migrate from a relational database management system (RDBMS) schema structure to a big data platform (NoSQL database implementation). This article (part 2 of an overall series; Big Data: part 1) offers some practical advice for such a migration.

In the following sections you’ll find information that will show you the ins and outs of migrating from a traditional RDBMS schema structure to a big data storage structure. While most of the examples are made to be fairly generic, the final structure will be created for use in the “Wide Column / Column Family Store” variety. There are many ways of getting to the end goal. The process I illustrate is one of many, but it does work well for what I’m trying to show.

Warning! Here’s where things will get a little deep, technically speaking. Not to worry though! Just like in American football when you know you’ll be breaking through the defenses line, buckle up your chin-strap and keep going. The reality is that even if this discussion causes you to twitch a bit (indicating that you may be on the upper end of the non-technical speaking population) you’ll probably come through it just fine.

Relational to Composite Structure

For the sake of narrowing the scope of this discussion, I’m going to limit the focus to the “De-normalize and duplicate” approach. We’ll be ignoring the “Balanced” approach because it takes too much time to explain correctly. We’ll also be ignoring the “Model it as-is” approach because that’s a poor a practice and doesn’t perform well.

With our discussion scope sufficiently narrowed, we’ll start by tackling a relatively simple relational structure. The very first thing we’ll need to do is to evaluate which entities can be de-normalized to become what I call super-classes. “Super-class” is not a standard big data term. It’s my term and I find it makes things easier for the initial discussion. I’ll explain why later. Each of these super-classes will be used to help define the new composite structure (an actual Big Data term). We’ll be using the following Entity Relationship Diagram (ERD) to lay out the steps needed to identify our super-classes.

BigDataProjects_part2_1

In this example we have a relational structure that is used to manage information about Really Simple Syndication (RSS) feeds. Remember, there are no joins and no referential integrity in the No SQL database world. Thus, we’ll be adapting the relational structure represented in the ERD to one that matches more of a composite structure.

Finding and de-normalizing the entities in our diagram into our super-classes can be as easy as asking a few simple questions. These questions will help us identify the super-class candidates, and the de-normalization will help us create them correctly. Ready? Let’s go!

Does the entity contain foreign keys (references) to other entities?

If the answer to the foreign key question is no, we have to ask an additional clarifying question. Do other entities reference this entity? If yes, in most cases, we have a pure reference entity that will be absorbed into one or more super-classes. Reference entities become part of the super-classes that reference them (de-normalization).

If yes, this is an indicator that the entity could be either an intersection/junction or an actual super-class. To discover which, we’ll ask a clarifying question. Does the entity contain only foreign keys, foreign keys that all make up the primary key or foreign keys with an additional, separate primary key, dates or other superfluous information? A ‘no’ means we’ve identified a super-class. ‘Yes’ indicates that we have identified an intersection/junction entity. We’ll ignore any entities of this type (for now).

Let’s take a look at how this type of analysis works with our existing schema structure. Using the questions we just went through, it looks as though we have three reference entities: “categories”, “sources”, and “item_types.” We also have an intersection/junction entity: “feed_contents.” That leaves the two remaining entities: “feeds” and “contents” as our identified super-classes. That’s it, right? Not quite.

We haven’t touched on the ever-so-popular: (pause for effect…) common modeling jargon, lingo, you know – the terms you just have to know! Let’s get that taken care of now, shall we? The following table shows the common terms that we’ll need to understand to make that all important transition in our thinking when modeling from relational to non-relational.

BigDataProjects_part2_2

See! That wasn’t so bad. The key here is that unlike in the relational world we have to being to think about structures in more finite terms. A relational column is broken down into a name-value pair. Primary keys become row keys. Tables are column families and databases are key spaces. Column Families can be related to one another but not in a traditional sense so the concepts of joins and referential integrity do not translate well at all.

Before we get into the resulting structure, we still have to address a few missing parts. Namely, we need to follow through with the de-normalization of the reference entities. Let’s start with the reference table “item_types.” Its information will need to be duplicated into both the “feeds” and the “contents” super-classes. That doesn’t mean simply adding the columns from one table to another. Instead, think of it more as nesting. I’ll use braces and brackets to illustrate table boundaries, nesting depth, and column ownership as follows:

 

BigDataProjects_part2_3

You see, the “item_types” entity literally becomes part of the “feeds” entity. Using that same technique, let’s fast-forward and show how our “feeds” super-class looks after de-normalization.

BigDataProjects_part2_4

Along with the modeling terms and the de-normalization technique we just learned about, we’ll also need to know about composite structures and how they can look in Wide Column Stores. Each has their uses, pros, and cons. While I’m going to introduce these structures and their uses, I will not get into all of the pros and cons as there are currently too many discussions for, and arguments against, one structure over another to list them accurately. That said, here we go.

BigDataProjects_part2_5

The column family structure is what I would call a “class.” It’s a pretty generic structure and can be easily used to model simple entities.

BigDataProjects_part2_6

The super column family structure is derived from what I’ve been calling a “super-class.” It’s the one structure that allows us to model our de-normalized entities and accommodates elements of the column family structure also.

Let’s look at an illustration of how our “feeds” super-class converts into the “feeds” super column family.

BigDataProjects_part2_7

I know what you’re thinking: “what about that intersection/junction entity we’ve been ignoring?” You’re right. Now is definitely the right time to bring that up. How are we going to handle the many-to-many relationship? My preferred way is by modeling it explicitly.

BigDataProjects_part2_8

While this may seem like a waste of space, this model allows us to efficiently query all the contents for a specific feed as well as all feeds that have a specific content without using the actual super column families making the whole lookup process more efficient. This form of modeling can also be called: Query Pattern Modeling.

How many lookup column families and which data columns you have in those families depends what you’re intending to use in your application data-wise and how you’d like to model your query patterns. Ultimately, modeling the data structure based on the type of data you’re after is the right way to go. I will note that some structures will require much less de-normalization than ours. In our example we opted for super column families to hold the bulk of our data and did our de-normalization for that up-front.

A note on performance: Modeling the intersection/junction explicitly caused us to create two lookup column families. It also caused some obvious, additional, de-normalization for both “feeds” and “contents.” Even with the extra duplication, this method is still very reasonable to populate lookup information for both feed and content detail pages. Both will perform well with this model. Both will cause two lookups: one to query feeds and one to query contents. This will remain constant regardless of the total number of feeds and the total number of contents (a good thing).

In this article, I’ve illustrated how to define a data model for a “Wide Column / Column Family Store” NoSQL database implementation. I’ve shown how to define the model and discussed an applicable modeling technique for our narrowly defined scope.

The information presented is relevant regardless of structure modeling technique you end up using for the “Wide Column / Column Family Store” NoSQL database implementation you choose.

 

Thomas Casselberry

Original article published by Information Management, January 21, 2015.

 

So, you’ve succumbed to the buzz and now you’re looking around trying to make heads or tails of the mass amounts of information out there hyped up as “big data.” Or perhaps you’re even ready to start your own internal project to get your existing applications on the bandwagon. In either case, terrific! Your decision is a good one.

Unfortunately, now comes the flurry of potentially overwhelming questions:

  • Where do I start?
  • What are my expectations?
  • What does big data mean to my company?
  • What does big data mean in the context of our applications
  • How do I assess my application needs?
  • How do I know or determine if big data solutions will work for us?

After some online research, you’ll quickly find that most folks are merely picking a place at the edge of the pool, dipping their toes in here and there to test the water temperature.

The reality is that it is incredibly difficult to define the term big data. Its meaning includes so much more than just storing and using large data sets. When you hear people referring to big data, they’re actually referring to is the use of NoSQL database implementations to store and process large amounts of information.

Don’t be discouraged! In the following sections we’ll get into what those NoSQL databases are and how to identify which is best, if any, for your project[s]. That’s right; the goal here is to provide you with information that will allow you to draw the correct conclusions for your organization and your specific application[s] or project[s]. (In part two of this article, I explain how to migrate from relational databases to NoSQL.)

NoSQL databases aren’t really databases. In fact, they are nothing like a traditional relational database management system (RDBMS). Instead they are implementations of various data stores which do not have fixed schemas, referential integrity, defined joins, or a common storage model. Also, they typically do not adhere to ACID principles (atomicity, consistency, isolation, and durability) and have sometimes widely varied technologies behind them. The term NoSQL (or Not-only SQL) is intended to imply that many of these implementations also support SQL-like query capabilities.

In this big data market where the NoSQL database is king, there are more than 100 different offerings available in various licensed models. The fact that these non-databases vary is no accident. Each distinct implementation has different strengths, weaknesses, and generally accepted uses. However, the bulk of these break down into four major categories based on some common underlying characteristics — as shown in this chart:

Choosing the Right Path

A heavy emphasis should be placed on the definition of your requirements. What are those? Well, that’s a large discussion all by itself. However, I’ll try to quickly paraphrase for the purpose of furthering this topic of discussion: Data requirements are artifacts captured during the process of defining application behavior with respect to gathering, storing, retrieving, or displaying information (data).

For example, in your application are you processing stock quotes, working with CRM data, or processing social information? There are different needs for different application types and thus a varied number of NoSQL implementations, not all of which are designed to be applicable to your needs.

It may be that after careful evaluation, you determine that your current RDBMS approach is valid and appropriate for your application. That’s not a bad thing. Traditional RDBMS certainly has its place and will remain very relevant for business use well into the future.

You see, there’s loads of confusion about this big data thing because there is no One Path concept. It just doesn’t exist. What’s good for one business may not be good for another even though they are doing similar things. There are many factors that go into selecting the right implementation and, honestly, not everyone is careful or critical when evaluating their needs.

An Assessment

So you still think that you might be better off migrating away from RDBMS to a NoSQL implementation. This decision isn’t for the faint of heart. It requires real consideration and planning, which raises several additional questions:

Question 1: How do I know which implementation is correct for my needs?

Fortunately, there are certain high-level criteria that help us get the decision process started. One such is determined by answering this question: Is the application intensive with reads or writes (e.g. large numbers of transactions in an OLTP system or large numbers of reads in an OLAP system) today? If the answer is no, and we’re merely dealing with volumes of information, we can automatically exclude items that fall into the “Column Families/Wide Column Stores” category. There are always caveats but that’s a good, general rule of thumb. An initial litmus test if you will.

Let’s test this with an example: We’ve got a compiled application that processes transactions for a global book seller. This application sees no fewer than 70,000 transactions a minute, 24 hours per day, 7 days per week. Which category of NoSQL database implementations fits? It’s obvious, right? You bet! It’s the “Column Families/Wide Column Stores” category.

Let’s try another. In this example we’ve got a web-based application that enables title companies to enable secure signing of large numbers of documents. The transactional volume isn’t large nor are the numbers of reads. Which category fits now? Right! It’s the “Document Store” category. This is primarily because only a small amount of data is changing; the signatures and whole documents need to be stored.

I think you’ve probably got a handle on how to determine initial fits by category now. So long as we ask the right questions, with the right view of our data requirements, we should always be able to identify the right category to start trying to assess which implementations might meet our big data needs.

We’re done then, right? We can pick any random implementation from the category we’ve identified, get it installed, configured and deploy our application[s] so we can start touting our Big Data story! Hold on there, we’re not really done yet.

Question 2: What will we be using to communicate?

The language[s] your application[s] use to communicate is an important consideration when choosing the right NoSQL database implementation. In the earlier examples we were able to identify the right categories based on our knowledge of the requirements and the application usage. Now we need to narrow down the field of choices and zero in on what is likely to be right for our needs. Unfortunately we actually don’t have enough information to identify, even at a high level, which implementation matches our communication needs. Yet!

We’ll need help to answer this and other questions, so we’ll employ the use of another table. In this table we’ve laid out a few of the most popular NoSQL database implementations, their protocol[s], API[s], licenses and replication models:

BigDataProjects_2

Clear as mud? Right! Basically what this table shows is exactly what you’ll find on the web: Every NoSQL database implementation has its own way of getting data into and out of its store. Fortunately though, this is precisely what we need to match our requirements with implementations in our selected category because, in the end, our application[s] will need to know how to interact with it.

Shall we try another example?: This time, we’ll use our high-transaction application for our global book seller which we’ve already matched to NoSQL database implementations in the “Column Families / Wide Column Stores” category. Now we take a look at our application’s communication requirements. The application today communicates via an ODBC driver (a compiled binary). We could make some assumption that if we use one compiled driver, we can use any of them. That line of thinking is not un-heard of and would lead us to select Cassandra from the above tables because it resides in the right category and because of it also uses a compiled driver: its Thrift driver. A stretch? Perhaps a small one, but it’s really close enough for the purposes of our discussion.

Can it really be that easy? Well, yes and no. There are many other factors that we should consider when choosing a NoSQL database implementation. Some of these will leverage information from the table above. The additional criteria can be determined by answering the following questions:

  • Is this a commercial application? If so, is this for internal or external use? You see, there may be specific licensing restrictions that must be considered, especially in commercial, for-profit, applications.
  • Are there existing deployment restrictions such as supported operating systems or others like single vs. multi-server, replication model, or even needing to meet specific backup requirements for the company’s disaster recovery plan?

Not asking for or failing to give heed to these answers may cause your project to fail due to a poorly selected match.

I hope this has helped in some small way to eliminate some of the confusion around big data and what it might mean for your company.