Cassandra Column Families

Jump to: navigation, search

This article builds upon the explanation of HBase column families given in the article Column Familes 101. If you are unfamiliar with the NoSQL concept of column families then you should read this one first. If you know what a column family is, but need to know specifically how Cassandra column families are mapped in Toad for Cloud Databases then you’re in the right place!

Firstly, we need to review the Cassandra terminology and data organization, which differs somewhat from that of HBase, but confusingly uses the term “column family” to mean something entirely different.


File:Cassandra_column_families_figure01.jpg

Figure 1 shows the internal structure of an HBase database, where the column family is simply an organization unit that groups related columns together to tell HBase that data for these columns should be stored in close proximity on disk. The only real difficulty in this is that column names can actually be dynamic data instead of static identifiers, which means Toad for Cloud Databases needs to pivot those columns and present them as rows in a child table to model this data in a relational format. In HBase though, the column family names are always static identifiers, much like the column names in a relational table.

Cassandra takes this problem to a whole new level of difficulty by allowing the column family names to also be dynamic data instead of static identifiers. With this extra permutation, there are actually five basic data patterns that can appear in Cassandra databases, all of which I’m about to try and explain...  :-)

Standard Column Families

Cassandra also has two different types of column family. They have standard column families, which may only contain static columns, and super column families, which are like HBase column families on steroids! More on those later.

File:Cassandra_column_families_figure02.jpg

Figure 2 shows how standard column families appear in a Cassandra database. The names of standard column families and their columns must be static identifiers. This makes it easy for Toad for Cloud Databases, which simply shows each standard column family as a separate remote object, which maps to a single relational table.

One thing to note at this point is that Cassandra does not have the concept of a “database” or a table. Instead they call the outermost data container a keyspace, which contains column families as an equivalent to a table. To my mind this structure exposes the internal Cassandra architecture in a way that introduces yet another obstacle to knowledge transference without really achieving anything. For all the semantic difference that key spaces and column families offer they should have just called them databases and tables, and the majority of people would have been much clearer about how the data was organised.

Super Column Families

The other type of Cassandra column family is a super column family. As illustrated in Figure 3, a super column family contains one or more super columns. Each super column contains zero or more columns and is organizationally equivalent to an HBase column family.

File:Cassandra_column_families_figure03.jpg

The problem here is that both the names of the super columns and those of the columns can be either static identifiers or dynamic data. This gives four different permutations that are all mapped differently in Toad for Cloud Databases.

In each of the four data patterns, Toad for Cloud Databases will show the super column family as the remote object, which is mapped to the main (or parent) table. Super columns that have dynamic names or contain columns with dynamic names are mapped as sub-tables. Note that two or more of these patterns may appear in the same super column family, resulting in multiple sub-tables being defined in the mapping process.

Data Pattern #1 – Static / Static

This is the easiest of the four patterns as it requires no sub-table. Borrowing from the example data in Column Families 101, this pattern is illustrated in Figure 4.

File:Cassandra_column_families_figure04.jpg

Both the super column names and the column names are static identifiers – hence the term “static/static”. In this case, Toad for Cloud Databases will map the “PERSON” super column family as the main table. It will also map the two super columns “Personal” and “Demographic” into the main table as well, which will produce a single table like the one shown in Figure 5, where the value of the “id” column is the row key from Figure 4.

File:Cassandra_column_families_figure05.jpg

Data Pattern #2 – Static / Dynamic

This is the pattern that is synonymous with the HBase dynamic column family, where the super column name is static but the column names are dynamic data values. The HBase example was that of cars each person had owned, where the name of the car was the column name, and the year of manufacture was the column value. Figure 6 shows this data as it appears in Cassandra, and confirms the little known fact that Merlin never owned a car and preferred to walk everywhere.

File:Cassandra_column_families_figure06.jpg

As with the HBase example, the static/static super columns still get mapped into the main table, and the dynamic columns in the static super column “Cars” are pivoted to become the rows of a “PERSON_CARS” sub-table as shown in Figure 7.

File:Cassandra_column_families_figure07.jpg

For this type of mapping, the user needs to give Toad for Cloud Databases static identifiers for both the column name and column value. It knows the sub-table is about “Cars” from the super column name, but because the column name itself is data and not a static identifier for the column value then the user needs to label them both during the mapping process.

Data Pattern #3 – Dynamic / Static

The data pattern is used where there is a group of the same repeating columns and the super column names are multiple instances of the same data entity. Clear as mud, no? An example of blog posts that each person has made is depicted in Figure 8.

File:Cassandra_column_families_figure08.jpg

Each of the dynamic super columns represents a separate posting in that person’s blog. The format of each post however is consistent and repeats the same set of columns with the same static identifiers as the column names.

For this data pattern, Toad for Cloud Databases will pivot both the columns and the super column itself to create rows in the sub-table. This is subtly different from pivoting only the columns as we did in the static/dynamic pattern. Because the columns are all static, they simply become additional columns in the sub-table as shown in Figure 9.

File:Cassandra_column_families_figure09.jpg

The user needs to provide two pieces of information when creating this type of mapping. Toad for Cloud Databases needs to be given a static identifier for the super column, in this example it is “BLOG”, and also an identifier for the column, in this example it is “post_id”.

A word of warning though... A super column family could conceivably have more than one repeating data entity saved into super columns. For example, the PERSON super column family could have blog entries, relatives, holidays and many others all mapped as super columns. The type of each super column could only be identified by the set of columns it contained or maybe a consistent pattern in the super column name. Data of this complexity is currently beyond the scope of Toad for Cloud Databases.

Data Pattern #4 – Dynamic / Dynamic

This data pattern is where both the super column name and the column name contain dynamic data instead of static identifiers. A potential use case for this pattern would be some form of data that is aggregated at two levels. Figure 10 shows how web page visits by each person could be aggregated by the domain name, and by a relative page within that domain.

File:Cassandra_column_families_figure10.jpg

This is the most complex data pattern for discovery and mapping because essentially there is no pattern! None of the super columns or their columns have identifiers that are consistent across keys (rows). The only thing that is consistent for each super column is the two-level aggregation pattern.

Therefore it is up to the user to define this pattern during the mapping process. They do this be giving Toad for Cloud Databases the overall name of the data in the super columns, in Figure 10 this would be “web use”. Then they must also label the data in both the column names and column values. In this example, these names would be something like “web page” and “visit count”. Armed with this information, Toad for Cloud Databases would create the relational model show in Figure 11.

File:Cassandra_column_families_figure11.jpg

This data pattern is currently not supported by Toad for Cloud Databases version 1.1, but it is planned for the next release. We are hoping that by then someone may have provided more feedback about dynamic/dynamic data requirements. If you have any comments then please contribute to the Toad for Cloud Databases forums.