Column Families 101

Jump to: navigation, search

One of the challenges in creating a product like Toad for Cloud Databases is the mapping of NoSQL concepts to a relational model that will support SQL queries. The modelling of NoSQL column families is the clearest example of this problem.

So what is a column family? It helps if you start with the problem that column families solve. Imagine you have a database that contains census data. The main census table has one row for each person who participated in and would probably be keyed by a unique synthetic key. All singleton attributes such as date of birth, gender, address and so forth would exist in this table. Some repeating attributes like work history would be normalised out into related tables. Depending upon the size of the sample, a census may take in hundreds of millions of people, and would look something like Figure 1 - Census Data.

File:tcd_column_families_Figure1.jpg

The obvious problem is that analysing this census data to answer a question such as “How many men were born in each year?” also entails reading the name and address of each person, together with whatever other data is present in each row.

Columnar databases were devised to solve this problem. These databases store each column separately so that aggregate operations for one column of the entire table are significantly quicker than the traditional row storage model. The problem with this columnar approach is that getting all the data for a single person becomes very expensive because the database must fetch data from numerous places on disk and glue together all those columns to represent a single row.

Enter the column family. Columns of related data are grouped together within one table as shown in Figure 2 - Census Data in Column Families. The PERSON table has now been subdivided so that all personal name and address data is grouped together, as is statistical demographic data for each person. Any other columns in the table would be grouped accordingly as well.

File:tcd_column_families_Figure2.jpg

The clever bit is that NoSQL databases that support column families, such as HBase and Cassandra, only fetch the column families of those columns that are required by a query. This allows our previous statistical query to group men by age without fetching the long text columns that are present in each row. Note that the row key is not in a column family, but cuts a horizontal slice across all columns and their families to unify all data for one row.

Furthermore, NoSQL databases group all the columns in a column family together on the disk. This means they can fetch multiple rows of one column family in a single read operation. Grouping the columns like this to speed up the reading of data is called data locality.

Now that we understand the genesis of column families, let’s add in the wrinkle that causes Toad for Cloud Databases such grief! A normal RDBMS column has a static name and a single value per row. In HBase however, a family can also be used as a container for columns where the column name itself contains data. Confused? Take a look at Figure 3 - Data Column Names.

File:tcd_column_families_Figure3.jpg

In this example, a column family in our PERSON table is a list of cars the person has owned, and the year of manufacture for each car. This would entail a sub-table in a relational data model but an HBase column family can accommodate this because it can contain many name/value pairs, where the name is the column name and the value is the value of that column for that row. It is important to realise that the names of the columns in a single family can vary arbitrarily for each row.

So how does Toad for Cloud Databases map column families to a relational model? It starts with the data discovery that occurs when a user first chooses to map an HBase table. The Data Hub will read a number of rows from the table and examine the column names in each family for each row. It will then divide the column families into static and dynamic families. Static families like personal_data and demographic in Figure 2 have mostly the same column names on every row. Dynamic families like cars_owned contain mostly different column names for each row. Cassandra refers to these as standard and super column families respectively.

Obviously this method of discovery may be open to some degree of error, but an accurate trend should emerge over a sample set of one thousand rows. The user also has the chance to override the Data Hub’s choices before the remote data is mapped. Figure 4 - Toad Mapping Dialog shows what this table would look like in the Toad for Cloud Databases dialog for mapping remote objects.

File:tcd_column_families_Figure4.jpg

In this case, the user would enter “make” as the Name Column and “year” as the Value Column. They should also change the Value Data Type to “integer” to accurately reflect the type of data that is stored in these columns.

What will Toad for Cloud Databases make of this metadata when the OK button is clicked? Two tables will be created. The first table will be called person, and will contain all columns from the static column families where the column names have the format columnFamily_columnName, although this may have been overridden by the user.

The second table will be called person_cars_owned and will have three columns: id, make and year. This table pivots the data in the dynamic column family so that each column becomes a row, as shown in Figure 5 - Query From person_cars_owned, where the data is queried with a SQL statement.

File:tcd_column_families_Figure5.jpg

In summary, column families at their simplest are just a grouping element that sit between the table and column levels, and provide data locality for more efficient read operations. In some NoSQL databases, they may also be a container for data in the form of name and value pairs.