Open source Java projects: Apache Phoenix

Use standard SQL queries in a NoSQL database with Phoenix

flickr kdooley jw osjp apache phoenix
Kevin Dooley (Creative Commons BY or BY-SA)

Apache Phoenix is a relatively new open source Java project that provides a JDBC driver and SQL access to Hadoop's NoSQL database: HBase. It was created as an internal project at Salesforce, open sourced on GitHub, and became a top-level Apache project in May 2014. If you have strong SQL programming skills and would like to be able to use them with a powerful NoSQL database, Phoenix could be exactly what you're looking for!

This installment of Open source Java projects introduces Java developers to Apache Phoenix. Since Phoenix runs on top of HBase, we'll start with an overview of HBase and how it differs from relational databases. You'll learn how Phoenix bridges the gap between SQL and NoSQL, and how it's optimized to efficiently interact with HBase. With those basics out of the way, we'll spend the remainder of the article learning how to work with Phoenix. You'll set up and integrate HBase and Phoenix, create a Java application that connects to HBase through Phoenix, and you'll write your first table, insert data, and run a few queries on it.

HBase: A primer

Apache HBase is a NoSQL database that runs on top of Hadoop as a distributed and scalable big data store. HBase is a column-oriented database that leverages the distributed processing capabilities of the Hadoop Distributed File System (HDFS) and Hadoop's MapReduce programming paradigm. It was designed to host large tables with billions of rows and potentially millions of columns, all running across a cluster of commodity hardware.

Apache HBase combines the power and scalability of Hadoop with the ability to query for individual records and execute MapReduce processes.

In addition to capabilities inherited from Hadoop, HBase is a powerful database in its own right: it combines real-time queries with the speed of a key/value store, a robust table-scanning strategy for quickly locating records, and it supports batch processing using MapReduce. As such, Apache HBase combines the power and scalability of Hadoop with the ability to query for individual records and execute MapReduce processes.

HBase's data model

HBase organizes data differently from traditional relational databases, supporting a four-dimensional data model in which each "cell" is represented by four coordinates:

  1. Row key: Each row has a unique row key that is represented internally by a byte array, but does not have any formal data type.
  2. Column family: The data contained in a row is partitioned into column families; each row has the same set of column families, but each column family does not need to maintain the same set of column qualifiers. You can think of column families as being similar to tables in a relational database.
  3. Column qualifier: These are similar to columns in a relational database.
  4. Version: Each column can have a configurable number of versions. If you request the data contained in a column without specifying a version then you receive the latest version, but you can request older versions by specifying a version number.

Figure 1 shows how these four dimensional coordinates are related.

osjp phoenix fig01

Figure 1. HBase data mode

The model in Figure 1 shows that a row is comprised of a row key and an arbitrary number of column families. Each row key is associated to a collection of "rows in tables," each of which has its own columns. While each table must exist, the columns in tables may be different across rows. Each column family has a set of columns, and each column has a set of versions that map to the actual data in the row.

If we were modeling a person, the row key might be the person's social security number (to uniquely identify them), and we might have column families like address, employment, education, and so forth. Inside the address column family we might have street, city, state, and zip code columns, and each version might correspond to where the person lived at any given time. The latest version might list the city "Los Angeles," while the previous version might list "New York." You can see this example model in Figure 2.

osjp phoenix fig02

Figure 2. Person model in HBase

In sum, HBase is a column-oriented database that represents data in a four dimensional model. It is built on top of the Hadoop Distributed File System (HDFS), which partitions data across potentially thousands of commodity machines. Developers using HBase can access data directly by accessing a row key, by scanning across a range of row keys, or by using batch processing via MapReduce.

Bridging the NoSQL gap: Apache Phoenix

Apache Phoenix is a top-level Apache project that provides an SQL interface to HBase, mapping HBase models to a relational database world. Of course, HBase provides its own API and shell for performing functions like scan, get, put, list, and so forth, but more developers are familiar with SQL than NoSQL. The goal of Phoenix is to provide a commonly understood interface for HBase.

In terms of features, Phoenix does the following:

  • Provides a JDBC driver for interacting with HBase.
  • Supports much of the ANSI SQL standard.
  • Supports DDL operations such as CREATE TABLE, DROP TABLE, and ALTER TABLE.
  • Supports DML operations such as UPSERT and DELETE.
  • Compiles SQL queries into native HBase scans and then maps the response to JDBC ResultSets.
  • Supports versioned schemas.

In addition to supporting a vast set of SQL operations, Phoenix is also very high performing. It analyzes SQL queries, breaks them down into multiple HBase scans, and runs them in parallel, using the native API instead of MapReduce processes.

Phoenix uses two strategies--co-processors and custom filters--to bring computations closer to the data:

  • Co-processors perform operations on the server, which minimizes client/server data transfer.
  • Custom filters reduce the amount of data returned in a query response from the server, which further reduces the amount of transferred data. Custom filters are used in a few ways:
    1. When executing a query, a custom filter can be used to identify only the essential column families required to satisfy the search.
    2. A skip scan filter uses HBase's SEEK_NEXT_USING_HINT to quickly navigate from one record to the next, which speeds up point queries.
    3. A custom filter can "salt the data," meaning that it adds a hash byte at the beginning of row key so that it can quickly locate records.

In sum, Phoenix leverages direct access to HBase APIs, co-processors, and custom filters to give you millisecond-level performance for small datasets and second-level performance for humongous ones. Above all, Phoenix exposes these capabilities to developers via a familiar JDBC and SQL interface.

Get started with Phoenix

In order to use Phoenix, you need to download and install both HBase and Phoenix. You can find the Phoenix download page (and HBase compatibility notes) here.

Download and setup

At the time of this writing, the latest version of Phoenix is 4.6.0 and the download page reads that 4.x is compatible with HBase version 0.98.1+. For my example, I downloaded the latest version of Phoenix that is configured to work with HBase 1.1. You can find it in the folder: phoenix-4.6.0-HBase-1.1/.

Here's the setup:

  1. Download and decompress this archive and then use one of the recommended mirror pages here to download HBase. For instance, I selected a mirror, navigated into the 1.1.2 folder, and downloaded hbase-1.1.2-bin.tar.gz.
  2. Decompress this file and create an HBASE_HOME environment variable that points to it; for example, I added the following to my ~/.bash_profile file (on Mac): export HBASE_HOME=/Users/shaines/Downloads/hbase-1.1.2.

Integrate Phoenix with HBase

The process to integrate Phoenix into HBase is simple:

  1. Copy the following file from the Phoenix root directory to the HBase lib directory: phoenix-4.6.0-HBase-1.1-server.jar.
  2. Start HBase by executing the following script from HBase's bin directory:./
  3. With HBase running, test that Phoenix is working by executing the SQLLine console, by executing following command from Phoenix's bin directory: ./ localhost.

The SQLLine console is a Python script that starts a console that connects to HBase's Zookeeper address; localhost in this case. You can walk through an example that I am going to summarize in this section here.

First, let's view all of the tables in HBase by executing !table:

0: jdbc:phoenix:localhost> !tables
|                TABLE_CAT                 |               TABLE_SCHEM                |                TABLE_NAME                |                TABLE_TYPE                |                 REMARKS  |
|                                          | SYSTEM                                   | CATALOG                                  | SYSTEM TABLE                             |                          |
|                                          | SYSTEM                                   | FUNCTION                                 | SYSTEM TABLE                             |                          |
|                                          | SYSTEM                                   | SEQUENCE                                 | SYSTEM TABLE                             |                          |
|                                          | SYSTEM                                   | STATS                                    | SYSTEM TABLE                             |                          |

Because this is a new instance of HBase the only tables that exist are system tables. You can create a table by executing a create table command:

0: jdbc:phoenix:localhost> create table test (mykey integer not null primary key, mycolumn varchar);
No rows affected (2.448 seconds)

This command creates a table named test, with an integer primary key named mykey and a varchar column named mycolumn. Now insert a couple rows by using the upsert command:

0: jdbc:phoenix:localhost> upsert into test values (1,'Hello');
1 row affected (0.142 seconds)
0: jdbc:phoenix:localhost> upsert into test values (2,'World!');
1 row affected (0.008 seconds)

UPSERT is an SQL command for inserting a record if it does not exist or updating a record if it does. In this case, we inserted (1,'Hello') and (2,'World!'). You can find the complete Phoenix command reference here. Finally, query your table to see the values that you upserted by executing select * from test:

0: jdbc:phoenix:localhost> select * from test;

|                  MYKEY                   |                 MYCOLUMN                 |
| 1                                        | Hello                                    |
| 2                                        | World!                                   |
2 rows selected (0.111 seconds)

As expected, you'll see the values that you just inserted. If you want to clean up the table, execute a drop table test command.

1 2 Page 1
Page 1 of 2
NEW! Download the Winter 2018 digital edition of CIO magazine