Snowflake: Hybrid Relational JSON Tables

Have you ever had a new field or field rename break your ETL pipelines to the Data Warehouse and ultimately your consumers of data? If you have done any data project, ever, the answer is undoubtedly “yes”. To solve this ages old problem with data the confluence of some tenants of Hadoop’s Schema-On-Read and Snowflake’s Architecture separation come together in a beautiful harmony on the Snowflake Data Cloud. In Snowflake you can have a hybrid relational and JSON table, in the same table. I have customers doing this to the tune of 70,000,000,000 (seventy trillion) records in one table. This hybrid method scales beautifully to multiple petabytes in a non-brittle fashion via ELT.

Hybrid relational columns and JSON in the same table in Snowflake

What do Hybrid Tables Enable?
You get all the goodness of a relational table with RBAC and ACID without the fragility of data types, schema evolution, and failed pipelines for mostly cosmetic reasons. If the data is bogus, you have the ability to transform in the database. This ELT evolution allows you to use the power of Snowflake’s compute and time-travel on files to undo data issues quickly. This core tenant is huge when it comes to Snowflake, because you can elastically scale up thousands of CPUs to unwind the issue in minutes instead of days with the bottleneck of an ETL tool. Further, storing most fields as JSON instead of columns allows for schema evolution to flow through the database instead of being stopped by the database. Most of the fields in a table are never used in a SELECT clause anyway. Let them live on the long tail in JSON and move important fields, key fields, and dates to columns for ease of query and joins. When new fields appear in your JSON payload without notice, no worries, it’s already in the database and queryable via v.parent.child:new_field::cast. Ultimately, this allows for incredibly fast ingest to the database to the tunes of millions of events per second. Take the event from the message broker as is, store it in a hybrid table and use views to unwind the JSON for analytics. This will increase your data agility by an order of magnitude and make the database the star of the show instead of the constant perceived bottleneck to progressing your organization into real-time data.

Two distinct ELT patterns will start to emerge from raw JSON to ready to report tables and views. A pull through or a push through from raw to conformed to the reporting objects. For ELT pattern 1 (pull), the BLOB storage is the raw zone. Snowpipe unwinds the JSON a little with further views ready for reporting. For ELT pattern 2 (push), Snowflake stores data in a raw hybrid table. Then, MERGE logic unwinds the data to tables and views ready for reporting. Both patterns are viable and the complexity of the Transform logic will drive your choices. The higher the complexity, the more likely you are for pattern 2 (push) as MERGE logic can support more complex transforms and UPSERTS.

ELT Pipeline Pattern 1: Push JSON to S3 and Pull from Snowpipe with the Transform Logic

ELT Pipeline Pattern 2: Push JSON through to Metadata / JSON tables in Snowflake and Transform in MERGE Logic

Hybrid Table Design Patterns
Some level of schema design is necessary for our hybrid tables. Typically, it is best to make cluster, join and common where predicate fields into full-fledged columns Snowflake. Further, it is a good idea to pull out metadata like the file dts, insert dts, file name and file path and create a sequence or UUID on the table. Leave everything else as JSON. Let me say that again, leave everything else as JSON. Snowflake does something very special to JSON and other variant types as you ingest. It creates a full columnarized dictionary over the JSON greatly speeding up query times on the JSON document. This means you get MPP column performance without the constraints of physical schema. Here is a taste of what Snowflake collects even for this simple table. This unseen metadata dictionary step is what allows JSON hybrid tables to scale and perform to trillions of rows and petabytes in scale in Snowflake.

//AUTOMATICALLY GENERATED DATA DICTIONARY EXAMPLE JSON INGESTED INTO SNOWFLAKE
{
   "columns":[
      {
         "columnId":3,
         "columnName":"V",
         "dataType":"FIXED",
         "distinctValues":1,
         "minValue":100,
         "nullCount":0,
         "nullable":true,
         "path":"smart_198_normalized"
      },
      {
         "columnId":3,
         "columnName":"V",
         "dataType":"FIXED",
         "distinctValues":1,
         "minValue":100,
         "nullCount":0,
         "nullable":true,
         "path":"smart_187_normalized"
      },
//...this is over 900 lines so I cut it...
      {
         "columnId":3,
         "columnName":"V",
         "dataType":"VARIANT",
         "dictBytes":12,
         "dictCnt":9,
         "dictValueBits":10,
         "distinctValues":46,
         "maxLength":811,
         "nullCount":0,
         "nullable":true,
         "offset":6088,
         "size":880,
         "textBytes":817,
         "textCompression":"snappy",
         "textSize":7293,
         "valueBits":4,
         "vectorBytes":23
      }
   ],
   "created":1608825979,
   "database":"DEMO_DB",
   "filter":"F6C96BC029F4C0DE186464F8135B1835B302C4D02F58D723C9291FCA0C6100EA0C5B0E25064D900331C91775F8315E18409DF683E6D1C940BE5B3BCA1E4472C228FF8303318290189D7A17EBF92A10C03293F8536F2504393FB2D7808CDA16CAFE6DE64DBCEAA6E1EE042DA09A653B71580EBB9CCDF03154ECB727E255B06B10",
   "majorVersion":2,
   "minorVersion":5,
   "rows":46,
   "tableName":"DEMO_DB.IOT.JSON_HYBRID",
   "tableSourceId":"XXXXXXXXXXXXXXXXX"
}

Public Facing Example
This is great, I want to try it! Yes, yes you do, and I have something special for the Holidays for you. I extended a simple example off my friend Chris Richardson’s hard work. First, we will create the database, file format, and stage for the 11 JSON files I am hosting on public S3.

--I am using demo_db.iot for my example work
create database if not exists demo_db;
create schema if not exists demo_db.iot;

--file format
create or replace file format demo_db.iot.json
    type = 'json'
    compression = 'auto'
    enable_octal = false
    allow_duplicate = false
    strip_outer_array = true
    strip_null_values = false
    ignore_utf8_errors = false;
    
--show file formats, look at that!
show file formats;

--stage from s3://snowflake-corp-se-workshop/VHOL_Hybrid_Tables/
create or replace stage hybrid
    url = 's3://snowflake-corp-se-workshop/VHOL_Hybrid_Tables/';
    
--list 11 JSON files
ls @hybrid;

This is why you love Snowflake, even the file formats object is a JSON hybrid, so much more agile and best practice following!

Now we can create the hybrid table using the special Variant type in Snowflake (further reading).

--create a hybrid table
create or replace table demo_db.iot.json_hybrid (
    log_ts timestamp,
    serial_number string,
    v variant,
    path_name string,
    file_name string
) cluster by (date(log_ts) , serial_number);
  
--copy data from stage
copy into demo_db.iot.json_hybrid
    from (
    select
        current_timestamp::timestamp log_ts,
        parse_json($1):serial_number::string serial_number,
        regexp_replace($1, '\[[0-9]+\]', '[]') v,
        left(metadata$filename,86) path_name,
        regexp_replace (metadata$filename,'.*\/(.*)','\\1') file_name
    from
        @hybrid
) file_format = 'json';

--select data
select * from demo_db.iot.json_hybrid;

Columns and JSON coexisting without pandemonium.

The final step is to use Snowflake’s dot notation over JSON to create an analytics ready view. As new data lands to the hybrid table it is instantly selectable over the view… it’s that simple.

--view to unwind for analysts
create or replace view demo_db.iot.json_hybrid_vw as (
    select
        log_ts as LogDate,
        v:date::date as Date,
        v:serial_number::string as SerialNumber,
        v:model::string as Model,
        v:capacity_bytes::number as CapacityBytes,
        v:failure::number Failure,
        round(CapacityBytes/power(1024,3),2) as Capacity_GB,
        round(CapacityBytes/power(1024,4),2) as Capacity_TB,
        v:smart_4_raw::number as StartStopCount_smart_4,
        v:smart_5_raw::number Reallocated_Sector_Count_smart_5,
        v:smart_9_raw::number as PowerOnHours_smart_9,
        v:smart_12_raw::number as PowerCycleCount_smart_12,
        v:smart_187_raw::number as Reported_Uncorrectable_Errors_smart_187,
        v:smart_188_raw::number as Command_Timeout_smart_188,
        v:smart_197_raw::number as Current_Pending_Sector_Count_smart_197,
        v:smart_198_raw::number as Offline_Uncorrectable_smart_198,
        v:smart_194_raw::number as TemperatureC_smart_194,
        v:smart_240_raw::number as HeadFlyingHours_smart_240,
        v:smart_241_raw::number as TotalWriteLba_smart_241,
        v:smart_242_raw::number as TotalReadLba_smart_242,
        round(v:smart_9_raw/24) as PowerOnDays,
        round(v:smart_9_raw/24/365,1) as PowerOnYears,
        round(v:smart_241_raw::number*512/1000000000) as WriteGb,
        round(v:smart_242_raw::number*512/1000000000) as ReadGb,
        v as conetent,
        path_name as path,
        file_name as file
    from demo_db.iot.json_hybrid
);

--select from view
select * from demo_db.iot.json_hybrid_vw;

Analytics ready, stored once as JSON, scales to trillions of records and petabytes.

Conclusion
First and foremost, let the Data Cloud do the hard work. Ingest data as relational JSON hybrid tables and Snowflake will columnarize the data without typing SQL for days. Store the important fields as columns and leave the rest to Snowflake by keeping everything else as JSON without sacrificing performance. By merging the tenants of schema on read, blob storage, smart metadata along with ACID, cloud-first architecture; is what allows Snowflake to scale up to webscale. Happy JSON Querying!

Snowflake: Hybrid Relational JSON Tables

One comment

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply