Why You Should Check Out BigQuery for Big Data

(AKA why I can't stop talking about it)

Posted by Justin Carmony on April 18th, 2017

If you have a conversation with me right now, you’ll likely end up hearing about BigQuery. For the last year and a half we’ve been using BigQuery to answer Big Data(ish) questions at Deseret Digital Media and the Deseret News. We’ve gained insights that weren’t possible with our existing tools and have been big fans of it since.

So here is why I would recommend for you to check it out:

It’s Fast

it's fast

Holy cow it’s fast.

For example, one of our largest datasets is 12 billion rows at a total size of 11 Terabytes, which averages about 1KB per record. Every day we download our ad serving logs and upload it to BigQuery. Each record represents an individual ad impression on one of our websites. So lets say I want to calculate the average click through rate across all impressions for the entire year of 2016. I’m doing some ad-hoc analysis so I write a quick query:

# Select Impressions, Clicks, and Click-Through-Rate
SELECT 
  COUNT(*) AS impressions,
  SUM(A.Clicked) AS clicks,
  ROUND((IF(SUM(A.Clicked) IS NULL, 0, SUM(A.Clicked)) / COUNT(*) * 100), 2) AS ctr
FROM (
  # Our tables are partitioned by day,
  # this allows me to control how much data
  # I access, which helps reduce my price.
  SELECT * FROM TABLE_DATE_RANGE(dfp_all.unified_,
  TIMESTAMP('2016-01-01'), 
  TIMESTAMP('2016-12-31'))
) A
  # We copy our Ad Server information to BigQuery to easily join it in for filtering.
  INNER JOIN dfp_api.line_items LI ON A.LineItemId = LI.id
WHERE 
  # Most of our Line Items use the correct category of STANDARD, SPONSOR, or AD_EXCHANGE
  (LI.lineItemType IN ('STANDARD', 'SPONSOR', 'AD_EXCHANGE') 
    # But we have some marked as PRICE_PRIORITY that we want to include as well
    OR LI.name LIKE '%AdX%' OR LI.name LIKE '%AdExchange%')
   # We want to filter our ads that show up in the right position on our website
   AND A.CT_Pos = 'right'

A couple of notes about this query (which is a real query we ran just the other day). We had to add a couple of WHERE clauses to “clean up” our data because not all line items have click data, as well as limit the results to a particular position. Normally adding in this type of filtering would slow down a query massively. However, clicking execute you see it runs in just two minutes:

./bigquery1.png

A couple of notes about this query and why it’s awesome:

  • It processed 209 GB of data instead of all 11 TB. BigQuery under the hood is column based, so you are only billed for the columns used in the query. If we had a thousand extra columns it wouldn’t impact our billing for query execution. This makes things feel like “Big Data” and you never worry about putting too much data in BigQuery. Have the data? Put it up there! It won’t affect your query performance unless you use those columns in your query.
  • The cost was $1.04, which isn’t bad for ad-hoc analysis against an entire year’s worth of data. Especially not having to wait an hour on our previous system. I’ll cover more about the costs later in the article.
  • Because our tables are partitioned by day I can control how many tables the query is executed against by TABLE_DATE_RANGE.

Now, it might be more normal to query the last 31 days to see how your CTR is performing. Querying 1 month of data took 5.8 seconds and cost $0.09 cents.

./bigquery2.png

It’s Easy

easy button

We haven’t had to spin up large clusters of servers or spend lots of time in configuration. BigQuery abstracts all of the complexity and we treat BigQuery pretty much like a regular database (only more awesome).

There are two main ways of getting data into BigQuery: uploading CSV/JSON files to import via Jobs and streaming data directly into BigQuery. Every day we do a ETL to take our ad impressions logs, transform them, and load a day’s worth of logs to BigQuery.

But if you want to track some custom data through http website beacons you can stream data directly to BigQuery. Over the weekend I used Boomerang.js to capture real-time user performance metrics on our website. Then I sent data to an HTTP Cloud Function on every profiled pageview, which just used a little code to validate the request, transform it, and send it to BigQuery. Here is an example:

// Code to be executed by Google Cloud Functions

// Require lodash.forown and BigQuery's 
var forOwn = require('lodash.forown');
var bigquery = require('@google-cloud/bigquery')({
    projectId: 'ddm-dbi',
    keyFilename: './config/auth.json'
});

function validateReq(req) {
    // ...
    // logic to validate request
    // ...
}

function formatRow(req) {
    // ...
    // logic to format request
    // ...

    // Notice I'm creating JSON String's instead of specifying all 40+ metrics returned
    var row = {
        'Time': Math.floor(Date.now() / 1000),
        'ResponseTimingJson': JSON.stringify(responseTiming),
        'TimingJson': JSON.stringify(timing),
        'NavigationTimingJson': JSON.stringify(navigationTiming),
        'OtherJson': JSON.stringify(other),
        'UserIp': req.connection.remoteAddress
    };

    return row;
}

// Format the data then insert to BigQuery
function insertData(req) {
    return new Promise(function(resolve, reject){
        // validate is a legitimate boomerang request
        if(!validateRequest(req)) {
            reject('Invalid request');
            return;
        }

        // Take the data and format it
        var row = formatRow(req);

        // Grab the dataset
        var dataset = bigquery.dataset('boomerang_performance_logs');

        // Grab the table
        var table = dataset.table('beacon_logs');

        // Insert, aka stream, the row into BigQuery, ignore values if they don't know them
        table.insert(row, { ignoreUnknownValues: true }, function(err) {
            if(err) {
                reject(err);
                console.error("Error occurred", err);
            } else {
                resolve();
            }
        });
    });
}

// The function that Google Cloud Functions calls
exports.collect = function collect (req, res) {
    insertData(req).then(function(){
        res.send('1');
    }, function(err) {
        res.send('0');
    });

};

Now here is my schema:

./bigquery-schema1.png

Notice how I’m storing JSON instead of creating a field for every possible value passed by Boomerang? This was a quick way to get something up and working to quickly get to analysis. I will likely after a few weeks using the data create a new schema that is more detailed with all the columns listed out. However, lets querying against the JSON strings:

SELECT CEIL(INTEGER(JSON_EXTRACT(TimingJson, '$.dfp_lib_loaded')) / 100) * 100 AS DfpLoadedBucket,
  COUNT(*) as TheCount
FROM boomerang_performance_logs.beacon_logs 
WHERE JSON_EXTRACT(TimingJson, '$.dfp_lib_loaded') IS NOT NULL
  AND INTEGER(JSON_EXTRACT(TimingJson, '$.dfp_lib_loaded')) > 0
GROUP BY DfpLoadedBucket
ORDER BY DfpLoadedBucket ASC

Using the JSON_EXTRACT function I can pull data out of JSON string with a JSONPath. In the SELECT statement I divide by 100, CEIL, and multiply by 100 to create buckets in 100 millisecond increments.

I got my data, but these numbers are hard to visualize. So I click Save to Google Sheets and in a few seconds I have a Google Sheet with my results. I select the two columns I need, select Create Chart and boom, I have something I can show my team:

./bigquery-chart1.png

Because it just uses SQL, almost any developer or data analyst can immediately start working with the dat. The BigQuery UI is super functional, allowing people to immediately start querying and seeing results.

It’s Cheap

its cheap

Not only is it powerful, it’s cheap. It has been the most cost-effective Big Data data store we’ve found. New data in BigQuery is $0.02 cents per GB, however after 90 days if the data hasn’t been altered, it drops to $0.01 cents, half the price. This makes partitioning data super effective. Our 22 TB of data at $0.01/GB is $225 a month to store.

Where it can get more expensive is if you do lots of querying incurring execution costs. Right now the rate is $5.00 per TB processed. One of my favorite tools is the BigQuery Mate extension for Chrome. It allows you to see the estimate costs of any query you’re going to execute.

Because storage is so cheap, we’ve found that if you have common queries or reports that are ran, we create summary tables after inserting the raw data into BigQuery. That way our automated tools execute against GB of data, not TB. It is super easy to run a query and save it’s output as another table. Betweeun using partitioned tables and summarizing data in additional tables is pretty easy to control your costs.

You can also have data auto-expire after a period of time. I’ve set to keep my Boomerang logs for 120 days, or about 4 months. After that BigQuery automatically deletes the outdated table partitions. My Boomerang example is using about 250 MB a day, so 30 GBs for 4 months, which is $0.60 per month! The Cloud Function is costing about $0.13 per day, so $4.03 per month to collect the data. $4.63 per month total. That is crazy to give me detailed user performance on a website with millions of monthly users.

Also, if you have a Google Analytics 360 Premium account, you get $500 a month credit towards BigQuery costs.

It’s Growing

its growing

Since we started using BigQuery in November of 2015 they have added many new features. Just a couple are:

  • New Standard SQL format making it even easier to use with existing tools.
  • Addition of Long-term storage, cutting our storage costs in half.
  • Save Results to Google Sheets
  • Flat-Rate Pricing for high-volume/enterprise customers
  • User-defined Functions
  • BigQuery Data Transer Service which allows you to export data from many Google services into BigQuery directly. (We don’t need our DFP ETLs with this now!)

Not only that, but many tools are quickly adopting BigQuery support. Since Google annouced that Data Studio is free for everyone, we’ve been building out many dashboards and reports using our BigQuery data.

It’s dirt-cheap for Medium Sized Data

Let’s get real, for most companies, we’re not to “Big Data” yet. If this is the case, then its even better for you! I’m serious, BigQuery is super fast and powerful when you only have tens or hundreds of Gigabytes of data. Think I’m kidding? I saw a tweet from someone in the community looking for suggestions with his smaller data set of 15 GB.

Don’t get me wrong, in MySQL or another regular database setup, 15GB can be a pain to manage. But when you price it w/ BigQuery?

It is down right crazy.

Okay! I’m Super Excited, Now What?

Find some data you want to get into BigQuery, sign up for an account, and stick it in there. CSV is super easy, though I prefer to use JSON for my inserts. Once you get your data in there, start writing some queries! A couple of caveats to be aware of:

  • Most examples use the Standard SQL syntax, but be aware there is a Legacy SQL syntax as well, in case you find some examples online and wonder why they don’t work.
  • Tables support “behind the scenes” partitioning now (i.e. beacon_logs), instead of having to manually specify table names (i.e. beacon_logs_20170101). Just be aware there are the two methods for handling tables. Feel free to Tweet at me if you have a question on which to use in which scenarios. For us, we still use the older manual tables since it integrates better with Google Data Studio for now, though I’m positive they’ll improve that for the future.

I think thats it, go ahead and get started. Big Data (or Medium Data) has never been easier!

About Justin Carmony

Justin is the Sr. Director of Engineering for the Deseret News and Deseret Digital Media, Previous President of the Utah PHP Usergroup, and frequently works with the Utah Open Source Foundation which organizes the OpenWest Conference. Justin loves just about anything with web technologies from PHP, JavaScript, Node.js, Salt, and managing engineering teams.

Learn More