Library Technology – Reviews, Tips, Giveaways, Freeware

What is a BigQuery Data Warehouse?

Posted In Business - By Techtiplib on Friday, September 22nd, 2017 With No Comments »

Google BigQuery is the external version of one of the company’s core technologies—the Dremel execution engine. BigQuery can be set up to replicate the architecture of a traditional data warehouse in the cloud.

BigQuery takes a serverless approach to warehousing (see this article by Boolean World for more details). This means that you don’t need to provision any resources to set up your warehouse—you simply use the service, and BigQuery dynamically allocates computing resources based on your patterns of use.

BigQuery’s serverless approach differs markedly from other popular cloud warehouses such as Amazon Redshift and traditional data warehouses, in which you must provision, set up, and manage the computing resources required to meet your querying and storage needs.

BigQuery Concepts

Storage Architecture

Data in BigQuery is stored in a special Capacitor columnar storage format. Columnar storage optimizes data warehouses for querying because table data is stored in chunks of columns, meaning aggregations are made much simpler due to fewer required I/O operations for reading the data.

Image source: http://saphanatutorial.com/wp-content/uploads/2013/09/Difference-between-Column-based-and-Row-based-Tables.png

All data is stored in Google’s Colossus file storage system, which replicates data to multiple data centers for high availability.

Resource Provisioning

As mentioned, Google dynamically provisions resources based on user requirements.

  • You receive storage allocations based on how much data you store. When you remove data, BigQuery revises your storage allocations downward. When you add data, you are allocated more storage.
  • Query allocations are based on query type and complexity. Queries use slots, which are pre-defined units of allocation that use up RAM and CPU on Google’s machines.      

Maintenance

You don’t need to update or maintain your BigQuery data warehouse. The BigQuery engineering team completes all upgrades and maintenance tasks.

Permissions

To control access to view or query your data on BigQuery, you create cloud IAM roles that define a set of permissions. You then assign each cloud IAM role to different users or groups of users depending on what type of access you want to give to the data.

Loading Data

To load data, it needs to be in a BigQuery-supported format: Avro, JSON, or CSV. Data can be loaded in batches using load jobs or by streaming data one record at a time.           

  • Load jobs are used to import large datasets into the system. Load jobs in BigQuery are free of charge.    
  • Streaming allows you to load data on a row-by-row basis in real-time. You get charged for streaming inserts.          

Denormalization

The BigQuery service recommends denormalizing data tables before loading them into the system because denormalized data is better optimized for performing queries.

BigQuery Data

Image source: https://www.codeproject.com/KB/database/359654/a3.jpg

Denormalized data uses more storage space because it groups data and doesn’t eliminate data redundancies. But the improved query performance is considered worth it, especially for organizations with lots of data to be loaded.

What is a BigQuery Query?

BigQuery uses the Dremel execution engine to run queries on your data. You can run your queries using standard SQL.

Query costs are based on the amount of data scanned, so it’s important to only reference relevant columns in each query you use. The following query scans 100 billion rows and completes in less than 30 seconds.

Image Source: https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query

The reason queries execute so quickly on a BigQuery dataset is because the execution engine uses massively parallel processing to run queries on thousands of servers simultaneously. A tree architecture is used to dispatch queries across many machines at once and aggregate the results.

Image source: https://cloud.google.com/files/BigQueryTechnicalWP.pdf

BigQuery vs Google Analytics

For analysis purposes on a website or application, BigQuery offers significant advantages over Google Analytics data.

When building reports using GA data, the data returned using the API is aggregated and processed. If you connect Google Analytics to BigQuery, GA exports raw session data to BigQuery, giving you the option of more in-depth analysis.

Second, sampling in BigQuery is not a problem. In Google Analytics, reports are based on data subsets when they retrieve a certain number of sessions (100 million sessions for Google Analytics 360, and 500,000 for regular Google Analytics users). With BigQuery, there is no sampling—reports are based on the entire dataset because of the powerful BigQuery engine.

Closing Thoughts

  • Google BigQuery is a cloud-based data warehouse solution optimized for easy of use.
  • You don’t need to worry about provisioning computing resources for BigQuery, and you only get charged based on how much data you store and the complexity of queries you run.
  • BigQuery’s Dremel execution engine uses massively parallel processing and a tree architecture to perform queries with impressive speed on huge data sets.
  • To get the most out of BigQuery as a data warehouse, consider using a solution that integrates all your data sources into the BigQuery system for easy data loading and analysis.

More contents in:

About - Hey, this blog belongs to me! I am the founder of TechTipLib and managing editor right now. And I love to hear what do you think about this article, leave comment below! Thank you so much...