Bytes, Pipelines & Queries #4 – Optimizing Data Workflows with BigQuery
February 12, 2025 • 3 min read
Welcome to the “Bytes, Pipelines & Queries” series – this post focuses on BigQuery, where I share insights on optimizing data workflows for massive datasets stemming from Module 3 of the Data Engineering Zoomcamp.
Like most engineers, I’ve interacted with databases before—running queries, optimizing indexes, and making sure transactions commit successfully. But what happens when the data size explodes? When millions (or billions) of records flow in every day? That’s where Data Warehousing comes into play, and my recent deep dive into BigQuery revealed just how different analytical databases are from transactional ones.
Although I knew about BigQuery and its capabilities through my prep work for the Google Cloud Certified Associate Cloud Engineer exam, I never really delved into the nitty-gritty of its use. Module 3 of the zoomcamp, however, provided a great opportunity to explore BigQuery in depth.
OLTP vs OLAP: The Right Database for the Right Job
Most engineers are comfortable working with OLTP (Online Transaction Processing) systems—databases optimized for handling frequent, small transactions like inserting a new user, updating an order, or processing a payment. Think of MySQL, PostgreSQL, or any standard relational database. In fact, those were the only ones I was used to.
But when you need to analyze massive datasets—billions of records across years—OLTP databases struggle. That’s where OLAP (Online Analytical Processing) systems come in. These databases are optimized for aggregations, complex queries, and historical analysis.
Below is a table comparing the key features of OLTP and OLAP databases:
Feature | OLTP (Transactional) | OLAP (Analytical) |
---|---|---|
Purpose | Fast transactions | Large-scale analytics |
Data Size | Small to medium | Terabytes to petabytes |
Query Type | Simple reads/writes | Aggregations, joins, historical trends |
Examples | PostgreSQL, MySQL | BigQuery, Snowflake, Redshift |
BigQuery falls squarely into the OLAP category—designed for high-speed analytical queries on massive datasets.
Partitioning & Clustering: How to Avoid Costly Queries
BigQuery charges based on the amount of data processed per query, meaning badly written queries can get expensive very fast. Thankfully, Google provides tools to optimize query performance and reduce costs.
1️⃣ Partitioning: Process Only the Data You Need
Partitioning lets you divide your table into smaller chunks based on a column, usually dates. Instead of scanning an entire table, queries only process relevant partitions.
Example: Creating a partitioned table by tpep_dropoff_datetime
(taxi trip end time) using 'yellow taxi data' from the TLC Trip Record Dataset:
CREATE OR REPLACE TABLE my_project.my_dataset.yellow_taxi_partitioned
PARTITION BY DATE(tpep_dropoff_datetime) AS
SELECT * FROM my_project.my_dataset.yellow_taxi_materialized;
Now, if we run a query for March 2024 data, BigQuery skips irrelevant partitions, reducing the amount of data scanned. Check by running this query:
SELECT COUNT(*) FROM my_project.my_dataset.yellow_taxi_partitioned
WHERE tpep_dropoff_datetime BETWEEN '2024-03-01' AND '2024-03-31';
2️⃣ Clustering: Optimize Filtering & Sorting
Clustering further optimizes performance by sorting data within each partition based on frequently used columns.
Example: Creating a table partitioned by date and clustered by VendorID
:
CREATE OR REPLACE TABLE my_project.my_dataset.yellow_taxi_optimized
PARTITION BY DATE(tpep_dropoff_datetime)
CLUSTER BY VendorID AS
SELECT * FROM my_project.my_dataset.yellow_taxi_materialized;
Now, queries filtering by VendorID
will be faster and cheaper.
External Tables vs Materialized Tables: Choosing the Right Approach
BigQuery supports External Tables, which allow you to query data directly from a data lake like Google Cloud Storage (GCS) without ingesting it into BigQuery. This is useful for raw data analysis but can be slow and expensive for frequent queries.
On the other hand, Materialized Tables store data inside BigQuery, leading to faster query performance. Below is a comparison of the two:
Feature | External Table | Materialized Table |
---|---|---|
Data Location | Google Cloud Storage | BigQuery Storage |
Query Performance | Slower (reads from GCS) | Faster (BQ-optimized storage) |
Storage Cost | No extra cost (pay per query) | Charged for storage, but cheaper queries |
Best Use Case | Rarely queried data | Frequently queried data |
Creating a Materialized Table from an External Table
After connecting your external data source to BigQuery, you can create a materialized table using the following SQL:
CREATE OR REPLACE TABLE my_project.my_dataset.yellow_taxi_materialized AS
SELECT * FROM my_project.my_dataset.yellow_taxi_external;
This copies data from GCS into BigQuery, making future queries much faster and cheaper.
Final Thoughts
Exploring BigQuery and Data Warehousing has been an eye-opener. Unlike traditional OLTP databases, OLAP systems one focused on cost-efficient loading, partitioning, and data modeling of large scale data. Also, BigQury is serverless and adapts to your data, making it a powerful tool for data analysis.
Partitioning and clustering are great for reducing query costs, and understanding when to use External vs Materialized Tables helps balance cost and performance.
There’s still so much more to explore, from real-time analytics to ML models within BigQuery—but one thing is clear: efficient querying is just as important as writing the query itself. 🚀