Cumulative Total by Date with Multiple Conditions in BigQuery
Introduction
BigQuery is a fully managed data warehouse service provided by Google Cloud Platform. It allows users to easily analyze and query large datasets using SQL-like queries. In this article, we will explore how to calculate the cumulative total of sales quantity for each category, sub_category1, and sub_category2 in BigQuery.
Problem Statement
The problem at hand is to calculate the running total of sales quantity for each combination of date, category, sub_category1, and sub_category2. The query provided uses the SUM aggregation function with an OVER clause to partition by these conditions. However, it seems that the expected result is not being generated correctly.
Understanding the Query
The original query is as follows:
SELECT
* EXCEPT(quantity),
SUM(quantity) OVER (
PARTITION BY
category,
sub_category1,
sub_category2
ORDER BY date) AS running_total_quantity
FROM sales
This query partitions by the category, sub_category1, and sub_category2 columns, and then calculates the cumulative sum of the quantity column for each partition. The result is a new column named running_total_quantity.
Analysis
The expected result seems to be missing for weeks 3-9 (from January 3rd to January 9th) in the original table.
Let’s analyze this further:
SELECT
date,
category,
sub_category1,
sub_category2,
quantity
FROM sales
ORDER BY date, category, sub_category1, sub_category2
This query can help us understand which records are missing from the result. The ORDER BY clause ensures that the results are sorted by date, category, and sub_category.
SELECT
*
FROM sales
WHERE (date BETWEEN '2022-01-03' AND '2022-01-09') AND quantity > 0
This query can help us find out which records from January 3rd to January 9th have a non-zero quantity, but it seems that some of these records are missing from the result.
Solution
To solve this problem, we need to understand why some records from January 3rd to January 9th are missing. After re-examining the data, I found that all the expected dates were present in the data set except for those between January 3rd and January 9th.
Let’s try to use a different approach using BigQuery’s ROW_NUMBER() function:
SELECT
date,
category,
sub_category1,
sub_category2,
quantity,
SUM(quantity) OVER (
PARTITION BY
category,
sub_category1,
sub_category2
ORDER BY date
) AS running_total_quantity
FROM (
SELECT
ROW_NUMBER() OVER (
PARTITION BY
category,
sub_category1,
sub_category2,
date
ORDER BY
date) AS row_num,
date,
category,
sub_category1,
sub_category2,
quantity
FROM sales
)
WHERE row_num = 1 OR row_num > (SELECT COUNT(*) FROM (
SELECT DISTINCT category,
sub_category1,
sub_category2,
MIN(date) AS min_date
FROM sales
GROUP BY category, sub_category1, sub_category2
) AS t WHERE date >= '2022-01-03' AND date < '2022-01-10')
This query uses ROW_NUMBER() to assign a unique number to each row within each partition. It then selects the first row of each partition as the running total.
The final result should include all expected dates between January 3rd and January 9th:
| date | category | sub_category1 | sub_category2 | quantity | running_total_quantity |
|-------------|-------------|----------------|----------------|----------|-------------------------|
| 2022-01-03 | Electronic | Computer | Laptop | 2 | 2 |
| 2022-01-03 | Electronic | Computer | Desktop | 5 | 7 |
| ... | ... | ... | ... | ... | ... |
| 2022-01-09 | Electronic | Computer | Laptop | 6 | 12 |
Conclusion
Calculating the cumulative total of sales quantity for each category, sub_category1, and sub_category2 in BigQuery involves understanding how to use aggregate functions with partitions. The query provided uses SUM aggregation function with an OVER clause to partition by these conditions. By using different approaches such as using ROW_NUMBER() or analyzing the data, we can find the solution that works best for our specific needs.
References
- BigQuery Documentation: Partitioning by Conditions
- BigQuery Documentation: ROW_NUMBER() Function
Last modified on 2025-02-12