Build and Optimize Data Warehouses with BigQuery: Challenge Lab

In this article, we will go through the lab to Build and Optimize Data Warehouses with BigQuery: Challenge Lab. In the previous, lab you will get familiar with Creating a Data Warehouse Through Joins and Unions,
Creating Date-Partitioned Tables in BigQuery, Troubleshooting and Solving Data Join Pitfalls, Working with JSON, Arrays, and Structs in BigQuery, and Build and Execute MySQL, PostgreSQL, and SQLServer to Data Catalog Connectors.

The challenge contains 6 required tasks.

  1. Create a table partitioned by date.
  2. Correctly add appropriate columns to a table.
  3. Populate the population column.
  4. Populate the country_area column.
  5. Populate the mobility STRUCT.
  6. Query missing data in population & country_area columns.

Challenge scenario

You are part of an international public health organization that is tasked with developing a machine learning model to predict the daily case count for countries during the Covid-19 pandemic. As a junior member of the Data Science team, you’ve been assigned to use your data warehousing skills to develop a table containing the features for the machine learning model.

1. Create a table partitioned by date

The First step is to create a Dataset Name ecommerce

  • In the GCP Console go to Navigation Menu >BigQury.
  • Click on Create Database and Name it as ecommerce.
  • Then Click o Create Database.
  • We can check on our Project Id Database has been Created or Not.
  • Click on Compose New Query.
  • Copy and paste the following code to the Query editor and then click Run
CREATE OR REPLACE TABLE ecommerce.sample 
PARTITION BY date OPTIONS ( partition_expiration_days=90,description="COVID 19 data") AS
SELECT *
FROM bigquery-public-data.covid19_govt_response.oxford_policy_tracker WHERE alpha_3_code != 'USA' AND alpha_3_code != 'GBR'
  • The query should return the Table that has been created.

2.Add new columns to your table

This step is to add new columns to your table

  • Click on compose new query and then copy and paste the following query into the BigQuery Query editor.
ALTER TABLE ecommerce.sample
ADD COLUMN IF NOT EXISTS population INT64,
ADD COLUMN IF NOT EXISTS country_area FLOAT64,
ADD COLUMN IF NOT EXISTS mobility STRUCT<avg_retail FLOAT64,
avg_grocery FLOAT64,
avg_parks FLOAT64,
avg_transit FLOAT64,
avg_workplace FLOAT64,
avg_residential FLOAT64>
  • The query should return the new columns to your table.

3.Add country population data to the population column

  • Click on compose new query and then copy and paste the following query into the BigQuery Query editor.
  • Replace Project Id.
UPDATE `<PROJECT_ID>.ecommerce.sample` count
SET count.population = count1.pop_data_2019
FROM `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide` count1
WHERE count.date = count1.date AND count.alpha_3_code=count1.country_territory_code

4.Add country area data to the country_area column

  • Click on compose new query and then copy and paste the following query into the BigQuery Query editor.
  • Replace Project Id.
UPDATE `<PROJECT_ID>.ecommerce.sample` count
SET count.country_area = count1.country_area
FROM `bigquery-public-data.census_bureau_international.country_names_area` count1
WHERE count.country_name = count1.country_name

5.Populate the mobility record data

  • Click on compose new query and then copy and paste the following query into the BigQuery Query editor.
  • Replace Project Id.
UPDATE `<PROJECT_ID>.ecommerce.sample` count
SET count.mobility = STRUCT<avg_retail FLOAT64, avg_grocery FLOAT64, avg_parks FLOAT64, avg_transit FLOAT64, avg_workplace FLOAT64, avg_residential FLOAT64>
(count1.avg_retail, count1.avg_grocery, count1.avg_parks, count1.avg_transit, count1.avg_workplace, count1.avg_residential)FROM ( SELECT country_region, date, AVG(retail_and_recreation_percent_change_from_baseline) as avg_retail, AVG(grocery_and_pharmacy_percent_change_from_baseline)as avg_grocery,
AVG(parks_percent_change_from_baseline) as avg_parks,
AVG(transit_stations_percent_change_from_baseline) as avg_transit, AVG( workplaces_percent_change_from_baseline ) as avg_workplace,
AVG( residential_percent_change_from_baseline) as avg_residential FROM `bigquery-public-data.covid19_google_mobility.mobility_report` GROUP BY country_region, date) AS count1
WHERE count.country_name = count1.country_region
AND count.date = count1.date

6.Query missing data in population & country_area columns

  • Click on compose new query and then copy and paste the following query into the BigQuery Query editor.
  • Replace Project Id.
SELECT DISTINCT country_name
FROM `Project Id.ecommerce.sample`
WHERE population is NULL
UNION ALL
SELECT DISTINCT country_name
FROM `Project Id.ecommerce.sample`
WHERE country_area IS NULL ORDER BY country_name ASC
  • The query should return the missing countries in the population and country_area data.

Congratulations! Done with the challenge lab.

Stay tuned till the next blog

If you Want to Connect with Me:

Linkedin: https://www.linkedin.com/in/mayank-chourasia-38421a134/

Twitter: https://twitter.com/ChourasiaMayank.

Youtube: https://youtu.be/Luc03Dgs_yU

Google Developers Google Cloud Qwiklabs ChallengePost

I am Google Cloud Certified Associate Cloud Engineer and MS Azure Certifed. I have been working on different fields including Cloud Computing , Sap ABAP.