Create ML Models with BigQuery ML: Challenge Lab

6 min readApr 30, 2021

This article will go through the lab to Create ML Models with BigQuery ML: Challenge Lab. In the previous, lab you will get familiar with Getting Started with BQML, Predict Visitor Purchases with a Classification Model in BQML, Predict Taxi Fare with a BigQuery ML Forecasting Model, Bracketology with Google Machine Learning, and Implement a Helpdesk Chatbot with Dialogflow & BigQuery ML.

The challenge contains 5 required tasks.

1.Create a dataset to store your machine learning models.

2.Create a forecasting BigQuery machine learning model.

3.Create the second machine learning model.

4.Evaluate the two machine learning models.

5.Use the subscriber type machine learning model to predict average trip durations.

Challenge Scenario

One of the projects you are working on needs to provide analysis based on real-world data that will help in the selection of new bicycle models for public bike share systems. Your role in this project is to develop and evaluate machine learning models that can predict average trip durations for bike schemes using the public data from Austin’s public bike share scheme to train and evaluate your models.

Two of the senior data scientists in your team have different theories on what factors are important in determining the duration of a bike-share trip and you have been asked to prioritize these to start. The first data scientist maintains that the key factors are the start station, the location of the start station, the day of the week, and the hour the trip started. While the second data scientist argues that this is an over-complication and the key factors are simply start station, subscriber type, and the hour the trip started.

You have been asked to develop a machine learning model based on each of these input features. Given the fact that stay-at-home orders were in place for Austin during parts of 2020 as a result of COVID-19, you will be working on data from previous years. You have been instructed to train your models on data from 2018 and then evaluate them against data from 2019 based on Mean Absolute Error and the square root of Mean Squared Error.

You can access the public data for the Austin bike share scheme in your project by opening this link to the Austin bike share dataset in the browser tab for your lab.

As a final step, you must create and run a query that uses the model that includes subscriber type as a feature, to predict the average trip duration for all trips from the busiest bike-sharing station in 2019 (based on the number of trips per station in 2019) where the subscriber type is ‘Single Trip’.

1.Create a dataset to store your machine learning models

This dataset will be used to store our BQML models.

In the GCP Console go to Navigation Menu >BigQuery.
Click on Create Dataset.

Write the name of the Dataset you want.

Click on Create Dataset.

2.Create a forecasting BigQuery machine learning model

This task asks us to train a model-based.

Copy and paste the following code to the Query editor and then click Run.

CREATE OR REPLACE MODEL royalbike.model
OPTIONS
  (model_type='linear_reg', labels=['duration_minutes']) AS
SELECT
    start_station_name,
    EXTRACT(HOUR FROM start_time) AS start_hour,
    EXTRACT(DAYOFWEEK FROM start_time) AS day_of_week,
    duration_minutes
FROM
    `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips
JOIN
    `bigquery-public-data.austin_bikeshare.bikeshare_stations` AS stations
ON
    trips.start_station_name = stations.name
WHERE
    EXTRACT(YEAR FROM start_time) = 2018
    AND duration_minutes > 0

The query should return the starting station name, the hour the trip started, the weekday of the trip, and the address of the start station.

3.Create the second machine learning model

Similar to the second task, this task requires you to train a second model.

Click on compose new query and then copy and paste the following query into the BigQuery Query editor.

CREATE OR REPLACE MODEL royalbike.subscriber
OPTIONS
  (model_type='linear_reg', labels=['duration_minutes']) AS
SELECT
    start_station_name,
    EXTRACT(HOUR FROM start_time) AS start_hour,
    subscriber_type,
    duration_minutes
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips
WHERE EXTRACT(YEAR FROM start_time) = 2018

The query should return to predict the trip duration for bike trips.

4.Evaluate the two machine learning models

Click on compose new query and then copy and paste the following query into the BigQuery Query editor.

-- Evaluation metrics for location_model
SELECT
  SQRT(mean_squared_error) AS rmse,
  mean_absolute_error
FROM
  ML.EVALUATE(MODEL royalbike.model, (
  SELECT
    start_station_name,
    EXTRACT(HOUR FROM start_time) AS start_hour,
    EXTRACT(DAYOFWEEK FROM start_time) AS day_of_week,
    duration_minutes
  FROM
    `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips
  JOIN
   `bigquery-public-data.austin_bikeshare.bikeshare_stations` AS stations
  ON
    trips.start_station_name = stations.name
  WHERE EXTRACT(YEAR FROM start_time) = 2019)
)

Click on compose new query and then copy and paste the following query into the BigQuery Query editor.

-- Evaluation metrics for subscriber_modelSELECT
  SQRT(mean_squared_error) AS rmse,
  mean_absolute_error
FROM
  ML.EVALUATE(MODEL royalbike.subscriber, (
  SELECT
    start_station_name,
    EXTRACT(HOUR FROM start_time) AS start_hour,
    subscriber_type,
    duration_minutes
  FROM
    `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips
  WHERE
    EXTRACT(YEAR FROM start_time) = 2019)
)

The query should return both the Mean Absolute Error and the Root Mean Square Error of the Model.

5.Use the subscriber type machine learning model to predict average trip durations

In this task, we are required to calculate the average predicted trip time for the busiest station.

Click on compose new query and then copy and paste the following query into the BigQuery Query editor.

SELECT
  start_station_name,
  COUNT(*) AS trips
FROM
  `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
  EXTRACT(YEAR FROM start_time) = 2019
GROUP BY
  start_station_name
ORDER BY
  trips DESC

Click on compose new query and then copy and paste the following query into the BigQuery Query editor.
The query should return the busiest station

SELECT AVG(predicted_duration_minutes) AS average_predicted_trip_length
FROM ML.predict(MODEL royalbike.subscriber, (
SELECT
    start_station_name,
    EXTRACT(HOUR FROM start_time) AS start_hour,
    subscriber_type,
    duration_minutes
FROM
  `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE 
  EXTRACT(YEAR FROM start_time) = 2019
  AND subscriber_type = 'Single Trip'
  AND start_station_name = '21st & Speedway @PCL'))