# Introduction

This post provides an overview of how to use **BigQuery ML for time series prediction**. The process will be mostly carried out using SQL queries. For this example, we will be using real-world data from Paris free bike service, Vélib. We’ll try to predict the availability of bikes in a station. We'll be using an ARIMA model, which is one of BigQuery ML built-in models.

Let’s begin with a short refresher on the main theoretical aspects.

## Time series components

A time series can fundamentally be broken down into four components:

**Trend**: This component captures the underlying pattern of progression in a time series, which could be either increasing or decreasing over time. It's the general direction in which the data is developing over long periods.**Seasonal**: The seasonal component represents regular fluctuations or patterns that occur at specific intervals such as daily, weekly, annually, etc. These patterns repeat over the short term and are often associated with the seasons or specific events.**Cyclical**: The cyclical component refers to fluctuations that occur over longer periods of time, usually over two or more years. These cycles are not of a fixed and regular period, unlike seasonal patterns.**Residual**: The residual component includes all other unexplained variations in the time series. These are fluctuations that are not captured by the trend, cyclical, or seasonal components. The residual component can be seen as "noise" or random variation.

## ARIMA model

An ARIMA model, which stands for AutoRegressive Integrated Moving Average, is a class of models that is widely used in time series forecasting. This type of model captures a suite of different standard temporal structures in time series data.

The model is characterized by three main parameters:

**Auto-Regressive (AR)**component: involves regressing the variable on its own lagged (past) values.**Integration (I)**or**Differencing**component: differencing is required to make the time series*stationary*. It is the process of subtracting the previous value from the current value, effectively calculating the change between consecutive time periods.**Moving Average (MA)**component: models the error term as a linear combination of error terms occurring contemporaneously and at various times in the past.

The complete ARIMA model is denoted as *ARIMA(p, d, q)*, where:

`p`

is the order of the**AR**component, i.e. how many past values are used in the model.`d`

is the degree of the**I**(differencing) component, which indicates how many times the data have had past values subtracted.`q`

is the order of the**MA**component, i.e. how many past residuals are used in the model.

## BigQuery commands

We’ll use the following BigQuery commands to create and use our model:

`CREATE MODEL`

to create a time series model.`ML.EVALUATE`

and`ML.ARIMA_EVALUATE`

to evaluate the model.`ML.ARIMA_COEFFICIENTS`

to inspect the model coefficients.`ML.FORECAST`

to forecast daily total visits.`ML.EXPLAIN_FORECAST`

to retrieve various components of the time series (such as seasonality and trend) that can be used to explain the results.

# Explore the data

For this tutorial, I have extracted 30 days of data from one specific Vélib station, collected at an hourly interval.

```
SELECT
time_slot,
num_bikes_available
FROM `velib.velib_hourly_training`
ORDER BY time_slot
```

The line represents the number of available bikes at the station. Daily patterns are clearly visible, although with a dip on the 22nd and 23rd of September. We'll attempt to predict the last five days (highlighted in orange) using BigQuery ML.

# Train the model

We use an ARIMA model to train on the 25 first days of data. Although a number of parameters can be customized, in this example we adhere to the default settings, except for `HOLIDAY_REGION`

, that we set to France.

Training a model is done with the `CREATE [OR REPLACE] MODEL`

instruction. In this case, the training only takes a few seconds.

```
CREATE OR REPLACE MODEL
`velib.velib_arima_plus` OPTIONS(
MODEL_TYPE='ARIMA_PLUS',
TIME_SERIES_TIMESTAMP_COL='time_slot',
TIME_SERIES_DATA_COL='num_bikes_available',
HOLIDAY_REGION='fr'
)
AS
SELECT
time_slot,
(avg_num_bikes_mechanical + avg_num_bikes_ebike) AS num_bikes_available
FROM
`velib.velib_hourly_training`
WHERE
time_slot < '2023-09-26'
```

# Inspect the model

We can start by looking at quality indicators of the models with `ML.EVALUATE`

.

`SELECT * FROM ML.ARIMA_EVALUATE(MODEL `velib.velib_arima_plus`)`

This returns a number of models (41 in our example) with several indicators. Here are the 5 first rows:

non_seasonal_p | non_seasonal_d | non_seasonal_q | has_drift | log_likelihood | AIC | variance | seasonal_periods | has_holiday_effect | has_spikes_and_dips | has_step_changes | error_message |

1 | 1 | 4 | False | -1514.685432 | 3041.370864 | 9.090127 | [DAILY, WEEKLY] | False | False | False | |

1 | 1 | 4 | True | -1513.843211 | 3041.686423 | 9.039186 | [DAILY, WEEKLY] | False | False | False | |

3 | 1 | 2 | True | -1517.780857 | 3049.561714 | 9.170558 | [DAILY, WEEKLY] | False | False | False | |

3 | 1 | 2 | False | -1519.275303 | 3050.550606 | 9.231439 | [DAILY, WEEKLY] | False | False | False | |

2 | 1 | 3 | False | -1519.450261 | 3050.900522 | 9.260623 | [DAILY, WEEKLY] | False | False | False |

- The first 4 columns are related to the coefficients of the ARIMA model
`log_likelihood`

,`AIC`

and`variance`

provide information about model quality.`AIC`

stands for Akaike information criterion and is a relative indicator of quality. The model with the lowest AIC is considered the best model.

Also, we can inspect the **model coefficients** with `ML.ARIMA_COEFFICIENTS`

:

`SELECT * FROM ML.ARIMA_COEFFICIENTS(MODEL `velib.velib_arima_plus`)`

ar_coefficients | ma_coefficients | intercept_or_drift |

[0.8896916571920077] | [0.627531724223125, -0.52700607139083722, -0.074155752108028508, 0.23338453677391041] | 0.0 |

`ar_coefficients`

shows the model coefficients of the autoregressive (AR) part`ma_coefficients`

shows the model coefficients of moving-average (MA) part.

They are both arrays, whose lengths are equal to `non_seasonal_p`

and `non_seasonal_q`

`intercept_or_drift`

is the constant term in the ARIMA model.

More usefully, we can have an **evaluation of the model** against an evaluation dataset with `ML.EVALUATE`

, that will return several performance indicators:

```
SELECT * FROM ML.EVALUATE(
MODEL `velib.velib_arima_plus`,
TABLE `velib.velib_hourly_test`,
STRUCT(120 AS horizon)
)
```

mean_absolute_error | mean_squared_error | root_mean_squared_error | mean_absolute_percentage_error | symmetric_mean_absolute_percentage_error |

8.855005 | 105.212068 | 10.257293 | 133.110359 | 61.387895 |

# Forecast

Now let’s use the model to predict 5 days of bikes availability. with the `ML.FORECAST`

instruction. Since the timeframe is hourly, we set `120`

as prediction horizon. The confidence interval level can also be customized, and is 0.95 by default.

```
SELECT *
FROM ML.FORECAST(
MODEL `velib.velib_arima_plus`,
STRUCT(
120 AS horizon,
0.95 AS confidence_level
)
)
```

# Decompose the prediction

Finally, we can get a deeper understanding of the prediction with the `ML.EXPLAIN_FORECAST`

that will return a number of indicators for historical and predicted data:

```
SELECT *
FROM
ML.EXPLAIN_FORECAST(
MODEL `velib.velib_arima_plus`,
STRUCT(72 AS horizon)
)
```

# Going beyond a basic model

In addition to predicting a single time series, BigQuery ML offers more advanced options for enhancing your forecasting capabilities, on two main aspects:

**Simultaneously forecast****multiple time series**, up to hundreds of thousands. This can be achieved by utilizing the`TIME_SERIES_ID_COL`

parameter, which lets you manage and analyze multiple datasets concurrently for more extensive insights.**Include****external regressors**. This allows your model to account for additional variables that can impact your time series data, providing a more holistic and accurate prediction.