Methodology
Discussions on the methodology of the project.
Table of contents
Model Selection
CNN-LSTM
For the task of weather prediction using machine learning, the CNN-LSTM model is typically used. A CNN-LSTM model utilizes a convolutional layer to learn spatial features, which are then passed to an LSTM layer to capture temporal dependencies. The final fully connected layer processes the LSTM layer’s output to minimize variability and improve predictions (Sainath et al., 2015). However, the LSTM layer linearizes its input to the fully connected layer into a 1-dimensional array (Hu et al., 2020), leading to a loss of spatial considerations while retaining only the temporal ones (Shi et al., 2015); (Hu et al., 2020).
Typical CNN-LSTM Model Architecture (Oh et al., 2018)
ConvLSTM2D
In contrast, ConvLSTM2D performs convolutional operations within the LSTM cell, allowing for a 3-dimensional input incorporating spatial and temporal dimensions (Hu et al., 2020). This results in the retention of both spatial and temporal features, thus enhancing the learning of correlations within the data (Shi et al., 2015); (Gaur et al., 2020).
Inner Structure of ConvLSTM2D (Shi et al., 2015)
Data Source
This research utilizes the Weather Research & Forecasting (WRF) Model dataset provided by Singapore’s Climate ArtificiaL intelligence Engine (SgCALE). The data was bias-corrected, downscaled, and refined from Global Climate Models (GCMs) and the European Centre for Medium-Range Weather Forecasts Reanalysis 5 (ERA5) dataset.
Data Downscaling Process (SgCALE, 2022)
Dataset Characteristic
Characteristics | Details |
---|---|
Source | SgCALE |
Resolution | 500 m grids |
Time Frame | 1981 - 2020 |
NaN Filling | Linear interpolation of adjacent grids |
Input Variables | Temperature, Relative Humidity, Surface Pressure, Cloud Fraction, Wind Speed |
Output Variable | Precipitation |
Dataset Split | Training (70%), Validation (15%), Test (15%) |
Batch Size | 32 |
Lookback | 7 |
The model takes in data from the past 7-days (t-6, t-5, …, t) to predict the next day precipitation (t+1).
ConvLSTM2D Model Architecture
The ConvLSTM2D model was designed with an input layer, two hidden ConvLSTM2D layers, a batch normalization layer, and a dense output layer.
Layer | Filters/Units | Filter Size | Output Shape | Parameters |
---|---|---|---|---|
ConvLSTM2D | 64 | 3 x 3 | (None, 7, 120, 160, 64) | 159,232 |
Batch Normalisation | - | - | (None, 7, 120, 160, 64) | 256 |
ConvLSTM2D | 64 | 3 x 3 | (None, 7, 120, 160, 64) | 295,168 |
Dense | 1 | - | (None, 120, 160, 1) | 65 |
Considerations
- Two ConvLSTM2D layers with 64 filters were ideal for learning the dataset’s spatiotemporal dependencies.
- More layers or filters led to overfitting and extended training times.
- Fewer layers or filters compromised performance.
- Batch normalization was utilized to:
- Ensure faster model convergence by normalizing data between layers, as gradients maintain similar scales.
- Mitigate the risk of vanishing or exploding gradients as they maintain similar scales.
- Reduce overfitting by reducing the internal covariate shift.
- Batch normalization is not used before the fully connected layer as it negatively impacts performance.
- Likely due to undesired shift in our data scale, mean, or variance.
Model Training
We trained the proposed ConvLSTM2D model for 100 epochs, employing only model checkpointing without early stopping to save the model with the best validation loss during training. This approach helps avoid the double descent phenomenon (Heckel & Yilmaz, 2020) and ensures optimal model performance.
As for the optimization method, we chose the Adaptive Moment Estimation (Adam) optimizer. Adam is a popular choice due to its adaptive nature, adjusting the learning rate throughout the training process, thereby ensuring faster convergence and improved generalisation of the model.
Loss Function
Unlike traditional convolutional outputs where loss computations often revolve around pixel-to-pixel comparisons, our model utilizes a custom loss function that computes loss over an area.
Neighbourhood Scanning Loss Function (Uphoff, et al., 2021)
The predicted grids with rain are only one grid away from the observed values. Using the built-in loss functions, such as Mean Squared Error (MSE) loss, would result in the model being penalized twice for what could be considered a reasonable prediction. The first penalty would be applied to the grid that has observed precipitation but no predicted precipitation, while the second would apply to the grid with predicted precipitation but no observed precipitation. This is despite the model having fairly accurately identified the areas experiencing precipitation.
To overcome this issue, we implement a custom loss function called the Fractions Skill Score (FSS) loss. The FSS loss scans an area of size m x m (where m refers to the user-defined mask size), calculating the average precipitation within that area, and then computing the losses between the true and predicted values. This approach better accommodates the spatial nature of our data and mitigates overly penalizing reasonable predictions.
# Define modified FSS loss
def make_FSS_loss(mask_size):
def my_FSS_loss(y_true, y_pred):
cutoff = 0.5
c = 10
y_true_binary = tf.math.sigmoid( c * ( y_true - cutoff ))
y_pred_binary = tf.math.sigmoid( c * ( y_pred - cutoff ))
pool1 = tf.keras.layers.AveragePooling2D(pool_size=(mask_size, mask_size), strides=(1, 1), padding='same')
y_true_density = pool1(y_true_binary);
n_density_pixels = tf.cast( (tf.shape(y_true_density)[1] * tf.shape(y_true_density)[2]) , tf.float32 )
pool2 = tf.keras.layers.AveragePooling2D(pool_size=(mask_size, mask_size), strides=(1, 1), padding='same')
y_pred_density = pool2(y_pred_binary);
# calculate MSE
MSE_n = tf.keras.losses.MeanSquaredError()(y_true_density, y_pred_density)
O_n_squared_image = tf.keras.layers.Multiply()([y_true_density, y_true_density])
O_n_squared_vector = tf.keras.layers.Flatten()(O_n_squared_image)
O_n_squared_sum = tf.reduce_sum(O_n_squared_vector)
M_n_squared_image = tf.keras.layers.Multiply()([y_pred_density, y_pred_density])
M_n_squared_vector = tf.keras.layers.Flatten()(M_n_squared_image)
M_n_squared_sum = tf.reduce_sum(M_n_squared_vector)
MSE_n_ref = (O_n_squared_sum + M_n_squared_sum) / n_density_pixels
# calculate MAE
MAE_n = tf.keras.losses.MeanAbsoluteError()(y_true_density, y_pred_density)
MAE_n_ref = tf.reduce_sum(tf.abs(tf.subtract(y_true_density, y_pred_density))) / n_density_pixels
# initialize weights
alpha = 0.70 # for MSE loss
beta = 0.30 # for MAE loss
my_epsilon = tf.keras.backend.epsilon() # this is 10^(-7)
return (alpha * (MSE_n / (MSE_n_ref + my_epsilon))) + (beta * (MAE_n / (MAE_n_ref + my_epsilon)))
return my_FSS_loss
We modified the FSS loss to combine MSE and Mean Absolute Error (MAE) loss, weighted at 0.70 and 0.30 respectively. This places slightly less emphasis on the extreme values and more on the average values, which might be counterintuitive for our focus on the prediction of extreme weather events. However, we found that this approach resulted in better model prediction for both floods and droughts. The overprediction of precipitation intensity across all areas result in the underprediction of drought intensity, which is undesirable. Lastly, we used a mask size of 9 x 9 to scan the area, as it demonstrated the best performance.