Thursday, December 12, 2024
No menu items!
HomeNatureProbabilistic weather forecasting with machine learning

Probabilistic weather forecasting with machine learning

Task definition and general approach

A general formulation of the task of probabilistic weather forecasting from the present time t = 0 into the future is to model the joint probability distribution \(P({\bar{{\bf{X}}}}^{0:T}| {{\bf{O}}}^{\le 0})\), where T is the forecast horizon, \({\bar{{\bf{X}}}}^{t}\) denotes the atmospheric state at time t and O≤0 are observations made up to the forecast initialization time t = 0. This joint distribution can be factored as

$$P({\bar{{\bf{X}}}}^{0:T}| {{\bf{O}}}^{\le 0})=\mathop{\underbrace{P({\bar{{\bf{X}}}}^{0}| {{\bf{O}}}^{\le 0})}}\limits_{{\rm{State}}\,{\rm{inference}}}\,\mathop{\underbrace{P({\bar{{\bf{X}}}}^{1:T}| {\bar{{\bf{X}}}}^{0})}}\limits_{{\rm{Forecast}}\,{\rm{model}}}$$

Our innovation in this work is an MLWP-based Forecast model, and we adopt a traditional NWP-based State inference approach. We make several approximations to the above general formulation, as follows.

Full atmospheric states are not directly observed, and so we approximate each \({\bar{{\bf{X}}}}^{t}\) with a best-estimate NWP-based analysis state Xt, which has been generated at finite resolution, using a window of observations in a process known as data assimilation. In our case, each Xt is an 84 × 720 × 1,440 array, which includes six surface variables and six atmospheric variables at each of 13 vertical pressure levels (Extended Data Table 1), on a 0.25° latitude–longitude grid. We generate 15-day forecasts, at 12 h steps, so T = 30.

As a first, standard approximation29, we use analysis X0:T as evaluation targets. This means we are in effect evaluating the forecast of each model as a predictive distribution,

$$P({{\bf{X}}}^{0:T}| {{\bf{O}}}^{\le 0}),$$

over sequences of future best-estimate NWP analyses.

Second, we wish to rely on a Markov assumption, but although the underlying atmospheric state sequence \({\bar{{\bf{X}}}}^{1:T}\) is Markov, it is only partially observed in X1:T. In our models GenCast and GenCast-Perturbed, we make a weaker second-order Markov approximation, under which we factorize

$$P({{\bf{X}}}^{-1},{{\bf{X}}}^{0:T}| {{\bf{O}}}^{\le 0})=P({{\bf{X}}}^{0},{{\bf{X}}}^{-1}| {{\bf{O}}}^{\le 0})\mathop{\prod }\limits_{t=1}^{T}P({{\bf{X}}}^{t}| {{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2}).$$

We found that conditioning on two previous time steps works better than one.

For GenCast, the initialization P(X0, X−1|O≤0) is handled by fixing (X0, X−1) to their values obtained from two consecutive best-estimate analyses from the ERA5 dataset27. For GenCast-Perturbed, additional perturbations are added, see Supplementary Information section A.4. With initialization dealt with, the problem is reduced to modelling P(Xt|Xt−1, Xt−2), and samples of X1:T can be generated autoregressively.

Diffusion model specification

Beyond image and video generation, diffusion models21,22,23 have also been applied in the geophysical domain, to tasks including data assimilation50, NWP ensemble emulation51 and climate downscaling52. In this work, we model P(Xt|Xt−1, Xt−2) with a diffusion model, which enables us to sample forecast trajectories.

Rather than sampling Xt directly, our approach is to sample a residual Zt with respect to the most recent weather state Xt−1, in which the residuals have been normalized to unit variance on a per-variable and per-level basis as was done for GraphCast2. Xt is then computed as Xt = Xt−1 + SZt, where S is a diagonal matrix that inverts the normalization. The one exception to this is precipitation, for which we set Xt = SZt without adding the previous state.

We broadly follow the diffusion framework presented in ref. 21, and refer the reader to their paper for a more rigorous introduction to diffusion, as well as a detailed treatment of the available modelling decisions. We adopt their choices of noise schedule, noise scaling, loss weighting by noise level and preconditioning. However, we make changes to the noise distribution, the training-time distribution of noise levels and add additional loss weightings, all of which are described below. These changes improve performance on the task of probabilistic weather forecasting.

Sampling process

The sampling process begins by drawing an initial sample \({{\bf{Z}}}_{0}^{t}\) from a noise distribution on the sphere Pnoise(·|σ0), at a high initial noise level σ0. After N steps of transformation, we end up at \({{\bf{Z}}}_{N}^{t}:= {{\bf{Z}}}^{t}\), our sample from the target distribution at noise level σN = 0. To take us from one to the other, we apply an ODE solver to a probability flow ODE21,22. Each step of this solver is denoted by rθ (Fig. 1), with

$${{\bf{Z}}}_{i+1}^{t}={r}_{\theta }({{\bf{Z}}}_{i}^{t};{{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2},{\sigma }_{i+1},{\sigma }_{i})$$

taking us from noise level σi to the next (lower) noise level σi+1, conditioned on (Xt−1, Xt−2).

We use the second-order DPMSolver++2S solver53, augmented with the stochastic churn (again making use of Pnoise) and noise inflation techniques used in ref. 21 to inject further stochasticity into the sampling process. In conditioning on previous time steps, we follow the conditional denoising estimator approach outlined and motivated in ref. 54.

Each step rθ of the solver makes use of a learned denoiser Dθ with parameters θ, described in detail below. We take N = 20 solver steps per generated forecast time step. As we are using a second-order solver, each step rθ requires two function evaluations of the denoiser Dθ (except the last step which requires only a single evaluation). This results in 39 function evaluations in total. See Supplementary Information section A.2.1 for further details, including a full list of sampling hyperparameters.

Noise distribution on the sphere

At the core of a diffusion model is the addition and removal of noise, drawn from some distribution Pnoise(·|σ) parameterized by noise level σ. When using diffusion to generate natural images55, Pnoise is usually chosen to be independent and identically distributed (i.i.d.) Gaussian. However, we have found it beneficial to use a noise distribution that better respects the spherical geometry of global weather variables. Rather than sampling i.i.d. Gaussian noise on the latitude–longitude grid, we instead sample isotropic Gaussian white noise on the sphere, which is then projected onto the grid. This choice of Pnoise has the consequence that the noise has a flat spherical harmonic power spectrum in expectation. For motivation and details of these changes, see Supplementary Information section A.2.3.

Denoiser architecture

To recap, our diffusion sampling process involves taking several solver steps rθ, and each solver step calls a denoiser Dθ as part of its computation. We parameterize the denoiser Dθ following ref. 21 as a preconditioned version of a neural network function fθ.

$${D}_{\theta }({{\bf{Z}}}_{\sigma }^{t};{{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2},\sigma )\,:= \,{c}_{{\rm{skip}}}(\sigma )\cdot {{\bf{Z}}}_{\sigma }^{t}+{c}_{{\rm{out}}}(\sigma )\cdot {f}_{\theta }({c}_{{\rm{in}}}(\sigma ){{\bf{Z}}}_{\sigma }^{t};\,{{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2},{c}_{{\rm{noise}}}(\sigma )).$$

Here \({{\bf{Z}}}_{\sigma }^{t}\) denotes a noise-corrupted version of the target Zt at noise level σ, and cin, cout, cskip and cnoise are preconditioning functions taken from table 1 in ref. 21, with σdata = 1 because of the normalization of the targets.

The architecture used for fθ is related to the GraphCast architecture2. To be precise, the Encoder and Decoder architectures stay the same, and those inputs to the encoder corresponding to the previous two time steps are normalized to zero mean and unit variance in the same way. However, unlike in GraphCast, which uses a similar message-passing GNN for the Processor architecture as in the Encoder and Decoder, in GenCast the Processor is a graph-transformer model operating on a spherical mesh that computes neighbourhood-based self-attention. Unlike the multimesh used in GraphCast, the mesh in GenCast is a six-times refined icosahedral mesh2, with 41,162 nodes and 246,960 edges. The Processor consists of 16 consecutive standard transformer blocks26,56, with feature dimension equal to 512. The four-head self-attention mechanism in each block is such that each node in the mesh attends to itself and to all other nodes that are within its 32-hop neighbourhood on the mesh.

To condition on previous time steps (Xt−1, Xt−2), we concatenate these along the channel dimension with the input to be denoised and feed this as input to the model. Conditioning on noise level σ is achieved by replacing all layer-norm layers in the architecture with conditional layer-norm57 layers. We transform log noise levels into a vector of sine–cosine Fourier features at 32 frequencies with base period 16 and pass them through a two-layer MLP to obtain 16-dimensional noise-level encodings. Each of the conditional layer-norm layers applies a further linear layer to output replacements for the standard scale and offset parameters of layer norm, conditioned on these noise-level encodings.

Training the denoiser

At training time, we apply the denoiser to a version of the target Zt, which has been corrupted by adding noise ε ~ Pnoise(·|σ) at noise level σ:

$${{\bf{Y}}}^{t}={D}_{\theta }({{\bf{Z}}}^{t}+{\boldsymbol{\varepsilon }}\,;{{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2},\sigma ).$$

We train its output, denoted as Yt, to predict the expectation of the noise-free target Zt by minimizing the following mean-squared-error objective weighted per elevation level and by latitude–longitude cell area,

$$\sum _{t\in {D}_{{\rm{train}}}}E\left[\lambda (\sigma )\frac{1}{| G| | \,J| }\sum _{i\in G}\sum _{j\in J}{w}_{j}{a}_{i}{({Y}_{i,j}^{t}-{Z}_{i,j}^{t})}^{2}\right],$$

where

  • t indexes the different time steps in the training set Dtrain;

  • j ∈ J indexes the variable, and for atmospheric variables the pressure level, that is, J = {z1000, z850, …, 2t, msl};

  • i ∈ G indexes the location (latitude and longitude coordinates) in the grid;

  • wj is the per-variable-level loss weight, set as in GraphCast2 with the additional sea surface temperature variable weighted at 0.1;

  • ai is the area of the latitude–longitude grid cell, which varies with latitude and is normalized to unit mean over the grid;

  • λ(σ) is the per-noise-level loss weight in ref. 21; and

  • the expectation is taken over σ ~ Ptrain, ε ~ Pnoise(·; σ).

Instead of using the log-normal distribution for Ptrain that is suggested in ref. 21, we construct a distribution whose quantiles match the noise-level schedule used for sample generation, assigning a higher probability to noise levels that are closer together during sampling. Details are in Supplementary Information section A.2.2. As done by GraphCast2, we weight the squared error made at each latitude–longitude grid cell by a per-variable-level loss weight, as well as the normalized area of that grid cell; this is also a departure from ref. 21.

Unlike GraphCast, which is fine-tuned by back-propagating gradients through 12-step trajectories (3 days with 6 h steps) produced by feeding the model its own predictions as inputs during training, GenCast is only ever trained using targets that consist of the next 12-h state, without ever being provided its own predictions on previous steps as inputs.

Resolution training schedule

The GenCast results reported in this paper were generated by a model that was trained in a two-stage process. Stage 1 was a pre-training stage, taking 2 million training steps. During this stage, the ground truth dataset was bilinearly downsampled from 0.25° to 1° and the denoiser architecture used a 5-refined icosahedral mesh. This training stage takes a little over 3.5 days using 32 TPUv5 instances. After this training phase was complete, stage 2 was conducted, fine-tuning the model to 0.25°, taking 64,000 further training steps. This takes just under 1.5 days using 32 TPUv5 instances. During stage 2, the ground truth data is kept at 0.25°, and the denoiser architecture is updated to take in 0.25° data and output 0.25° outputs and to operate on a 6-refined icosahedral mesh. The GNN and graph-transformer architectures are such that the same model weights can operate on the higher data and mesh resolutions without any alterations. We do, however, make a minor modification before beginning the fine-tuning stage to decrease the shock to the model of operating on higher resolution data. In the Encoder GNN, which performs message passing between the grid and mesh nodes, when the data resolution increases from 1° to 0.25°, the number of messages being received by each mesh node increases by a factor of 16. To approximately preserve the scale of the incoming signal to all mesh nodes at the start of fine-tuning, we divide the sum of these message vectors by 16. The optimization hyperparameters used for both stages of training are detailed in Supplementary Information section A.3.

Training data

We trained GenCast on a dataset built from the ERA5 archive of ECMWF27, a large corpus of global reanalysis data. Our dataset contains the best-estimate analyses of ERA5 for a subset of the available variables, on 13 pressure levels (see Extended Data Table 1 for a complete list of variables and pressure levels), on a 0.25° equiangular grid. We also subsampled the temporal resolution from 1 h to 6 h, corresponding to 00:00, 06:00, 12:00 and 18:00 UTC times each day. From this dataset, we extracted sequences at 12-h temporal resolution (sequences of 00/12 UTC or 06/18 UTC times) to train GenCast.

Although its temporal resolution is hourly, ERA5 only assimilates observations in 12-h windows, from 21 UTC–09 UTC and 09 UTC–21 UTC. This means that steps taken within a single 12-h assimilation window have a different, less dispersed distribution to those that jump from one window into the next. By choosing a 12-h time step, we avoid training on this bimodal distribution and ensure that our model always predicts a target from the next assimilation window.

For accumulated variables such as precipitation, instead of subsampling the data in time, we accumulated values over the 12-h period preceding each time.

Our dataset covers the period 1979–2019. During the development phase of GenCast, we used dates from 1979 to 2017 for training and validated results in 2018. Before starting the test phase, we froze all model and training choices, retrained the model on data from 1979 to 2018 and evaluated results in 2019.

GenCast-Perturbed training protocol

GenCast-Perturbed is trained by taking the GenCast architecture for fθ described above in the section ‘Denoiser architecture’, removing the conditioning on noise level and noisy targets, and training it at 0.25° resolution as a deterministic forecast model using the same training dataset. It takes (Xt−1, Xt−2) as inputs and outputs a single forecast of the normalized residual target Zt. It is trained to minimize the mean-squared error of its single-step 12-h forecasts. Specifically, we minimize

$$\sum _{t\in {D}_{{\rm{train}}}}{\rm{E}}\left[\frac{1}{| G| | \,J| }\sum _{i\in G}\sum _{j\in J}{w}_{j}{a}_{i}{({Y}_{i,j}^{t}-{Z}_{i,j}^{t})}^{2}\right],$$

where in this case Yt is the deterministic forecast rather than the output of a denoising step and \(t,j\in J,i\in G,{w}_{j},{a}_{i}\) are all defined as above. The optimization hyperparameters are detailed in Supplementary Information section A.3.

Statistical methods

We compare GenCast with ENS on several verification metrics (detailed in Supplementary Information section A.5) computed on our 2019 evaluation set. For each relevant metric (and where applicable at each lead time, level, quantile and cost/loss ratio), we test the null hypothesis of no difference in the metric between GenCast and ENS, against the two-sided alternative. Specifically, we are testing for differences in the values of the metrics that would be attained in the limit of infinite years of evaluation data, assuming the stationarity of the climate.

Most of our metrics are computed from time series of spatially aggregated values given at n = 730 12-hourly initialization times from 2019. For these metrics, we apply a paired-differences significance test based on the stationary block bootstrap58, which handles temporal dependence by resampling blocks of the time-series data from which the metric is computed. We use automatic block length selection59,60.

By contrast, deterministic cyclone position error is only obtained for a given cyclone at select times at which pairing criteria are met. For this metric, we instead perform a cluster bootstrap61 that assumes independence between (but not within) cyclones.

We base all our tests on bias-corrected and accelerated (bca) bootstrap confidence intervals62. Further details of the statistical tests are given in Supplementary Information section A.6.

Local surface extremes evaluation

We evaluate GenCast and ENS on the task of predicting when surface weather variables exceed high (99.99th, 99.9th and 99th) and low (0.01st, 0.1st and 1st) climatological percentiles. These percentiles are computed per latitude–longitude using 7 recent years of 6-hourly data from 2016 to 2022, taken from the corresponding ground truth dataset for each model (ERA5 for GenCast and HRES-fc0 for ENS). For each latitude–longitude, the 99.99th and 0.01st percentiles correspond to a return period of approximately 7 years.

Tropical cyclone evaluation

We extract cyclone trajectories from ENS and GenCast forecasts using the same cyclone tracker, TempestExtremes, downsampling ENS forecasts from a 6-h to 12-h resolution for a fair comparison with GenCast. We also apply the same cyclone tracker to chunks of HRES-fc0 and ERA5 spanning the same time period as the forecast trajectories, generating ground truth cyclone tracks for each model and initialization. The ensemble cyclone forecast skill of each model is then evaluated against its own ground truth. We use established deterministic and probabilistic verification methods from the tropical cyclone literature, detailed below. See Supplementary Information section A.7 for a comparison between our two cyclone evaluations, the motivation behind our choice of ground truth and further cyclone tracker details.

Cyclone position error evaluation

We evaluate ensemble mean cyclone trajectory forecasts from GenCast and ENS using position error48. To be able to compare GenCast and ENS against the same cyclones (despite being evaluated against different ground truths), we first associate the ERA5 and HRES-fc0 cyclone trajectories with named cyclones from the International Best Track Archive for Climate Stewardship (IBTrACS)63,64. Ground truth cyclones that are within 200 km (in geodesic distance) of an IBTrACS cyclone at lead time zero are retained and any others are removed (TempestExtremes and IBTrACS have different definitions of a cyclone, meaning that they do not necessarily identify exactly the same set of cyclones).

Next, for both models, each ensemble member cyclone trajectory is paired to a TempestExtremes named ground truth cyclone if it is within 100 km of that cyclone at lead time zero (otherwise it is removed). We then compute the ensemble mean cyclone location for each named cyclone as the cyclone progresses until fewer than 50% of the ensemble member cyclones remain active and compute the position error between each ensemble mean cyclone centre and its corresponding ground truth cyclone centre. To account for the 6-h offsets between GenCast and ENS initializations, we estimate the position error of ENS at the same 06/18 UTC initializations as GenCast by averaging the two position errors on either side of that initialization with the same lead time2. For a fair comparison, we evaluate GenCast and ENS against exactly the same cyclones and lead times by computing average position error over the intersection of named cyclone and lead time pairs for which both a GenCast and ENS ensemble mean track position error exists (Fig. 4b).

Cyclone track probability evaluation

To evaluate the probabilistic skill of ensemble cyclone tracks, we compute 1° resolution track probability heatmaps for each time step, in which the predicted probability in each 1° cell is the fraction of ensemble members predicting a cyclone centre within that cell. We choose 1° as it corresponds to 111 km at the equator, which is close to 120 km, a common radius used for defining cyclone track probability48. We convert the ground truth cyclone tracks from ERA5 and HRES-fc0 to binary ground truth maps for each initialization time and lead time. Finally, we follow ref. 34 in computing the REV of the track probability forecast of each model against their respective binary ground truth heatmaps.

Unlike the paired position error analysis above, this track probability analysis does not restrict the ground truth TempestExtremes tracks to IBTrACS-named cyclones, nor does it evaluate GenCast and ENS against exactly the same cyclones. Owing to differences between HRES-fc0 and ERA5, the TempestExtremes cyclone tracker identifies 23% more cyclones in HRES-fc0 than in ERA5. However, REV accounts for this difference in base rates by virtue of its normalizations with respect to climatology and the perfect forecast (Supplementary Information section A.5.6), and is thus a fair metric to use when comparing methods evaluated against different ground truths. Furthermore, even when using HRES-fc0 as the ground truth of GenCast, which puts GenCast at a disadvantage, GenCast outperforms ENS beyond one day lead times (Supplementary Information Fig. B12).

Cyclone tracker

To extract cyclone trajectories from gridded forecasts and analysis datasets, we use the TempestExtremes v2.1 cyclone tracking algorithm47. TempestExtremes is open-source on GitHub (https://github.com/ClimateGlobalChange/tempestextremes) and has been used in a wide range of cyclone studies47. The algorithm has two stages. The first stage, DetectNodes, finds candidate tropical cyclones where minima in mean sea level pressure are co-located with upper-level warm cores. The second stage, StitchNodes, stitches these locations together to form trajectories. Further details of how the tracker identifies cyclones and what is involved in each tracker stage are given in Supplementary Information section A.7.3, and readers are referred to refs. 47,65 for full details.

In their 2017 work, the authors of ref. 66 optimized the hyperparameters of TempestExtremes so that when applied to 6-hourly reanalysis datasets the resulting tracks closely match the observed tracks from the IBTrACS dataset63,64. We made two changes to the StitchNodes hyperparameters of the tracker (Supplementary Information section A.7.3) to account for the 12-hourly (instead of 6-hourly) temporal resolution of our evaluation, but otherwise left all tracker hyperparameters at their default values. We then used the same set of tracker hyperparameters for each model and each analysis dataset.

As TempestExtremes performs a global optimization when stitching nodes, the track results at a particular lead time depend on raw predictions at nearby lead times. We prepend 10 days of the respective ground truth of the model (ERA5 or HRES-fc0) to each forecast before running the cyclone tracker. This avoids cyclones being dropped when forecasts are initialized close to the end of the lifetime of a cyclone because of the short duration of the cyclone within the forecast period not passing the criteria of the tracker. Similarly, we report only results up to lead times of 9 days despite providing 15 days of predictions to the tracker, because the tracker may drop cyclones that begin close to the end of the forecast period.

Spatially pooled CRPS evaluation

To evaluate skill at forecasting spatial structure, we compute spatially pooled versions of CRPS. Our approach is an instance of neighbourhood verification39, adapted to the surface of a sphere. We define pool centres as the nodes of a k-times refined icosahedral mesh. Pooling regions are defined within a fixed geodesic distance of each pool centre, with radii set to the mean distance between mesh nodes. To capture performance at different spatial scales, we do this separately for 6 mesh refinement levels (k = 7, 6, …, 2), resulting in a wide range of pool sizes: 120 km, 241 km, 481 km, 962 km, 1,922 km and 3,828 km. We evaluate performance on two types of pooling aggregation: average pooling and max pooling. Forecasts and targets are first aggregated over pooling regions and then standard skill scores are computed on these pooled counterparts. For average pooling, the grid cells are weighted by their area. Finally, to account for slight non-uniformities in the distribution of pooling centres when computing the global average-pooled CRPS, we weight each pooling region by the area of the Voronoi cell of the pooling centre.

These metrics are computed for 2 m temperature, 10 m wind speed, 12-h accumulated precipitation and mean sea level pressure at 0.25° (Supplementary Figs. B13 and B14).

We also compute pooled CRPS scorecards for wind speed, geopotential, temperature and specific humidity at all pressure levels (Extended Data Figs. 6 and 7 and Supplementary Figs. B15 and B16). To reduce the computational cost of these pooled scorecard evaluations that include all pressure levels, forecasts and targets were subsampled to 1° before pooling. In this case, we skipped the smallest pool size because 120 km corresponds to approximately 1° at the equator, making it similar to a univariate evaluation of the subsampled forecasts.

Supplementary Information section A.8 provides further motivation and details on the pooled metrics evaluation.

Regional wind power evaluation

For the regional wind power forecasting experiment, we use all 5,344 wind farm locations and their nominal capacities from the Global Power Plant Database (GPPD)44, which captures about 40% of all global wind farm capacity as of 2020 (ref. 44). We first bilinearly interpolate 10 m wind speed forecasts and analysis states at each wind farm location. We then map 10 m wind speed to load factor—the ratio between the actual wind turbine power output and the maximum power output—using an idealized International Electrotechnical Commission Class II 2 MW turbine power curve from the WIND Toolkit67. This power curve has a cut-in speed of 3 ms−1, maximum output at 14 ms−1 and curtailment at 25 ms−1 (Supplementary Fig. A1). The load factor is then multiplied by the nominal capacity to obtain idealized power generation in megawatts at each wind farm.

To generate arbitrary groupings of wind farms across the globe at a range of spatial scales, we use a similar procedure to the pooled evaluation. Pooling centres are defined on a 7-times refined icosahedral mesh and separate evaluations performed using pool sizes of 120 km, 240 km and 480 km. The 120 km scale contains 3,648 groups with a mean capacity of 272 MW, the 240 km scale contains 7,759 groups with a mean capacity of 513 MW and the 480 km scale contains 15,913 groups with a mean capacity of 996 MW. The power output is summed over wind farm sites in each group and CRPS is computed for this derived quantity. We then compute the average CRPS across all wind farm groups. By using power as the target variable, more weight is applied to pools containing more wind farm capacity in the global average CRPS.

Accounting for assimilation windows

During our 2019 test period, ENS was initialized with analyses whose assimilation window had between 3 h and 5 h of look-ahead beyond the stated initialization time68. The 06/18 UTC ERA5 initializations of the ML models afford them only 3 h of look-ahead. The 00/12 UTC states of ERA5 have 9 h of look-ahead, which we show in Supplementary Fig. B20 translates into improved metrics on 00/12 UTC initializations over 06/18 UTC initializations. Overall, the difference in assimilation windows used in our evaluation leaves ENS with a small advantage of up to 2 h additional look-ahead over the ML models, for all variables except sea surface temperature.

ENS initialization and evaluation times

As discussed above, we evaluate GenCast only on forecasts initialized at 06/18 UTC, as using 00/12-initialized forecasts gives GenCast an additional advantage because of the longer data-assimilation look-ahead. Ideally, we would compare all models at the same 06/18 UTC initialization times. However, ENS forecasts from 06/18 UTC are archived only up to 6-day lead times and are not free for public download. Hence, we evaluate ENS on forecasts initialized at 00/12 UTC. For globally averaged metrics, this should not matter, and in fact ref. 2 found that 00/12 UTC initialization tends to give a small advantage in RMSE to the deterministic HRES forecast over the 06/18 UTC initialization, and we expect a similar minor advantage to apply to ENS. However, the regional wind power evaluation is sensitive to the diurnal cycle because wind power capacity is sparsely and non-uniformly distributed around the world. Thus, in this case, it is important to compare forecasts by ENS and GenCast at the same set of validity times. We, therefore, evaluate ENS (initialized at 00/12 UTC) at the same 06/18 UTC targets as GenCast. However, GenCast produces 06/18 UTC forecasts at lead times of 12 h, 24 h, 36 h and so on, whereas for ENS we obtain only 06/18 UTC forecasts at lead times of 6 h, 18 h, 30 h and so on. To estimate 06/18 UTC regional wind power CRPS of ENS at the same lead times as GenCast, we linearly interpolate the CRPS curve of ENS. In Supplementary Information section B.8.1, we validate this approach on 2018 data in which we did get access to ENS 06/18 UTC initializations, showing that this lead time interpolation overestimates the performance of ENS, in particular at short lead times.

RELATED ARTICLES

Most Popular

Recent Comments