Task definition and general approach
A general formulation of the task of probabilistic weather forecasting from the present time tâ=â0 into the future is to model the joint probability distribution \(P({\bar{{\bf{X}}}}^{0:T}| {{\bf{O}}}^{\le 0})\), where T is the forecast horizon, \({\bar{{\bf{X}}}}^{t}\) denotes the atmospheric state at time t and Oâ¤0 are observations made up to the forecast initialization time tâ=â0. This joint distribution can be factored as
$$P({\bar{{\bf{X}}}}^{0:T}| {{\bf{O}}}^{\le 0})=\mathop{\underbrace{P({\bar{{\bf{X}}}}^{0}| {{\bf{O}}}^{\le 0})}}\limits_{{\rm{State}}\,{\rm{inference}}}\,\mathop{\underbrace{P({\bar{{\bf{X}}}}^{1:T}| {\bar{{\bf{X}}}}^{0})}}\limits_{{\rm{Forecast}}\,{\rm{model}}}$$
Our innovation in this work is an MLWP-based Forecast model, and we adopt a traditional NWP-based State inference approach. We make several approximations to the above general formulation, as follows.
Full atmospheric states are not directly observed, and so we approximate each \({\bar{{\bf{X}}}}^{t}\) with a best-estimate NWP-based analysis state Xt, which has been generated at finite resolution, using a window of observations in a process known as data assimilation. In our case, each Xt is an 84âÃâ720âÃâ1,440 array, which includes six surface variables and six atmospheric variables at each of 13 vertical pressure levels (Extended Data Table 1), on a 0.25° latitudeâlongitude grid. We generate 15-day forecasts, at 12âh steps, so Tâ=â30.
As a first, standard approximation29, we use analysis X0:T as evaluation targets. This means we are in effect evaluating the forecast of each model as a predictive distribution,
$$P({{\bf{X}}}^{0:T}| {{\bf{O}}}^{\le 0}),$$
over sequences of future best-estimate NWP analyses.
Second, we wish to rely on a Markov assumption, but although the underlying atmospheric state sequence \({\bar{{\bf{X}}}}^{1:T}\) is Markov, it is only partially observed in X1:T. In our models GenCast and GenCast-Perturbed, we make a weaker second-order Markov approximation, under which we factorize
$$P({{\bf{X}}}^{-1},{{\bf{X}}}^{0:T}| {{\bf{O}}}^{\le 0})=P({{\bf{X}}}^{0},{{\bf{X}}}^{-1}| {{\bf{O}}}^{\le 0})\mathop{\prod }\limits_{t=1}^{T}P({{\bf{X}}}^{t}| {{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2}).$$
We found that conditioning on two previous time steps works better than one.
For GenCast, the initialization P(X0, Xâ1|Oâ¤0) is handled by fixing (X0, Xâ1) to their values obtained from two consecutive best-estimate analyses from the ERA5 dataset27. For GenCast-Perturbed, additional perturbations are added, see Supplementary Information section A.4. With initialization dealt with, the problem is reduced to modelling P(Xt|Xtâ1, Xtâ2), and samples of X1:T can be generated autoregressively.
Diffusion model specification
Beyond image and video generation, diffusion models21,22,23 have also been applied in the geophysical domain, to tasks including data assimilation50, NWP ensemble emulation51 and climate downscaling52. In this work, we model P(Xt|Xtâ1, Xtâ2) with a diffusion model, which enables us to sample forecast trajectories.
Rather than sampling Xt directly, our approach is to sample a residual Zt with respect to the most recent weather state Xtâ1, in which the residuals have been normalized to unit variance on a per-variable and per-level basis as was done for GraphCast2. Xt is then computed as Xtâ=âXtâ1â+âSZt, where S is a diagonal matrix that inverts the normalization. The one exception to this is precipitation, for which we set Xtâ=âSZt without adding the previous state.
We broadly follow the diffusion framework presented in ref.â21, and refer the reader to their paper for a more rigorous introduction to diffusion, as well as a detailed treatment of the available modelling decisions. We adopt their choices of noise schedule, noise scaling, loss weighting by noise level and preconditioning. However, we make changes to the noise distribution, the training-time distribution of noise levels and add additional loss weightings, all of which are described below. These changes improve performance on the task of probabilistic weather forecasting.
Sampling process
The sampling process begins by drawing an initial sample \({{\bf{Z}}}_{0}^{t}\) from a noise distribution on the sphere Pnoise(·|Ï0), at a high initial noise level Ï0. After N steps of transformation, we end up at \({{\bf{Z}}}_{N}^{t}:= {{\bf{Z}}}^{t}\), our sample from the target distribution at noise level ÏNâ=â0. To take us from one to the other, we apply an ODE solver to a probability flow ODE21,22. Each step of this solver is denoted by rθ (Fig. 1), with
$${{\bf{Z}}}_{i+1}^{t}={r}_{\theta }({{\bf{Z}}}_{i}^{t};{{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2},{\sigma }_{i+1},{\sigma }_{i})$$
taking us from noise level Ïi to the next (lower) noise level Ïi+1, conditioned on (Xtâ1, Xtâ2).
We use the second-order DPMSolver++2S solver53, augmented with the stochastic churn (again making use of Pnoise) and noise inflation techniques used in ref.â21 to inject further stochasticity into the sampling process. In conditioning on previous time steps, we follow the conditional denoising estimator approach outlined and motivated in ref.â54.
Each step rθ of the solver makes use of a learned denoiser Dθ with parameters θ, described in detail below. We take Nâ=â20 solver steps per generated forecast time step. As we are using a second-order solver, each step rθ requires two function evaluations of the denoiser Dθ (except the last step which requires only a single evaluation). This results in 39 function evaluations in total. See Supplementary Information section A.2.1 for further details, including a full list of sampling hyperparameters.
Noise distribution on the sphere
At the core of a diffusion model is the addition and removal of noise, drawn from some distribution Pnoise(·|Ï) parameterized by noise level Ï. When using diffusion to generate natural images55, Pnoise is usually chosen to be independent and identically distributed (i.i.d.) Gaussian. However, we have found it beneficial to use a noise distribution that better respects the spherical geometry of global weather variables. Rather than sampling i.i.d. Gaussian noise on the latitudeâlongitude grid, we instead sample isotropic Gaussian white noise on the sphere, which is then projected onto the grid. This choice of Pnoise has the consequence that the noise has a flat spherical harmonic power spectrum in expectation. For motivation and details of these changes, see Supplementary Information section A.2.3.
Denoiser architecture
To recap, our diffusion sampling process involves taking several solver steps rθ, and each solver step calls a denoiser Dθ as part of its computation. We parameterize the denoiser Dθ following ref.â21 as a preconditioned version of a neural network function fθ.
$${D}_{\theta }({{\bf{Z}}}_{\sigma }^{t};{{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2},\sigma )\,:= \,{c}_{{\rm{skip}}}(\sigma )\cdot {{\bf{Z}}}_{\sigma }^{t}+{c}_{{\rm{out}}}(\sigma )\cdot {f}_{\theta }({c}_{{\rm{in}}}(\sigma ){{\bf{Z}}}_{\sigma }^{t};\,{{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2},{c}_{{\rm{noise}}}(\sigma )).$$
Here \({{\bf{Z}}}_{\sigma }^{t}\) denotes a noise-corrupted version of the target Zt at noise level Ï, and cin, cout, cskip and cnoise are preconditioning functions taken from table 1 in ref.â21, with Ïdataâ=â1 because of the normalization of the targets.
The architecture used for fθ is related to the GraphCast architecture2. To be precise, the Encoder and Decoder architectures stay the same, and those inputs to the encoder corresponding to the previous two time steps are normalized to zero mean and unit variance in the same way. However, unlike in GraphCast, which uses a similar message-passing GNN for the Processor architecture as in the Encoder and Decoder, in GenCast the Processor is a graph-transformer model operating on a spherical mesh that computes neighbourhood-based self-attention. Unlike the multimesh used in GraphCast, the mesh in GenCast is a six-times refined icosahedral mesh2, with 41,162 nodes and 246,960 edges. The Processor consists of 16 consecutive standard transformer blocks26,56, with feature dimension equal to 512. The four-head self-attention mechanism in each block is such that each node in the mesh attends to itself and to all other nodes that are within its 32-hop neighbourhood on the mesh.
To condition on previous time steps (Xtâ1, Xtâ2), we concatenate these along the channel dimension with the input to be denoised and feed this as input to the model. Conditioning on noise level Ï is achieved by replacing all layer-norm layers in the architecture with conditional layer-norm57 layers. We transform log noise levels into a vector of sineâcosine Fourier features at 32 frequencies with base period 16 and pass them through a two-layer MLP to obtain 16-dimensional noise-level encodings. Each of the conditional layer-norm layers applies a further linear layer to output replacements for the standard scale and offset parameters of layer norm, conditioned on these noise-level encodings.
Training the denoiser
At training time, we apply the denoiser to a version of the target Zt, which has been corrupted by adding noise εâ~âPnoise(·|Ï) at noise level Ï:
$${{\bf{Y}}}^{t}={D}_{\theta }({{\bf{Z}}}^{t}+{\boldsymbol{\varepsilon }}\,;{{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2},\sigma ).$$
We train its output, denoted as Yt, to predict the expectation of the noise-free target Zt by minimizing the following mean-squared-error objective weighted per elevation level and by latitudeâlongitude cell area,
$$\sum _{t\in {D}_{{\rm{train}}}}E\left[\lambda (\sigma )\frac{1}{| G| | \,J| }\sum _{i\in G}\sum _{j\in J}{w}_{j}{a}_{i}{({Y}_{i,j}^{t}-{Z}_{i,j}^{t})}^{2}\right],$$
where
-
t indexes the different time steps in the training set Dtrain;
-
jâââJ indexes the variable, and for atmospheric variables the pressure level, that is, Jâ=â{z1000, z850,ââ¦, 2t, msl};
-
iâââG indexes the location (latitude and longitude coordinates) in the grid;
-
wj is the per-variable-level loss weight, set as in GraphCast2 with the additional sea surface temperature variable weighted at 0.1;
-
ai is the area of the latitudeâlongitude grid cell, which varies with latitude and is normalized to unit mean over the grid;
-
λ(Ï) is the per-noise-level loss weight in ref.â21; and
-
the expectation is taken over Ïâ~âPtrain, εâ~âPnoise(·;âÏ).
Instead of using the log-normal distribution for Ptrain that is suggested in ref.â21, we construct a distribution whose quantiles match the noise-level schedule used for sample generation, assigning a higher probability to noise levels that are closer together during sampling. Details are in Supplementary Information section A.2.2. As done by GraphCast2, we weight the squared error made at each latitudeâlongitude grid cell by a per-variable-level loss weight, as well as the normalized area of that grid cell; this is also a departure from ref.â21.
Unlike GraphCast, which is fine-tuned by back-propagating gradients through 12-step trajectories (3âdays with 6âh steps) produced by feeding the model its own predictions as inputs during training, GenCast is only ever trained using targets that consist of the next 12-h state, without ever being provided its own predictions on previous steps as inputs.
Resolution training schedule
The GenCast results reported in this paper were generated by a model that was trained in a two-stage process. Stage 1 was a pre-training stage, taking 2 million training steps. During this stage, the ground truth dataset was bilinearly downsampled from 0.25° to 1° and the denoiser architecture used a 5-refined icosahedral mesh. This training stage takes a little over 3.5âdays using 32 TPUv5 instances. After this training phase was complete, stage 2 was conducted, fine-tuning the model to 0.25°, taking 64,000 further training steps. This takes just under 1.5âdays using 32 TPUv5 instances. During stage 2, the ground truth data is kept at 0.25°, and the denoiser architecture is updated to take in 0.25° data and output 0.25° outputs and to operate on a 6-refined icosahedral mesh. The GNN and graph-transformer architectures are such that the same model weights can operate on the higher data and mesh resolutions without any alterations. We do, however, make a minor modification before beginning the fine-tuning stage to decrease the shock to the model of operating on higher resolution data. In the Encoder GNN, which performs message passing between the grid and mesh nodes, when the data resolution increases from 1° to 0.25°, the number of messages being received by each mesh node increases by a factor of 16. To approximately preserve the scale of the incoming signal to all mesh nodes at the start of fine-tuning, we divide the sum of these message vectors by 16. The optimization hyperparameters used for both stages of training are detailed in Supplementary Information section A.3.
Training data
We trained GenCast on a dataset built from the ERA5 archive of ECMWF27, a large corpus of global reanalysis data. Our dataset contains the best-estimate analyses of ERA5 for a subset of the available variables, on 13 pressure levels (see Extended Data Table 1 for a complete list of variables and pressure levels), on a 0.25° equiangular grid. We also subsampled the temporal resolution from 1âh to 6âh, corresponding to 00:00, 06:00, 12:00 and 18:00 UTC times each day. From this dataset, we extracted sequences at 12-h temporal resolution (sequences of 00/12 UTC or 06/18 UTC times) to train GenCast.
Although its temporal resolution is hourly, ERA5 only assimilates observations in 12-h windows, from 21 UTCâ09 UTC and 09 UTCâ21 UTC. This means that steps taken within a single 12-h assimilation window have a different, less dispersed distribution to those that jump from one window into the next. By choosing a 12-h time step, we avoid training on this bimodal distribution and ensure that our model always predicts a target from the next assimilation window.
For accumulated variables such as precipitation, instead of subsampling the data in time, we accumulated values over the 12-h period preceding each time.
Our dataset covers the period 1979â2019. During the development phase of GenCast, we used dates from 1979 to 2017 for training and validated results in 2018. Before starting the test phase, we froze all model and training choices, retrained the model on data from 1979 to 2018 and evaluated results in 2019.
GenCast-Perturbed training protocol
GenCast-Perturbed is trained by taking the GenCast architecture for fθ described above in the section âDenoiser architectureâ, removing the conditioning on noise level and noisy targets, and training it at 0.25° resolution as a deterministic forecast model using the same training dataset. It takes (Xtâ1, Xtâ2) as inputs and outputs a single forecast of the normalized residual target Zt. It is trained to minimize the mean-squared error of its single-step 12-h forecasts. Specifically, we minimize
$$\sum _{t\in {D}_{{\rm{train}}}}{\rm{E}}\left[\frac{1}{| G| | \,J| }\sum _{i\in G}\sum _{j\in J}{w}_{j}{a}_{i}{({Y}_{i,j}^{t}-{Z}_{i,j}^{t})}^{2}\right],$$
where in this case Yt is the deterministic forecast rather than the output of a denoising step and \(t,j\in J,i\in G,{w}_{j},{a}_{i}\) are all defined as above. The optimization hyperparameters are detailed in Supplementary Information section A.3.
Statistical methods
We compare GenCast with ENS on several verification metrics (detailed in Supplementary Information section A.5) computed on our 2019 evaluation set. For each relevant metric (and where applicable at each lead time, level, quantile and cost/loss ratio), we test the null hypothesis of no difference in the metric between GenCast and ENS, against the two-sided alternative. Specifically, we are testing for differences in the values of the metrics that would be attained in the limit of infinite years of evaluation data, assuming the stationarity of the climate.
Most of our metrics are computed from time series of spatially aggregated values given at nâ=â730 12-hourly initialization times from 2019. For these metrics, we apply a paired-differences significance test based on the stationary block bootstrap58, which handles temporal dependence by resampling blocks of the time-series data from which the metric is computed. We use automatic block length selection59,60.
By contrast, deterministic cyclone position error is only obtained for a given cyclone at select times at which pairing criteria are met. For this metric, we instead perform a cluster bootstrap61 that assumes independence between (but not within) cyclones.
We base all our tests on bias-corrected and accelerated (bca) bootstrap confidence intervals62. Further details of the statistical tests are given in Supplementary Information section A.6.
Local surface extremes evaluation
We evaluate GenCast and ENS on the task of predicting when surface weather variables exceed high (99.99th, 99.9th and 99th) and low (0.01st, 0.1st and 1st) climatological percentiles. These percentiles are computed per latitudeâlongitude using 7 recent years of 6-hourly data from 2016 to 2022, taken from the corresponding ground truth dataset for each model (ERA5 for GenCast and HRES-fc0 for ENS). For each latitudeâlongitude, the 99.99th and 0.01st percentiles correspond to a return period of approximately 7âyears.
Tropical cyclone evaluation
We extract cyclone trajectories from ENS and GenCast forecasts using the same cyclone tracker, TempestExtremes, downsampling ENS forecasts from a 6-h to 12-h resolution for a fair comparison with GenCast. We also apply the same cyclone tracker to chunks of HRES-fc0 and ERA5 spanning the same time period as the forecast trajectories, generating ground truth cyclone tracks for each model and initialization. The ensemble cyclone forecast skill of each model is then evaluated against its own ground truth. We use established deterministic and probabilistic verification methods from the tropical cyclone literature, detailed below. See Supplementary Information section A.7 for a comparison between our two cyclone evaluations, the motivation behind our choice of ground truth and further cyclone tracker details.
Cyclone position error evaluation
We evaluate ensemble mean cyclone trajectory forecasts from GenCast and ENS using position error48. To be able to compare GenCast and ENS against the same cyclones (despite being evaluated against different ground truths), we first associate the ERA5 and HRES-fc0 cyclone trajectories with named cyclones from the International Best Track Archive for Climate Stewardship (IBTrACS)63,64. Ground truth cyclones that are within 200âkm (in geodesic distance) of an IBTrACS cyclone at lead time zero are retained and any others are removed (TempestExtremes and IBTrACS have different definitions of a cyclone, meaning that they do not necessarily identify exactly the same set of cyclones).
Next, for both models, each ensemble member cyclone trajectory is paired to a TempestExtremes named ground truth cyclone if it is within 100âkm of that cyclone at lead time zero (otherwise it is removed). We then compute the ensemble mean cyclone location for each named cyclone as the cyclone progresses until fewer than 50% of the ensemble member cyclones remain active and compute the position error between each ensemble mean cyclone centre and its corresponding ground truth cyclone centre. To account for the 6-h offsets between GenCast and ENS initializations, we estimate the position error of ENS at the same 06/18 UTC initializations as GenCast by averaging the two position errors on either side of that initialization with the same lead time2. For a fair comparison, we evaluate GenCast and ENS against exactly the same cyclones and lead times by computing average position error over the intersection of named cyclone and lead time pairs for which both a GenCast and ENS ensemble mean track position error exists (Fig. 4b).
Cyclone track probability evaluation
To evaluate the probabilistic skill of ensemble cyclone tracks, we compute 1° resolution track probability heatmaps for each time step, in which the predicted probability in each 1° cell is the fraction of ensemble members predicting a cyclone centre within that cell. We choose 1° as it corresponds to 111âkm at the equator, which is close to 120âkm, a common radius used for defining cyclone track probability48. We convert the ground truth cyclone tracks from ERA5 and HRES-fc0 to binary ground truth maps for each initialization time and lead time. Finally, we follow ref.â34 in computing the REV of the track probability forecast of each model against their respective binary ground truth heatmaps.
Unlike the paired position error analysis above, this track probability analysis does not restrict the ground truth TempestExtremes tracks to IBTrACS-named cyclones, nor does it evaluate GenCast and ENS against exactly the same cyclones. Owing to differences between HRES-fc0 and ERA5, the TempestExtremes cyclone tracker identifies 23% more cyclones in HRES-fc0 than in ERA5. However, REV accounts for this difference in base rates by virtue of its normalizations with respect to climatology and the perfect forecast (Supplementary Information section A.5.6), and is thus a fair metric to use when comparing methods evaluated against different ground truths. Furthermore, even when using HRES-fc0 as the ground truth of GenCast, which puts GenCast at a disadvantage, GenCast outperforms ENS beyond one day lead times (Supplementary Information Fig. B12).
Cyclone tracker
To extract cyclone trajectories from gridded forecasts and analysis datasets, we use the TempestExtremes v2.1 cyclone tracking algorithm47. TempestExtremes is open-source on GitHub (https://github.com/ClimateGlobalChange/tempestextremes) and has been used in a wide range of cyclone studies47. The algorithm has two stages. The first stage, DetectNodes, finds candidate tropical cyclones where minima in mean sea level pressure are co-located with upper-level warm cores. The second stage, StitchNodes, stitches these locations together to form trajectories. Further details of how the tracker identifies cyclones and what is involved in each tracker stage are given in Supplementary Information section A.7.3, and readers are referred to refs.â47,65 for full details.
In their 2017 work, the authors of ref.â66 optimized the hyperparameters of TempestExtremes so that when applied to 6-hourly reanalysis datasets the resulting tracks closely match the observed tracks from the IBTrACS dataset63,64. We made two changes to the StitchNodes hyperparameters of the tracker (Supplementary Information section A.7.3) to account for the 12-hourly (instead of 6-hourly) temporal resolution of our evaluation, but otherwise left all tracker hyperparameters at their default values. We then used the same set of tracker hyperparameters for each model and each analysis dataset.
As TempestExtremes performs a global optimization when stitching nodes, the track results at a particular lead time depend on raw predictions at nearby lead times. We prepend 10âdays of the respective ground truth of the model (ERA5 or HRES-fc0) to each forecast before running the cyclone tracker. This avoids cyclones being dropped when forecasts are initialized close to the end of the lifetime of a cyclone because of the short duration of the cyclone within the forecast period not passing the criteria of the tracker. Similarly, we report only results up to lead times of 9âdays despite providing 15âdays of predictions to the tracker, because the tracker may drop cyclones that begin close to the end of the forecast period.
Spatially pooled CRPS evaluation
To evaluate skill at forecasting spatial structure, we compute spatially pooled versions of CRPS. Our approach is an instance of neighbourhood verification39, adapted to the surface of a sphere. We define pool centres as the nodes of a k-times refined icosahedral mesh. Pooling regions are defined within a fixed geodesic distance of each pool centre, with radii set to the mean distance between mesh nodes. To capture performance at different spatial scales, we do this separately for 6 mesh refinement levels (kâ=â7,â6,ââ¦,â2), resulting in a wide range of pool sizes: 120âkm, 241âkm, 481âkm, 962âkm, 1,922âkm and 3,828âkm. We evaluate performance on two types of pooling aggregation: average pooling and max pooling. Forecasts and targets are first aggregated over pooling regions and then standard skill scores are computed on these pooled counterparts. For average pooling, the grid cells are weighted by their area. Finally, to account for slight non-uniformities in the distribution of pooling centres when computing the global average-pooled CRPS, we weight each pooling region by the area of the Voronoi cell of the pooling centre.
These metrics are computed for 2âm temperature, 10âm wind speed, 12-h accumulated precipitation and mean sea level pressure at 0.25° (Supplementary Figs. B13 and B14).
We also compute pooled CRPS scorecards for wind speed, geopotential, temperature and specific humidity at all pressure levels (Extended Data Figs. 6 and 7 and Supplementary Figs. B15 and B16). To reduce the computational cost of these pooled scorecard evaluations that include all pressure levels, forecasts and targets were subsampled to 1° before pooling. In this case, we skipped the smallest pool size because 120âkm corresponds to approximately 1° at the equator, making it similar to a univariate evaluation of the subsampled forecasts.
Supplementary Information section A.8 provides further motivation and details on the pooled metrics evaluation.
Regional wind power evaluation
For the regional wind power forecasting experiment, we use all 5,344 wind farm locations and their nominal capacities from the Global Power Plant Database (GPPD)44, which captures about 40% of all global wind farm capacity as of 2020 (ref.â44). We first bilinearly interpolate 10âm wind speed forecasts and analysis states at each wind farm location. We then map 10âm wind speed to load factorâthe ratio between the actual wind turbine power output and the maximum power outputâusing an idealized International Electrotechnical Commission Class II 2âMW turbine power curve from the WIND Toolkit67. This power curve has a cut-in speed of 3âmsâ1, maximum output at 14âmsâ1 and curtailment at 25âmsâ1 (Supplementary Fig. A1). The load factor is then multiplied by the nominal capacity to obtain idealized power generation in megawatts at each wind farm.
To generate arbitrary groupings of wind farms across the globe at a range of spatial scales, we use a similar procedure to the pooled evaluation. Pooling centres are defined on a 7-times refined icosahedral mesh and separate evaluations performed using pool sizes of 120âkm, 240âkm and 480âkm. The 120âkm scale contains 3,648 groups with a mean capacity of 272âMW, the 240âkm scale contains 7,759 groups with a mean capacity of 513âMW and the 480âkm scale contains 15,913 groups with a mean capacity of 996âMW. The power output is summed over wind farm sites in each group and CRPS is computed for this derived quantity. We then compute the average CRPS across all wind farm groups. By using power as the target variable, more weight is applied to pools containing more wind farm capacity in the global average CRPS.
Accounting for assimilation windows
During our 2019 test period, ENS was initialized with analyses whose assimilation window had between 3âh and 5âh of look-ahead beyond the stated initialization time68. The 06/18 UTC ERA5 initializations of the ML models afford them only 3âh of look-ahead. The 00/12 UTC states of ERA5 have 9âh of look-ahead, which we show in Supplementary Fig. B20 translates into improved metrics on 00/12 UTC initializations over 06/18 UTC initializations. Overall, the difference in assimilation windows used in our evaluation leaves ENS with a small advantage of up to 2âh additional look-ahead over the ML models, for all variables except sea surface temperature.
ENS initialization and evaluation times
As discussed above, we evaluate GenCast only on forecasts initialized at 06/18 UTC, as using 00/12-initialized forecasts gives GenCast an additional advantage because of the longer data-assimilation look-ahead. Ideally, we would compare all models at the same 06/18 UTC initialization times. However, ENS forecasts from 06/18 UTC are archived only up to 6-day lead times and are not free for public download. Hence, we evaluate ENS on forecasts initialized at 00/12 UTC. For globally averaged metrics, this should not matter, and in fact ref.â2 found that 00/12 UTC initialization tends to give a small advantage in RMSE to the deterministic HRES forecast over the 06/18 UTC initialization, and we expect a similar minor advantage to apply to ENS. However, the regional wind power evaluation is sensitive to the diurnal cycle because wind power capacity is sparsely and non-uniformly distributed around the world. Thus, in this case, it is important to compare forecasts by ENS and GenCast at the same set of validity times. We, therefore, evaluate ENS (initialized at 00/12 UTC) at the same 06/18 UTC targets as GenCast. However, GenCast produces 06/18 UTC forecasts at lead times of 12âh, 24âh, 36âh and so on, whereas for ENS we obtain only 06/18 UTC forecasts at lead times of 6âh, 18âh, 30âh and so on. To estimate 06/18 UTC regional wind power CRPS of ENS at the same lead times as GenCast, we linearly interpolate the CRPS curve of ENS. In Supplementary Information section B.8.1, we validate this approach on 2018 data in which we did get access to ENS 06/18 UTC initializations, showing that this lead time interpolation overestimates the performance of ENS, in particular at short lead times.