Probabilistic weather forecasting with machine learning

December 5, 2024

53

Task definition and general approach

A general formulation of the task of probabilistic weather forecasting from the present time tâ=â0 into the future is to model the joint probability distribution $P({\bar{{\bf{X}}}}^{0:T}| {{\bf{O}}}^{\le 0})$, where T is the forecast horizon, ${\bar{{\bf{X}}}}^{t}$ denotes the atmospheric state at time t and O^â¤0 are observations made up to the forecast initialization time tâ=â0. This joint distribution can be factored as

$$P({\bar{{\bf{X}}}}^{0:T}| {{\bf{O}}}^{\le 0})=\mathop{\underbrace{P({\bar{{\bf{X}}}}^{0}| {{\bf{O}}}^{\le 0})}}\limits_{{\rm{State}}\,{\rm{inference}}}\,\mathop{\underbrace{P({\bar{{\bf{X}}}}^{1:T}| {\bar{{\bf{X}}}}^{0})}}\limits_{{\rm{Forecast}}\,{\rm{model}}}$$

Our innovation in this work is an MLWP-based Forecast model, and we adopt a traditional NWP-based State inference approach. We make several approximations to the above general formulation, as follows.

Full atmospheric states are not directly observed, and so we approximate each ${\bar{{\bf{X}}}}^{t}$ with a best-estimate NWP-based analysis state X^t, which has been generated at finite resolution, using a window of observations in a process known as data assimilation. In our case, each X^t is an 84âÃâ720âÃâ1,440 array, which includes six surface variables and six atmospheric variables at each of 13 vertical pressure levels (Extended Data Table 1), on a 0.25Â° latitudeâlongitude grid. We generate 15-day forecasts, at 12âh steps, so Tâ=â30.

As a first, standard approximation²⁹, we use analysis X^0:T as evaluation targets. This means we are in effect evaluating the forecast of each model as a predictive distribution,

$$P({{\bf{X}}}^{0:T}| {{\bf{O}}}^{\le 0}),$$

over sequences of future best-estimate NWP analyses.

Second, we wish to rely on a Markov assumption, but although the underlying atmospheric state sequence ${\bar{{\bf{X}}}}^{1:T}$ is Markov, it is only partially observed in X^1:T. In our models GenCast and GenCast-Perturbed, we make a weaker second-order Markov approximation, under which we factorize

$$P({{\bf{X}}}^{-1},{{\bf{X}}}^{0:T}| {{\bf{O}}}^{\le 0})=P({{\bf{X}}}^{0},{{\bf{X}}}^{-1}| {{\bf{O}}}^{\le 0})\mathop{\prod }\limits_{t=1}^{T}P({{\bf{X}}}^{t}| {{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2}).$$

We found that conditioning on two previous time steps works better than one.

For GenCast, the initialization P(X⁰, X^â1|O^â¤0) is handled by fixing (X⁰, X^â1) to their values obtained from two consecutive best-estimate analyses from the ERA5 dataset²⁷. For GenCast-Perturbed, additional perturbations are added, see Supplementary Information section A.4. With initialization dealt with, the problem is reduced to modelling P(X^t|X^tâ1, X^tâ2), and samples of X^1:T can be generated autoregressively.

Diffusion model specification

Beyond image and video generation, diffusion models^21,22,23 have also been applied in the geophysical domain, to tasks including data assimilation⁵⁰, NWP ensemble emulation⁵¹ and climate downscaling⁵². In this work, we model P(X^t|X^tâ1, X^tâ2) with a diffusion model, which enables us to sample forecast trajectories.

Rather than sampling X^t directly, our approach is to sample a residual Z^t with respect to the most recent weather state X^tâ1, in which the residuals have been normalized to unit variance on a per-variable and per-level basis as was done for GraphCast². X^t is then computed as X^tâ=âX^tâ1â+âSZ^t, where S is a diagonal matrix that inverts the normalization. The one exception to this is precipitation, for which we set X^tâ=âSZ^t without adding the previous state.

We broadly follow the diffusion framework presented in ref.â²¹, and refer the reader to their paper for a more rigorous introduction to diffusion, as well as a detailed treatment of the available modelling decisions. We adopt their choices of noise schedule, noise scaling, loss weighting by noise level and preconditioning. However, we make changes to the noise distribution, the training-time distribution of noise levels and add additional loss weightings, all of which are described below. These changes improve performance on the task of probabilistic weather forecasting.

Sampling process

The sampling process begins by drawing an initial sample ${{\bf{Z}}}_{0}^{t}$ from a noise distribution on the sphere P_noise(Â·|Ï₀), at a high initial noise level Ï₀. After N steps of transformation, we end up at ${{\bf{Z}}}_{N}^{t}:= {{\bf{Z}}}^{t}$, our sample from the target distribution at noise level Ï_Nâ=â0. To take us from one to the other, we apply an ODE solver to a probability flow ODE^21,22. Each step of this solver is denoted by r_Î¸ (Fig. 1), with

$${{\bf{Z}}}_{i+1}^{t}={r}_{\theta }({{\bf{Z}}}_{i}^{t};{{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2},{\sigma }_{i+1},{\sigma }_{i})$$

taking us from noise level Ï_i to the next (lower) noise level Ï_i₊₁, conditioned on (X^tâ1, X^tâ2).

We use the second-order DPMSolver++2S solver⁵³, augmented with the stochastic churn (again making use of P_noise) and noise inflation techniques used in ref.â²¹ to inject further stochasticity into the sampling process. In conditioning on previous time steps, we follow the conditional denoising estimator approach outlined and motivated in ref.â⁵⁴.

Each step r_Î¸ of the solver makes use of a learned denoiser D_Î¸ with parameters Î¸, described in detail below. We take Nâ=â20 solver steps per generated forecast time step. As we are using a second-order solver, each step r_Î¸ requires two function evaluations of the denoiser D_Î¸ (except the last step which requires only a single evaluation). This results in 39 function evaluations in total. See Supplementary Information section A.2.1 for further details, including a full list of sampling hyperparameters.

Noise distribution on the sphere

At the core of a diffusion model is the addition and removal of noise, drawn from some distribution P_noise(Â·|Ï) parameterized by noise level Ï. When using diffusion to generate natural images⁵⁵, P_noise is usually chosen to be independent and identically distributed (i.i.d.) Gaussian. However, we have found it beneficial to use a noise distribution that better respects the spherical geometry of global weather variables. Rather than sampling i.i.d. Gaussian noise on the latitudeâlongitude grid, we instead sample isotropic Gaussian white noise on the sphere, which is then projected onto the grid. This choice of P_noise has the consequence that the noise has a flat spherical harmonic power spectrum in expectation. For motivation and details of these changes, see Supplementary Information section A.2.3.

Denoiser architecture

To recap, our diffusion sampling process involves taking several solver steps r_Î¸, and each solver step calls a denoiser D_Î¸ as part of its computation. We parameterize the denoiser D_Î¸ following ref.â²¹ as a preconditioned version of a neural network function f_Î¸.

$${D}_{\theta }({{\bf{Z}}}_{\sigma }^{t};{{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2},\sigma )\,:= \,{c}_{{\rm{skip}}}(\sigma )\cdot {{\bf{Z}}}_{\sigma }^{t}+{c}_{{\rm{out}}}(\sigma )\cdot {f}_{\theta }({c}_{{\rm{in}}}(\sigma ){{\bf{Z}}}_{\sigma }^{t};\,{{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2},{c}_{{\rm{noise}}}(\sigma )).$$

Here ${{\bf{Z}}}_{\sigma }^{t}$ denotes a noise-corrupted version of the target Z^t at noise level Ï, and c_in, c_out, c_skip and c_noise are preconditioning functions taken from table 1 in ref.â²¹, with Ï_dataâ=â1 because of the normalization of the targets.

The architecture used for f_Î¸ is related to the GraphCast architecture². To be precise, the Encoder and Decoder architectures stay the same, and those inputs to the encoder corresponding to the previous two time steps are normalized to zero mean and unit variance in the same way. However, unlike in GraphCast, which uses a similar message-passing GNN for the Processor architecture as in the Encoder and Decoder, in GenCast the Processor is a graph-transformer model operating on a spherical mesh that computes neighbourhood-based self-attention. Unlike the multimesh used in GraphCast, the mesh in GenCast is a six-times refined icosahedral mesh², with 41,162 nodes and 246,960 edges. The Processor consists of 16 consecutive standard transformer blocks^26,56, with feature dimension equal to 512. The four-head self-attention mechanism in each block is such that each node in the mesh attends to itself and to all other nodes that are within its 32-hop neighbourhood on the mesh.

To condition on previous time steps (X^tâ1, X^tâ2), we concatenate these along the channel dimension with the input to be denoised and feed this as input to the model. Conditioning on noise level Ï is achieved by replacing all layer-norm layers in the architecture with conditional layer-norm⁵⁷ layers. We transform log noise levels into a vector of sineâcosine Fourier features at 32 frequencies with base period 16 and pass them through a two-layer MLP to obtain 16-dimensional noise-level encodings. Each of the conditional layer-norm layers applies a further linear layer to output replacements for the standard scale and offset parameters of layer norm, conditioned on these noise-level encodings.

Training the denoiser

At training time, we apply the denoiser to a version of the target Z^t, which has been corrupted by adding noise Îµâ~âP_noise(Â·|Ï) at noise level Ï:

$${{\bf{Y}}}^{t}={D}_{\theta }({{\bf{Z}}}^{t}+{\boldsymbol{\varepsilon }}\,;{{\bf{X}}}^{t-1},{{\bf{X}}}^{t-2},\sigma ).$$

We train its output, denoted as Y^t, to predict the expectation of the noise-free target Z^t by minimizing the following mean-squared-error objective weighted per elevation level and by latitudeâlongitude cell area,

$$\sum _{t\in {D}_{{\rm{train}}}}E\left[\lambda (\sigma )\frac{1}{| G| | \,J| }\sum _{i\in G}\sum _{j\in J}{w}_{j}{a}_{i}{({Y}_{i,j}^{t}-{Z}_{i,j}^{t})}^{2}\right],$$

where

t indexes the different time steps in the training set D_train;
jâââJ indexes the variable, and for atmospheric variables the pressure level, that is, Jâ=â{z1000, z850,ââ¦, 2t, msl};
iâââG indexes the location (latitude and longitude coordinates) in the grid;
w_j is the per-variable-level loss weight, set as in GraphCast² with the additional sea surface temperature variable weighted at 0.1;
a_i is the area of the latitudeâlongitude grid cell, which varies with latitude and is normalized to unit mean over the grid;
Î»(Ï) is the per-noise-level loss weight in ref.â²¹; and
the expectation is taken over Ïâ~âP_train, Îµâ~âP_noise(Â·;âÏ).

Instead of using the log-normal distribution for P_train that is suggested in ref.â²¹, we construct a distribution whose quantiles match the noise-level schedule used for sample generation, assigning a higher probability to noise levels that are closer together during sampling. Details are in Supplementary Information section A.2.2. As done by GraphCast², we weight the squared error made at each latitudeâlongitude grid cell by a per-variable-level loss weight, as well as the normalized area of that grid cell; this is also a departure from ref.â²¹.

Unlike GraphCast, which is fine-tuned by back-propagating gradients through 12-step trajectories (3âdays with 6âh steps) produced by feeding the model its own predictions as inputs during training, GenCast is only ever trained using targets that consist of the next 12-h state, without ever being provided its own predictions on previous steps as inputs.

Resolution training schedule

The GenCast results reported in this paper were generated by a model that was trained in a two-stage process. Stage 1 was a pre-training stage, taking 2 million training steps. During this stage, the ground truth dataset was bilinearly downsampled from 0.25Â° to 1Â° and the denoiser architecture used a 5-refined icosahedral mesh. This training stage takes a little over 3.5âdays using 32 TPUv5 instances. After this training phase was complete, stage 2 was conducted, fine-tuning the model to 0.25Â°, taking 64,000 further training steps. This takes just under 1.5âdays using 32 TPUv5 instances. During stage 2, the ground truth data is kept at 0.25Â°, and the denoiser architecture is updated to take in 0.25Â° data and output 0.25Â° outputs and to operate on a 6-refined icosahedral mesh. The GNN and graph-transformer architectures are such that the same model weights can operate on the higher data and mesh resolutions without any alterations. We do, however, make a minor modification before beginning the fine-tuning stage to decrease the shock to the model of operating on higher resolution data. In the Encoder GNN, which performs message passing between the grid and mesh nodes, when the data resolution increases from 1Â° to 0.25Â°, the number of messages being received by each mesh node increases by a factor of 16. To approximately preserve the scale of the incoming signal to all mesh nodes at the start of fine-tuning, we divide the sum of these message vectors by 16. The optimization hyperparameters used for both stages of training are detailed in Supplementary Information section A.3.

Training data

We trained GenCast on a dataset built from the ERA5 archive of ECMWF²⁷, a large corpus of global reanalysis data. Our dataset contains the best-estimate analyses of ERA5 for a subset of the available variables, on 13 pressure levels (see Extended Data Table 1 for a complete list of variables and pressure levels), on a 0.25Â° equiangular grid. We also subsampled the temporal resolution from 1âh to 6âh, corresponding to 00:00, 06:00, 12:00 and 18:00 UTC times each day. From this dataset, we extracted sequences at 12-h temporal resolution (sequences of 00/12 UTC or 06/18 UTC times) to train GenCast.

Although its temporal resolution is hourly, ERA5 only assimilates observations in 12-h windows, from 21 UTCâ09 UTC and 09 UTCâ21 UTC. This means that steps taken within a single 12-h assimilation window have a different, less dispersed distribution to those that jump from one window into the next. By choosing a 12-h time step, we avoid training on this bimodal distribution and ensure that our model always predicts a target from the next assimilation window.

For accumulated variables such as precipitation, instead of subsampling the data in time, we accumulated values over the 12-h period preceding each time.

Our dataset covers the period 1979â2019. During the development phase of GenCast, we used dates from 1979 to 2017 for training and validated results in 2018. Before starting the test phase, we froze all model and training choices, retrained the model on data from 1979 to 2018 and evaluated results in 2019.

GenCast-Perturbed training protocol

GenCast-Perturbed is trained by taking the GenCast architecture for f_Î¸ described above in the section âDenoiser architectureâ, removing the conditioning on noise level and noisy targets, and training it at 0.25Â° resolution as a deterministic forecast model using the same training dataset. It takes (X^tâ1, X^tâ2) as inputs and outputs a single forecast of the normalized residual target Z^t. It is trained to minimize the mean-squared error of its single-step 12-h forecasts. Specifically, we minimize

$$\sum _{t\in {D}_{{\rm{train}}}}{\rm{E}}\left[\frac{1}{| G| | \,J| }\sum _{i\in G}\sum _{j\in J}{w}_{j}{a}_{i}{({Y}_{i,j}^{t}-{Z}_{i,j}^{t})}^{2}\right],$$

where in this case Y^t is the deterministic forecast rather than the output of a denoising step and $t,j\in J,i\in G,{w}_{j},{a}_{i}$ are all defined as above. The optimization hyperparameters are detailed in Supplementary Information section A.3.

Statistical methods

We compare GenCast with ENS on several verification metrics (detailed in Supplementary Information section A.5) computed on our 2019 evaluation set. For each relevant metric (and where applicable at each lead time, level, quantile and cost/loss ratio), we test the null hypothesis of no difference in the metric between GenCast and ENS, against the two-sided alternative. Specifically, we are testing for differences in the values of the metrics that would be attained in the limit of infinite years of evaluation data, assuming the stationarity of the climate.

Most of our metrics are computed from time series of spatially aggregated values given at nâ=â730 12-hourly initialization times from 2019. For these metrics, we apply a paired-differences significance test based on the stationary block bootstrap⁵⁸, which handles temporal dependence by resampling blocks of the time-series data from which the metric is computed. We use automatic block length selection^59,60.

By contrast, deterministic cyclone position error is only obtained for a given cyclone at select times at which pairing criteria are met. For this metric, we instead perform a cluster bootstrap⁶¹ that assumes independence between (but not within) cyclones.

We base all our tests on bias-corrected and accelerated (bca) bootstrap confidence intervals⁶². Further details of the statistical tests are given in Supplementary Information section A.6.

Local surface extremes evaluation

We evaluate GenCast and ENS on the task of predicting when surface weather variables exceed high (99.99th, 99.9th and 99th) and low (0.01st, 0.1st and 1st) climatological percentiles. These percentiles are computed per latitudeâlongitude using 7 recent years of 6-hourly data from 2016 to 2022, taken from the corresponding ground truth dataset for each model (ERA5 for GenCast and HRES-fc0 for ENS). For each latitudeâlongitude, the 99.99th and 0.01st percentiles correspond to a return period of approximately 7âyears.

Tropical cyclone evaluation

We extract cyclone trajectories from ENS and GenCast forecasts using the same cyclone tracker, TempestExtremes, downsampling ENS forecasts from a 6-h to 12-h resolution for a fair comparison with GenCast. We also apply the same cyclone tracker to chunks of HRES-fc0 and ERA5 spanning the same time period as the forecast trajectories, generating ground truth cyclone tracks for each model and initialization. The ensemble cyclone forecast skill of each model is then evaluated against its own ground truth. We use established deterministic and probabilistic verification methods from the tropical cyclone literature, detailed below. See Supplementary Information section A.7 for a comparison between our two cyclone evaluations, the motivation behind our choice of ground truth and further cyclone tracker details.

Cyclone position error evaluation

We evaluate ensemble mean cyclone trajectory forecasts from GenCast and ENS using position error⁴⁸. To be able to compare GenCast and ENS against the same cyclones (despite being evaluated against different ground truths), we first associate the ERA5 and HRES-fc0 cyclone trajectories with named cyclones from the International Best Track Archive for Climate Stewardship (IBTrACS)^63,64. Ground truth cyclones that are within 200âkm (in geodesic distance) of an IBTrACS cyclone at lead time zero are retained and any others are removed (TempestExtremes and IBTrACS have different definitions of a cyclone, meaning that they do not necessarily identify exactly the same set of cyclones).

Next, for both models, each ensemble member cyclone trajectory is paired to a TempestExtremes named ground truth cyclone if it is within 100âkm of that cyclone at lead time zero (otherwise it is removed). We then compute the ensemble mean cyclone location for each named cyclone as the cyclone progresses until fewer than 50% of the ensemble member cyclones remain active and compute the position error between each ensemble mean cyclone centre and its corresponding ground truth cyclone centre. To account for the 6-h offsets between GenCast and ENS initializations, we estimate the position error of ENS at the same 06/18 UTC initializations as GenCast by averaging the two position errors on either side of that initialization with the same lead time². For a fair comparison, we evaluate GenCast and ENS against exactly the same cyclones and lead times by computing average position error over the intersection of named cyclone and lead time pairs for which both a GenCast and ENS ensemble mean track position error exists (Fig. 4b).

Cyclone track probability evaluation

To evaluate the probabilistic skill of ensemble cyclone tracks, we compute 1Â° resolution track probability heatmaps for each time step, in which the predicted probability in each 1Â° cell is the fraction of ensemble members predicting a cyclone centre within that cell. We choose 1Â° as it corresponds to 111âkm at the equator, which is close to 120âkm, a common radius used for defining cyclone track probability⁴⁸. We convert the ground truth cyclone tracks from ERA5 and HRES-fc0 to binary ground truth maps for each initialization time and lead time. Finally, we follow ref.â³⁴ in computing the REV of the track probability forecast of each model against their respective binary ground truth heatmaps.

Unlike the paired position error analysis above, this track probability analysis does not restrict the ground truth TempestExtremes tracks to IBTrACS-named cyclones, nor does it evaluate GenCast and ENS against exactly the same cyclones. Owing to differences between HRES-fc0 and ERA5, the TempestExtremes cyclone tracker identifies 23% more cyclones in HRES-fc0 than in ERA5. However, REV accounts for this difference in base rates by virtue of its normalizations with respect to climatology and the perfect forecast (Supplementary Information section A.5.6), and is thus a fair metric to use when comparing methods evaluated against different ground truths. Furthermore, even when using HRES-fc0 as the ground truth of GenCast, which puts GenCast at a disadvantage, GenCast outperforms ENS beyond one day lead times (Supplementary Information Fig. B12).

Cyclone tracker

To extract cyclone trajectories from gridded forecasts and analysis datasets, we use the TempestExtremes v2.1 cyclone tracking algorithm⁴⁷. TempestExtremes is open-source on GitHub (https://github.com/ClimateGlobalChange/tempestextremes) and has been used in a wide range of cyclone studies⁴⁷. The algorithm has two stages. The first stage, DetectNodes, finds candidate tropical cyclones where minima in mean sea level pressure are co-located with upper-level warm cores. The second stage, StitchNodes, stitches these locations together to form trajectories. Further details of how the tracker identifies cyclones and what is involved in each tracker stage are given in Supplementary Information section A.7.3, and readers are referred to refs.â^47,65 for full details.

In their 2017 work, the authors of ref.â⁶⁶ optimized the hyperparameters of TempestExtremes so that when applied to 6-hourly reanalysis datasets the resulting tracks closely match the observed tracks from the IBTrACS dataset^63,64. We made two changes to the StitchNodes hyperparameters of the tracker (Supplementary Information section A.7.3) to account for the 12-hourly (instead of 6-hourly) temporal resolution of our evaluation, but otherwise left all tracker hyperparameters at their default values. We then used the same set of tracker hyperparameters for each model and each analysis dataset.

As TempestExtremes performs a global optimization when stitching nodes, the track results at a particular lead time depend on raw predictions at nearby lead times. We prepend 10âdays of the respective ground truth of the model (ERA5 or HRES-fc0) to each forecast before running the cyclone tracker. This avoids cyclones being dropped when forecasts are initialized close to the end of the lifetime of a cyclone because of the short duration of the cyclone within the forecast period not passing the criteria of the tracker. Similarly, we report only results up to lead times of 9âdays despite providing 15âdays of predictions to the tracker, because the tracker may drop cyclones that begin close to the end of the forecast period.

Spatially pooled CRPS evaluation

To evaluate skill at forecasting spatial structure, we compute spatially pooled versions of CRPS. Our approach is an instance of neighbourhood verification³⁹, adapted to the surface of a sphere. We define pool centres as the nodes of a k-times refined icosahedral mesh. Pooling regions are defined within a fixed geodesic distance of each pool centre, with radii set to the mean distance between mesh nodes. To capture performance at different spatial scales, we do this separately for 6 mesh refinement levels (kâ=â7,â6,ââ¦,â2), resulting in a wide range of pool sizes: 120âkm, 241âkm, 481âkm, 962âkm, 1,922âkm and 3,828âkm. We evaluate performance on two types of pooling aggregation: average pooling and max pooling. Forecasts and targets are first aggregated over pooling regions and then standard skill scores are computed on these pooled counterparts. For average pooling, the grid cells are weighted by their area. Finally, to account for slight non-uniformities in the distribution of pooling centres when computing the global average-pooled CRPS, we weight each pooling region by the area of the Voronoi cell of the pooling centre.

These metrics are computed for 2âm temperature, 10âm wind speed, 12-h accumulated precipitation and mean sea level pressure at 0.25Â° (Supplementary Figs. B13 and B14).

We also compute pooled CRPS scorecards for wind speed, geopotential, temperature and specific humidity at all pressure levels (Extended Data Figs. 6 and 7 and Supplementary Figs. B15 and B16). To reduce the computational cost of these pooled scorecard evaluations that include all pressure levels, forecasts and targets were subsampled to 1Â° before pooling. In this case, we skipped the smallest pool size because 120âkm corresponds to approximately 1Â° at the equator, making it similar to a univariate evaluation of the subsampled forecasts.

Supplementary Information section A.8 provides further motivation and details on the pooled metrics evaluation.

Regional wind power evaluation

For the regional wind power forecasting experiment, we use all 5,344 wind farm locations and their nominal capacities from the Global Power Plant Database (GPPD)⁴⁴, which captures about 40% of all global wind farm capacity as of 2020 (ref.â⁴⁴). We first bilinearly interpolate 10âm wind speed forecasts and analysis states at each wind farm location. We then map 10âm wind speed to load factorâthe ratio between the actual wind turbine power output and the maximum power outputâusing an idealized International Electrotechnical Commission Class II 2âMW turbine power curve from the WIND Toolkit⁶⁷. This power curve has a cut-in speed of 3âms^â¹, maximum output at 14âms^â¹ and curtailment at 25âms^â¹ (Supplementary Fig. A1). The load factor is then multiplied by the nominal capacity to obtain idealized power generation in megawatts at each wind farm.

To generate arbitrary groupings of wind farms across the globe at a range of spatial scales, we use a similar procedure to the pooled evaluation. Pooling centres are defined on a 7-times refined icosahedral mesh and separate evaluations performed using pool sizes of 120âkm, 240âkm and 480âkm. The 120âkm scale contains 3,648 groups with a mean capacity of 272âMW, the 240âkm scale contains 7,759 groups with a mean capacity of 513âMW and the 480âkm scale contains 15,913 groups with a mean capacity of 996âMW. The power output is summed over wind farm sites in each group and CRPS is computed for this derived quantity. We then compute the average CRPS across all wind farm groups. By using power as the target variable, more weight is applied to pools containing more wind farm capacity in the global average CRPS.

Accounting for assimilation windows

During our 2019 test period, ENS was initialized with analyses whose assimilation window had between 3âh and 5âh of look-ahead beyond the stated initialization time⁶⁸. The 06/18 UTC ERA5 initializations of the ML models afford them only 3âh of look-ahead. The 00/12 UTC states of ERA5 have 9âh of look-ahead, which we show in Supplementary Fig. B20 translates into improved metrics on 00/12 UTC initializations over 06/18 UTC initializations. Overall, the difference in assimilation windows used in our evaluation leaves ENS with a small advantage of up to 2âh additional look-ahead over the ML models, for all variables except sea surface temperature.

ENS initialization and evaluation times

As discussed above, we evaluate GenCast only on forecasts initialized at 06/18 UTC, as using 00/12-initialized forecasts gives GenCast an additional advantage because of the longer data-assimilation look-ahead. Ideally, we would compare all models at the same 06/18 UTC initialization times. However, ENS forecasts from 06/18 UTC are archived only up to 6-day lead times and are not free for public download. Hence, we evaluate ENS on forecasts initialized at 00/12 UTC. For globally averaged metrics, this should not matter, and in fact ref.â² found that 00/12 UTC initialization tends to give a small advantage in RMSE to the deterministic HRES forecast over the 06/18 UTC initialization, and we expect a similar minor advantage to apply to ENS. However, the regional wind power evaluation is sensitive to the diurnal cycle because wind power capacity is sparsely and non-uniformly distributed around the world. Thus, in this case, it is important to compare forecasts by ENS and GenCast at the same set of validity times. We, therefore, evaluate ENS (initialized at 00/12 UTC) at the same 06/18 UTC targets as GenCast. However, GenCast produces 06/18 UTC forecasts at lead times of 12âh, 24âh, 36âh and so on, whereas for ENS we obtain only 06/18 UTC forecasts at lead times of 6âh, 18âh, 30âh and so on. To estimate 06/18 UTC regional wind power CRPS of ENS at the same lead times as GenCast, we linearly interpolate the CRPS curve of ENS. In Supplementary Information section B.8.1, we validate this approach on 2018 data in which we did get access to ENS 06/18 UTC initializations, showing that this lead time interpolation overestimates the performance of ENS, in particular at short lead times.

Probabilistic weather forecasting with machine learning

Task definition and general approach

Diffusion model specification

Sampling process

Noise distribution on the sphere

Denoiser architecture

Training the denoiser

Resolution training schedule

Training data

GenCast-Perturbed training protocol

Statistical methods

Local surface extremes evaluation

Tropical cyclone evaluation

Cyclone position error evaluation

Cyclone track probability evaluation

Cyclone tracker

Spatially pooled CRPS evaluation

Regional wind power evaluation

Accounting for assimilation windows

ENS initialization and evaluation times

Damaging the climate could violate international law, court rules

Ageing accelerates around age 50 ― some organs faster than others

Mystery food in Neanderthal diet might be maggots

Most Popular

Jordan Trunner O/S ‘Cave Stone’ Dad Shoe Release Date IB3722-201

An interview with CEO David Baszucki on Roblox’s origin story, what’s next for the gaming platform whose shares have risen about 200% in the...

Trump Refugee Policy Was Meant For White South Africans Only

Kelly Osbourne Remembers Ozzy Osbourne With Emotional Throwback Video

Recent Comments

ABOUT US

POPULAR POSTS

Jordan Trunner O/S ‘Cave Stone’ Dad Shoe Release Date IB3722-201

An interview with CEO David Baszucki on Roblox’s origin story, what’s next for the gaming platform whose shares have risen about 200% in the...

Trump Refugee Policy Was Meant For White South Africans Only

POPULAR CATEGORY