Methods and scores used for verifying ensemble forecasts

Short cuts to:
Bias score
Equitable threat score
Root mean square error
Reliability diagram
Brier skill score
Relative operating characteriscis (ROC)
Ranked probability skill score
Rank histogram
Relative value

Only the methods and scores explicitly used in the BoM EPS precipitation verification are included. These descriptions are taken from the WWRP-WGNE Joint Verification Working Group's verification web site and expanded upon in some cases using extracts lifted unashamedly from Stanski et al. (1989).

Verification of non-probabilistic forecasts

(a) Dichotomous forecasts

A dichotomous forecast says, "yes, an event will happen", or "no, the event will not happen". In the case of ensemble verification we usually verify the forecast occurrence of an event greater than a certain threshold (for example, daily rainfall of at least 1 mm/day). Because these apply to deterministic (non-probabilistic) forecasts, we use them to verify the individual ensemble members as well as the ensemble mean.

To verify this type of forecast  we start with a contingency table that shows the frequency of "yes" and "no" forecasts and occurrences. The four combinations of forecasts (yes or no) and observations (yes or no), called the conditional distribution, are:

hit - event forecast to occur, and did occur
miss - event forecast not to occur, but did occur
false alarm - event forecast to occur, but did not occur
correct negative - event forecast not to occur, and did not occur

The total numbers of forecast and observed occurrences and non-occurences are given on the lower and right sides of the contingency table, and are called the marginal distribution.

 Forecast yes no Total Observed yes hits misses observed yes no false alarms correct negatives observed no Total forecast yes forecast no total

The contingency table is a useful way to see what types of errors are being made. A perfect forecast system would produce only hits and correct negatives, and no misses or false alarms.

There are several categorical statistics that can be computed from the yes/no contingency table. The ones used in the EPS verification are:

Bias score

Answers the question: How does the forecast frequency of events compare to the actual (observed) frequency of events?

Range: 0 to infinity.  Perfect score: 1.

Characteristics: Indicates whether the forecast system has a tendency to underforecast (BIAS<1) or overforecast (BIAS>1) events. Does not measure how well the forecast corresponds to the observations (i.e., says nothing about accuracy), only measures relative frequencies.

Equitable threat score- where

Answers the question: How well did the forecast occurrence of events correspond to the actual (observed) occurrence of events?

Range: -1/3 to 1, 0 indicates no skill.   Perfect score: 1.

Characteristics: Measures the fraction of observed and/or forecast events that were correctly predicted, adjusted for hits associated with random chance. For example, it is easier to correctly forecast rain occurrence in a wet climate than in a dry climate. The ETS is often used in the verification of rainfall in NWP models because its "equitability" allows scores to be compared more fairly across different regimes. Because it penalises both misses and false alarms in the same way, it does not distinguish the source of forecast error.

(b) Continuous forecasts

Verification of forecasts of continuous variables measures how the values of the forecasts differ from the values of the observations.

Root mean square error

Answers the question: What is the magnitude of the forecast errors?

Range: 0 to infinity.  Perfect score: 0.

Characteristics: This simple and familiar score measures "average" error, weighted according to the square of the error.  Does not indicate the direction of the deviations. The root mean square error puts greater influence on large errors than smaller errors, which may be a good thing if large errors are especially undesirable. However, the emphasis on large errors may encourage conservative forecasting.

Verification of probabilistic forecasts

A probabilistic forecast gives a probability of an event occurring, with a value between 0 and 1 (or 0 and 100%). It is impossible to verify a single probabilistic forecast using a single observation. Instead one must verify a set of probabilistic forecasts, pi, using observations that those events either occurred (oi=1) or did not occur (oi=0).

A good probability forecast system has several attributes:

reliability - agreement between forecast probability and mean observed frequency; like a categorical bias
sharpness - tendency to forecast extreme values -- "climatology" is not sharp
resolution - ability of forecast to resolve the set of sample events into subsets with characteristically different frequencies

Reliability diagram - (also called "attributes diagram").

Answers the question: How well do the predicted probabilities of an event correspond to their observed frequencies?

Perfect: Curve lines up on the diagonal.

The reliability diagram plots the observed frequency against the forecast probability, where the range of forecast probabilities is divided into K bins (for example, 0-5%, 5-15%, 15-25%, etc.). The diagonal line indicates perfect reliability (average observed frequency equal to predicted probability for each category), and the horizontal line represents the climatological frequency. Sometimes sample sizes are plotted either as a histogram, or as numbers next to the data points.

Characteristics: The reliability diagram is in some ways analagous to a scatter plot, where the data are stratified by the forecast into K categories (K points). The reliability diagram thus represents stratification conditioned on the forecast and can be expected to give information on the real meaning of the forecast. Reliability is indicated by the proximity of the plotted curve to the diagonal. The deviation from the diagonal gives the conditional bias. If the curve lies below the line, this indicates overforecasting (probabilities too high); points above the line indicate underforecasting (probabilities too low). The flatter the curve in the reliability diagram, the less resolution it has. A forecast of climatology does not discriminate at all between events and non-events, and thus has no resolution.

The reliability diagram is conditioned on the forecasts (i.e., given that X was predicted, what was the outcome?). It it a good partner to the ROC, which is conditioned on the observations.

Brier score

Answers the question: What is the magnitude of the probability forecast errors?

Range: 0 to 1.  Perfect score: 0.

Characteristics: The Brier score is the mean squared error in probability space. It is sensitive to climatological frequency of the event. In the absence of any forecasting skill, the best strategy to optimise the Brier score is to forecast the climatological frequency. The more rare an event, the easier it is to get a good BS without having any real skill. For this reason, the Brier skill score (see below) is preferred because it references the score to climatology (sample or long-term).

Murphy (1973) showed that the Brier score could be partitioned into three terms: (1) reliability, (2) resolution, and (3) uncertainty. These terms are sometimes shown separately to attribute sources of error.

Brier skill score

Answers the question: What is the relative skill of the probabilistic forecast over that of climatology, in terms of predicting whether or not an event occurred?

Range: minus infinity to 1, 0 indicates no skill when compared to the reference forecast. Perfect score: 1.

Characteristics: The Brier skill score measures the improvement of the probabilistic forecast relative to a reference forecast (usually the long-term or sample climatology), therefore taking climatological frequency into account. Because the denominator approaches 0 for a perfect forecast, this score can be unstable when applied to small data sets. This score should always be applied to a sufficiently large sample, one for which the sample climatology of the event is representative of the long term climatology. The rarer the event, the larger the number of samples needed to stablise the score. For best results the Brier skill score should be computed on the whole sample, i.e., the skill should be computed for an aggregated sample, not averaged for several samples.

Relative operating characteristic

Answers the question: What is the ability of the forecast to discriminate between events and non-events?

ROC: Perfect: Curve travels from bottom left to top left of diagram, then across to top right of diagram. Diagonal line indicates no skill.
ROC area:  Range: 0 to 1, 0.5 indicates no skill. Perfect score: 1

The ROC is created by plotting the probability of detection versus the false alarm rate (false alarms / observed no, also known as probability of false detection), using a set of increasing probability thresholds (for example, 0.05, 0.15, 0.25, etc.) to make the yes/no decision. The area under the ROC curve is frequently used as a score.

Characteristics: ROC measures the ability of the forecast to discriminate between two alternative outcomes, thus measuring resolution. A good ROC is indicated by a curve that goes close to the upper left corner (low false alarm rate, high probability of detection). It is not sensitive to bias in the forecast, so says nothing about reliability. A biased forecast may still have good resolution and produce a good ROC curve, which means that it may be possible to improve the forecast through calibration. The ROC can thus be considered as a measure of potential usefulness.

The ROC is conditioned on the observations (i.e., given that Y occurred, what was the correponding forecast?)  It is therefore a good companion to the reliability diagram, which is conditioned on the forecasts.

Ranked probability score

Answers the question: How well did the probability forecast predict the category that the observations fell into?

Range: 0 to 1.  Perfect score: 0.

This score is used to assess multi-category forecasts, where M is the number of forecast categories (for example, rainfall bins: 0-1 mm, 1-5 mm, 5-10 mm, etc.), pk is the predicted probability in forecast category k, and ok is an indicator (0=no, 1=yes) for the observation in category k.

Characteristics: The ranked probability score measures the sum of squared differences in cumulative probability space for a multi-category probabilistic forecast. The RPS penalizes forecasts less severely when their probabilities are close to the true outcome, and more severely when their probabilities are further from the actual outcome. For two forecast categories the RPS is the same as the Brier Score.

Ranked probability skill score

Answers the question: What is the relative skill of the probabilistic forecast over that of climatology, in terms of getting close to the actual outcome?

Range: minus infinity to 1, 0 indicates no skill when compared to the reference forecast. Perfect score: 1.

Characteristics: The RPSS measures the improvement of the multi-category probabilistic forecast relative to a reference forecast (usually the long-term or sample climatology). It is similar to the 2-category Brier skill score, in that it takes climatological frequency into account. Because the denominator approaches 0 for a perfect forecast, this score can be unstable when applied to small data sets. This score should always be applied to a sufficiently large sample, one for which the sample climatology of the event is representative of the long term climatology. The rarer the event, the larger the number of samples needed to stablise the score. For best results the ranked probability skill score should be computed on the whole sample, i.e., the skill should be computed for an aggregated sample, not averaged for several samples.

Rank histogram (Hamill, 2001)

Answers the question: How well does the ensemble spread of the forecast represent the true variability (uncertainty) of the observations?

Perfect: All bars the same height, i.e., flat histogram.

To construct a rank histogram, do the following:
1. At every observation (or analysis) point rank the N ensemble members from lowest to highest. This represents N+1 possible bins that the observation could fit into, including the two extremes
2. Identify which bin the observation falls into at each point
3. Tally over many observations to create a histogram of rank.

Characteristics: Also known as a "Talagrand diagram", this method checks where the verifying observation usually falls with respect to the ensemble forecast data, which is arranged in increasing order at each grid point. In an ensemble with perfect spread, each member represents an equally likely scenario, so the observation is equally likely to fall between any two members. Note that a flat rank histogram does not necessarily indicate a good forecast, it only measures whether the observed probability distribution is well represented by the ensemble.

Interpretation:
U-shaped - ensemble spread too small, many observations falling outside the extremes of the ensemble
Dome-shaped - ensemble spread too large, most observations falling near the center of the ensemble
Asymmetric - ensemble contains bias

Relative value (value score) (Wilks, 2001)

Answers the question: For a cost/loss ratio C/L for taking action based on a forecast, what is the relative improvement in economic value between climatalogical and perfect information?

Range: minus infinity to 1.  Perfect score: 1.

In the above equation for value score, the hits, misses, and false alarms are computed using the climatological probability Pclim as the yes/no decision threshold.

Characteristics: The relative value is a skill score of expected expense, with climatology as the reference forecast. Because the cost/loss ratio is different for different users of forecasts, the value is generally plotted as a function of C/L. Like ROC, it gives information that can be used in decision making, but unlike ROC, it is sensitive to bias in the forecast.

References

Hamill, T.M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550-560.
Murphy, A.H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595-600.
Stanski, H.R., L.J. Wilson, and W.R. Burrows, 1989: Survey of common verification methods in meteorology. World Weather Watch Tech. Rept. No.8, WMO/TD No.358, WMO, Geneva, 114 pp. Click here to access a PDF version.
Wilks, D.S., 2001: A skill score based on economic value for probability forecasts. Meteorol. Appl., 8, 209-219.

Beth Ebert, BMRC Weather Forecasting Group, June 2003