Only the methods and scores explicitly used in
the BoM EPS precipitation verification are included. These descriptions
are taken from the WWRP-WGNE Joint Verification Working Group's verification
web site and expanded upon in some cases using extracts lifted unashamedly
from Stanski et al. (1989).
To verify this type of forecast we start with a contingency table that shows the frequency of "yes" and "no" forecasts and occurrences. The four combinations of forecasts (yes or no) and observations (yes or no), called the conditional distribution, are:
hit - event forecast to occur, and did
occur
miss - event forecast not to occur,
but did occur
false alarm - event forecast to occur,
but did not occur
correct negative - event forecast not
to occur, and did not occur
The total numbers of forecast and observed occurrences and non-occurences
are given on the lower and right sides of the contingency table, and are
called the marginal distribution.
Forecast | ||||
yes | no | Total | ||
Observed | yes | hits | misses | observed yes |
no | false alarms | correct negatives | observed no | |
Total | forecast yes | forecast no | total |
The contingency table is a useful way to see what types of errors are being made. A perfect forecast system would produce only hits and correct negatives, and no misses or false alarms.
There are several categorical statistics that can be computed from the
yes/no contingency table. The ones used in the EPS verification are:
Answers the question: How does the forecast frequency of events compare to the actual (observed) frequency of events?
Range: 0 to infinity. Perfect score: 1.
Characteristics: Indicates whether the forecast system has a
tendency to underforecast (BIAS<1) or overforecast (BIAS>1)
events. Does not measure how well the forecast corresponds to the observations
(i.e., says nothing about accuracy), only measures relative frequencies.
Answers the question: How well did the forecast occurrence of events correspond to the actual (observed) occurrence of events?
Range: -1/3 to 1, 0 indicates no skill. Perfect score: 1.
Characteristics: Measures the fraction of observed and/or forecast
events that were correctly predicted, adjusted for hits associated with
random chance. For example, it is easier to correctly forecast rain occurrence
in a wet climate than in a dry climate. The ETS is often used in
the verification of rainfall in NWP models because its "equitability" allows
scores to be compared more fairly across different regimes. Because it
penalises both misses and false alarms in the same way, it does not distinguish
the source of forecast error.
Verification of forecasts of continuous variables measures how the
values
of the forecasts differ from the values of the observations.
Answers the question: What is the magnitude of the forecast errors?
Range: 0 to infinity. Perfect score: 0.
Characteristics: This simple and familiar score measures "average"
error, weighted according to the square of the error. Does not indicate
the direction of the deviations. The root mean square error puts greater
influence on large errors than smaller errors, which may be a good thing
if large errors are especially undesirable. However, the emphasis on large
errors may encourage conservative forecasting.
A good probability forecast system has several attributes:
reliability - agreement between forecast
probability and mean observed frequency; like a categorical bias
sharpness - tendency to forecast extreme
values -- "climatology" is not sharp
resolution - ability of forecast to
resolve the set of sample events into subsets with characteristically different
frequencies
Reliability diagram - (also called "attributes diagram").
Answers the question: How well do the predicted probabilities of an event correspond to their observed frequencies?
Perfect: Curve lines up on the diagonal.
The reliability diagram plots the observed frequency against the forecast probability, where the range of forecast probabilities is divided into K bins (for example, 0-5%, 5-15%, 15-25%, etc.). The diagonal line indicates perfect reliability (average observed frequency equal to predicted probability for each category), and the horizontal line represents the climatological frequency. Sometimes sample sizes are plotted either as a histogram, or as numbers next to the data points.
Characteristics: The reliability diagram is in some ways analagous to a scatter plot, where the data are stratified by the forecast into K categories (K points). The reliability diagram thus represents stratification conditioned on the forecast and can be expected to give information on the real meaning of the forecast. Reliability is indicated by the proximity of the plotted curve to the diagonal. The deviation from the diagonal gives the conditional bias. If the curve lies below the line, this indicates overforecasting (probabilities too high); points above the line indicate underforecasting (probabilities too low). The flatter the curve in the reliability diagram, the less resolution it has. A forecast of climatology does not discriminate at all between events and non-events, and thus has no resolution.
The reliability diagram is conditioned on the forecasts (i.e., given
that X was predicted, what was the outcome?). It it a good partner to the
ROC, which is conditioned on the observations.
Answers the question: What is the magnitude of the probability forecast errors?
Range: 0 to 1. Perfect score: 0.
Characteristics: The Brier score is the mean squared error in probability space. It is sensitive to climatological frequency of the event. In the absence of any forecasting skill, the best strategy to optimise the Brier score is to forecast the climatological frequency. The more rare an event, the easier it is to get a good BS without having any real skill. For this reason, the Brier skill score (see below) is preferred because it references the score to climatology (sample or long-term).
Murphy (1973) showed that the Brier score
could be partitioned into three terms: (1) reliability, (2)
resolution,
and (3) uncertainty. These terms are sometimes shown separately
to attribute sources of error.
Answers the question: What is the relative skill of the probabilistic forecast over that of climatology, in terms of predicting whether or not an event occurred?
Range: minus infinity to 1, 0 indicates no skill when compared to the reference forecast. Perfect score: 1.
Characteristics: The Brier skill score measures the improvement
of the probabilistic forecast relative to a reference forecast (usually
the long-term or sample climatology), therefore taking climatological frequency
into account. Because the denominator approaches 0 for a perfect forecast,
this score can be unstable when applied to small data sets. This score
should always be applied to a sufficiently large sample, one for which
the sample climatology of the event is representative of the long term
climatology. The rarer the event, the larger the number of samples needed
to stablise the score. For best results the Brier skill score should be
computed on the whole sample, i.e., the skill should be computed for an
aggregated sample, not averaged for several samples.
Relative operating characteristic -
Answers the question: What is the ability of the forecast to discriminate between events and non-events?
ROC: Perfect: Curve travels from bottom left to top left
of diagram, then across to top right of diagram. Diagonal line indicates
no skill.
ROC area: Range: 0 to 1, 0.5 indicates no skill.
Perfect score: 1
The ROC is created by plotting the probability of detection versus the false alarm rate (false alarms / observed no, also known as probability of false detection), using a set of increasing probability thresholds (for example, 0.05, 0.15, 0.25, etc.) to make the yes/no decision. The area under the ROC curve is frequently used as a score.
Characteristics: ROC measures the ability of the forecast to discriminate between two alternative outcomes, thus measuring resolution. A good ROC is indicated by a curve that goes close to the upper left corner (low false alarm rate, high probability of detection). It is not sensitive to bias in the forecast, so says nothing about reliability. A biased forecast may still have good resolution and produce a good ROC curve, which means that it may be possible to improve the forecast through calibration. The ROC can thus be considered as a measure of potential usefulness.
The ROC is conditioned on the observations (i.e., given that
Y occurred, what was the correponding forecast?) It is therefore
a good companion to the reliability diagram,
which is conditioned on the forecasts.
Answers the question: How well did the probability forecast predict the category that the observations fell into?
Range: 0 to 1. Perfect score: 0.
This score is used to assess multi-category forecasts, where M is the number of forecast categories (for example, rainfall bins: 0-1 mm, 1-5 mm, 5-10 mm, etc.), p_{k} is the predicted probability in forecast category k, and o_{k} is an indicator (0=no, 1=yes) for the observation in category k.
Characteristics: The ranked probability score measures the sum of squared differences in cumulative probability space for a multi-category probabilistic forecast. The RPS penalizes forecasts less severely when their probabilities are close to the true outcome, and more severely when their probabilities are further from the actual outcome. For two forecast categories the RPS is the same as the Brier Score.
Ranked probability skill score -
Answers the question: What is the relative skill of the probabilistic forecast over that of climatology, in terms of getting close to the actual outcome?
Range: minus infinity to 1, 0 indicates no skill when compared to the reference forecast. Perfect score: 1.
Characteristics: The RPSS measures the improvement of the multi-category
probabilistic forecast relative to a reference forecast (usually the long-term
or sample climatology). It is similar to the 2-category Brier skill score,
in that it takes climatological frequency into account. Because the denominator
approaches 0 for a perfect forecast, this score can be unstable when applied
to small data sets. This score should always be applied to a sufficiently
large sample, one for which the sample climatology of the event is representative
of the long term climatology. The rarer the event, the larger the number
of samples needed to stablise the score. For best results the ranked probability
skill score should be computed on the whole sample, i.e., the skill should
be computed for an aggregated sample, not averaged for several samples.
Rank histogram (Hamill, 2001)
Answers the question: How well does the ensemble spread of the forecast represent the true variability (uncertainty) of the observations?
Perfect: All bars the same height, i.e., flat histogram.
To construct a rank histogram, do the following:
1. At every observation (or analysis) point rank the N ensemble
members from lowest to highest. This represents N+1 possible bins that
the observation could fit into, including the two extremes
2. Identify which bin the observation falls into at each point
3. Tally over many observations to create a histogram of rank.
Characteristics: Also known as a "Talagrand diagram", this method checks where the verifying observation usually falls with respect to the ensemble forecast data, which is arranged in increasing order at each grid point. In an ensemble with perfect spread, each member represents an equally likely scenario, so the observation is equally likely to fall between any two members. Note that a flat rank histogram does not necessarily indicate a good forecast, it only measures whether the observed probability distribution is well represented by the ensemble.
Interpretation:
Flat - ensemble spread about right to represent forecast uncertainty.
U-shaped - ensemble spread too small, many observations falling outside
the extremes of the ensemble
Dome-shaped - ensemble spread too large, most observations falling
near the center of the ensemble
Asymmetric - ensemble contains bias
Relative value (value score)
(Wilks, 2001)
Answers the question: For a cost/loss ratio C/L for taking action based on a forecast, what is the relative improvement in economic value between climatalogical and perfect information?
Range: minus infinity to 1. Perfect score: 1.
In the above equation for value score, the hits, misses, and false alarms are computed using the climatological probability P_{clim} as the yes/no decision threshold.
Characteristics: The relative value is a skill score of expected
expense, with climatology as the reference forecast. Because the cost/loss
ratio is different for different users of forecasts, the value is generally
plotted as a function of C/L. Like ROC, it gives information that can be
used in decision making, but unlike ROC, it is sensitive to bias in the
forecast.
Beth Ebert, BMRC Weather Forecasting Group, June 2003