A colleague of mine used
to ask me, "Isn't there just one statistic that would give the accuracy
of the forecast?"

It would be nice if things
were that simple, but they rarely are! Typically we want to measure the
accuracy and skill of a forecast *system*, which means that we have
a large number of forecasts and observations covering a domain in time
and/or space (rather than a single forecast / observation pair). There
may be occasions when all that we care about is the average magnitude of
the difference between the forecast and observations (in which case the
mean absolute error would be the appropriate statistic), but usually we
are interested in knowing more. Questions that we might like to answer
include:

- Was the forecast magnitude correct?
- Was the forecast biased?
- How often the forecast make an unacceptably large error?
- Did the distribution of forecast values resemble the distribution of observed values?
- Was the timing correct?
- Did the forecast put the event in the right place?
- Did the forecast event have the correct size and duration?
- Was the trend correct?

Some attributes of the forecast performance may be more important than others, depending on the application. For example, for daily maximum and minimum temperatures forecasts it is important to get the magnitude right, and avoid large errors. Statistics that would be useful in this case would be the mean difference (to measure bias), mean absolute (or RMS) error, and perhaps a binary accuracy score based on a temperature error threshold.

Another example is fog forecasts for airports, where the main issues are whether or not the fog occurs, and if so, when it begins and ends. Categorical statistics such as probability of detection and false alarm ratio are generally used to evaluate these forecasts. If the forecasts are issued as probabilities of the event occurring, then probabilistic verification methods such as reliability diagrams, ROC diagrams, and Brier scores can be used.

A third example is a hurricane forecast, where the most critical predictions are for its intensity and the location and timing of landfall. Because a hurricane is a definable entity in time and space, simple matched-point statistics may not be very revealing. Statistics based on the properties of the forecast and observed entities are more appropriate. Examples would be the difference between the forecast and observed central pressure, the distance between the forecast and observed low pressure center, and the vector difference between predicted and observed storm velocity. Mean values of these differences are generally used to evaluate a set of hurricane forecasts.

Sometimes an administrator insists on receiving only a summary score, in which case there are several options:

(a) argue for the use of
a few key statistics instead of only one,

(b) try to select the single
most important statistic,

(c) combine several statistics
into one score,

(d) combine several statistics
into one diagram.

I believe that (a) and (d) are the best options, if they are allowable. Selecting only one statistic (option b) gives an incomplete picture of forecast performance, and can even be misleading. For example, a forecast with little bias can look great according to the mean error, but be completely useless if it does not capture the varying nature of the quantity being predicted.

A combined score (option c) is tempting, but is difficult to interpret. Does a good score mean that the forecast has "good" performance in all aspects, or great performance on some aspects and lousy performance on others? Should all components of a combined score receive equal weight, or should some get more emphasis than others? A combined score may be used by administrators for monitoring overall performance and setting performance targets (one example is the Met Office's NWP Index).

An example of option (d)
is the Taylor diagram (Taylor, 2001), which combines the RMS error, the
correlation coefficient, and the standard deviations of the forecasts and
observations on one diagram. Other types of diagrams are possible.

References:

Brooks, H.E. and C.A. Doswell
III, 1996: A comparison of measures-oriented and distributions-oriented
approaches to forecast verification. *Wea. Forecasting*, **11**,
288-303.

Taylor, K.E., 2001: Summarizing
multiple aspects of model performance in a single diagram. *J. Geophys.
Res.*,
**106** (D7), 7183-7192.

June 2003