Calculation of a verification score from a sample of forecasts and verification data should usually be only a first step. It should ideally be followed by some form of statistical inference. Even if the quality of forecasts remains constant, sampling variability means that a later sample of data will give a different value for the score, so the value of a score cannot be viewed in isolation, without some idea of its sampling variation. Most scores have an underlying "population" value and the calculated score can be viewed as a (point) estimate of this population parameter. It is good practice, where possible, to find a confidence interval for this parameter - an interval that has a pre-specified high probability of including the true value of the parameter. To do so we need to know the sampling distribution of the sample score. Sometimes this can be approximated by a tractable distribution, such as a Gaussian distribution. On other occasions a non-parametric, or resampling approach, such as the bootstrap is needed.

It is important not to confuse the idea of a confidence interval for a population parameter with that of a prediction interval. The latter makes statements about likely values of a sample quantity, given assumptions about the underlying population; both can be useful in inference.

An alternative to interval estimation (constructing confidence intervals) is to test hypotheses. The most usual null hypotheses of interest are:

• The population value of a verification score for a forecasting system is that corresponding to some reference forecast and hence represents zero skill.
• The population values of a verification score are the same for two forecasting systems.
The alternative hypotheses are usually fairly obvious: the forecasting system has a population verification score better than that of the reference forecasts; a new system has a better population verification score than an old one.

The idea of power is often forgotten in hypothesis testing. The probability of Type I error (rejecting the null hypothesis when it is true) is controlled to be a small number (for example 5%, 1%), but the power (the probability of correctly rejecting the null hypothesis when it is false) is frequently ignored. A test whose power is not much greater than its probability of Type I error is of little use. Power can be used to choose between competing tests of the same null hypothesis.

There is a close link between hypothesis testing, confidence intervals and prediction intervals in many circumstances. A null hypothesis will be rejected if and only if the null value of a population parameter lies outside a corresponding confidence interval, if and only if a sample score value lies outside a corresponding prediction interval.

Ian Jolliffe, February 2003

A few other links for hypothesis testing:

Probability and Statistics for Biological Sciences: Introduction to Hypothesis Testing (David W. Sabo, British Columbia Institute of Technology)

WMO Climate Information and Prediction Services (CLIPS) curriculum - Link to Statistical Inference (Ian Jolliffe, University of Aberdeen)