cartoon of time series International Verification Methods Workshop cartoon of histogram
 

Abstracts and Presentations

Click here to download a MS Word (.doc) version of the abstracts.
 

1.1 Estimation of uncertainty in verification measures   Download presentation (PDF)

Ian Jolliffe
Department of Meteorology, University of Reading

A verification measure on its own is of little use – it needs to be complemented by some measure of uncertainty. If the aim is to find limits for an underlying ‘population’ value of the measure, then a confidence interval is the obvious way to express the uncertainty. Various ways of constructing confidence intervals will be discussed – exact, asymptotic, bootstrap etc.

In some circumstances, so-called prediction intervals are more relevant – the difference between these and confidence intervals will be explained. Hypothesis testing may also be useful for assessing uncertainty in some circumstances, especially when scores for two operational systems or two time periods are to be compared – connections with confidence intervals will be discussed.
 

1.2 Use of cross validation in forecast verification   Download presentation (PDF)

Tressa L. Fowler
National Center for Atmospheric Research, Boulder, CO

Cross validation techniques are commonly used in statistics, especially in the development of statistical models. These techniques can also be used in forecast verification, though may require some modification. Typically, cross validation is employed in forecast verification when the forecasts must be created and verified with the same observations. By using cross validation, a greater degree of independence between the forecasts and observations is achieved. However, complete independence may still not result. The observations may have bias or spatial and/or temporal dependence. Two examples of use of cross validation techniques in forecast verification are presented. Some issues regarding bias and lack of spatial and temporal independence in the observations will be discussed, along with some potential mitigation strategies.
 

1.3 Experimentation with the LEPS Score: Comparison of local forecast errors in probability and measurement space   Download presentation (PDF)

Pertti Nurmi and Sigbritt Näsman
Finnish Meteorological Institute

The quality of forecasts of continuous weather parameters, like temperature, is typically examined by computing the Root Mean Square Error (RMSE) or the Mean Absolute Error (MAE), and the skill score(s) based on these measures. Another, but very scarcely used method is to translate the forecast error in measurement space into probability space. Linear Error in Probability Space (LEPS) is defined as the mean absolute difference between the cumulative frequency of the observation and, hence, is a “relative” of the MAE. The definition and computation of the LEPS requires knowledge of the (sample) cumulative climatological distribution at the relevant location(s). LEPS takes into account the variability of the predictand and is not dependent on the scale of it. Further, LEPS encourages forecasting (and the forecaster) in the tails of the climatological distribution as it penalizes there less than for errors of similar size in a more probable region of the distribution, close to the median. LEPS is claimed to be applicable to verify and compare forecasts at different locations, with different climatological frequency distributions. Given a reference forecast, e.g. the climatological median, a LEPS skill score can be defined in an identical manner as in the measurement space.

Forecast performance based on the more traditional methods, as opposed to the LEPS approach, is studied utilizing ten years (1994-2003) of wintertime minimum temperature forecasts in a cold region of Finland, and, respectively, summertime maximum temperature forecasts in a warm region. Emphasis is thus on the locally most extreme cold vs. warm temperature regions of the Finnish climate, and forecasting there.
 

1.4 A comment on the ROC curve and the area under it as performance measures   Download presentation (PDF)

Caren Marzban
Center for Analysis and Prediction of Storms, University of Oklahoma, Norman OK and Department of Statistics, University of Washington, Seattle, WA

The Receiver Operating Characteristic (ROC) curve is a two-dimensional measure of classification performance. The area under the ROC curve (AUC) is a scalar measure gauging one facet of performance. In this note, five idealized models are utilized to relate the shape of the ROC curve, and the area under the it, to features of the underlying distribution of forecasts. This allows for an interpretation of the former in terms of the latter. The analysis is pedagogical in that many of the findings are already known in more general (and more realistic) settings; however, the simplicity of the models considered here allows for a clear exposition of the relation. For example, although in general there are many reasons for an asymmetric ROC curve, the models considered here clearly illustrate that for symmetric distributions, an asymmetry in the ROC curve can be attributed to unequal widths of the distributions.  Also, for bounded forecasts, e.g., probabilistic forecasts, any asymmetry in ROC can be explained in terms of a simple combination of the means and widths of the distributions. Furthermore, it is shown that AUC discriminates well between “good” and “bad” models, but not between “good” models.
 

1.5 Incorporating measurement error in skill assessment   Download presentation (PDF)

William Briggs
GIM, Weill Cornell Medical College, 525 E. 68th, Box 46, New York, NY 10021, wib2004@med.cornell.edu

Matt Pocernich
Research Applications Program, National Center for Atmospheric Research, Boulder, CO, pocernic@rap.ucar.edu

David Ruppert
School of Operations Research \& Industrial Engineering, Rhodes Hall, Cornell University, Ithaca, NY 14853, dr24@cornell.edu

We present an extension to the skill score test developed in Briggs and Ruppert (BR; 2004) to account for possible measurement error of the meteorological observation.  Errors in observations can occur in, among other places, pilot reports of icing, and tornado spotting.  It is desirable to account for measurement error so that the true skill of the forecast can be assessed.  Without accounting for measurement error gives a misleading picture of the forecast's true performance.  This extension supposes a statistical measurement error model where "gold" standard data, or expert opinion, is available to characterize themeasurement error characteristics of the observation.  These model parameters are then inserted into the BR skill score for which a statistical test of significance can be performed.
 

1.6 Incompatibility of equitability and propriety for the Brier score   Download presentation (PDF)

Ian Jolliffe and David Stephenson
Department of Meteorology, University of Reading

The Brier score, and its corresponding skill score, are the most usual verification measures for probability forecasts of a binary event. They also form the basis of the much-used Rank Probability Score for probability forecasts of more than 2 categories.

Recently published modifications of the Brier skill score have attempted to overcome a deficiency of the score related to its non-equitability. Although they improve matters in some respects, there are accompanying disadvantages, including the loss of propriety.

We examine the conditions needed for equitability and for propriety in the case of binary probability forecasts and show that in general the two requirements are incompatible. The case of deterministic forecasts for binary events is also investigated.
 

1.7 The use of equitable skill scores in the U.S. National Weather Service   Download presentation (PDF)

Charles K. Kluepfel
 NOAA/National Weather Service, Office of Climate Water and Weather Services, Silver Spring, Maryland

Momchil Georgiev
R.S. Information Systems, Silver Spring, Maryland

Building upon statistical methods discussed in the meteorological literatures about a decade ago, the U.S. National Weather Service (NWS) computes an equitable skill score to assist in the evaluation of forecast performance of any element that is easily divided into n categories.  Using these categories, an n x n contingency table of forecast categories versus observation categories may be prepared, and skill scores may be computed from the contingency table.  A skill score is equitable when the scoring rules do not encourage a forecaster to favor forecasts of one or more events at the expense of the other events.  Several choices of equitable scores are available.  The Gandin, Murphy, and Gerrity (GMG) scores have the following attributes: (1) correct forecasts of rare events are rewarded more than correct forecasts of common events, and (2) the penalty assigned to incorrect forecasts increases as the size of the error increases.  Prior to GMG, most equitable scores only rewarded categorically correct forecasts, i.e., forecast category equals observed category, and treated all “incorrect” forecasts equally, regardless of the size of the error.  Hence, they did not have the second attribute.

The GMG method computes the score by multiplying each cell of the n x n contingency table by the corresponding cell of a scoring or reward/penalty matrix, which is based upon climatology.  Finding an appropriate climatology for all forecast elements has proven to be a nontrivial exercise.  Several approaches to building the scoring matrix have been tried and will be presented.  Some samples of results will also be presented.

Acknowledgement: The authors wish to thank Dr. Robert E. Livezey for his valuable advice and encouragement on this project.
 

2.1 Verification of rare extreme events   Download presentation (PDF)

David B. Stephenson
University of Reading, Reading UK

Rare extreme events often lead to severe impacts/losses and therefore provide an important yet difficult challenge for operational weather forecasters.

This talk will define what is meant by an extreme event and will raise some of the issues that make verification of such events problematic.  A review of the most commonly used techniques will be presented and will be illustrated using Met Office mesoscale forecasts of 6 hourly precipitation totals observed at Eskdalemuir in Scotland.

Some recent asymptotic results will be presented that show that for regular ROC systems, most of the traditional scores tend to zero and become non-informative in the limit of vanishingly rare events. Some recent ideas from bivariate extreme value theory will be presented as an alternative for assessing the skill of forecasts of extreme events. It is hoped that this might trigger a stimulating debate as to whether or not we might have more skill at forecasting extremes than we do at forecasting more frequent low-intensity events.
 

2.2 We are surprisingly skillful, yet they call us liars. On accuracy versus skill in the weather forecast production process.   Download presentation (PDF)

Martin Göber
Basic Services, Deutscher Wetterdienst, Offenbach, Germany.

Given that we measure (almost) nothing, know (almost) nothing and program (almost) only bugs weather forecasts are surprisingly skillful.  Yet weather forecasts have been perceived as very inaccurate to the point of being joked about. The key to the explanation of this differing appreciation of weather forecasts lies in the difference between perceiving accuracy as a measure for the difference between forecasts and observations and scientific skill as a measure of  the ratio of the accuracies of  different forecasts. In other words, accuracy tells us how bad the forecast was, whereas skill tells us how bad the forecast was given the difficulty to forecast. The latter seems to be a „fairer“ view, but is hardly ever used in public to judge the achievements in weather forecasting.

Applying these two views to operational forecasts shows that while forecasts for more extreme events are less accurate than for normal events, forecasts for more extreme events are more skillful than for normal events. Using appropriate measures, this result is independent on the use of continuous or categorical statistics.

Measuring skill during the different stages of the weather forecast production process (climatology->persistence->numerical model->MOS->forecaster) can show where skill comes from and what problems exist in general. The concept above will be demonstrated with results from the verification of operational short and medium range, Terminal Aeorodrome and road weather forecasts as well as from a new system of weather warnings for counties.
 

2.3 A probability of event occurrence approach to performance estimate   Download presentation (PDF)

Phil Chadwick
Ontario Weather Center, Meteorological Service of Canada

The accurate measurement of program performance demands the careful definition of the event and a matched set of program messages and events. In remote areas, many events go undetected making the estimate of program performance inaccurate.

A more realistic measure of performance can be estimated by including events detected by remote sensing data. The probability that an event actually occurred can be estimated from the strength and pattern of the event signature as well as the number of different remote sensing platforms that identify the event. Performance measurement can be then completed using a series of probabilistic event datasets.  The 100 percent probability of event occurrence data set would include only those events that have been ground-truthed and confirmed. The 50 percent probability of event occurrence data set would include all of the confirmed events as well as those deemed to have occurred with at least a confidence of 50 percent. A continuum of performance estimates could be obtained by using the complete range of  probabilistic event datasets from the confirmed to those that include events with only a low probability of occurrence. The performance could then be plotted versus the probability of event occurrence used in each dataset. The shape of such a curve will reveal much about the likely performance as well as establish lower and upper limits on the actual program performance.

It is suggested that for severe convection, the remote sensing data used could include  radar, satellite and lightning data. Well established severe event signatures have been established for volume scan radar data. Such an approach would also encourage the quantification of event signatures for satellite and lightning data. This study would also yield more information on probable event distribution in time and space.
 

2.4 Verification package for R   Download presentation (PDF)

Matt Pocernich
Research Applications Program, National Center for Atmospheric Research, Boulder, CO

A new forecast verification package has been developed using the R programming language.  R is an open source statistical language that has been widely embraced by the statistical community.  Developed by statisticians around the world, more than 350 packages on a huge variety of topics have been contributed to the R library for general use. In the near future, the R verification package will also be included in this resource.

Some of the functions in the verification package are routine functions such as receiver operating characteristic plots, attributes diagrams and reliability plots.  Other functions are more research oriented in nature.  For example, Barbara Casati has contributed a spatial scale-intensity skill score function.  This function is used to verify spatial forecasts, taking into account the effects of scale on a skill score.    William Briggs (Cornell Medical College) has contributed an approach that accounts for measurement error in calculating a skill score.    While the verification package was primarily developed to  study meteorological forecasts, it has been created in a generic way so that it can be useful across many disciplines.  The functions and plots are written to operate in the context of various types of forecasts and observations, namely binary, continuous, probabilistic and distribution.

This package will soon be available via the R-project website.  We hope three outcomes will result from making it more generally available.  First, feedback from a larger base of users will make the functions more robust, and therefore benefit our analyses.  Second, by making verification routines more accessible, we hope to increase interest in the field of verification.  Finally, we hope to encourage others to share their verification algorithms with a larger community.  Allowing immediate access to a new verification method will increase the likelihood that it will be used.
 

3.1 Verifying probabilistic forecasts of continuous weather variables   Download presentation (PDF)

Tilmann Gneiting
Department of Statistics, University of Washington, Seattle WA

Probabilistic forecasts of continuous or mixed discrete-continuous weather variables ideally take the form of predictive probability density functions (PDFs) or predictive cumulative distribution functions (CDFs).  Then, how do we verify predictive CDFs?

The goal of probabilistic forecasting can be paraphrased as maximizing the sharpness of the predictive CDFs subject to calibration. Calibration refers to the statistical consistency between the forecasts and the verifications, and is a joint property of the predictions and the observations.  Sharpness refers to the concentration of the predictive CDFs and is a property of the forecasts only.  I will describe a game-theoretic framework and diagnostic tools for assessing calibration and sharpness, and I will review scoring rules, such as the ignorance score and the continuous ranked probability score, that assign numerical scores to forecasters.

The talk closes with a case study on probabilistic forecasts of wind speed at the Stateline wind energy center in the US Pacific Northwest.

This presentation is based on joint work with Adrian Raftery and Fadoua Balabdaoui, both at the University of Washington, and Kristin Larson and Kenneth Westrick, both at 3Tier Environmental Forecast Group, Inc., Seattle.
 

3.2 Methods for verifying quantile forecasts   Download presentation (PDF)

John Bjørnar Bremnes
Norwegian Meteorological Institute, Research and Development Department
Oslo, Norway

Probabilistic forecasts of continuous univariate variables, such as wind speed, are ideally fully specified probability distributions. The focus in this presentation is on the slightly simpler case when only a few quantiles are forecasted or available. Good quantile forecasts should possess certain properties, and verification approaches that quantify these are here described.

First, the reliability defined as the degree the fractions of observations below each quantile equal the quantile probabilities in the long run, is proposed assessed by using the chi-square hypothesis test for multinomial data. This test can be applied separately to each quantile or simultaneously to all. In addition, it is discussed how to examine whether reliability is independent of the quantile values. Second, the average length of forecast intervals formed by pairs of quantiles is suggested as a natural measure for sharpness although it is argued that it might be inadequate if multi-modal distributions are frequent. Third, the resolution or degree of variation in the forecasted quantiles (or the lengths of the forecast intervals), can be quantified by statistics such as the standard deviation or simply the range. The presentation is ended by shortly discussing how to rank quantile forecasting models.
 

3.3 Composite-based verification of warm season precipitation forecasts from a mesoscale model   Download presentation (PDF)

Jason Nachamkin
Naval Research Laboratory, Monterey, CA

Weather forecasts are issued as probabilities over general regions because our knowledge of the future is inexact.  Good forecasts allow for high probabilities over concentrated areas, whereas bad forecasts require lower probabilities over larger areas.  In this sense, “goodness” can be measured in terms of a probability density, or more simply, in terms of the expected distributions of observations and forecasts given that a subgroup of events is predicted or observed.   This approach is applied to heavy precipitation forecasts through the use of event composites.  Composite sample methods offer a simple way to collect and evaluate the probability distribution functions of the predicted and observed fields for specifically defined events.  False alarms and missed forecasts can be diagnosed using criteria based on average rain amount and intensity across the sample area.  Other diagnostics can be derived regarding the structure of the PDFs given that an event is either predicted or observed.

The composite method has been applied to the operational COAMPSTM forecasts on the 27 km grid for the warm season precipitation regime over the United States for 2003.  The initial results indicate that about 50% of the events sampled were false alarms or missed forecasts.  The remaining 50% were relatively “good” forecasts in that the predicted and observed PDFs had similar structures.  For these good forecasts, variability within the sample on any given day was quite high.  But in general a forecast of precipitation meeting the event criteria would have been correct within the sample collection area on these days.
 

4.1 Spatial and object-oriented verification   Download presentation (PDF)

Barbara Brown
National Center for Atmospheric Research, Boulder, CO

Beth Ebert
Bureau of Meteorology Research Centre, Melbourne, Australia

Automated spatial forecasts are usually computed on a grid and presented as maps. Verifying these forecasts requires that they be matched against observations, either by mapping the observations onto the same grid as the forecast or interpolating the forecast to the observation sites. If the field is not smooth then errors of representativity are introduced into the verification (in the case of numerical models these errors tend to be much smaller than the forecast errors). Standard categorical and continuous verification scores such as POD, FAR, RMSE, and correlation coefficient are often used to assess how well the forecast field represents the observed field. Human-generated forecasts, in the form of polygons, typically are treated in a similar manner, with the forecasts mapped to a grid.

Verification of mesoscale model forecasts presents a particular challenge since the higher resolution forecast and observation fields often show considerable spatial and temporal structure. Site-based observations may not capture the level of detail predicted by a model, making it difficult to assess whether the forecast is correct or not. A forecast may predict a particular weather feature at approximately the right place and time and be demonstrably useful to a forecaster or other user, yet it may score poorly according to standard verification statistics because of small offsets in position or timing. The traditional verification statistics are unable to diagnose these possible causes of the poor scores; essentially they treat all types of errors in the same manner.

In contrast, object-oriented verification approaches treat weather features as spatially connected entities rather than as a set of independent forecast/observation pairs. Examples include rain areas, convective complexes, hurricanes, low pressure centers, jet maxes - essentially any phenomenon that can be meaningfully defined by a closed contour on a map. Object-oriented verification assesses the properties of the forecast and observed entities, such as location, size, shape, mean and maximum intensity, and even whether the entity exists in both the forecast and the observations. This intuitive approach attempts to mimic the visual interpretation of the human while producing objective, quantitative, and hopefully more meaningful, output.

Approaches that are intuitive and which decompose the forecast error into diagnostic components are beginning to reach maturity. Each approach has its own focus and set of attributes; in many respects the available methods are complementary. A review of the approaches reveals how they may be applied in a variety of types of studies, and the how they may lead to an enhanced understanding of forecast quality.
 

4.2 Object identification techniques for object-oriented verification   Download presentation (PDF)

Michael Baldwin
CIMMS, University of Oklahoma, Norman, OK

The topic of object-oriented verification has recently received much attention (e.g., Baldwin et al. (2002), Bullock et al. (2004), Chapman et al. (2004)). A critical aspect of this verification method is the use of automated object identification procedures.  Unfortunately, the ideal identification method will vary depending upon the user of the verification information, the type of variable analyzed, etc.

This paper will present results of object-oriented QPF verification using several object identification techniques, ranging from the simple thresholding technique proposed by Baldwin and Lakshmivarahan (2003), to agglomerative methods such as the one presented by Lakshmanan (2001). The sensitivity of verification results to the choice of object identification technique will be examined.
 

4.3 Recent progress on object-oriented verification   Download presentation (PDF)

Randy Bullock, Barbara Brown, Chris Davis, and Mike Chapman
National Center for Atmospheric Research,  Boulder, CO USA

This presentation will detail recent progress in the development of an object-oriented forecast verification system. Progress has been made in several areas, most notably with respect to the problem of matching forecast and observed objects, as well as merging objects in one field into composite objects.  These steps are handled using a  fuzzy-logic approach, by weighting various interest fields and combining them into a single indicator. This approach allows flexibility in combining disparate measures of object "closeness".

In addition, several new areas have been explored, such as incorporating the time dimension to create higher-dimensional objects that can be used for object tracking, and to keep track of objects splitting or
coalescing over time.  Slicing these objects in various ways can produce results reminiscent of so-called Hovmoller diagrams.

Finally, a new software tool is being developed that will allow scientists and others to vary the fuzzy maps, interest fields and weights to see how the object matching and merging results change for
single cases. This capability allows scientists to enter the design loop in creating fuzzy systems, rather than merely giving feedback to designers based on end results.
 

4.4 Verification of quantitative precipitation forecasts using Baddeley’s delta metric   Download presentation (PDF)

Thomas C.M. Lee
Colorado State University, Fort Collins, CO

Eric Gilleland, Barb Brown, and Randy Bullock
Research Applications Program, National Center for Atmospheric Research, Boulder, CO

Over the last several years many new methods have been proposed for evaluating the results of quantitative precipitation forecasts. Partly because there is a need to obtain information that is more meaningful in an operational context than can be obtained from traditional grid-based verification approaches; the development of alternative verification approaches for these forecasts has been an important focus of research in the verification community.

A major advantage of these new approaches is that they make it possible to diagnose the specific sources of errors in forecasts. One such approach uses a threshold convolution method to define objects that cover areas of forecast and observed precipitation (see presentation by Bullock). These objects are represented as binary images. This paper proposes a technique to match forecast objects with observed objects in order to subsequently evaluate the forecast image in a meaningful and accurate way.

The Baddeley’s delta metric is used to rank object matches, as well as to decide if some objects should be merged. Ideally, all combinations of possible object mergings should be compared. However, if there are m forecast objects and n observed objects, the total number of all combinations is 2m • 2n; which would generally be too computationally intensive to be compared in practice. Therefore, the proposed technique only examines a reasonable subset of these combinations. Initial results of this matching method are promising. In addition, it appears that the Baddeley’s delta metric will be useful for verification of the forecast images.
 

4.5 Diagnostic verification measures associated with object-oriented verification approaches   Download presentation (PDF)

Barbara G. Brown, Chris Davis, Randy Bullock, Mike Chapman, and Kevin Manning
National Center for Atmospheric Research, Boulder, CO

As development of an object-oriented approach for verification of convective and quantitative precipitation forecasts is beginning to approach a level of maturity, development and implementation of measures that allow diagnostic evaluation of the forecast objects is advancing. Ideally, the measures identified will respond to the needs of forecast developers and users. To meet these needs, particular forecast features that are relevant for the developers and users must be identified and evaluated.

A primary set of features has been identified that may meet the basic needs of many users. Secondary attributes may also be identified that are specific to a particular set of users. The primary features relate to basic characteristics of the forecast objects, the observed objects, and their relationship (much as the Murphy and Winkler statistical framework for verification considers the forecasts, observations, and their relationship). Examples include the forecast area, the observed area, and their overlap or intersection; and the forecast location, observed location, and their displacement from each other. An example of a secondary feature might be the north-south extent of the object, which is a relevant feature for aviation traffic management. This feature would be measured for the forecast object and the observed object, and their difference would be computed. It is important to note that the same features must also be measured for those forecast and observed objects that do not have a match.

Once the object features and relationships have been measured for a set of forecasts, they can be summarized in any number of ways. In general, the attributes described by Murphy et al for continuous variables should be taken into account. These attributes can be represented using scatterplots, conditional quantile plots, discrimination diagrams, box plots, and so on.

The diagnostic measures and displays are applied to a set of matched precipitation objects for the summer season in 2002. In this case precipitation forecasts from the Weather Research and Forecasting (WRF) model run on a national (U.S.) domain are compared to observations from the U.S. Stage IV precipitation analysis. The results demonstrate that valuable information about the quality of the precipitation forecasts can be obtained using this approach.
 

4.6 Statistical cluster analysis for verification of spatial fields   Download presentation (PDF)

Caren Marzban
University of Washington, marzban@stat.washington.edu

The verification of spatial fields (e.g. precipitation over some region) is a difficult and complex problem. One facet of the problem originates from the fact that grid-based reports and/or forecasts are spatially
correlated. One approach to handling this problem is to perform the verification within an event-based or object-oriented framework. In this paper a method generally referred to as (statistical) cluster analysis
is utilized for identifying such events or objects. Specifically, cluster analysis is applied to both observation and forecast precipitation fields, and then the two fields are verified against one another in terms of the various identified clusters. It is shown that the method generally works well in terms of allowing an objective and automated verification of spatial fields.
 

4.7 An error decomposition method: Application to Mediterranean SST simulations assessment   Download presentation (PDF)

Z. Ben Bouallegue, A. Alvera-Azcarate, J.-M. Beckers
GHER, University of Liege, Belgium

Fields composed of daily simulations provided by an OGCM of the Mediterranean Sea are compared to weekly satellite observations. The method used is inspired of the object-oriented verification procedure introduced in meteorological forecast assessment by Ebert and al. 2000. The Error Decomposition Method presented here aims to identify error sources.

The method is carried out within the framework of the MFSTEP hindcasts. The MFSTEP project is an international scientific collaboration program which aims to create an operational forecasting system for the Mediterranean Sea. The simulations provided at the basin scale are 10 days forecasting fields in a 3-D ocean. The hydrodynamic model primitive equations are combined with the data assimilation scheme SOFA applied every week. The set of data used for the comparison are weekly SST satellite observations and means of seven daily MFSTEP simulations (analyzed fields) for the equivalent weeks.

The original simulation is transformed until the total squared difference between the observed and hindcast fields is minimized. Successively, a new combination of seven consecutive daily simulations is produced, the new SST field is displaced horizontally and the bias suppressed. This allows a decomposition of the total error in 4 parts: a temporal shift error, a position error, an intensity error and a pattern error. This last element is the remaining error after simulation field transformation and corresponds to the unexplained error.

The method is applied at different restricted areas of the Mediterranean basin. The predominant displacements in time and in space minimizing error are discussed through the physical processes taking places at each location. More over, ratio between the different error components is analysed in term of scale effect: the role of the application domain size is pointed out. Finally, the seasonal impact on the different results is commented.
 

4.8 An event oriented approach to the verification of summer monsoon rainfall over West Bengal, India

V. Mandal1, U.K. De1, B. K. Basu2
1Department of Physics, Jadavpur University, Kolkata 32, India,
2National Centre for Medium Range Weather Forecasting, New Delhi, India.

The Indian summer monsoon season is characterized by the formation and movement of low pressure system across the country giving rise to much needed widespread rainfall for cultivation. So the prediction as well as verification of Indian summer monsoon becomes considerably important. A global spectral model (T-80) is integrating to produce 5 day advanced forecast from 00 UTC initial condition operationally at National Centre for Medium Range Weather Forecasting (NCMRWF), New Delhi, India. The data assimilation procedure used at NCMRWF is very similar to that of National Center for Environmental Prediction (NCEP), USA.

Present article includes an event oriented verification approach of quantitative precipitation forecast of NCMRWF model over Gangetic West Bengal, India having almost uniform terrain and over north West Bengal, a part of foothill of Himalayan Mountains, with high topography. The observed precipitation data was collected from Agricultural Department of West Bengal, India and also from India Meteorological Department. All observations are accumulated rainfall at 0300 UTC and it is collected by standard manual rain gauge and measured up to first decimal place of mm.

The monsoon rainfall is well characterized through rigorous research work. Indian summer monsoon has active and break periods. During break monsoon condition, rainfall vanishes over Gangetic West Bengal, but increases over foothills of Himalayas. In one season there may be a number of active and break spell of monsoon. It was found that zonally oriented cloud band move northward from near equatorial latitude in the monsoon region. The fluctuation in cloud cover is strongly related to active and break spell of monsoon. These spells are usually of low frequency. In addition to low frequency oscillation, shorter period fluctuations are also found in monsoon rainfall. This short period oscillation phase may or may not be temporally stable. The temporal stability can be assumed to have the persistency for consecutive three days.

The main goal of the article is to develop distributive method to evaluate the stable nature of monsoon rainfall. Firstly rainfall is categorized in different classes depending upon the different threshold value. Total time series can be represented as the liner combination of different independent states (j1,j2, j3, j4, j5….).

j={j1,j2, j3,j4,j5,……….……….}

These independent states are defined to have some integer value.

j1 = a, j2 = b, j3 = c ……………..

Where a, b, c … are particular integer values.  The projection of time series following the above definition, can easily distinguish between stable phase and other phases.

The objective analysis method always introduces some error to generated observed value at grid point.  So during verification of stable phase between forecast and observed, one can define at least four different class of model performance depending upon transition between two consecutive states.

 The contingency table can be formed for two consecutive states j1 and j2. Let j1and j2 be represented by void circle and solid circle respectively. Whereas more than one higher state (e.g. j3, j4…..) are represented as star. So existence of three consecutive void or solid circles represents a stable monsoon phase under consideration.
 

5.1 Scale separation in verification measures   Download presentation (PDF)

Barbara Casati
Meteorological Service of Canada

Weather phenomena on different spatial scales are often triggered by physical processes of different nature. As an example, large scale frontal systems are driven by the global circulation of the atmosphere, whereas small scales showers are often generated by convection. Scale separation in verification measures can provide informative feedback on the nature of the forecast error and helps detecting those physical processes of NWP systems that need further development.

This talk reviews some recently developed scale separation verification techniques. Each technique will be critically analyzed. Comparisons between the scale separation techniques and their role in the context of the other existing verification measures will be discussed.
 

5.2 On the use of high-resolution network observations to verify precipitation forecast   Download presentation (PDF)

Anna Ghelli
European Center for Medium Range Weather Forecasts, Reading, UK

The use of high-resolution observations to produce model-oriented verification has been investigated at ECMWF. High-resolution network data are collected and up-scaled to generate a gridded precipitation analysis that represents a better match to the areal precipitation forecast from models.

Results of deterministic and probabilistic verification for the European area will be presented, as well as verification of the ECMWF precipitation forecast over the USA.
 

5.3 Verification against precipitation observations of a high density network - What did we learn?   Download presentation (PDF)

Ulrich Damrath
Deutscher Wetterdienst, Offenbach, Germany, ulrich.damrath@dwd.de

A high density network for precipitation observations over Germany (around 3500 stations) was used to investigate the error characteristics of a limited area model (meshwidth  7 km) and a global model (meshwidth 60 km). Traditional cross sections show typical  error structures concerning the orientation of the mountains depending on often occuring wind directions. Upscaling leads to the knowledge about the scale of predictability especially for the limited are model. With pattern recognition methods typical errors can be quantified. An attempt of fuzzy logic verification demonstrates an alternative look at observed and forecasted precipitation amounts.
 
 

5.4 Scale sensitivities in model precipitation skill scores   Download presentation (PDF)

Andrew F. Loughe, Stephen S. Weygandt, Jennifer L. Mahoney, Stanley G. Benjamin
NOAA Forecast Systems Laboratory, Boulder, CO

Statistical measures that are traditionally used for the evaluation of precipitation forecast skill are influenced by variations in the scale of features in both the forecast and verification fields. This scale-dependence complicates the comparison of precipitation fields that contain varying degrees of detail, and is especially evident for warm season precipitation, which is dominated by convective storms. These storms produce precipitation patterns with significant small-scale variability, which is hard to accurately predict.  With the ever-increasing grid resolution of numerical models, forecast precipitation fields with small-scale detail, similar to what is actually observed, can now be generated. Traditional dichotomous skill scores, such as the equitable threat score, are frequently lower (worse) for these highly detailed forecasts than for forecasts with less small-scale detail. This is primarily due to the large number of “near misses” produced for precipitation maxima from these high-resolution models. The degree to which small-scale details should be retained in mesoscale models, and the pressing need for more sophisticated techniques for verifying these features, is an important problem confronting the verification and mesoscale modeling communities.

In this study, we quantitatively document the scale sensitivities in precipitation skill for four numerical model formulations run in support of the International H2O Project (IHOP). The model comparisons include the operational 12-km Eta, operational 20-km RUC, experimental 10-km RUC, experimental 12-km LAPS/MM5, and experimental 12-km LAPS/WRF. Comparisons of the equitable threat score (ETS) and bias are made for each of these models, verified against the Stage IV precipitation analysis.  These results are computed on the native model grids, and on systematically coarsened grids.

In the first set of experiments, both the forecast and verification fields are upscaled (two-way smoothing), allowing for the assessment of scale impacts from forecasts with significantly different energy spectra.  By upscaling higher-resolution forecasts to coarser grids, we are able to isolate the impact of the smoothing on these skill scores.  The comparison of traditional scores is complement by spectral analyses of the forecast and verification fields. In the second set of experiments, only the forecast fields are smoothed (one-way smoothing), allowing for evaluation of the usefulness of enhanced precipitation detail, as reflected by traditional skill scores of verification.

The primary focus of this paper is the first set of experiments, involving two-way smoothing.  For these experiments, we document the skill-score dependence on the spectral characteristics and bias of the precipitation field, thus illustrating the impact of scale on these scores. Preliminary results from the second set of experiments, in which only the forecast fields are smoothed, will also be shown. Overall, these results support earlier research suggesting that it is difficult to show improvement in ETS for models with increasingly fine grid resolution.
 

5.5 Verification of mesoscale modeling for the severe rainfall event over southern Ontario in May 2000   Download presentation (PDF)

Zuohao Cao1, Pierre Pellerin2, and Harold Ritchie3
1Meteorological Service of Canada, Ontario, Canada
2Meteorological Service of Canada, Quebec, Canada
3Meteorological Service of Canada, Nova Scotia, Canada

A coupled atmospheric-hydrological model (CAHM) with a high-resolution, self-nesting and one-way coupling capability is employed to simulate the severe rainfall event that lead to a flood in May 2000 over southern Ontario. Three verification approaches are carried out to evaluate the atmospheric mesoscale model performance. The results show that the 48-h accumulated peak precipitation simulated by a mesoscale model successfully captures the observed peak rainfall recorded over a spatially dense raingauge network in southern Ontario. Furthermore, the quantitative evaluation of the model predicted precipitation demonstrates that there is a systematic improvement in terms of the accuracies and skills when the model resolution is increased. In addition, an independent verification by comparing the CAHM simulated streamflow with the observed hourly streamflow shows the excellent agreement between the simulations and the observations in terms of magnitudes and timing of peak streamflows, indicating that precipitation is well simulated by the atmospheric mesoscale model.
 

5.6 Evaluation of GFS model in predicting the daily rainfall over Ethiopia: A case study

Dawit Gezmu
National Meteorological Services Agency (NMSA), Addis Ababa, Ethiopia

The objective of this study is to evaluate the performance of the GFS model in predicting the daily rainfall over Ethiopia. The data used in this study includes the daily spatial precipitation forecast and the daily station observed precipitation value, which then is converted into spatial precipitation using the SURFER software, for the months of April and May 2004. The station data is obtained from NMSA and the spatial precipitation forecast is obtained from NCEP.

The methodology applied in this study is a combination of the subjective eye-ball method and the objective method of the multi-category contingency table. That is, first, the country is divided into grid-boxes of size 30 longitude by 30 latitude. Second, for both the actual and forecast spatial precipitation, how much area  of each grid-box is covered by rain and the maximum rainfall value in each grid-box are determined. Third, for each grid box the number of days with Isolated (<25% of the area), Scattered (25 - 50% of the area), Fairly Widespread (<25% of the area) and Widespread (>75% of the area) are counted. Finally, multi-category contingency tables are prepared and some statistics is applied.

The results indicate that there is a good relationship between the forecast and what is actually observed.
 

5.7 Optimizing METAR network design to remove potential bias in verification of cloud ceiling height and visibility forecasts   Download presentation (PDF)

Eric Gilleland
Research Applications Program, National Center for Atmospheric Research, Boulder CO

Verification of cloud ceiling and visibility forecasts is performed based on data from surface METAR stations, which for some areas are densely located and others only sparsely located. Forecasts, made over an entire grid, may be “penalized” multiple times for an incorrect forecast if there are many METAR stations situated closely together.  Conversely, correct forecasts in areas with numerous METAR stations may be too highly “rewarded” if the forecast misses in an area with fewer stations. A coverage design technique in conjunction with a percent agreement analysis is employed to find an “optimal” network design to be used to better score forecasts over densely located regions. Two regions of interest are examined: one in northern California and the other I the New England region. Results indicate that thinning the network in California may not be appropriate, at least without accounting for nonstationarity, whereas results in the New England area suggest that for this region, substantial network thinning is appropriate.
 

5.8 US NWS gridded marine verification: Derivation of ‘true’ wind and wave fields at NDFD grids using a geostatistical approach   Download presentation (PDF)

Matthew Jin
NWS Headquarters

The US NWS gridded marine verification requires ‘true’ (or observed) values of wind and significant wave height fields be available at National Digital Forecast Database (NDFD) grid points at required verification times. Although observations are available from several sources such as fixed buoys and CMANs, drifting buoys, ships, and scatterometers on polar-orbiting satellites, such observations are generally scattered in space and time and neither coincide with the verification grids nor with the desired verification times. Furthermore, the observations from different sources have varying reliabilities. To best incorporate the multiple sources of observations in deriving the ‘truth’ fields on the NDFD grids at all verification times, a class of geostatistical estimators are proposed for the estimation of the ‘truth’ fields. The proposed approach takes into account the distinctive precision and accuracy of various measurement devices as well as the spatial-temporal correlation structure of these fields via the use of variograms. The performance of the proposed approach is evaluated and its advantages are discussed.
 

6.1 The economic value of weather forecasts   Download presentation (PDF)

Jeffrey K. Lazo
Research Applications Program/Environmental and Societal Impacts Group, NCAR, Boulder, CO

In this paper we discuss the basic approach to defining and measuring the value of weather forecast information from an economist’s perspective. The classical theory of value asserts that the value of an object or service is derived from a consumer’s marginal utility for that good. We thus first discuss the basic concepts of marginal utility theory. Weather forecasting by its nature involves uncertainty and we thus extend the discussion to economic approaches to decision making under uncertainty. It is within this context of decision making under uncertainty that information, such as weather forecast information, has value. The problem of valuing weather information is compounded further by that fact that, in general, weather forecasts are not bought and sold in markets and thus there is little direct market information of the value of forecasts. These “non-market” characteristics of weather forecasts lead to their being defined by economists as public goods or quasi-public goods. We thus discuss methods for measuring the value of non-market commodities (such as forecasts). We complete the discussion by identifying areas where additional information is likely to be needed in order to better link changes in the “quality” of weather forecasts to economic measures of the value of such changes.
 

6.2 Defining observations fields for verifying convective forecasts that are aligned with user interpretation of the forecast   Download presentation (PDF)

Jennifer Luppens Mahoney1, Barbara Brown2, Joan E. Hart1, and Mike Kay3
1NOAA Forecast Systems Laboratory, Boulder, CO
2National Center for Atmospheric Research, Boulder, CO
3Cooperative Institute for Research in Environmental Sciences (CIRES) University of Colorado/NOAA Research-Forecast Systems Laboratory, Boulder, Colorado

A critical challenge for evaluating the quality of convective forecasts that are geared toward aviation users is defining the observations so that they reflect the forecast attributes and characteristics, the spatial and temporal scale of the forecast, and most importantly, portray the operational use and interpretation of the forecast.  Achieving this challenge is often difficult when various convective products are not directly geared toward a unique user or decision-making process, and provide slightly different information.

For example, The Collaborative Convective Forecast Product (CCFP) issued by the National Weather Service (NWS) provides information to air traffic managers regarding a forecast of convective coverage, for a minimum sized area, with echo tops reaching a predefined height.  The interpretation of these criteria indicates to air traffic managers the severity of convection and its potential impact on the flow of air traffic.  Thus, the observations used to evaluate the quality of the CCFP must specifically reflect these criteria.

Therefore, we will present our techniques for defining the observations so that they are aligned with user interpretation of the forecast.  In addition, we will show how differing user interpretation impacts the creation of the observational datasets and ultimately the statistical results that are used to evaluate the quality of the forecasts.
 

6.3 TAF verification in the U.S. National Weather Service   Download presentation (PDF)

Charles K. Kluepfel
NOAA/National Weather Service, Office of Climate Water and Weather Services, Silver Spring, Maryland

The United States National Weather Service (NWS) recently implemented a new terminal aerodrome forecast (TAF) verification program.  The program centralizes all data collection, data storage, computation, and data display.  It takes advantage of modern technology in terms of computer storage space and rapid data processing to help users evaluate the performance of the entire TAF.

The new system verifies all scheduled and amended TAFs issued by the NWS.  The following elements are verified: ceiling, visibility, wind direction, wind speed, wind gusts, and significant weather type.  To account for the frequent changes in weather that may impact aviation operations, each element of the TAF is compared to the latest surface observation once every five minutes.  Forecasts for temporary (TEMPO) and probabilistic (PROB) conditions, when included in the TAF, are also evaluated once every five minutes.  Verification statistics can be computed for all TAFs issued by a WFO or for a subset of desired terminals.  Each NWS forecaster may request verification statistics valid only for the TAFs he/she issued.  To protect each forecaster’s privacy, individually tailored statistics are protected with a password, so only the forecaster is able to see his/her results.

Users of the program access data through an internal NWS Web site, which utilizes the popular Stats on Demand format, where the user of the site specifies each data request with a set of initialization parameters that include the start and end dates for the period desired, the element desired, the type forecast desired (e.g., prevailing forecast, TEMPO forecast), the issuance times desired, and projection periods desired (e.g., 0-3 hours, 3-6 hours).  The program responds by retrieving the necessary forecasts and observations from the database, computing the verification statistics, and displaying the results online within one to two minutes.  Verification statistics for a large number of terminals (i.e., all terminals within the forecast areas of multiple WFOs) over relatively long time periods (i.e., three or more months) take longer to compute but are still available.
 

7.1 Design of operational verification systems   Download presentation (PDF)

Pertti Nurmi
Finnish Meteorological Institute, P.O. Box 503, 00101 Helsinki Finland

A posteriori verification, independent and separate from the operational daily forecast production chain is of little practical benefit. An operational real-time (or online) verification software package built seamlessly in the production chain will provide immediate information on the general quality and potential improvements inherent in any of the components of this chain. Features of such a verification package should cover, a.o. a human-friendly user interface, an exhaustive array of verification measures and statistics, means to aggregate and stratify the results, coverage of all conceivable forecast and guidance products like deterministic and probabilistic NWP output, their (statistical) interpretations and, at the end of the chain, the forecasters’ final forecasts. Since final forecasts are (still) generally produced, or at least touched, by human forecasters, presumably boasting individual forecasting behavior, a state-of-the-art verification system should include individual verification measures. Means to deliver personal feedback subtly to the individuals should be guaranteed. Additionally, individual (final) forecasts should be compared against corresponding automated forecast output.

The author’s personal views and ideas in such an undertaking are introduced, supported by some very recent information on the operational verification practices and methods, a.o. in Australia, Germany, Hungary, Norway, and the USA.
 

7.2 Verification of Canadian public weather forecasts by an automated system   Download presentation (PDF)

Nelson Shum and Jeff Thatcher
Meteorological Service of Canada, Toronto, Ontario

Services Clients and Partners Directorate, Meteorological Service of Canada
An automated system for measuring the performance of the public weather forecast program in the Meteorological Service of Canada (MSC) became operational in April 2004. Currently monthly verification scores for nine weather elements are generated by the system to help monitor and compare the performance of meteorologist-adjusted forecasts and corresponding model forecasts. These scores are calculated at three different levels of geographical coverage: forecast region level, forecast bulletin level, and provincial level. Users can access previously produced scores through a web based interface and also generate a limited set of scores on-demand.
Design of the system began in the fall of 2001 under the auspices of the Public Weather Forecast Performance Project. In the subsequent three years before the operational launch, the project faced many challenges, with the most prominent being building a system from ground up using such technology as Linux clustering and PostgreSQL, all of which was relatively new to the MSC. With much of the required infrastructure now in place, the system is now poised for further expansion. Near future plans include defining scores suitable for the general public and developing metrics for measuring high-impact events.
 

7.3 The design and evaluation of a measure of forecast consistency for the Collaborative Convective Forecast Product   Download presentation (PDF)

Michael P. Kay
Cooperative Institute for Research in Environmental Sciences (CIRES), University of Colorado/NOAA Research-Forecast Systems Laboratory, Boulder, Colorado

Airline dispatchers and other groups, many of whom are not trained as meteorologists, are constantly forced to make decisions based upon their interpretations of numerous guidance products and tools that have been created to help the decision-making process. One product in particular, the Collaborative Convective Forecast Product (CCFP) was created to help to unify airline operations around a cohesive set of forecasts that are well-understood and well-defined. Even though the decision-makers may well be focusing on a limited set of forecast products like the CCFP these products are updated every few hours.  These updates present a great challenge to them; in particular how to interpret the situations when forecast areas may appear and disappear during the 2, 4, and 6 hour forecast lengths of the CCFP.

To this end a procedure has been developed which assesses the spatial consistency of the CCFP as a forecaster would view it. Tools are developed to present the information to the users in a very efficient manner to allow them to rapidly assess and visualize the consistency between a series of CCFP forecasts valid at the same time. Forecast verification information will be presented to determine whether or not forecasts which are more consistent are more accurate.

Recent work in the field of numerical weather prediction suggests that caution is warranted when associating forecast stationarity with accuracy. Discussion will focus on the utility of this new approach and the importance of not relying on forecast consistency alone when attempting to make decisions based upon the forecasts themselves.
 

7.4 A graphical technique for diagnosing significant error tendencies in meteorological forecast models   Download presentation (PDF)

Matthew C. Sittel
Lockheed Martin Information Technology, Inc., Offutt AFB, NE

Robert J. Craig
Air Force Weather Agency, Offutt AFB, NE

Global and regional meteorological forecast models are invaluable in the planning of worldwide day-to-day military operations.  The spatial coverage of such models makes it possible to determine forecast conditions for specific locations of interest, including over continuous paths where air or surface travel may take place.  The critical nature of these endeavors requires a determination of forecast utility for operational use.  Furthermore, military preparedness for worldwide deployment of troops mandates the ability to determine forecast utility anywhere within the domain of these models. The field user of such forecasts rarely has the time or training to quickly assimilate such data for all locations of interest, yet knowledge of a forecast’s potential to be incorrect can aid in the success and safety of military maneuvers.

The Quality Control Team at the Air Force Weather Agency (AFWA-QC) archives forecast errors for thousands of locations daily.  These long-term records are the basis for a graphical product developed by AFWA-QC to depict model forecast accuracy for military field users.   The technique summarizes forecast errors into two numbers whose values define a size and color of a symbol plotted on a map.   These maps present long-term model performance in a manner that illustrates local as well as regional tendencies in forecast quality.  The technique can easily be adjusted to depict error tendencies for any user-defined length of time or criterion.  The data plots provide a simple yet informative method for a field user to ascertain forecast error tendencies and use this knowledge to make critical operational decisions.
 

7.5 Methodology of the U.S. National Weather Service warning product verification   Download presentation (PDF)

Brenton W. MacAloney II
NOAA National Weather Service, Office of Climate Water and Weather Services, Silver Spring, MD

In 1995, the U.S. National Weather Service undertook a modernization of data collection for verification of severe thunderstorm, tornado, and flash flood warnings. Verification scores had been manually calculated since 1986, but this method was time consuming and lacked the necessary quality control. The new process uses state of the art data collection techniques and robust methods of quality control to ensure the verification statistics contain the most accurate data possible.

For event collection, an event databasing program named StormDat was created.  For warning collection, software was developed to parse information contained in warning text products into a database. Once a month, the warnings and events databases are matched to create a verification database. This database is then accessed through a web based interface named Stats on Demand, which allows NWS employees to create custom verification reports in seconds via the World Wide Web.

Included in this presentation is an overview of the event/warning data collection process, demonstration of verification report generation, and an overview of verification scores calculated. Quality control procedures will also be discussed with emphasis on a real-time warning collection process. Plans for expansion of the program to other warning products will be mentioned.
 

7.6 Methodology for verification of mesoscale model predictions and analyses with atmospheric boundary layer profilers   Download presentation (PDF)

E. Astling1, G. Dodd2, and R-S. Sheu3
1 Meteorology Division, West Desert Test Center, Salt Lake City, Utah
2 H. E. Cramer Co., Salt Lake City, Utah
3 Research Application Program, NCAR, Boulder, Colorado

A methodology was developed to use measurements from an array of 924-MHz boundary layer profilers (BLP) with the Penn State/NCAR Mesoscale Model Version 5 (MM5) four –dimensional data assimilation (FDDA) system.  Three BLP were deployed in a triangular configuration in a mountain basin to continuously measure winds at 25-minute intervals between 120 to 2000 m AGL in three different seasons in 2003 and one season in 2000.  Quality control procedures were applied to the measurements.  FDDA output fields of horizontal wind components were bi-linearly interpolated to the profiler locations.  The total number of comparisons included more than 100,000 samples.  The large sample size allowed for development of subsets with respect to time of day and low-level atmospheric layers to carry out calculations of bias scores, RMSE, correlation coefficients, and histograms of multiple-category forecasts for horizontal- and vertical-wind components.  The results reveal large differences in verification statistics between layers and with respect to local time and season.