Standard Error calculation on EMB verification pages

The estimate of verification metrics at each forecast length has an associated uncertainty due, in part, to the small number of runs going into the sample.

For profile plots, dieoff plots, diurnal plots, and time series plots with temporal averaging, we don’t see the statistics from the underlying individual runs (although they can be seen on

the time series page if no averaging is used), but we can estimate the uncertainty in the mean statistic by considering the variations in the statistic over the time series of the matched model runs in the sample. The uncertainty of the mean (“standard error”) is estimated as:

Standard Error = σ/sqrt[(n-1)(1-φ)]

where σ is the standard deviation, n is the number of values in the sample, x is the

set of statistics, and φ Is the lag one autocorrelation derived from the time series x. This is

empirically derived from the individual values with the approximation:

φ ≅ cor(x_(1 … (n - 1)),x_(x … n))

The estimate of the standard error on the mean is distinct and separate from the

standard deviation from the sample. The standard error is an estimate of how well we

understand the underlying, fundamental statistic at each forecast length. The standard

deviation is an estimate of how far off the mean value any one statistic may be

(Weatherhead et al., 1998). Thus, were our sample of runs to be repeated in a similar

season and for a similar duration, we can say that the mean statistic has a 67% likelihood of

being within one standard error of our results, and a 95% likelihood of being within two (1.96)

standard errors of our results.

This approach at least partially accommodates the fact that the individual statistics are autocorrelated.

Physically, this implies that weather situations, which can be more or less forecastable, can last for more than one day, thus the sequential forecast statistics are not independent estimates, but represent an over-sampling of the system. The standard error equation above accounts for these. It should be noted that the lag 1 AutoRegressive assumption generally refers to a 12-hour lag (for runs that are verified every 12 h) or 24-hour lag (for 00Z or 12Z only verification) as the most significant approximation to the autocorrelation. It should also be noted that aggregation of 00Z and 12Z results allows for a larger sample size, but may result in combining different physical causes of differences as well as different statistical properties of the time series.

We generate the errors on the paired differences in the same way. That is, for each forecast

length, we generate a time series of paired differences, and calculate the mean and

standard error for the differences using the methods described above.

In the figures generated, the boxes indicate +/- 2 (actually 1.96) standard errors from the mean. If, for instance, a difference curve is one half-box away from the zero line, that difference is statistically significant at the 95% confidence level.

As an example, the figure below shows a difference curve that is significantly (95%) different from zero at most altitudes, but not at 150, 800, 300-150 hPa.

Also, in the figure below the situation at 1000 hPa shows the power of paired differences: The errors on the red and blue curves at 1000 nPa overlap considerably, suggesting that the difference in the overall vector wind RMSE for the period are not significant (at the 95% confidence level). But the difference curve which represents the statistical mean and standard error of the time-matched paired differences is significantly different from zero.

Some important details:

  1. The RMSE/bias curves themselves (such as the red and blue curves in the figure above) represent the overall RMSE, or bias, over the time period chosen. However, the difference curves are generated by comparing the statistics for each individual time period.  So, the mean of the difference curve at each level is the mean of the difference of the curve statistics for each individual time period, whereas the values of the two curves being compared at each level are the overall means. For bias, these values are identical, but for RMSE’s they are not: because the overall RMSE is not necesssarily  equal to the average of the RMSE’s at each individual time period. These differences are in general very small, but a careful user might notice these differences and be perplexed.
  2. If you choose ‘plot unmatched’, you can get difference curves, but not error bars, because the errors would not be accurate.


Weatherhead, E. C., G. C. Reinsel, G. C. Tiao, X. Meng, D. Choi, W. Cheang, T. Keller,

J. DeLuisi, D. J. Wuebbles, J. B. Kerr, A. J. Miller, S. J. Oltmans, and J. E. Frederick,

1998: Factors affecting the detection of trends: Statistical considerations and

applications to environmental data. Journal of Geophysical Research, 103, 17149-