RUNNING_COVARIANCE

The RUNNING_COVARIANCE function computes the unbiased sample covariance and correlation between two arrays without overflow. The function can also combine previously computed values with new data to allow computing covariance and correlation on data sets that are too large to fit into memory.

RUNNING_COVARIANCE uses the Welford "online" algorithm to compute the covariance in a single pass through the data. The routine is more stable and significantly faster than the CORRELATE function, and unlike CORRELATE, does not require any additional memory.

Examples

; Define two vectors of sample data:

IDL> A = [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ,10]

IDL> B = [-11 ,12 ,13 ,14 ,15 ,16 ,17 ,18 ,19 ,20]

; Compute the [covariance, mean1, mean2, count]:

IDL> result = RUNNING_COVARIANCE(A, B)

IDL> ['covariance','correlation','meanX','varianceX','meanY','varianceY','count'], format='(7a12)'

IDL> result, format='(7f12.5)'

IDL prints:

covariance correlation   meanX  varianceX         meanY   varianceY         count

  20.16667    0.74673     5.50000      9.16667    13.30000     79.56667     10.00000

Syntax

Result = RUNNING_COVARIANCE( X, Y, [, /NAN] [, PREVIOUS=value] )

Return Value

Returns the covariance of the arrays X and Y in the form [covariance, correlation, meanX, varianceX, meanY, varianceY, count] in double precision.

Arguments

X

A vector or array of any numeric type other than complex or double complex.

Y

A vector or array of any numeric type other than complex or double complex.

Keywords

NAN

Set this keyword to cause the routine to check for occurrences of the IEEE floating-point values NaN or Infinity in the input data. Elements with the value NaN or Infinity are treated as missing data.

Note: Since the value NaN is treated as missing data, if you set /NAN and X and Y contain only NaN values, the routine will return NaN for all of the returned values, except for the count, which will be zero.

Set this keyword to a seven-element array containing the [covariance, correlation, meanX, varianceX, meanY, varianceY, count] from a previous calculation. These values will be combined with the new values computed from the input arrays. If this keyword is omitted or is set to all zeroes, then a new calculation is started.

Tip: See below for examples of chaining together multiple calls to RUNNING_COVARIANCE using the PREVIOUS keyword.

Note: If the count from a previous calculation is zero, then a new calculation is started, regardless of the covariance or mean values.

Thread Pool Keywords

This routine is written to make use of IDL’s thread pool, which can increase execution speed on systems with multiple CPUs. The values stored in the !CPU system variable control whether IDL uses the thread pool for a given computation. In addition, you can use the thread pool keywords TPOOL_MAX_ELTS, TPOOL_MIN_ELTS, and TPOOL_NOTHREAD to override the defaults established by !CPU for a single invocation of this routine. See Thread Pool Keywords for details.

When computing the covariance for a large number of values, the results will depend upon the order in which the numbers are combined. Since the thread pool will combine values in a different order, you may obtain a different — but equally correct — result than that obtained using the standard non-threaded implementation. This effect occurs because RUNNING_COVARIANCE uses floating point arithmetic, and the mantissa of a floating point value has a fixed number of significant digits. For more information on floating-point numbers, see Accuracy and Floating Point Operations.

Additional Examples

; Define two arrays and compute the covariance in a single call:

IDL> x = [1:10]

IDL> y = [11:20]

IDL> print, RUNNING_COVARIANCE(x,y)

9.1666667       1.0000000       5.5000000       9.1666667       15.500000       9.1666667       10.000000

; Compute the covariance of just the first half of each array:

IDL> cov1 = RUNNING_COVARIANCE(x[0:4], y[0:4])

IDL> print, cov1

       2.5000000       1.0000000       3.0000000       2.5000000       13.000000       2.5000000       5.0000000

; Now combine that covariance with the covariance of the second half of each array:

IDL> print, RUNNING_COVARIANCE(x[5:*], y[5:*], PREVIOUS=cov1)

       9.1666667       1.0000000       5.5000000       9.1666667       15.500000       9.1666667       10.000000

; use PREVIOUS keyword to efficiently calculate covariance on huge arrays

IDL> covar = DBLARR(7)

IDL> seed1 = 1 & seed2 = 2

IDL> for i=0,99 do covar = RUNNING_COVARIANCE(randomu(seed1, 1e6), randomu(seed2, 1e6), PREVIOUS=covar)

IDL> print, covar

IDL prints:

   1.3501206e-05   0.00016202160      0.49997738     0.083330162      0.50005046     0.083329168   1.0000000e+08

Version History

8.8.3

Introduced