Proposals:Refactoring Statistics Framework 2007 Miscellaneous

From KitwarePublic

Jump to: navigation, search

Contents


Miscellaneous Issues

Most of these issues were identified by Brad Davis.

CovarianceCalculator

Issues

The Following Material has been remixed from the Wikipedia pages above

Given a sample of multivariate vectors, the entries of the sample Mean are given by:

 \bar{x}_{i}=\sum_{k=1}^{N}x_{ik},\quad i=1,\ldots,n.

The sample covariance entries are given by

 q_{ij}=\frac{1}{N-1}\sum_{k=1}^{N}\left(  x_{ik}-\bar{x}_{i}\right)  \left( x_{jk}-\bar{x}_{j}\right)

The sample mean and the sample covariance matrix are unbiased estimates of the mean and the covariance matrix of the random variable \mathbf{X}. The reason why the sample covariance matrix has N − 1 in the denominator rather than N is essentially that the mean is not known and is replaced by the sample mean \bar{x}. If the mean \bar{X} is known, the analogous unbiased estimate

 q_{ij}=\frac{1}{N}\sum_{k=1}^{N}\left(  x_{ik}-\bar{X}_{i}\right)  \left( x_{jk}-\bar{X}_{j}\right)

Recursive Computation

The current code of the Covariance Calculator is taking advantage of the possibility of computing the mean and covariance in a recursive manner.

Derivation

The recursive algorithm is depicted below for the case of a univariate distribution.

First, the mean value of a sample of K elements can be computed as the sum of the K values weighted by their frequencies, and normalized by the total frequency.

 \bar{x}_K=\frac{ \sum_{i=1}^{K}x_i  f_i }{ \sum_{i=1}^{K}f_i }

Where f is the frequency of occurrence of value x.

This expression can be rewritten in terms of the previous estimation of the mean and the current measure as

 \bar{x}_K = \frac{ \sum_{i=1}^{K-1}x_i  f_i }{ \sum_{i=1}^{K-1}f_i } \left[ \frac{ \sum_{i=1}^{K-1}f_i }{ \sum_{i=1}^{K}f_i } \right] + x_K \left[ \frac{f_K}{ \sum_{i=1}^{K}f_i } \right]
 \bar{x}_K = \bar{x}_{K-1} \left[ \frac{ \sum_{i=1}^{K-1}f_i }{ \sum_{i=1}^{K}f_i }\right] + x_K \left[ \frac{f_K}{ \sum_{i=1}^{K}f_i } \right]
 \bar{x}_K = \bar{x}_{K-1} \left[ \frac{ \sum_{i=1}^{K}f_i - f_K }{ \sum_{i=1}^{K}f_i }\right] + x_K \left[ \frac{f_K}{ \sum_{i=1}^{K}f_i } \right]
 \bar{x}_K = \bar{x}_{K-1}  - \bar{x}_{K-1} \left[ \frac{f_K }{ \sum_{i=1}^{K}f_i }\right] + x_K \left[ \frac{f_K}{ \sum_{i=1}^{K}f_i } \right]
 \bar{x}_K = \bar{x}_{K-1} + \left( x_K - \bar{x}_{K-1}  \right) \left[ \frac{f_K}{ \sum_{i=1}^{K}f_i } \right]


Next, the Covariance can be estimated as

 \bar{S}_K=\frac{ \sum_{i=1}^{K} {\left( x_i - \bar{x}_K \right) }^2  f_i }{ \sum_{i=1}^{K}f_i - 1 }

Where it is assumed that the frequencies are actually counts of values as opposed to normalized frequencies.

The numerator of this expression can then be written in terms of the previous mean estimation

 \bar{Q}_K= \sum_{i=1}^{K} {\left( x_i - \bar{x}_K \right) }^2  f_i = \sum_{i=1}^{K} {\left( \left( x_i - \bar{x}_{K-1} \right) - \left( x_K - \bar{x}_{K-1} \right) F  \right) }^2 f_i

Where

 F = \left[ \frac{f_K}{ \sum_{i=1}^{K}f_i } \right]

Therefore

 \bar{Q}_K= \sum_{i=1}^{K} {\left( x_i - \bar{x}_{K-1} \right)   }^2 f_i  - 2 F {\left( x_K - \bar{x}_{K-1} \right) } \sum_{i=1}^{K}{\left( x_i - \bar{x}_{K-1} \right)} f_i + F^2 {\left( x_K - \bar{x}_{K-1} \right) }^2 \sum_{i=1}^{K}  f_i

where the first sum can be expressed in terms of the Q for K-1

 \bar{Q}_K= \bar{Q}_{K-1} + {\left( x_K - \bar{x}_{K-1} \right)   }^2 f_K  - 2 F {\left( x_K - \bar{x}_{K-1} \right) } \sum_{i=1}^{K}{\left( x_i - \bar{x}_{K-1} \right)} f_i + F^2 {\left( x_K - \bar{x}_{K-1} \right) }^2 \sum_{i=1}^{K}  f_i

where the second and fourth terms can be factorized

 \bar{Q}_K= \bar{Q}_{K-1}  - 2 F {\left( x_K - \bar{x}_{K-1} \right) } \sum_{i=1}^{K}{\left( x_i - \bar{x}_{K-1} \right)} f_i + {\left( x_K - \bar{x}_{K-1} \right) }^2 { \left[ F^2 \sum_{i=1}^{K}  f_i + f_K \right] }

and replacing F

 \bar{Q}_K= \bar{Q}_{K-1}  - 2 \left[ \frac{f_K}{ \sum_{i=1}^{K}f_i } \right] {\left( x_K - \bar{x}_{K-1} \right) } \sum_{i=1}^{K}{\left( x_i - \bar{x}_{K-1} \right)} f_i + {\left( x_K - \bar{x}_{K-1} \right) }^2 { \left[ {\left[ \frac{f_K}{ \sum_{i=1}^{K}f_i } \right] } ^2 \sum_{i=1}^{K}  f_i + f_K \right] }


 \bar{Q}_K= \bar{Q}_{K-1}  - 2 \left[ \frac{f_K}{ \sum_{i=1}^{K}f_i } \right] {\left( x_K - \bar{x}_{K-1} \right) } \sum_{i=1}^{K}{\left( x_i - \bar{x}_{K-1} \right)} f_i + {\left( x_K - \bar{x}_{K-1} \right) }^2  { \left[ {\left[ \frac{ {f_K}^2 }{ \sum_{i=1}^{K}f_i } \right] } + f_K \right] }

Where the sum in the second term

 \sum_{i=1}^{K}{\left( x_i - \bar{x}_{K-1} \right)} f_i
 = \sum_{i=1}^{K}{ x_i f_i } -  \bar{x}_{K-1} \sum_{i=1}^{K} f_i
 = { x_K f_K } + \sum_{i=1}^{K-1}{ x_i f_i } -  \bar{x}_{K-1} \sum_{i=1}^{K} f_i
 = { x_K f_K } + \bar{x}_{K-1} \sum_{i=1}^{K-1}{ f_i } -  \bar{x}_{K-1} \sum_{i=1}^{K} f_i
 = { x_K f_K } - \bar{x}_{K-1} f_K =  { \left( x_K  - \bar{x}_{K-1} \right) } f_K

So the second term can be written as

 - 2 \left[ \frac{{f_K}^2}{ \sum_{i=1}^{K}f_i } \right] {\left( x_K - \bar{x}_{K-1} \right) }^2

and Q can be written as

 \bar{Q}_K= \bar{Q}_{K-1}   + {\left( x_K - \bar{x}_{K-1} \right) }^2  { \left[ { f_K - \left[ \frac{ {f_K}^2 }{\sum_{i=1}^{K}f_i } \right] } \right] }

and the last factor can be written as

 f_K - \left[ \frac{ {f_K}^2 }{\sum_{i=1}^{K}f_i } \right] =  \frac{ f_K \sum_{i=1}^{K}f_i  -  {f_K}^2 }{\sum_{i=1}^{K}f_i } =  \frac{ f_K \sum_{i=1}^{K-1}f_i + {f_K}^2 -  {f_K}^2 }{\sum_{i=1}^{K}f_i } = f_K \frac{\sum_{i=1}^{K-1}f_i }{\sum_{i=1}^{K}f_i }

So the numerator Q is

 \bar{Q}_K= \bar{Q}_{K-1}   + {\left( x_K - \bar{x}_{K-1} \right) }^2  f_K \left[ \frac{\sum_{i=1}^{K-1}f_i }{\sum_{i=1}^{K}f_i } \right]

and finally the Covariance can be computed as


 \bar{S}_K=\frac{ \bar{Q}_K }{ \sum_{i=1}^{K}f_i - 1 }


Proposal

API SetMean

The API of the Covariance Calculator should make clear when the value of Mean that was passed to the SetMean() method corresponds to

  • The real Mean of the population or
  • An estimation of the Mean computed from the sample itself.

Two options seem to be good candidates

  1. Deprecate the ambiguous SetMean() method and replace it with SetPopulationMean() and SetSampleMean() in order to make clear whether the mean is the real one or an estimated one
  2. Keep the ambiguous SetMean() method, and add a boolean method that the users can invoke to indicate whether the provided Mean is the real one or an estimated one.

From these two options, the one with lower API impact, and the one that has less backward compatibility implications is the second one. That is: to add a boolean method that makes possible for the user to specify that the mean provided to the SetMean() method is the real mean of the population and not a value estimated from the sample.

Internal Mean Estimation

When the Mean is estimated by the Covariance Calculator itself, and then used for estimating the covariance, the unbiased estimation formula (with N − 1) is correctly used.

Distance Metric

  • Room to expand to other metrics (1_1, 1_inf, Mahalanobis)

Distance To Centroid Membership Function

  • Why is 1_2 distance hard coded ? as opposed to using a distance metric ?
  • Title centroid is unnecessarily narrow
  • Why not a generic Distance Membership Function where metrics are plugged in ?

Distances

  • itkEuclideanDistance
  • itkMahalanobisDistanceMembershipFunction

ImageToCoocurrenceListAdaptor

  • This is not really an adaptor, it generates new data.
  • Should it be called Generator ? or Filter ?

ListSampleBase

  • Very lightweight, probably unnecessary
  • Only provides a Search() method, and this method probably shouldn't be there

NeighborhoodSampler

  • Sample with spherical (1_2_ neighborhood)
  • Allow other metrics ?

SampleAlgorithmBase

  • Very lightweight, no obvious benefit
  • Is it a candidate for deprecation ?
Personal tools