Proposals:Refactoring Statistics Framework 2007 Miscellaneous
From KitwarePublic
Contents |
Miscellaneous Issues
Most of these issues were identified by Brad Davis.
CovarianceCalculator
Issues
- Is it redundant to have this class and the weighted covariance calculator ?
- SetMean(), GetMean() methods are not const-correct.
- Known Mean versus Estimated Mean (incorrect equation: N versus N-1)
- Concept of Covariance: http://en.wikipedia.org/wiki/Covariance
- Sample Mean and Covariance: http://en.wikipedia.org/wiki/Sample_mean_and_covariance
The Following Material has been remixed from the Wikipedia pages above
Given a sample of multivariate vectors, the entries of the sample Mean are given by:
The sample covariance entries are given by
The sample mean and the sample covariance matrix are unbiased estimates of the mean and the covariance matrix of the random variable
. The reason why the sample covariance matrix has N − 1 in the denominator rather than N is essentially that the mean is not known and is replaced by the sample mean
. If the mean
is known, the analogous unbiased estimate
Recursive Computation
The current code of the Covariance Calculator is taking advantage of the possibility of computing the mean and covariance in a recursive manner.
Derivation
The recursive algorithm is depicted below for the case of a univariate distribution.
First, the mean value of a sample of K elements can be computed as the sum of the K values weighted by their frequencies, and normalized by the total frequency.
Where f is the frequency of occurrence of value x.
This expression can be rewritten in terms of the previous estimation of the mean and the current measure as
Next, the Covariance can be estimated as
Where it is assumed that the frequencies are actually counts of values as opposed to normalized frequencies.
The numerator of this expression can then be written in terms of the previous mean estimation
Where
Therefore
where the first sum can be expressed in terms of the Q for K-1
where the second and fourth terms can be factorized
and replacing F
Where the sum in the second term
So the second term can be written as
and Q can be written as
and the last factor can be written as
So the numerator Q is
and finally the Covariance can be computed as
Proposal
API SetMean
The API of the Covariance Calculator should make clear when the value of Mean that was passed to the SetMean() method corresponds to
- The real Mean of the population or
- An estimation of the Mean computed from the sample itself.
Two options seem to be good candidates
- Deprecate the ambiguous SetMean() method and replace it with SetPopulationMean() and SetSampleMean() in order to make clear whether the mean is the real one or an estimated one
- Keep the ambiguous SetMean() method, and add a boolean method that the users can invoke to indicate whether the provided Mean is the real one or an estimated one.
From these two options, the one with lower API impact, and the one that has less backward compatibility implications is the second one. That is: to add a boolean method that makes possible for the user to specify that the mean provided to the SetMean() method is the real mean of the population and not a value estimated from the sample.
Internal Mean Estimation
When the Mean is estimated by the Covariance Calculator itself, and then used for estimating the covariance, the unbiased estimation formula (with N − 1) is correctly used.
Distance Metric
- Room to expand to other metrics (1_1, 1_inf, Mahalanobis)
Distance To Centroid Membership Function
- Why is 1_2 distance hard coded ? as opposed to using a distance metric ?
- Title centroid is unnecessarily narrow
- Why not a generic Distance Membership Function where metrics are plugged in ?
Distances
- itkEuclideanDistance
- itkMahalanobisDistanceMembershipFunction
ImageToCoocurrenceListAdaptor
- This is not really an adaptor, it generates new data.
- Should it be called Generator ? or Filter ?
ListSampleBase
- Very lightweight, probably unnecessary
- Only provides a Search() method, and this method probably shouldn't be there
NeighborhoodSampler
- Sample with spherical (1_2_ neighborhood)
- Allow other metrics ?
SampleAlgorithmBase
- Very lightweight, no obvious benefit
- Is it a candidate for deprecation ?