Research:Revision scoring as a service/SigClust
Note: This page is under construction and may contain errors.
In a 2008 paper [1] Liu et al. present the clustering algorithm SigClust. The algorithm takes as a framework the assumption that clusters are by definition, generated by single, multivariate Gaussian distributions. Thus the algorithm considers the null hypothesis that the data in question (viewed as the rows of ) come from a single, multivariate Gaussian.
A translation and rotation invariant test static ($CL_{2}(\cdot)$, the $k$-means cluster index for $k=2$) is used and thus we can assume without loss that the distribution of the null hypothesis is distributed as $D_{0} \sim N(0, \Sigma_{0})$ where $\Sigma_{0}$ is a diagonal matrix. This matrix is estimated from data, taking into account some symmetric (fixed $\sigma^{2}$), multivariate Gaussian background noise. Given $D_0$, one can estimate a distribution on the test statistic $CL_{k}(Z)$ on datasets $Z$ generated by $D_0$. From here it is straightforward to compute the p-value of the $CI_{k}$ of the clustered \emph{original} dataset with respect to this CDF. When the p-value is below some threshold $\alpha$, the null hypothesis is rejected. The data may then be clustered as desired, in particular one may use the two clusters given when computing $CL_{2}(X)$.\\
\indent Here we will present a short tour of SigClust. Since Liu et al. break the algorithm down into seven steps, we will devote a small section to each step (after making some initial definitions). \\
- ↑ Yufeng Liu, David Neil Hayes, Andrew Nobel and J. S. Marron, \emph{Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data} Journal of the American Statistical Association, Vol. 103, No. 483 (Sep., 2008), pp. 1281-1293\\ See also http://arxiv.org/abs/1305.5879v2}