PCA learning for sparse high-dimensional data
Department of Computer Science, University of Manchester, Kilburn Building
Oxford Rd., Manchester, M13 9PL, UK
2 Biochemistry Research Division, School of Biological Sciences University of Manchester, Stopford Building - Oxford Rd., Manchester, M13 9PT, UK
Accepted: 20 January 2003
We study the performance of principal component analysis (PCA). In particular, we consider the problem of how many training pattern vectors are required to accurately represent the low-dimensional structure of the data. This problem is of particular relevance now that PCA is commonly applied to extremely high-dimensional (–30000) real data sets produced from molecular-biology research projects. In these applications the number of patterns p is often orders of magnitude less than the data dimension . We follow previous work and perform the analysis in the context of p random patterns which are isotropically distributed with the exception of a single symmetry-breaking direction. The standard mean-field theory for the performance of PCA is constructed by considering the thermodynamic limit , with fixed. For real data sets the strength of the symmetry breaking may increase with N, and therefore one must reconsider the accuracy of the mean-field theory. We show, using simulation results, that the mean-field theory is still accurate even when the strength of the symmetry breaking scales with N, and even for small values of α that are more appropriate to real biological data sets.
PACS: 87.10.+e – General theory and mathematical aspects / 02.50.-r – Probability theory, stochastic processes, and statistics
© EDP Sciences, 2003