Volume 62, Number 1, April 2003
|Page(s)||117 - 123|
|Section||Interdisciplinary physics and related areas of science and technology|
|Published online||01 March 2003|
PCA learning for sparse high-dimensional data
Department of Computer Science, University of Manchester, Kilburn Building
Oxford Rd., Manchester, M13 9PL, UK
2 Biochemistry Research Division, School of Biological Sciences University of Manchester, Stopford Building - Oxford Rd., Manchester, M13 9PT, UK
Accepted: 20 January 2003
We study the performance of principal component analysis (PCA). In particular, we consider the problem of how many training pattern vectors are required to accurately represent the low-dimensional structure of the data. This problem is of particular relevance now that PCA is commonly applied to extremely high-dimensional (–30000) real data sets produced from molecular-biology research projects. In these applications the number of patterns p is often orders of magnitude less than the data dimension . We follow previous work and perform the analysis in the context of p random patterns which are isotropically distributed with the exception of a single symmetry-breaking direction. The standard mean-field theory for the performance of PCA is constructed by considering the thermodynamic limit , with fixed. For real data sets the strength of the symmetry breaking may increase with N, and therefore one must reconsider the accuracy of the mean-field theory. We show, using simulation results, that the mean-field theory is still accurate even when the strength of the symmetry breaking scales with N, and even for small values of α that are more appropriate to real biological data sets.
PACS: 87.10.+e – General theory and mathematical aspects / 02.50.-r – Probability theory, stochastic processes, and statistics
© EDP Sciences, 2003
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.