Issue |
Europhys. Lett.
Volume 62, Number 1, April 2003
|
|
---|---|---|
Page(s) | 117 - 123 | |
Section | Interdisciplinary physics and related areas of science and technology | |
DOI | https://doi.org/10.1209/epl/i2003-00370-1 | |
Published online | 01 March 2003 |
PCA learning for sparse high-dimensional data
1
Department of Computer Science, University of Manchester, Kilburn Building
Oxford Rd., Manchester, M13 9PL, UK
2
Biochemistry Research Division, School of Biological Sciences
University of Manchester, Stopford Building -
Oxford Rd., Manchester, M13 9PT, UK
Received:
28
October
2002
Accepted:
20
January
2003
We study the performance of principal component analysis (PCA). In
particular, we consider the problem of how many training pattern
vectors are required to accurately represent the low-dimensional
structure of the data. This problem is of particular relevance
now that PCA is commonly applied to extremely high-dimensional
(–30000) real data sets produced from
molecular-biology research projects. In these applications the
number of patterns p is often orders of magnitude less than the
data dimension
. We follow previous work and perform
the analysis in the context of p random patterns which are
isotropically distributed with the exception of a single
symmetry-breaking direction. The standard mean-field theory for
the performance of PCA is constructed by considering the
thermodynamic limit
, with
fixed. For real data sets the strength of the symmetry breaking
may increase with N, and therefore one must reconsider the
accuracy of the mean-field theory. We show, using simulation
results, that the mean-field theory is still accurate even when
the strength of the symmetry breaking scales with N, and even
for small values of α that are more appropriate to real
biological data sets.
PACS: 87.10.+e – General theory and mathematical aspects / 02.50.-r – Probability theory, stochastic processes, and statistics
© EDP Sciences, 2003
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.