*Europhys. Lett.*,

**62**(1), pp. 117-123 (2003)

## PCA learning for sparse high-dimensional data

^{1}
Department of Computer Science, University of Manchester, Kilburn Building
Oxford Rd., Manchester, M13 9PL, UK

^{2}
Biochemistry Research Division, School of Biological Sciences
University of Manchester, Stopford Building -
Oxford Rd., Manchester, M13 9PT, UK

Received:
28
October
2002

Accepted:
20
January
2003

We study the performance of principal component analysis (PCA). In
particular, we consider the problem of how many training pattern
vectors are required to accurately represent the low-dimensional
structure of the data. This problem is of particular relevance
now that PCA is commonly applied to extremely high-dimensional
(–*30000*) real data sets produced from
molecular-biology research projects. In these applications the
number of patterns *p* is often orders of magnitude less than the
data dimension . We follow previous work and perform
the analysis in the context of *p* random patterns which are
isotropically distributed with the exception of a single
symmetry-breaking direction. The standard mean-field theory for
the performance of PCA is constructed by considering the
thermodynamic limit , with
fixed. For real data sets the strength of the symmetry breaking
may increase with *N*, and therefore one must reconsider the
accuracy of the mean-field theory. We show, using simulation
results, that the mean-field theory is still accurate even when
the strength of the symmetry breaking scales with *N*, and even
for small values of *α* that are more appropriate to real
biological data sets.

PACS: 87.10.+e – General theory and mathematical aspects / 02.50.-r – Probability theory, stochastic processes, and statistics

*© EDP Sciences, 2003*