PCA learning for sparse high-dimensional data

D. C. Hoyle; M. Rattray

doi:10.1209/epl/i2003-00370-1

Issue		Europhys. Lett. Volume 62, Number 1, April 2003


Page(s)		117 - 123
Section		Interdisciplinary physics and related areas of science and technology
DOI		https://doi.org/10.1209/epl/i2003-00370-1
Published online		01 March 2003

Europhys. Lett., 62 (1), pp. 117-123 (2003)

PCA learning for sparse high-dimensional data

D. C. Hoyle¹^,2 and M. Rattray¹

¹ Department of Computer Science, University of Manchester, Kilburn Building Oxford Rd., Manchester, M13 9PL, UK
² Biochemistry Research Division, School of Biological Sciences University of Manchester, Stopford Building - Oxford Rd., Manchester, M13 9PT, UK

Received: 28 October 2002
Accepted: 20 January 2003

Abstract

We study the performance of principal component analysis (PCA). In particular, we consider the problem of how many training pattern vectors are required to accurately represent the low-dimensional structure of the data. This problem is of particular relevance now that PCA is commonly applied to extremely high-dimensional ( $Mathematical equation: $N\simeq 5000$$ –30000) real data sets produced from molecular-biology research projects. In these applications the number of patterns p is often orders of magnitude less than the data dimension $Mathematical equation: $(p\ll N)$$ . We follow previous work and perform the analysis in the context of p random patterns which are isotropically distributed with the exception of a single symmetry-breaking direction. The standard mean-field theory for the performance of PCA is constructed by considering the thermodynamic limit $Mathematical equation: $N\rightarrow\infty$$ , with $Mathematical equation: $\alpha = p/N$$ fixed. For real data sets the strength of the symmetry breaking may increase with N, and therefore one must reconsider the accuracy of the mean-field theory. We show, using simulation results, that the mean-field theory is still accurate even when the strength of the symmetry breaking scales with N, and even for small values of α that are more appropriate to real biological data sets.

PACS: 87.10.+e – General theory and mathematical aspects / 02.50.-r – Probability theory, stochastic processes, and statistics

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.