Computational Space Reduction and Parallelization of
a new Clustering Approach for Large Groups of Sequences
Trelles O.(1), Andrade M.A.(2), Valencia A.(2), Zapata E.L.(1), and Carazo
J.M.(1,3,*)
(1) Computer Architecture Department, University of Malaga, 29017 Malaga, Spain
(2) Protein Design Unit, Centro Nacional de Biotecnologia, 28049 Madrid, Spain
(3) Biocomputing Unit, Centro Nacional de Biotecnologia, 28049 Madrid, Spain
(*) To whom correspondence should be addressed
Fax +(34) 1 585 45 06
Phone +(34) 1 585 45 43
E-mail carazo@samba.cnb.uam.es
Running Title: Computational space reduction for sequence clustering
Key words: Clustering of sequences, Parallel Computing.
Abstract.
Motivation:
The explosive growth of the biological sequences databases
estimulated by Genome's Projects has modified the framework of several
applications on the biological sequence analysis area. In most cases this new
scenary is formed by large sets of sequences that suggest the need for
effective and automatic methods for a clustering of sequences that allow to
apply common schemes of reference to the formed groups.
Results:
In this work we present a new strategy to reduce the computational
cost associated to the clustering of large set of sequences which are expected
to contain several families. The strategy is based on the grouping of the
sequences into families by using several thresholds on a pairwise sequence
similarity criterium. Routine clustering of whole data bases can now be done
very efficiently, providing a very compact and efficient representation of the
existing relationship within a data base. The method developed here achieves
a computational space reduction of about an order of magnitude over more
traditional ones, while producing family groupings and alignments that
reproduce closely already accepted biological results. Our work includes a
parallel implementation for distributed memory multiprocessors with a dynamic
scheduling strategy for performance optimization.
Availability:
By anonymous ftp at ftp.ac.uma.es (/pub/ots/pCluster directory),
or from our web-site
http://www.cnb.uam.es/www/software
Contact: e-mail carazo@samba.cnb.uam.es