Computational Space Reduction and Parallelization of a new Clustering Approach for Large Groups of Sequences

Trelles O.(1), Andrade M.A.(2), Valencia A.(2), Zapata E.L.(1), and Carazo J.M.(1,3,*)

(1) Computer Architecture Department, University of Malaga, 29017 Malaga, Spain
(2) Protein Design Unit, Centro Nacional de Biotecnologia, 28049 Madrid, Spain
(3) Biocomputing Unit, Centro Nacional de Biotecnologia, 28049  Madrid, Spain
(*) To whom correspondence should be addressed
         Fax     +(34) 1 585 45 06
         Phone   +(34) 1 585 45 43
         E-mail  carazo@samba.cnb.uam.es

Running Title: Computational space reduction for sequence clustering
Key words: Clustering of sequences, Parallel Computing.

Abstract.

Motivation:

The explosive growth of the biological sequences databases estimulated by Genome's Projects has modified the framework of several applications on the biological sequence analysis area. In most cases this new scenary is formed by large sets of sequences that suggest the need for effective and automatic methods for a clustering of sequences that allow to apply common schemes of reference to the formed groups.

Results:

In this work we present a new strategy to reduce the computational cost associated to the clustering of large set of sequences which are expected to contain several families. The strategy is based on the grouping of the sequences into families by using several thresholds on a pairwise sequence similarity criterium. Routine clustering of whole data bases can now be done very efficiently, providing a very compact and efficient representation of the existing relationship within a data base. The method developed here achieves a computational space reduction of about an order of magnitude over more traditional ones, while producing family groupings and alignments that reproduce closely already accepted biological results. Our work includes a parallel implementation for distributed memory multiprocessors with a dynamic scheduling strategy for performance optimization.

Availability:

By anonymous ftp at ftp.ac.uma.es (/pub/ots/pCluster directory),
or from our web-site http://www.cnb.uam.es/www/software

Contact: e-mail carazo@samba.cnb.uam.es