\name{cmp.cluster}
\alias{cmp.cluster}
\title{cluster compounds using a descriptor database}
\description{
	'cmp.cluster' uses compound descriptors in a database and clusters these
		compounds based on their pairwise distances. 'cmp.cluster' uses
		single linkage to measure distance between clusters when it
		merges clusters. 'cmp.cluster' accepts both a single cutoff and a
		cutoff vector. By using a cutoff vector, it can generate the
		same result as hierachical clustering.
}
\usage{
cmp.cluster(db, cutoff, is.similarity = TRUE, save.distances = FALSE,
        use.distances = NULL, quiet = FALSE, ...)
}
\arguments{
  \item{db}{The desciptor database, in the format returned by 'cmp.parse'.}
  \item{cutoff}{The clustering cutoff. Can be a single value or a vector. The
      cutoff gives the maximum distance between two compounds in order to
      group them in the same clsuter.}
  \item{is.similarity}{Set when the cutoff supplied is a similarity cutoff.
      This cutoff is the mimumum similarity value between two compounds such
      that they will be grouped in the same cluster.}
  \item{save.distances}{whether to save distance for future clustering. See
	  details below.}
  \item{use.distances}{Supply pre-computed distance matrix.}
  \item{quiet}{Whether to supress the progress information.}
  \item{...}{Further arguments to be passed to 'cmp.similarity' to calculate similarities if necessary.}
}
\details{
    'cmp.cluster' will compute distance on the fly if 'use.distances' is not set.
    Furthermore, if 'save.distances' is not set, the distance will never be
    stored and distance between any two compounds is guaranteed not to be
    computed twice. Using this method, 'cmp.cluster' can deal with large database,
    when a distance matrix in memory is not feasible. The speed of this cluster
    function should be slowed because of using this transient distance value.
	
    When 'save.distances' is set, 'cmp.cluster' will be forced to compute the
    distance matrix and save it in memory before doing clustering. This is
    useful when you need to do further clustering in the future and do not want
    the distance to be re-computed then. Set 'save.distances' to TRUE if you
    only want to force the clustering to use this 2-step approach; otherwise,
    set it to the filename under which you want the distance matrix to be
    saved. After you save it, when you need to reuse the distance matrix, you
    can 'load' it, and supply to 'cmp.cluster' via the 'use.distances' argument.

    'cmp.cluster' supports vector of cutoffs. When you have multiple cutoffs,
    'cmp.cluster' still guarantees that pairwise distances will never be
    recomputed, and no copy of distances is kept in memory. It is guaranteed to
    be as fast as calling 'cmp.cluster' with a single cutoff that results in the
    longest processing time, plus some small overhead linear in that processing
    time.
}
\value{
    Returns a data frame. Besides a variable giving compound ID, each of the
    other variables in the data frame will either give the cluster IDs of
    compounds under some clustering cutoff, or the size of clusters that the
    compounds belong to. When N cutoffs are given, in total 2*N+1 variables
    will be generated, with N of them giving the cluster ID of each compound
    under each of the N cutoffs, and the other N of them giving the cluster
    size under each of the N cutoffs. The rows are sorted by the cluster sizes.
}
\author{Y. Eddie Cao, Li-Chang Cheng}
\seealso{\code{\link{cmp.parse1}}, \code{\link{cmp.parse}}, \code{\link{cmp.search}}, \code{\link{cmp.similarity}}}
\examples{
## Load sample SD file
# data(sdfsample); sdfset <- sdfsample

## Generate atom pair descriptor database for searching
# apset <- sdf2ap(sdfset) 

## Loads same atom pair sample data set provided by library
data(apset) 
db <- apset

## cluster using multiple cutoffs
clusters <- cmp.cluster(db, cutoff=c(0.5, 0.85))

## or save the distance before clustering:
clusters <- cmp.cluster(db, cutoff=0.65, save.distances="distmat.rda")
# later, you can load the matrix and pass it to do clustering. Load will load
# the variable 'distmat' that contains the distance matrix
load("distmat.rda")
clusters <- cmp.cluster(db, cutoff=0.60, use.distances=distmat)
}
\keyword{utilities}