sClust R Package

The R a package contains a rather complete set of different classification algorithms.

Among these different functions you will find bi-parted or k-class or multi-level/hierarchical classification functions. You are free to choose how to use each of the functions. In clustering, no algorithm will be universally better, you will probably have to use several functions before you get the desired result.

Common information

Most of the functions in this package need a similarity matrix calculated from the data. If needed, in this package, is provided two functions:

compute.similarity.gaussian : this function computes the gaussian similarity
compute.similarity.ZP : this function computes the similarity according to the method of Zelnik and Perona

The resulting matrix is of size N x N with N = the number of observations of the input data.

Moreover, these matrices must be positive definite or at least positive semidefinite to be used by the algorithms. For this purpose another function is provided to perform this preliminary check:

checking.gram.similarityMatrix : This function will check if the matrix is correctly defined and if not it will multiply the matrix by its transpose in order to make it positive semi-defined

The outputs and inputs of the different functions have been harmonized in order to make the different functions as exchangeable as possible

Bi-parted methods

Bi-partition algorithms will split the datasets into 2 distinct classes. In this package two algorithms have been implemented: * Shi and Malik bi-parted spectral clustering * Perona and Freeman bi-parted clustering

These functions are then immediately efficient on datasets to be split in two. However, most of the time you would want to split into more clusters. It is then possible to use the algorithm recursion method which is provided in the package. Moreover, it is mentioned in the paper of Shi and Malik that it is better to use their algorithm recursively.

Shi and Malik bi-parted spectral clustering

This classification computes the Laplacian matrix. Then extracts the second smallest eigenvector. From this eigenvector, depending on the sign, the data will be classified into two different clusters.

This function takes 3 inputs arguments : * W : a similarity matrix computed from the data. (N*N matrix) * flagDiagZero : If true, sets the diagonal of the similarity matrix to 0 (Boolean) * verbose: If true, outputs in the console the different debugging information (Boolean)

This function returns a list containing : * cluster : a vector containing all the labels. (vector of size N) * eigenVect : a matrix containing the different eigenvectors resulting from the Laplacian matrix. * eigenVal : A vector containing the eigenvalues of the Laplacian matrix.

Perona and Freeman bi-parted clustering

This classification extracts the first smallest eigenvector and the largest eigenvalue from the similarity matrix. From this eigenvector, depending on the sign, the data will be classified into two different clusters.

This function takes 3 inputs arguments : * W : a similarity matrix computed from the data. (N*N matrix) * flagDiagZero : If true, sets the diagonal of the similarity matrix to 0 (Boolean) * verbose: If true, outputs in the console the different debugging information (Boolean)

This function returns a list containing : * cluster : a vector containing all the labels. (vector of size N) * eigenVect : a matrix containing the different eigenvectors resulting from the similarity matrix. * eigenVal : A vector containing the eigenvalues of the similarity matrix.

K clusters methods

The following algorithms will split the datasets into K distinct clusters.

It is therefore necessary to provide as input the number of class K that we wish to obtain. If you do not know it, two functions are available to determine it automatically.

The first function : compute.kclust. The first function : compute.kclust. This function determines the number of classes automatically as per Kelly Grassi's thesis.

The second function : compute.kclust2. This is an improved version of compute.kclust which allows to have an PEV method closer to reality but still with permissiveness if needed.

These functions have the same input variables: * eigenValues: the eigenvalues of the Laplacian matrix. (Vector) * method : the method that will be used. "default" to let the function choose the most appropriate method. "PEV" for the principal eigenvalues method. "GAP" for the GAP method. (String) * Kmax : the maximum number of clusters that is allowed. (Integer) * Tolerance : the tolerance allowed for the principal eigenvalue method. (Double) * threshold : the threshold of selection of the dominant eigenvalue for the GAP method. (Double) * verbose : If true, outputs in the console the different debugging information (Boolean)

And the same output: * the number of clusters K to be cut (Integer)

For the following clustering fuction, they have the same inputs and outuputs (except Spectral Partition Around Medoïds clustering which have a different output) :

The K clusters methods functions takes 4 inputs arguments : * W : a similarity matrix computed from the data. (N*N matrix) * K : the number of cluster to compute. (Integer) * flagDiagZero : If true, sets the diagonal of the similarity matrix to 0 (Boolean) * verbose: If true, outputs in the console the different debugging information (Boolean)

And these functions returns a list containing : * cluster : a vector containing all the labels. (vector of size N) * eigenVect : a matrix containing the different eigenvectors resulting from the Laplacian matrix. * eigenVal : A vector containing the eigenvalues of the Laplacian matrix.

Spectral Partition Around Medoïds clustering

This algorithm, like the others, compute the spectral space thanks to a laplacian matrix. After, in the eigenspace, the clustering is made with the k-medoids algorithm. This method try to minimize the distance between the points labelled as part of a cluster and a point designated to be the center of this cluster.

This function have a slightly different output, it's a list containing : * cluster : a vector containing all the labels. (vector of size N) * eigenVect : a matrix containing the different eigenvectors resulting from the Laplacian matrix. * eigenVal : A vector containing the eigenvalues of the Laplacian matrix. * eigenVectNorm : a matrix containing the normalized eigenvectors that are used for the clustering.

Unormalized clustering

This is the most classical spectral clustering method. It just consists in the projection of the data in the eigenspace formed by the eigenvectors of the non-normalized Laplacian matrix. Then the classification using the k-means algorithm.

Von Luxburg clustering

The clustering described by Von Luxburg applies a modified version of Shi and Malik's clustering generalized for K clusters.

Hierarchical spectral clustering

Hierarchical spectral clustering allows to apply the hierarchical clustering algorithm (ascending or descending) in the spectral space.

Multi-levels / hierarchical methods

Multi-level algorithms differ from other algorithms because they do not cluster on a single level. That is, instead of directly classifying the data : a first classification is done, then a second one, etc. This is done until a non-cut criterioa is reached or a cut criteria is no longer met.

The following functions takes 4 inputs arguments : * X : a dataframe containing all the dataset. (N x M dataframe) * levelMax : the maximum depth level. (Integer) * silMin : the minimum silhouette allowed for the criteria. (Double) * vois : the number of neighbors used for calculate the similarity of Zelnik and Perona (Integer) * flagDiagZero : if true, sets the diagonal of the similarity matrix to 0 (Boolean) * method : the method that will be used. "default" to let the function choose the most appropriate method. "PEV" for the principal eigenvalues method. "GAP" for the GAP method. (String) * Kmax : the maximum number of clusters that is allowed. (Integer) * tolerence : the tolerance allowed for the principal eigenvalue method. (Double) * threshold : the threshold of selection of the dominant eigenvalue for the GAP method. (Double) * minPoint : the minimum number of points allowed to compute a cluster. (Integer) * verbose: if true, outputs in the console the different debugging information (Boolean)

these functions returns : * a dataframe containing all the labels for each levels.

Multi-levels clustering

Fast multi-levels clustering

Other functions

The other functions that are presented in this part are simply functions that can be useful to go further in clustering.

Fast function

The Fast method allows to make any of the previously described algorithms (except the multi-level methods) faster for large datasets. To do so, a quantization using k-means is performed. This reduces the initial dataset. Then, points are added (from the initial data) in order to respect the density of the initial data. Then, clustering is performed on this reduced dataset. In order to apply this result to the initial data, the K-nearest Neighbors method is applied.

This function takes at least 6 inputs arguments : * dataFrame : a dataframe containing all the dataset. (N x M dataframe) * smplPoint : maximum of sample number for reduction. (Integer) * stopCriteria : criteria for minimizing intra-group distance and select final smplPoint. (Double) * neighbours : number of points that will be selected for the similarity computation. (Integer) * similarity : if True, will use the similarity matrix for the clustering function. (Boolean) * clustFunc : the clustering function to apply on data. (Function) * ... : other arguments that are useful for the clustering function.

This function returns a list containing : * results : clustering results. (List) * sample : dataframe containing the sample used (Dataframe) * quantLabels : quantization labels. (Vector) * clustLabels : results labels. (Vector) * kmeans : kmeans quantization results. (List)

Recursive function

The Recursive method allows to transform any of the previously described algorithms (except the multi-level methods) into a multi-level method. But this function doesn't implement a non-cut criteria. The function will cut until the maximum level (or until the PEV can't detect more than one new cluster to cut).

This function takes at least 6 inputs arguments : * dataFrame : a dataframe containing all the dataset. (N x M dataframe) * levelMax : the maximum depth level. (Integer) * clustFunc : the clustering function to apply on data. (Function) * similarity : if True, will use the similarity matrix for the clustering function. (Boolean) * vois : number of points that will be selected for the similarity computation. (Integer) * flagDiagZero : if true, sets the diagonal of the similarity matrix to 0 (Boolean) * biparted : if True, the function will not automatically choose the number of clusters to compute. (Boolean) * method : the method that will be used. "default" to let the function choose the most suitable method. "PEV" for the Principal EigenValue method. "GAP" for the GAP method. (String) * tolerence : the tolerance allowed for the Principal EigenValue method. * threshold : the threshold to select the dominant eigenvalue for the GAP method. * minPoint : the minimum number of points required to compute a cluster. * verbose : if true, outputs in the console the different debugging information (Boolean) * ... : other arguments that are useful for the clustering function.

This function returns a list containing : * results : clustering results. (List) * sample : dataframe containing the sample used (Dataframe) * quantLabels : quantization labels. (Vector) * clustLabels : results labels. (Vector) * kmeans : kmeans quantization results. (List)