Parameters¶

The clustering properties set in LMCLUSParameters instance, which is defined as follows:

type LMCLUSParameters
    min_dim::Int                     # Minimum cluster dimension
    max_dim::Int                     # Maximum cluster dimension
    number_of_clusters::Int          # Nominal number of resulting clusters
    hist_bin_size::Int               # Fixed number of bins for the distance histogram.
    min_cluster_size::Int            # Minimum cluster size
    best_bound::Float64              # Best bound
    error_bound::Float64             # Error bound
    max_bin_portion::Float64         # Maximum histogram bin size
    random_seed::Int64               # Random seed
    sampling_heuristic::Int          # Sampling heuristic
    sampling_factor::Float64         # Sampling factor
    histogram_sampling::Bool         # Sample points for distance histogram
    zero_d_search::Bool              # Enable zero-dimensional manifold search
    basis_alignment::Bool            # Manifold cluster basis alignment
    dim_adjustment::Bool             # Manifold dimensionality adjustment
    dim_adjustment_ratio::Float64    # Ratio of manifold principal subspace variance
    mdl::Bool                        # Enable MDL heuristic
    mdl_model_precision::Int         # MDL model precision encoding constant
    mdl_data_precision::Int          # MDL data precision encoding constant
    mdl_quant_error::Float64         # Quantization error of a bin size calculation
    mdl_compres_ratio::Float64       # Cluster compression ration
    log_level::Int                   # Log level (0-5)
end

Here is a description of algorithm parameters and their default values:

name	description	default
min_dim	Low bound of a cluster manifold dimension.	`1`
max_dim	High bound of a cluster manifold dimension. It cannot be larger then a dimensionality of a dataset.
number_of_clusters	Expected number of clusters. Requred for the sampling heuristics.	`10`
hist_bin_size	Number of bins for a distance histogram. If this parameter is set to zero, the number of bins in the distance histogram determined by parameter `max_bin_portion`.	`0`
min_cluster_size	Minimum size of a collection of data points to be considered as a proper cluster.	`20`
best_bound	Separation best bound value is used for evaluating a goodness of separation characterized by a discriminability and a depth between modes of a distance histogram.	`1.0`
error_bound	Sampling error bound determines a minimal number of samples required to correctly identify a linear manifold cluster.	`1e-4`
max_bin_portion	Sampling error bound determines a minimal number of samples required to correctly identify a linear manifold cluster. Value should be selected from a (0,1) range.	`0.1`
random_seed	Random number generator seed. If not specified then RNG will be reinitialized at every run.	`0`
sampling_heuristic	The choice of heuristic method: algorithm will use a probabilistic heuristic which will sample a quantity exponential in `max_dim` and `cluster_number` parameters will sample fixed number of points the lesser of the previous two	`3`
sampling_factor	Sampling factor used in the sampling heuristics (see above, options 2 & 3) to determine a number of samples as a percentage from a total dataset size.	`0.01`
histogram_sampling	Turns on a sampling for a distance histogram. Instead of computing the distance histogram from a whole dataset, the algorithm draws a small sample for the histogram construction, thus improving a its performance. This parameter depends on a `cluster_number` value.	`false`
zero_d_search	Turn on/off zero dimensional manifold search.	`false`
basis_alignment	Turn of/off an alignment of a manifold cluster basis. *If it’s on, a manifold basis of the generated cluster is aligned along the direction of the maximum variance (by performing PCA).	`false`
dim_adjustment	Turn on/off a linear manifold cluster dimensionality detection by looking for portion of a variance associated with principal components.	`false`
dim_adjustment_ratio	Ratio of manifold principal subspace variance.	`0.99`
mdl	Turn on/off minimum description length heuristic for a complexity validation of a generated cluster.	`false`
mdl_model_precision	MDL model precision encoding value.	`32`
mdl_data_precision	MDL data precision encoding value.	`16`
mdl_quant_error	Quantization error of a bin size calculation for a histogram which used in determining entropy value of the empirical distance distribution.	`1e-4`
mdl_compres_ratio	Compression threshold value for discarding candidate clusters.	`1.05`
log_level	Logging level (ranges from 0 to 5).	`0`

Suggestions¶

Particular settings could impact performance of the algorithm:

If you want a persistent clustering results fix a random_seed parameter. By default, RNG is reinitialized every time when algorithm runs.
If a dimensionality of the data is low, a histogram sampling could speeding up calculations.
Value 1 of sampling_heuristic parameter should not be used if parameter max_dim is large, as it will generate a very large number of samples.
Increasing value of max_bin_portion parameter could improve an efficiency of the clustering partitioning, but as well could degrade overall performance of the algorithm.

Parallelization¶

This implementation of LMCLUS algorithm uses parallel computations during a manifold sampling stage. You need add additional workers before executing the algorithm.