Parameters¶
The clustering properties set in LMCLUSParameters
instance, which is defined as follows:
type LMCLUSParameters min_dim::Int # Minimum cluster dimension max_dim::Int # Maximum cluster dimension number_of_clusters::Int # Nominal number of resulting clusters hist_bin_size::Int # Fixed number of bins for the distance histogram. min_cluster_size::Int # Minimum cluster size best_bound::Float64 # Best bound error_bound::Float64 # Error bound max_bin_portion::Float64 # Maximum histogram bin size random_seed::Int64 # Random seed sampling_heuristic::Int # Sampling heuristic sampling_factor::Float64 # Sampling factor histogram_sampling::Bool # Sample points for distance histogram zero_d_search::Bool # Enable zero-dimensional manifold search basis_alignment::Bool # Manifold cluster basis alignment dim_adjustment::Bool # Manifold dimensionality adjustment dim_adjustment_ratio::Float64 # Ratio of manifold principal subspace variance mdl::Bool # Enable MDL heuristic mdl_model_precision::Int # MDL model precision encoding constant mdl_data_precision::Int # MDL data precision encoding constant mdl_quant_error::Float64 # Quantization error of a bin size calculation mdl_compres_ratio::Float64 # Cluster compression ration log_level::Int # Log level (0-5) end
Here is a description of algorithm parameters and their default values:
name | description | default |
---|---|---|
min_dim | Low bound of a cluster manifold dimension. | 1 |
max_dim | High bound of a cluster manifold dimension. It cannot be larger then a dimensionality of a dataset. | |
number_of_clusters | Expected number of clusters. Requred for the sampling heuristics. | 10 |
hist_bin_size | Number of bins for a distance histogram.
If this parameter is set to zero, the number of bins in
the distance histogram determined by parameter
max_bin_portion . |
0 |
min_cluster_size | Minimum size of a collection of data points to be considered as a proper cluster. | 20 |
best_bound | Separation best bound value is used for evaluating a goodness of separation characterized by a discriminability and a depth between modes of a distance histogram. | 1.0 |
error_bound | Sampling error bound determines a minimal number of samples required to correctly identify a linear manifold cluster. | 1e-4 |
max_bin_portion | Sampling error bound determines a minimal number of samples required to correctly identify a linear manifold cluster. Value should be selected from a (0,1) range. | 0.1 |
random_seed | Random number generator seed. If not specified then RNG will be reinitialized at every run. | 0 |
sampling_heuristic | The choice of heuristic method:
|
3 |
sampling_factor | Sampling factor used in the sampling heuristics (see above, options 2 & 3) to determine a number of samples as a percentage from a total dataset size. | 0.01 |
histogram_sampling | Turns on a sampling for a distance histogram.
Instead of computing the distance histogram from
a whole dataset, the algorithm draws a small sample for
the histogram construction, thus improving a its performance.
This parameter depends on a cluster_number value. |
false |
zero_d_search | Turn on/off zero dimensional manifold search. | false |
basis_alignment | Turn of/off an alignment of a manifold cluster basis. *If it’s on, a manifold basis of the generated cluster is aligned along the direction of the maximum variance (by performing PCA). | false |
dim_adjustment | Turn on/off a linear manifold cluster dimensionality detection by looking for portion of a variance associated with principal components. | false |
dim_adjustment_ratio | Ratio of manifold principal subspace variance. | 0.99 |
mdl | Turn on/off minimum description length heuristic for a complexity validation of a generated cluster. | false |
mdl_model_precision | MDL model precision encoding value. | 32 |
mdl_data_precision | MDL data precision encoding value. | 16 |
mdl_quant_error | Quantization error of a bin size calculation for a histogram which used in determining entropy value of the empirical distance distribution. | 1e-4 |
mdl_compres_ratio | Compression threshold value for discarding candidate clusters. | 1.05 |
log_level | Logging level (ranges from 0 to 5). | 0 |
Suggestions¶
Particular settings could impact performance of the algorithm:
- If you want a persistent clustering results fix a
random_seed
parameter. By default, RNG is reinitialized every time when algorithm runs. - If a dimensionality of the data is low, a histogram sampling could speeding up calculations.
- Value
1
ofsampling_heuristic
parameter should not be used if parametermax_dim
is large, as it will generate a very large number of samples. - Increasing value of
max_bin_portion
parameter could improve an efficiency of the clustering partitioning, but as well could degrade overall performance of the algorithm.
Parallelization¶
This implementation of LMCLUS algorithm uses parallel computations during a manifold sampling stage. You need add additional workers before executing the algorithm.