Parameters¶
The clustering properties set in LMCLUSParameters instance, which is defined as follows:
type LMCLUSParameters min_dim::Int # Minimum cluster dimension max_dim::Int # Maximum cluster dimension number_of_clusters::Int # Nominal number of resulting clusters hist_bin_size::Int # Fixed number of bins for the distance histogram. min_cluster_size::Int # Minimum cluster size best_bound::Float64 # Best bound error_bound::Float64 # Error bound max_bin_portion::Float64 # Maximum histogram bin size random_seed::Int64 # Random seed sampling_heuristic::Int # Sampling heuristic sampling_factor::Float64 # Sampling factor histogram_sampling::Bool # Sample points for distance histogram zero_d_search::Bool # Enable zero-dimensional manifold search basis_alignment::Bool # Manifold cluster basis alignment dim_adjustment::Bool # Manifold dimensionality adjustment dim_adjustment_ratio::Float64 # Ratio of manifold principal subspace variance mdl::Bool # Enable MDL heuristic mdl_model_precision::Int # MDL model precision encoding constant mdl_data_precision::Int # MDL data precision encoding constant mdl_quant_error::Float64 # Quantization error of a bin size calculation mdl_compres_ratio::Float64 # Cluster compression ration log_level::Int # Log level (0-5) end
Here is a description of algorithm parameters and their default values:
| name | description | default |
|---|---|---|
| min_dim | Low bound of a cluster manifold dimension. | 1 |
| max_dim | High bound of a cluster manifold dimension. It cannot be larger then a dimensionality of a dataset. | |
| number_of_clusters | Expected number of clusters. Requred for the sampling heuristics. | 10 |
| hist_bin_size | Number of bins for a distance histogram.
If this parameter is set to zero, the number of bins in
the distance histogram determined by parameter
max_bin_portion. |
0 |
| min_cluster_size | Minimum size of a collection of data points to be considered as a proper cluster. | 20 |
| best_bound | Separation best bound value is used for evaluating a goodness of separation characterized by a discriminability and a depth between modes of a distance histogram. | 1.0 |
| error_bound | Sampling error bound determines a minimal number of samples required to correctly identify a linear manifold cluster. | 1e-4 |
| max_bin_portion | Sampling error bound determines a minimal number of samples required to correctly identify a linear manifold cluster. Value should be selected from a (0,1) range. | 0.1 |
| random_seed | Random number generator seed. If not specified then RNG will be reinitialized at every run. | 0 |
| sampling_heuristic | The choice of heuristic method:
|
3 |
| sampling_factor | Sampling factor used in the sampling heuristics (see above, options 2 & 3) to determine a number of samples as a percentage from a total dataset size. | 0.01 |
| histogram_sampling | Turns on a sampling for a distance histogram.
Instead of computing the distance histogram from
a whole dataset, the algorithm draws a small sample for
the histogram construction, thus improving a its performance.
This parameter depends on a cluster_number value. |
false |
| zero_d_search | Turn on/off zero dimensional manifold search. | false |
| basis_alignment | Turn of/off an alignment of a manifold cluster basis. *If it’s on, a manifold basis of the generated cluster is aligned along the direction of the maximum variance (by performing PCA). | false |
| dim_adjustment | Turn on/off a linear manifold cluster dimensionality detection by looking for portion of a variance associated with principal components. | false |
| dim_adjustment_ratio | Ratio of manifold principal subspace variance. | 0.99 |
| mdl | Turn on/off minimum description length heuristic for a complexity validation of a generated cluster. | false |
| mdl_model_precision | MDL model precision encoding value. | 32 |
| mdl_data_precision | MDL data precision encoding value. | 16 |
| mdl_quant_error | Quantization error of a bin size calculation for a histogram which used in determining entropy value of the empirical distance distribution. | 1e-4 |
| mdl_compres_ratio | Compression threshold value for discarding candidate clusters. | 1.05 |
| log_level | Logging level (ranges from 0 to 5). | 0 |
Suggestions¶
Particular settings could impact performance of the algorithm:
- If you want a persistent clustering results fix a
random_seedparameter. By default, RNG is reinitialized every time when algorithm runs. - If a dimensionality of the data is low, a histogram sampling could speeding up calculations.
- Value
1ofsampling_heuristicparameter should not be used if parametermax_dimis large, as it will generate a very large number of samples. - Increasing value of
max_bin_portionparameter could improve an efficiency of the clustering partitioning, but as well could degrade overall performance of the algorithm.
Parallelization¶
This implementation of LMCLUS algorithm uses parallel computations during a manifold sampling stage. You need add additional workers before executing the algorithm.