Utility to calculate a self-organizing map of a set of features

Name

som - Utility to calculate a self-organizing map of a set of features

Synopsis

som [options] WIDTH HEIGHT DATASET MAPFILE

Description

Trains a Self-Organizing Map of the specified WIDTH and HEIGHT on the DATASET. The output is a set of files prefixed by MAPFILE. The options supported are as follows:

-b Use binary mode for reading and writing matrices (quicker and smaller, but not as flexible)

-d fn The distance function to use. Can be: basic; euclidean; cosine.

-p The first line of the text file is normally assumed to be a header containing the attributes of the matrix (i.e. number of columns / rows). This option indicates the file is plain text file without the header and that the parameters should be automatically deduced.

-d fn The map initialisation method to use. Can be: sample; random; gradient. Sample fills the initial map with random samples from the dataset. Random fills the initial map with randomly created elements. The minimum and maximum values for each variable are used as limits, so the made-up elements are potentially representative. The gradient method places two disimilar elements from the dataset in opposite corners of the map and fills the map with interpolated values.

-n Normalise the dataset on a per-variable basis before mapping. That is, each 'column' of the dataset is normalised independently from other columns. The values are normalised to lie between 0 and 1, using the minimum and maximum values as limits.

-s Summarise the clusters. This lists the location and size of sections in the DATASET which are from same cluster. This is intended to be used for sequential data, such as time-series audio features. The results are output to MAPFILE.clumps

Outputs

The following files are output as part of the mapping process.

MAPFILE.ca Cluster assignment for each element in the DATASET. This is always a text file.

MAPFILE.cap The populations of each cluster, as assigned in the MAPFILE.ca file. This is always a text file.

MAPFILE.clumps Only output when the -s option is present. This lists the location and size of sections in the DATASET which are from same cluster. This is intended to be used for sequential data, such as time-series audio features. This is always a text file.

MAPFILE.cp A component plane showing the distance between a sample from the DATASET and each point on the map. It is similar to the u-matrix in that lower values mean a greater similarity. The sample choosen is currently the first entry in the DATASET.

MAPFILE.som The exemplar vectors for each point on the map. This is currently always a text file.

MAPFILE.um A u-matrix of the map. This shows how similar each point on the map is to its neighbours. A low value means it is more similar. When visualised, ridges of higher values show boundaries of any clusters that might have formed.

Remarks:: Implemented by som.cpp.