Parallel K-Means Data Clustering for Large Data Sets

This software package parallel-kmeans-int64.tar.gz (2.3 MB) of parallel K-means data clustering is the extension of the Parallel K-means Software for handling data sets with more than 2 billion data points. It uses "long long" data type to represent the number of data points, instead of "int" in the previous release. Note that this software contains only the MPI version and uses Parallel netCDF (PnetCDF) for its I/O method. You can build and install PnetCDF in your home directory. PnetCDF release includes build recipes for various machine platforms.

To compile:

Edit Makefile to make the following changes and then run command "make".

MPICC -- set to the path of MPI C compiler.
PNETCDF_DIR -- set to the path where parallel netCDF is installed.
DATATYPE -- set to the data type of your input data. In this release, "short" is used.

To run:

"make" command will produce an executable file named: "mpi_main".
Command-line arguments :

Usage: mpi_main [switches]
       -i filename    : input netCDF file containing data to be clustered
       -v var_name    : name of variable in the netCDF file to be clustered
       -c filename    : name of netCDF file that contains the initial cluster centers
                        if skipped, the same file from option "-i" is used
       -k var_name    : name of variable in the netCDF to be used as the initial cluster centers
                        if skipped, the variable name from the option "-v" is used
       -n num_clusters: number of clusters (K, must > 1)
       -t threshold   : threshold value (default 0.0010)
       -o             : output timing results (default no)
       -q             : quiet mode
       -d             : enable debug mode
       -h             : print this help information

Input file format:

Only netCDF file format is supported in this software release. A few example files are provided in the sub-directory ./Image_data. Some information about netCDF file format can be found from links below.

netCDF is a portable and self-describing file format. http://www.unidata.ucar.edu/software/netcdf
Parallel netCDF (PnetCDF) is used to carry out parallel I/O, please check the link below for further information about PnetCDF. http://cucis.ece.northwestern.edu/projects/PnetCDF

Output file:

The output file is in netCDF format (CDF-5). If command-line option "-o" was not used, the default output file name will be the input file name with ".kmeans_out" appended and the file extension ".nc" will still preserved. For example, if the input file name is "input.nc", then the default output file name will be "input.kmeans_out.nc".

Output variables:

Coordinates of cluster centers will be stored in variable named "clusters".
Membership of all data points to the clusters will be in variable named "membership".

Examples: Here are the file header of input and output files from using an example "Image_data/color17695.nc".

  % mpiexec -n 4 mpi_main -i Image_data/color17695.nc -v color17695 -n 4 -o output/out.nc
  Writing coordinates of K=4 cluster centers to file "output/out.nc"
  Writing membership of N=140737181790296 data objects to file "output/out.nc"
  Writing coordinates of K=4 cluster centers to file "output/out.nc"
  Writing membership of N=17695 data objects to file "output/out.nc"

  Performing **** Parallel Kmeans  (MPI) ****
  Num of processes = 4
  Input file       : Image_data/color17695.nc
  Output file      : output/out.nc
  numObjs          = 17695
  numCoords        = 9
  numClusters      = 4
  threshold        = 0.0010
  I/O time         =     0.0672 sec
  Computation time =     0.0463 sec



  % ncmpidump -h Image_data/color17695.nc
  netcdf color17695 {
  // file format: CDF-5 (big variables)
  dimensions:
          num_elements = 17695 ;
          num_coordinates = 9 ;
  variables:
          float color17695(num_elements, num_coordinates) ;
  }



  % ncmpidump -h output/out.nc
  netcdf out {
  // file format: CDF-5 (big variables)
  dimensions:
          num_clusters = 4 ;
          num_coordinates = 9 ;
          num_elements = 17695 ;
  variables:
          float clusters(num_clusters, num_coordinates) ;
          int64 membership(num_elements) ;
  }

Parallel K-Means Data Clustering for Large Data Sets

To compile:

To run:

Input file format:

Output file:

Related Links: