Parallel K-Means Data Clustering for Large Data Sets


This software package parallel-kmeans-int64.tar.gz (2.3 MB) of parallel K-means data clustering is the extension of the Parallel K-means Software for handling data sets with more than 2 billion data points. It uses "long long" data type to represent the number of data points, instead of "int" in the previous release. Note that this software contains only the MPI version and uses Parallel netCDF (PnetCDF) for its I/O method. You can build and install PnetCDF in your home directory. PnetCDF release includes build recipes for various machine platforms.

To compile:

Edit Makefile to make the following changes and then run command "make".

To run:

Usage: mpi_main [switches]
       -i filename    : input netCDF file containing data to be clustered
       -v var_name    : name of variable in the netCDF file to be clustered
       -c filename    : name of netCDF file that contains the initial cluster centers
                        if skipped, the same file from option "-i" is used
       -k var_name    : name of variable in the netCDF to be used as the initial cluster centers
                        if skipped, the variable name from the option "-v" is used
       -n num_clusters: number of clusters (K, must > 1)
       -t threshold   : threshold value (default 0.0010)
       -o             : output timing results (default no)
       -q             : quiet mode
       -d             : enable debug mode
       -h             : print this help information

Input file format:

Only netCDF file format is supported in this software release. A few example files are provided in the sub-directory ./Image_data. Some information about netCDF file format can be found from links below.

Output file:

The output file is in netCDF format (CDF-5). If command-line option "-o" was not used, the default output file name will be the input file name with ".kmeans_out" appended and the file extension ".nc" will still preserved. For example, if the input file name is "input.nc", then the default output file name will be "input.kmeans_out.nc".


Output variables: Examples: Here are the file header of input and output files from using an example "Image_data/color17695.nc".
  % mpiexec -n 4 mpi_main -i Image_data/color17695.nc -v color17695 -n 4 -o output/out.nc
  Writing coordinates of K=4 cluster centers to file "output/out.nc"
  Writing membership of N=140737181790296 data objects to file "output/out.nc"
  Writing coordinates of K=4 cluster centers to file "output/out.nc"
  Writing membership of N=17695 data objects to file "output/out.nc"

  Performing **** Parallel Kmeans  (MPI) ****
  Num of processes = 4
  Input file       : Image_data/color17695.nc
  Output file      : output/out.nc
  numObjs          = 17695
  numCoords        = 9
  numClusters      = 4
  threshold        = 0.0010
  I/O time         =     0.0672 sec
  Computation time =     0.0463 sec



  % ncmpidump -h Image_data/color17695.nc
  netcdf color17695 {
  // file format: CDF-5 (big variables)
  dimensions:
          num_elements = 17695 ;
          num_coordinates = 9 ;
  variables:
          float color17695(num_elements, num_coordinates) ;
  }



  % ncmpidump -h output/out.nc
  netcdf out {
  // file format: CDF-5 (big variables)
  dimensions:
          num_clusters = 4 ;
          num_coordinates = 9 ;
          num_elements = 17695 ;
  variables:
          float clusters(num_clusters, num_coordinates) ;
          int64 membership(num_elements) ;
  }

Related Links:


Wei-keng Liao
Electrical Engineering and Computer Science Department
Northwestern University
Please send comments to
Software available since Nov. 30, 2013.
Page last modified date: Nov. 30, 2013.