Initial Starting Point Analysis for K-Means ... - DLHoffman.Com
Initial Starting Point Analysis for K-Means Clustering: A Case Study. Authors:
Frank Robinson, Vanderbilt University. Amy Apon, University of Arkansas, ...
Part of the document
Initial Starting Point Analysis for K-Means Clustering: A Case Study Authors: Frank Robinson, Vanderbilt University
Amy Apon, University of Arkansas, Fayetteville
Denny Brewer, Acxiom Corporation
Larry Dowdy, Vanderbilt University
Doug Hoffman, Acxiom Corporation
Baochuan Lu, University of Arkansas, Fayetteville Initial Starting Point Analysis for K-Means Clustering:
A Case Study Abstract- Workload characterization is an important part of systems
performance modeling. Clustering is a method used to find classes of
jobs within workloads. K-Means is one of the most popular clustering
algorithms. Initial starting point values are needed as input parameters
when performing k-means clustering. This paper shows that the results of
the running the k-means algorithm on the same workload will vary
depending on the values chosen as initial starting points. Fourteen
methods of composing initial starting point values are compared in a case
study. The results indicate that a synthetic method, scrambled
midpoints, is an effective starting point method for k-means clustering. 1. Introduction Most companies today rely on information technology (IT) to guide
business decisions, direct major overseas transactions, and manage their
day-to-day activities. These companies have Quality of Service (QoS)
requirements that, if not met, negatively impact the business'
profitability. In order to analyze IT systems and to ensure that QoS
requirements are met, a good understanding of the current and/or projected
workload is needed. This workload model should be representative of the
actual workload and be able to imitate and accurately predict its impact on
the performance of the specific IT system. Workload models, used in
conjunction with other performance models, including simulation models and
analytic queuing network models, can be used for capacity planning,
forecasting future workloads, predicting system performance, and analyzing
various "what if" performance scenarios [1].
Workload models are described by features that characterize the actual
workload, and these features become parameters of performance models.
Different systems are characterized by different features. Example features
include the number of jobs, the job type, memory requirements, network
bandwidth demands, processor needs, and software service requirements,
among others. Choosing which features to include in a representative
workload model depends on the nature of the specific system, what
parameters are actually available, and the purpose of the underlying
performance study [2]. The features that are chosen make up the feature
vector. After the features for the workload model are chosen, sample data
is collected. The actual workload is then described by the set of points
that include all of the selected features for each of the jobs in the
system. That is, each job is described by its feature vector.
Because the number of jobs in a collected data set may be quite large, it
is often necessary to perform clustering analysis on the sample data.
Clustering analysis is a technique used to find representative classes, or
homogenous groups, within the raw workload job data set. These classes of
jobs determine the workload model.
There are several clustering techniques, including hierarchical
clustering, minimum spanning tree, and k-means clustering [3]. The k-means
clustering algorithm is used in the case study described in this paper.
This is one of the simplest and most popular clustering algorithms. The
algorithm finds k distinct classes of jobs, or clusters, when given a
specific workload data set. The algorithm finds the midpoint, or centroid,
of each cluster and assigns each job to its nearest centroid. The
algorithm is initialized by providing: 1) the number of desired clusters,
k, and 2) initial starting point estimates of the k centroids. There is no
commonly accepted or standard "best" way to determine either the number of
clusters or the initial starting point values. The resulting set of
clusters, both their number and their centroids, depends on the specified
choice of initial starting point values [4]. This case study shows that
the number of clusters and the final cluster centroid values differ based
on the initial starting values used when performing k-means clustering.
One common way of choosing the initial starting point values is to
randomly choose k of the actual sample data points and use these as the
initial starting values. This method, along with 13 other techniques, are
proposed and analyzed in this study. For brevity, six of these techniques
are presented and analyzed in this paper. These methods fall into two
categories: actual sample starting points and synthetic starting points.
In the case study reported here, the results indicate that the best
synthetic method performs better than the best actual sample method.
The results of this case study indicate that the selection of initial
starting point values impacts the result of the k-means clustering
algorithm. The results also provide reasonable, rule-of-thumb heuristics
for selecting initial starting point values for the clustering algorithm.
The remainder of this paper is organized as follows. Section 2 overviews
the k-means clustering algorithm. Section 3 describes starting point
selection. Section 4 presents the details of the experimental case study.
Finally, Section 5 summarizes the impact of the results. 2. Overview of the K-Means Clustering Algorithm Given a data set of workload samples, a desired number of clusters, k,
and a set of k initial starting points, the k-means clustering algorithm
finds the desired number of distinct clusters and their centroids. A
centroid is defined as the point whose coordinates are obtained by
computing the average of each of the coordinates (i.e., feature values) of
the points of the jobs assigned to the cluster [2]. Formally, the k-means
clustering algorithm follows the following steps.
1. Choose a number of desired clusters, k.
2. Choose k starting points to be used as initial estimates of the
cluster centroids. These are the initial starting values.
3. Examine each point (i.e., job) in the workload data set and assign it
to the cluster whose centroid is nearest to it.
4. When each point is assigned to a cluster, recalculate the new k
centroids.
5. Repeat steps 3 and 4 until no point changes its cluster assignment, or
until a maximum number of passes through the data set is performed.
Before the clustering algorithm can be applied, actual data samples
(i.e., jobs) are collected from observed workloads. The features that
describe each data sample in the workload are required a priori. The
values of these features make up a feature vector (Fi1, Fi2, ... , FiM),
where Fim is the value of the mth feature of the ith job. Each job is
described by its M features. For example, if job 1 requires 3MB of storage
and 20 seconds of CPU time, then (F11, F12) = (3, 20). The feature vector
can be thought of as a point in M-dimensional space. Like other clustering
algorithms, k-means requires that a distance metric between points be
defined [2]. This distance metric is used in step 3 of the algorithm given
above. A common distance metric is the Euclidean distance. Given two
sample points, pi and pj, each described by their feature vectors, pi =
(Fi1, Fi2, ... , FiM) and pj = (Fj1, Fj2, ... , FjM), the distance, dij,
between pi and pj is given by:
[pic] (1)
If the different features being used in the feature vector have different
relative values and ranges, the distance computation may be distorted since
features with large absolute values tend to dominate the computation [2].
To mitigate this, it is common for the feature values to be first scaled in
order to minimize distortion. There are several different methods that can
be used to scale data. The method used in this paper is z-score scaling.
Z-score scaling uses the number of standard deviations away from the mean
that the data point resides [5]. The z-score equation is
[pic] (2)
where Fim is the value of the mth feature of the ith job (i.e., the data
point), ?m is the mean value of the mth feature, and ?m is the standard
deviation of the mth feature. Thus, before the algorithm is applied, the
original data set is scaled, using the z-score scaling technique, where the
feature mean is subtracted from the feature value and then divided by the
standard deviation of that feature (i.e., Fim is replaced by its scaled
value F*im). This technique has the effect of normalizing the workload
features so that no single feature dominates in the clustering algorithm.
The number of clusters to be found, along with the initial starting point
values are specified as input parameters to the clustering algorithm. Given
the initial starting values, the distance from each (z-scored scaled)
sample data point to each initial starting value is found using equation
(1). Each data point is then placed in the cluster associated with the
nearest starting point. New cluster centroids are calculated after all
data points have been assigned to a cluster. Suppose that Cim represents
the centroid o