Skip to main content

1.5 Agglomerative online clustering

So far we have discussed two issues, scale sensitivity (with KMeans) and embeddings (with Kohenen Feature Maps)

https://rumble.com/vcb5co-1.5-agglomerative-online-clustering.html

Wouldn't it be nice if there was an algorithm that created a representation while streaming data, not sensitive to initial starting conditions and able to capture the scale of the data...

Agglomerative online clustering, here is a link to my paper: https://www.cs.huji.ac.il/~werman/Papers/guedalia_etal99.pdf

So first a better dataset that describes multi-scale data, in this dataset, the scales are very apparent, the top right cluster of data is very tight (small), while the center left cluster is wide spread over the space.

X1 = (X / 4) + 2

X1 = X1.sample(X1.shape[0] // 10)


kmeans = KMeans(n_clusters=2, random_state=0).fit(X.append(X1))

centroids_df = pd.DataFrame(kmeans.cluster_centers_)


labels_df = pd.DataFrame(kmeans.labels_)

count_df = labels_df[0].value_counts()

count_df = count_df.reset_index().sort_values('index')[0]

plt.close()

plt.scatter(X[0], X[1])

plt.scatter(X1[0], X1[1])

plt.scatter(centroids_df[0], centroids_df[1], s=count_df)


KMeans as before attempts to minimise the global distortion, i.e. represent as well as possible all the data elements.  Since the bulk of the data is in the center cluster it biases towards the center.

Here is my simple agglomerative algorithm.


When a data element arrives, any free centroid is assigned to the new data point (not moved).  So the only question is are there any free centroids?  At first there are, but at some point a limit is reached (this limit is the single parameter), however unlike KMeans that parameter does not define the final number of centroids with which to represent the data, but rather a scale of the analysis.

So what to do when the limit is reached, consolidate memory, i.e. learn!  That is step two in my algorithm, merge the two closest centroids.  This frees up a memory allocation.  More importantly it creates an abstraction of the original data locally.  With a goal of minimising the local loss of information.

from sklearn.metrics.pairwise import euclidean_distances


class Faddc:

    def __init__(self, n_cluster, feature_count):

        self.n_clusters       = n_cluster

        self.m_centroids      = np.zeros((n_cluster, feature_count))

        self.m_count          = np.zeros((n_cluster, 1))


    def mergeCentroids(self):

        # find pair with min distance...

        dist = euclidean_distances(self.m_centroids, self.m_centroids)

        dist = np.triu(dist, 1)

        dist[dist == 0] = np.nan

        try:

            c1, c2 = np.unravel_index(np.nanargmin(dist), np.array(dist).shape)

        except ValueError:  # first case scenario

            return (0, 1)


        self.m_centroids[c1] = (   (self.m_centroids[c2] * self.m_count[c2])  \

                                 + (self.m_centroids[c1] * self.m_count[c1])) \

                               / (self.m_count[c2] + self.m_count[c1])

        self.m_centroids[c2] = None

        self.m_count[c1]     += self.m_count[c2]

        return (c2, c1)


    def faddcRule(self, newPoint):

        newCentroid, mergedCentroid   = self.mergeCentroids()

        self.m_centroids[newCentroid] = newPoint

        self.m_count[newCentroid]     = 1


    def fit(self, X):

        for x in X:

            self.faddcRule(x)


    def predict(self, X):

        y1_err = euclidean_distances(X, self.m_centroids)

        y1 = np.argmin(y1_err, axis=1)

        return (y1, y1_err)


So in this example, the four blue centroids have memorised the data observations so far, but now a fifth dat element arrives.  This forces a consolidation of memory, a learning phase, where the two closest centroids are merged (into the brown centroid), the free centroid then memorises the new observation
This results with again four centroids, however one of which is an average of two data observations, while the others continue to represent a single data observation each.

Playing it forward with our multi-scale data, with four centroids, you can see how nicely two centroids come to represent the two different clusters, with the excess memory being assigned to a few negligible data elements.

faddc = Faddc(n_cluster=4, feature_count=2)

faddc.fit(X.append(X1).values)

centroids_df = pd.DataFrame(faddc.m_centroids)

plt.close()

plt.scatter(X[0], X[1])

plt.scatter(X1[0], X1[1])

plt.scatter(centroids_df[0], centroids_df[1], s=faddc.m_count)




Just to complete the picture, for reference, here are results from faddc on the previous data set, where the two clusters of data are the same spatial scale but different quantities











Comments