Skip to main content

1.5 Agglomerative online clustering

So far we have discussed two issues, scale sensitivity (with KMeans) and embeddings (with Kohenen Feature Maps)

https://rumble.com/vcb5co-1.5-agglomerative-online-clustering.html

Wouldn't it be nice if there was an algorithm that created a representation while streaming data, not sensitive to initial starting conditions and able to capture the scale of the data...

Agglomerative online clustering, here is a link to my paper: https://www.cs.huji.ac.il/~werman/Papers/guedalia_etal99.pdf

So first a better dataset that describes multi-scale data, in this dataset, the scales are very apparent, the top right cluster of data is very tight (small), while the center left cluster is wide spread over the space.

X1 = (X / 4) + 2

X1 = X1.sample(X1.shape[0] // 10)


kmeans = KMeans(n_clusters=2, random_state=0).fit(X.append(X1))

centroids_df = pd.DataFrame(kmeans.cluster_centers_)


labels_df = pd.DataFrame(kmeans.labels_)

count_df = labels_df[0].value_counts()

count_df = count_df.reset_index().sort_values('index')[0]

plt.close()

plt.scatter(X[0], X[1])

plt.scatter(X1[0], X1[1])

plt.scatter(centroids_df[0], centroids_df[1], s=count_df)


KMeans as before attempts to minimise the global distortion, i.e. represent as well as possible all the data elements.  Since the bulk of the data is in the center cluster it biases towards the center.

Here is my simple agglomerative algorithm.


When a data element arrives, any free centroid is assigned to the new data point (not moved).  So the only question is are there any free centroids?  At first there are, but at some point a limit is reached (this limit is the single parameter), however unlike KMeans that parameter does not define the final number of centroids with which to represent the data, but rather a scale of the analysis.

So what to do when the limit is reached, consolidate memory, i.e. learn!  That is step two in my algorithm, merge the two closest centroids.  This frees up a memory allocation.  More importantly it creates an abstraction of the original data locally.  With a goal of minimising the local loss of information.

from sklearn.metrics.pairwise import euclidean_distances


class Faddc:

    def __init__(self, n_cluster, feature_count):

        self.n_clusters       = n_cluster

        self.m_centroids      = np.zeros((n_cluster, feature_count))

        self.m_count          = np.zeros((n_cluster, 1))


    def mergeCentroids(self):

        # find pair with min distance...

        dist = euclidean_distances(self.m_centroids, self.m_centroids)

        dist = np.triu(dist, 1)

        dist[dist == 0] = np.nan

        try:

            c1, c2 = np.unravel_index(np.nanargmin(dist), np.array(dist).shape)

        except ValueError:  # first case scenario

            return (0, 1)


        self.m_centroids[c1] = (   (self.m_centroids[c2] * self.m_count[c2])  \

                                 + (self.m_centroids[c1] * self.m_count[c1])) \

                               / (self.m_count[c2] + self.m_count[c1])

        self.m_centroids[c2] = None

        self.m_count[c1]     += self.m_count[c2]

        return (c2, c1)


    def faddcRule(self, newPoint):

        newCentroid, mergedCentroid   = self.mergeCentroids()

        self.m_centroids[newCentroid] = newPoint

        self.m_count[newCentroid]     = 1


    def fit(self, X):

        for x in X:

            self.faddcRule(x)


    def predict(self, X):

        y1_err = euclidean_distances(X, self.m_centroids)

        y1 = np.argmin(y1_err, axis=1)

        return (y1, y1_err)


So in this example, the four blue centroids have memorised the data observations so far, but now a fifth dat element arrives.  This forces a consolidation of memory, a learning phase, where the two closest centroids are merged (into the brown centroid), the free centroid then memorises the new observation
This results with again four centroids, however one of which is an average of two data observations, while the others continue to represent a single data observation each.

Playing it forward with our multi-scale data, with four centroids, you can see how nicely two centroids come to represent the two different clusters, with the excess memory being assigned to a few negligible data elements.

faddc = Faddc(n_cluster=4, feature_count=2)

faddc.fit(X.append(X1).values)

centroids_df = pd.DataFrame(faddc.m_centroids)

plt.close()

plt.scatter(X[0], X[1])

plt.scatter(X1[0], X1[1])

plt.scatter(centroids_df[0], centroids_df[1], s=faddc.m_count)




Just to complete the picture, for reference, here are results from faddc on the previous data set, where the two clusters of data are the same spatial scale but different quantities











Comments

Popular posts from this blog

V) How do we know we made a reasonable judgement?

V) How do we know we made a reasonable judgement? I was by my brother in NY, on my way to the airport, and I spotted a book by Umberto Eco on information and open systems.  I borrowed the book (and still have it -- sorry Jacob),  just on the whim that I would enjoy more Eco in my life.  I discovered much more, the book is Eco's earlier writing, semiotics mixed with art and science, and has had a profound affect on me.  Eco makes the argument that Shannon's description of information, a measure of the communicability of a message, provides for a measure of art. If it helps think about 'On Interpretation' by Susan Sontag, experience art without interpreting it.  There is no message not even one that we the viewer creates.   There is no meaning to be had, just an experience.  The flip side of this argument is that when there is interpretation there is meaning.  This view, proposed by Semiotics, states that when two closed systems meet and are ...

0.0 Introduction to advanced concepts in AI and Machine Learning

Introduction to advanced concepts in AI and Machine Learning I created a set of short videos and blog posts to introduce some advanced ideas in AI and Machine Learning.  It is easier for me to think about them as I met them, chronologically in my life, but I may revisit the ideas later from a different perspective. I also noticed that one of things I am doing is utilising slightly off-centre tools to describe an idea.  So for example, I employ Kohonen Feature Maps to describe embeddings.  I think I gain a couple of things this way, first it is a different perspective than most people are used to.  In addition, well you will see :-) I recommend first opening the blog entry (as per the links below), then concurrently watching the linked video. Hope you enjoy these as much as I did putting them together, David Here are links: https://data-information-meaning.blogspot.com/2020/12/memorization-learning-and-classification.html https://data-information-meaning.blogspot.com/...

III) Metrics

III) Metrics One of these things is not like the other -- but two of these things are distant from a third. I grew up with Brisk Torah, more specifically my father was a Talmid of Rabbi Joseph Soloveichik and dialectic thinking was part and parcel of our discussions.  Two things, two dinim, the rhythm in the flow between two things.  Dialectics not dichotomies.  The idea espoused by the Rambam in his description of Love and Awe, mutually exclusive, we travel between them. Why create duality?  Dialectics or dichotomies provide a powerful tool, but what is it that tool? What is the challenge? I think the Rabbinic language might be נתת דברך לשיעורים, 'your words are given to degrees', the idea being that without clear definitions we are left with vague language, something is more than something else, ok, but how much more? This I think is the reasoning for the first of the twenty one questions I was taught by my father's mother, 'is it bigger than a breadbox?',...