Skip to main content

1.3 KMeans as a means to describe failed learning

KMeans as a means to describe failed learning.

https://rumble.com/vc5zei-1.3-kmeans-fails-to-learn.html


So here is a little data:

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs


def make_simple_data():

    X, y = make_blobs(n_samples=100, centers=1, n_features=1, center_box=(-10.0, 10.0), random_state=0)

    X = pd.DataFrame(X)

    plt.hist(X.values)

    plt.title('hist of rand 1d data 100 samples')

    plt.savefig('hist_1d_100samples.png', dpi=300)

    plt.close()



Looks pretty Gaussian.   
I tried this with multiple centers and got a mess, so I ended up doing something a little more ugly to create a multi dimensional dataset with multiple centers.


def
get_all_data():

    X, y = make_blobs(n_samples=500000, centers=1, n_features=1, center_box=(-10.0, 10.0), random_state=0)

    X_all = pd.DataFrame(X)

    X_all = X_all - X_all.mean()


    return X_all


def get_2d_data():

    X_all = get_all_data() 

    samples = 1000

    dimen = 2

    X = X_all.sample(n=samples).reset_index(drop=True)

    for d in range(1, dimen):

        X2 = X_all.sample(n=samples).reset_index(drop=True).rename(columns={0:d})

        X = pd.concat([X, X2], axis=1)


    return X


X = get_2d_data()


X.plot.scatter([0],[1])

plt.savefig('2d_data.png', dpi=300)


kmeans = KMeans(n_clusters=1, random_state=0).fit(X)

centroids_df = pd.DataFrame(kmeans.cluster_centers_)


In [49]: print(centroids_df.values) 

    ...: print(X.mean().values)                                                                                                                                 

[[-0.02331018  0.00369451]]

[-0.02331018  0.00369451]


KMeans is finding the representation that minimizes the global distortion, i.e. if I can only communicate a single point, what is the message (the single point) that best represents the data?  The average (mean) of all the data.  Why, since the distortion of the original data is minimized, the distance from the average data point to the other data points is the smallest.  

This is true since the data has a Gaussian distribution, so the center is more populated than the edges.

What happens when I set K=2, i.e. I now can communicate two messages, two points.  Which two should I choose?

# - two centroids 

kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

centroids_df = pd.DataFrame(kmeans.cluster_centers_)


# add count

labels_df = pd.DataFrame(kmeans.labels_)

count_df = labels_df[0].value_counts()

plt.scatter(X[0], X[1])

plt.scatter(centroids_df[0], centroids_df[1], s=count_df)

plt.savefig('2d_2centroids.png', dpi=300)

Here the size of the centroid is proportional to the number of elements it represents.

What happens when there are two sets of data

X1 = X + 2


# - two centroids 

kmeans = KMeans(n_clusters=2, random_state=0).fit(X.append(X1))

centroids_df = pd.DataFrame(kmeans.cluster_centers_)


# add count

labels_df = pd.DataFrame(kmeans.labels_)

count_df = labels_df[0].value_counts()

plt.scatter(X[0], X[1])

plt.scatter(X1[0], X1[1])

plt.scatter(centroids_df[0], centroids_df[1], s=count_df)

plt.savefig('2d_2centers_2centroids.png', dpi=300)



Now comes a tricky problem, I get to communicate two messages, however the data is not evenly distributed 

X1 = X + 2

X1 = X1.sample(X1.shape[0] // 10)


kmeans = KMeans(n_clusters=2, random_state=0).fit(X.append(X1))

centroids_df = pd.DataFrame(kmeans.cluster_centers_)


labels_df = pd.DataFrame(kmeans.labels_)

count_df = labels_df[0].value_counts()

count_df = count_df.reset_index().sort_values('index')[0]  # strange little thing, need to ensure it is ordered correctly for plot

plt.scatter(X[0], X[1])

plt.scatter(X1[0], X1[1])

plt.scatter(centroids_df[0], centroids_df[1], s=count_df)

plt.savefig('2d_2uneven_close_centers_2centroids.png', dpi=300)


Here can be clearly seen that KMeans chooses to communicate the average message, which dilutes the uniqueness of each independent element.

Global maximization is a failed strategy, it is not scale sensitive, since it treats the data as if there is a global single scale.

Another way to say this, I am trying to learn a particular distinction but constrained not to forget the general idea.  Or, KMeans attempts to memorize the data, if it can't memorize it in its entirety it will minimize the loss of memorization! not the loss of learning the data.

Memorization is a failed strategy

---
Notes: 
Sparse Algorithms are not Stable: A No-free-lunch Theorem
Huan Xu, Constantine Caramanis, Member, IEEE and Shie Mannor, Senior Member, IEEE




Comments

Popular posts from this blog

V) How do we know we made a reasonable judgement?

V) How do we know we made a reasonable judgement? I was by my brother in NY, on my way to the airport, and I spotted a book by Umberto Eco on information and open systems.  I borrowed the book (and still have it -- sorry Jacob),  just on the whim that I would enjoy more Eco in my life.  I discovered much more, the book is Eco's earlier writing, semiotics mixed with art and science, and has had a profound affect on me.  Eco makes the argument that Shannon's description of information, a measure of the communicability of a message, provides for a measure of art. If it helps think about 'On Interpretation' by Susan Sontag, experience art without interpreting it.  There is no message not even one that we the viewer creates.   There is no meaning to be had, just an experience.  The flip side of this argument is that when there is interpretation there is meaning.  This view, proposed by Semiotics, states that when two closed systems meet and are ...

0.0 Introduction to advanced concepts in AI and Machine Learning

Introduction to advanced concepts in AI and Machine Learning I created a set of short videos and blog posts to introduce some advanced ideas in AI and Machine Learning.  It is easier for me to think about them as I met them, chronologically in my life, but I may revisit the ideas later from a different perspective. I also noticed that one of things I am doing is utilising slightly off-centre tools to describe an idea.  So for example, I employ Kohonen Feature Maps to describe embeddings.  I think I gain a couple of things this way, first it is a different perspective than most people are used to.  In addition, well you will see :-) I recommend first opening the blog entry (as per the links below), then concurrently watching the linked video. Hope you enjoy these as much as I did putting them together, David Here are links: https://data-information-meaning.blogspot.com/2020/12/memorization-learning-and-classification.html https://data-information-meaning.blogspot.com/...

III) Metrics

III) Metrics One of these things is not like the other -- but two of these things are distant from a third. I grew up with Brisk Torah, more specifically my father was a Talmid of Rabbi Joseph Soloveichik and dialectic thinking was part and parcel of our discussions.  Two things, two dinim, the rhythm in the flow between two things.  Dialectics not dichotomies.  The idea espoused by the Rambam in his description of Love and Awe, mutually exclusive, we travel between them. Why create duality?  Dialectics or dichotomies provide a powerful tool, but what is it that tool? What is the challenge? I think the Rabbinic language might be נתת דברך לשיעורים, 'your words are given to degrees', the idea being that without clear definitions we are left with vague language, something is more than something else, ok, but how much more? This I think is the reasoning for the first of the twenty one questions I was taught by my father's mother, 'is it bigger than a breadbox?',...