1.3 KMeans as a means to describe failed learning

KMeans as a means to describe failed learning.

https://rumble.com/vc5zei-1.3-kmeans-fails-to-learn.html

So here is a little data:

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

def make_simple_data():

X, y = make_blobs(n_samples=100, centers=1, n_features=1, center_box=(-10.0, 10.0), random_state=0)

X = pd.DataFrame(X)

plt.hist(X.values)

plt.title('hist of rand 1d data 100 samples')

plt.savefig('hist_1d_100samples.png', dpi=300)

plt.close()

Looks pretty Gaussian. I tried this with multiple centers and got a mess, so I ended up doing something a little more ugly to create a multi dimensional dataset with multiple centers.

def get_all_data():

X, y = make_blobs(n_samples=500000, centers=1, n_features=1, center_box=(-10.0, 10.0), random_state=0)

X_all = pd.DataFrame(X)

X_all = X_all - X_all.mean()

return X_all

def get_2d_data():

X_all = get_all_data()

samples = 1000

dimen = 2

X = X_all.sample(n=samples).reset_index(drop=True)

for d in range(1, dimen):

X2 = X_all.sample(n=samples).reset_index(drop=True).rename(columns={0:d})

X = pd.concat([X, X2], axis=1)

return X

X = get_2d_data()

X.plot.scatter([0],[1])

plt.savefig('2d_data.png', dpi=300)

kmeans = KMeans(n_clusters=1, random_state=0).fit(X)

centroids_df = pd.DataFrame(kmeans.cluster_centers_)

In [49]: print(centroids_df.values)

...: print(X.mean().values)

[[-0.02331018 0.00369451]]

[-0.02331018 0.00369451]

KMeans is finding the representation that minimizes the global distortion, i.e. if I can only communicate a single point, what is the message (the single point) that best represents the data? The average (mean) of all the data. Why, since the distortion of the original data is minimized, the distance from the average data point to the other data points is the smallest.

This is true since the data has a Gaussian distribution, so the center is more populated than the edges.

What happens when I set K=2, i.e. I now can communicate two messages, two points. Which two should I choose?

# - two centroids

kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

centroids_df = pd.DataFrame(kmeans.cluster_centers_)

# add count

labels_df = pd.DataFrame(kmeans.labels_)

count_df = labels_df[0].value_counts()

plt.scatter(X[0], X[1])

plt.scatter(centroids_df[0], centroids_df[1], s=count_df)

plt.savefig('2d_2centroids.png', dpi=300)

Here the size of the centroid is proportional to the number of elements it represents.

What happens when there are two sets of data

X1 = X + 2

# - two centroids

kmeans = KMeans(n_clusters=2, random_state=0).fit(X.append(X1))

centroids_df = pd.DataFrame(kmeans.cluster_centers_)

# add count

labels_df = pd.DataFrame(kmeans.labels_)

count_df = labels_df[0].value_counts()

plt.scatter(X[0], X[1])

plt.scatter(X1[0], X1[1])

plt.scatter(centroids_df[0], centroids_df[1], s=count_df)

plt.savefig('2d_2centers_2centroids.png', dpi=300)

Now comes a tricky problem, I get to communicate two messages, however the data is not evenly distributed

X1 = X + 2

X1 = X1.sample(X1.shape[0] // 10)

kmeans = KMeans(n_clusters=2, random_state=0).fit(X.append(X1))

centroids_df = pd.DataFrame(kmeans.cluster_centers_)

labels_df = pd.DataFrame(kmeans.labels_)

count_df = labels_df[0].value_counts()

count_df = count_df.reset_index().sort_values('index')[0] # strange little thing, need to ensure it is ordered correctly for plot

plt.scatter(X[0], X[1])

plt.scatter(X1[0], X1[1])

plt.scatter(centroids_df[0], centroids_df[1], s=count_df)

plt.savefig('2d_2uneven_close_centers_2centroids.png', dpi=300)

Here can be clearly seen that KMeans chooses to communicate the average message, which dilutes the uniqueness of each independent element.

Global maximization is a failed strategy, it is not scale sensitive, since it treats the data as if there is a global single scale.

Another way to say this, I am trying to learn a particular distinction but constrained not to forget the general idea. Or, KMeans attempts to memorize the data, if it can't memorize it in its entirety it will minimize the loss of memorization! not the loss of learning the data.

Memorization is a failed strategy

---

Notes:

Sparse Algorithms are not Stable: A No-free-lunch Theorem
Huan Xu, Constantine Caramanis, Member, IEEE and Shie Mannor, Senior Member, IEEE

Too much data not enough information: a survival guide to the information age

Search This Blog

1.3 KMeans as a means to describe failed learning

Comments

Post a Comment

Popular posts from this blog

III) Metrics

0.0 Introduction to advanced concepts in AI and Machine Learning

V) How do we know we made a reasonable judgement?