KMeans as a means to describe failed learning.
https://rumble.com/vc5zei-1.3-kmeans-fails-to-learn.html
So here is a little data:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
def make_simple_data():
X, y = make_blobs(n_samples=100, centers=1, n_features=1, center_box=(-10.0, 10.0), random_state=0)
X = pd.DataFrame(X)
plt.hist(X.values)
plt.title('hist of rand 1d data 100 samples')
plt.savefig('hist_1d_100samples.png', dpi=300)
plt.close()
Looks pretty Gaussian. I tried this with multiple centers and got a mess, so I ended up doing something a little more ugly to create a multi dimensional dataset with multiple centers.
def get_all_data():
X, y = make_blobs(n_samples=500000, centers=1, n_features=1, center_box=(-10.0, 10.0), random_state=0)
X_all = pd.DataFrame(X)
X_all = X_all - X_all.mean()
return X_all
def get_2d_data():
X_all = get_all_data()
samples = 1000
dimen = 2
X = X_all.sample(n=samples).reset_index(drop=True)
for d in range(1, dimen):
X2 = X_all.sample(n=samples).reset_index(drop=True).rename(columns={0:d})
X = pd.concat([X, X2], axis=1)
return X
X = get_2d_data()
X.plot.scatter([0],[1])
plt.savefig('2d_data.png', dpi=300)
kmeans = KMeans(n_clusters=1, random_state=0).fit(X)
centroids_df = pd.DataFrame(kmeans.cluster_centers_)
In [49]: print(centroids_df.values)
...: print(X.mean().values)
[[-0.02331018 0.00369451]]
[-0.02331018 0.00369451]
KMeans is finding the representation that minimizes the global distortion, i.e. if I can only communicate a single point, what is the message (the single point) that best represents the data? The average (mean) of all the data. Why, since the distortion of the original data is minimized, the distance from the average data point to the other data points is the smallest.
This is true since the data has a Gaussian distribution, so the center is more populated than the edges.
What happens when I set K=2, i.e. I now can communicate two messages, two points. Which two should I choose?
# - two centroids
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
centroids_df = pd.DataFrame(kmeans.cluster_centers_)
# add count
labels_df = pd.DataFrame(kmeans.labels_)
count_df = labels_df[0].value_counts()
plt.scatter(X[0], X[1])
plt.scatter(centroids_df[0], centroids_df[1], s=count_df)
plt.savefig('2d_2centroids.png', dpi=300)
Here the size of the centroid is proportional to the number of elements it represents.
What happens when there are two sets of dataX1 = X + 2
# - two centroids
kmeans = KMeans(n_clusters=2, random_state=0).fit(X.append(X1))
centroids_df = pd.DataFrame(kmeans.cluster_centers_)
# add count
labels_df = pd.DataFrame(kmeans.labels_)
count_df = labels_df[0].value_counts()
plt.scatter(X[0], X[1])
plt.scatter(X1[0], X1[1])
plt.scatter(centroids_df[0], centroids_df[1], s=count_df)
plt.savefig('2d_2centers_2centroids.png', dpi=300)
Now comes a tricky problem, I get to communicate two messages, however the data is not evenly distributed
X1 = X + 2
X1 = X1.sample(X1.shape[0] // 10)
kmeans = KMeans(n_clusters=2, random_state=0).fit(X.append(X1))
centroids_df = pd.DataFrame(kmeans.cluster_centers_)
labels_df = pd.DataFrame(kmeans.labels_)
count_df = labels_df[0].value_counts()
count_df = count_df.reset_index().sort_values('index')[0] # strange little thing, need to ensure it is ordered correctly for plot
plt.scatter(X[0], X[1])
plt.scatter(X1[0], X1[1])
plt.scatter(centroids_df[0], centroids_df[1], s=count_df)
plt.savefig('2d_2uneven_close_centers_2centroids.png', dpi=300)
Here can be clearly seen that KMeans chooses to communicate the average message, which dilutes the uniqueness of each independent element.
Global maximization is a failed strategy, it is not scale sensitive, since it treats the data as if there is a global single scale.
Another way to say this, I am trying to learn a particular distinction but constrained not to forget the general idea. Or, KMeans attempts to memorize the data, if it can't memorize it in its entirety it will minimize the loss of memorization! not the loss of learning the data.
Memorization is a failed strategy
Huan Xu, Constantine Caramanis, Member, IEEE and Shie Mannor, Senior Member, IEEE
Comments
Post a Comment