So far we have discussed two issues, scale sensitivity (with KMeans) and embeddings (with Kohenen Feature Maps)
https://rumble.com/vcb5co-1.5-agglomerative-online-clustering.html
Wouldn't it be nice if there was an algorithm that created a representation while streaming data, not sensitive to initial starting conditions and able to capture the scale of the data...
Agglomerative online clustering, here is a link to my paper: https://www.cs.huji.ac.il/~werman/Papers/guedalia_etal99.pdf
So first a better dataset that describes multi-scale data, in this dataset, the scales are very apparent, the top right cluster of data is very tight (small), while the center left cluster is wide spread over the space.
X1 = (X / 4) + 2
X1 = X1.sample(X1.shape[0] // 10)
kmeans = KMeans(n_clusters=2, random_state=0).fit(X.append(X1))
centroids_df = pd.DataFrame(kmeans.cluster_centers_)
labels_df = pd.DataFrame(kmeans.labels_)
count_df = labels_df[0].value_counts()
count_df = count_df.reset_index().sort_values('index')[0]
plt.close()
plt.scatter(X[0], X[1])
plt.scatter(X1[0], X1[1])
plt.scatter(centroids_df[0], centroids_df[1], s=count_df)
KMeans as before attempts to minimise the global distortion, i.e. represent as well as possible all the data elements. Since the bulk of the data is in the center cluster it biases towards the center.
Here is my simple agglomerative algorithm.
from sklearn.metrics.pairwise import euclidean_distances
class Faddc:
def __init__(self, n_cluster, feature_count):
self.n_clusters = n_cluster
self.m_centroids = np.zeros((n_cluster, feature_count))
self.m_count = np.zeros((n_cluster, 1))
def mergeCentroids(self):
# find pair with min distance...
dist = euclidean_distances(self.m_centroids, self.m_centroids)
dist = np.triu(dist, 1)
dist[dist == 0] = np.nan
try:
c1, c2 = np.unravel_index(np.nanargmin(dist), np.array(dist).shape)
except ValueError: # first case scenario
return (0, 1)
self.m_centroids[c1] = ( (self.m_centroids[c2] * self.m_count[c2]) \
+ (self.m_centroids[c1] * self.m_count[c1])) \
/ (self.m_count[c2] + self.m_count[c1])
self.m_centroids[c2] = None
self.m_count[c1] += self.m_count[c2]
return (c2, c1)
def faddcRule(self, newPoint):
newCentroid, mergedCentroid = self.mergeCentroids()
self.m_centroids[newCentroid] = newPoint
self.m_count[newCentroid] = 1
def fit(self, X):
for x in X:
self.faddcRule(x)
def predict(self, X):
y1_err = euclidean_distances(X, self.m_centroids)
y1 = np.argmin(y1_err, axis=1)
return (y1, y1_err)
Playing it forward with our multi-scale data, with four centroids, you can see how nicely two centroids come to represent the two different clusters, with the excess memory being assigned to a few negligible data elements.
faddc = Faddc(n_cluster=4, feature_count=2)
faddc.fit(X.append(X1).values)
centroids_df = pd.DataFrame(faddc.m_centroids)
plt.close()
plt.scatter(X[0], X[1])
plt.scatter(X1[0], X1[1])
plt.scatter(centroids_df[0], centroids_df[1], s=faddc.m_count)
Comments
Post a Comment