So far we have discussed two issues, scale sensitivity (with KMeans) and embeddings (with Kohenen Feature Maps) https://rumble.com/vcb5co-1.5-agglomerative-online-clustering.html Wouldn't it be nice if there was an algorithm that created a representation while streaming data, not sensitive to initial starting conditions and able to capture the scale of the data... Agglomerative online clustering, here is a link to my paper: https://www.cs.huji.ac.il/~werman/Papers/guedalia_etal99.pdf So first a better dataset that describes multi-scale data, in this dataset, the scales are very apparent, the top right cluster of data is very tight (small), while the center left cluster is wide spread over the space. X1 = (X / 4 ) + 2 X1 = X1.sample(X1.shape[ 0 ] // 10 ) kmeans = KMeans(n_clusters= 2 , random_state= 0 ).fit(X.append(X1)) centroids_df = pd.DataFrame(kmeans.cluster_centers_) labels_df = pd.DataFrame(kmeans.labels_) count_df = labels_df[ 0 ].value_counts() count_df = count_df.reset_ind...