Too much data not enough information: a survival guide to the information age

Posts

Showing posts from December, 2020

1.5 Agglomerative online clustering

So far we have discussed two issues, scale sensitivity (with KMeans) and embeddings (with Kohenen Feature Maps) https://rumble.com/vcb5co-1.5-agglomerative-online-clustering.html Wouldn't it be nice if there was an algorithm that created a representation while streaming data, not sensitive to initial starting conditions and able to capture the scale of the data... Agglomerative online clustering, here is a link to my paper: https://www.cs.huji.ac.il/~werman/Papers/guedalia_etal99.pdf So first a better dataset that describes multi-scale data, in this dataset, the scales are very apparent, the top right cluster of data is very tight (small), while the center left cluster is wide spread over the space. X1 = (X / 4 ) + 2 X1 = X1.sample(X1.shape[ 0 ] // 10 ) kmeans = KMeans(n_clusters= 2 , random_state= 0 ).fit(X.append(X1)) centroids_df = pd.DataFrame(kmeans.cluster_centers_) labels_df = pd.DataFrame(kmeans.labels_) count_df = labels_df[ 0 ].value_counts() count_df = count_df.reset_ind...

1.4 Kohenen feature maps (new embedded space) vs Histograms of activation

Kohenen Feature Maps are an interesting blend of two processes, simultaneously creating a model and imposing on that model a new metric space, it works by taking a simple KMeans learning function and blend it with a graph. https://rumble.com/vc7ei8-1.4-kohonen-feature-map.html So here is online KMeans in actions: A centroid is placed in a metric space, when a data element is observed, the system calculates the distance between the closest centroid (in my example there is only a single centroid) and the data element, then moves the centroid in the direction of the data element. When a new data element is observed the centroid moves towards the new data element Fast forward and lets put two centroids in play: Now when a new data element is observed, the two centroids compete, the closest centroid is the 'winner', and only the closest centroid is moved in the direction of the new data element. This plays out very much like classic KMeans (if the data is non-stationary and stochas...

1.3 KMeans as a means to describe failed learning

KMeans as a means to describe failed learning. https://rumble.com/vc5zei-1.3-kmeans-fails-to-learn.html So here is a little data: import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import make_blobs def make_simple_data (): X, y = make_blobs(n_samples= 100 , centers= 1 , n_features= 1 , center_box=(- 10.0 , 10.0 ), random_state= 0 ) X = pd.DataFrame(X) plt.hist(X.values) plt.title( 'hist of rand 1d data 100 samples' ) plt.savefig( 'hist_1d_100samples.png' , dpi= 300 ) plt.close() Looks pretty Gaussian. I tried this with multiple centers and got a mess, so I ended up doing something a little more ugly to create a multi dimensional dataset with multiple centers. def get_all_data (): X, y = make_blobs(n_samples= 500000 , centers= 1 , n_features= 1 , center_box=(- 10.0 , 10.0 ), random_state= 0 ) X_all = pd.DataFrame(X) ...

1.2 Data, Information and Meaning (communication and semiotic systems)

Data, Information and Meaning (communication and semiotic systems) https://rumble.com/vbyj4q-data-information-and-meaning.html Learning increases information in a system. Some basic definitions first. Data: a recording of sensory perceptions, 'observations' might better to describe a sensory perception Sensory receptive field: field of external stimuli that results in an observation by a sensor form: a simple elemental part, a data element symbol: a form that represents an object mapping: creating a relationship between parts sign: two symbols mapped to each other System: [a closed system is] a set of signs (elements and their relationships) Information: a measure of signs (data and their relationships) in a system, how many signs are present in a system, typically calculated either as ind ependent information (Shannon) or dependent information (LZW complexity information) or prescriptive information (Kolmogorov information) -=-=-=- Why is communication necessary for learnin...

1.1 Memorization, Learning and Classification

Memorization, Learning and Classification https://rumble.com/vbp5su-memorization-learning-and-classification.html Memorization - Store the contents of a set of observations Learning - When a constraint is imposed with a requirement to communicate outside the system, then learning occurs, a new representation of the observations is necessary, more efficient, abstractions occur. Classification - A judgment statement, good/bad, a category is labeled with a quality While learning provides a better quantity, more efficient representation, classification provides a quality. Notes: 1. Short term memory is constrained and communicated to long term memory typically at night, most learning occurs at night 2. Maimonidies in the Guide opens with the distinction between truth-falsity vs. good-bad 3. Prof. K. Smith, Human and non-human communication