Skip to main content

II) Too much data

II) Too much data

Way back when, in the late 90's, before the advent of Big Data, we had little computing.  I was working with the Volcani institute (the ministry of agriculture's research institute) on the problem of quality sorting of tomatoes.  The project, multi-sensor fusion for tomato quality control, required us to analyze tomatoes with vision, smell and touch sensors.  I was responsible for the vision sensor, setting up the lighting and the new digital camera, image processing was an art, very different than the art of Deep Learning today.  I took pictures of several hundred tomatoes stored on CD-ROMs (I think) to analyze.  The basic image processing technique was to break the image up into small windows of 16x16 pixels, create a signature (feature vector) based on texture and color and then cluster those signatures into groups.  With the idea being that sections of the tomato that look alike must also share similar qualities (bruised/sweet/healthy).

Very quickly I ran into the problem of too much data.  The standard clustering techniques (much like today) assumed batch processing of the data.  That means, get all you data into a single room (on a single computer) and think.  It was a problem I had met in part earlier when doing my masters studying aerial photography, but now I had too much data and couldn't avoid it. So, how to process lots of data without storing it became my new problem.  This lead to my doctoral thesis, a streaming clustering algorithm.

It is interesting to pause here and think what too much data means.  What is data?

Data is recordings of sensory perceptions.  Do you like that definition?  Sometimes I find the word 'observations' better to describe a sensory perception, somehow that is less anthropomorphic.  The term 'sensory perception' and its brother term 'receptive field' are terms I first heard from Prof. Shaul Hochstein at Hebrew University.  A receptive field is the field which stimulates a neuron: a location in space that is reflecting photons to a specific location on the retina excites a specific neuron at that location, a sound at a frequency that travels through the cochlea and excites a neuron.  There is something elegant about defining an abstract receptive field, independent of the type of stimulus.  Either way it helps to separate between the external source of the stimulus and the resultant stimulated sensor.

Data is then the recording of a sensor that observes an external event.

Well, storing the data is one problem, but what to do with the data once it is stored is another.

Batch processing, trying to make sense of all the data at the same time has its logic.  Intuitively, we know that you can't understand something out of context. Hence, it is best to reserve judgment until all the facts are gathered -- batch processing.  Gather all the data and then make a decision.

But what to do when either it is not possible to gather all the facts or even more likely, it is not possible to grasp all the facts, it is not possible to process all the data?

Too much data.

The premise of my thesis is that learning occurs precisely when we have too much data.  I think we gravitate to the mistaken understanding that we learn things when we are exposed to them.  Oh, here is something new, let me understand it.  But this is fundamentally incorrect.  Learning requires a distance metric and metrics require relative measures to previous knowledge, more on that later.  For now let me introduce this simple idea.  We learn when we have too much data, we learn when we need to forget something, not when we try to remember it.

Too much data forces us to move from memorization to learning.  And learning is beautiful thing.





Comments

Popular posts from this blog

0.0 Introduction to advanced concepts in AI and Machine Learning

Introduction to advanced concepts in AI and Machine Learning I created a set of short videos and blog posts to introduce some advanced ideas in AI and Machine Learning.  It is easier for me to think about them as I met them, chronologically in my life, but I may revisit the ideas later from a different perspective. I also noticed that one of things I am doing is utilising slightly off-centre tools to describe an idea.  So for example, I employ Kohonen Feature Maps to describe embeddings.  I think I gain a couple of things this way, first it is a different perspective than most people are used to.  In addition, well you will see :-) I recommend first opening the blog entry (as per the links below), then concurrently watching the linked video. Hope you enjoy these as much as I did putting them together, David Here are links: https://data-information-meaning.blogspot.com/2020/12/memorization-learning-and-classification.html https://data-information-meaning.blogspot.com/...

III) Metrics

III) Metrics One of these things is not like the other -- but two of these things are distant from a third. I grew up with Brisk Torah, more specifically my father was a Talmid of Rabbi Joseph Soloveichik and dialectic thinking was part and parcel of our discussions.  Two things, two dinim, the rhythm in the flow between two things.  Dialectics not dichotomies.  The idea espoused by the Rambam in his description of Love and Awe, mutually exclusive, we travel between them. Why create duality?  Dialectics or dichotomies provide a powerful tool, but what is it that tool? What is the challenge? I think the Rabbinic language might be נתת דברך לשיעורים, 'your words are given to degrees', the idea being that without clear definitions we are left with vague language, something is more than something else, ok, but how much more? This I think is the reasoning for the first of the twenty one questions I was taught by my father's mother, 'is it bigger than a breadbox?',...

V) How do we know we made a reasonable judgement?

V) How do we know we made a reasonable judgement? I was by my brother in NY, on my way to the airport, and I spotted a book by Umberto Eco on information and open systems.  I borrowed the book (and still have it -- sorry Jacob),  just on the whim that I would enjoy more Eco in my life.  I discovered much more, the book is Eco's earlier writing, semiotics mixed with art and science, and has had a profound affect on me.  Eco makes the argument that Shannon's description of information, a measure of the communicability of a message, provides for a measure of art. If it helps think about 'On Interpretation' by Susan Sontag, experience art without interpreting it.  There is no message not even one that we the viewer creates.   There is no meaning to be had, just an experience.  The flip side of this argument is that when there is interpretation there is meaning.  This view, proposed by Semiotics, states that when two closed systems meet and are ...