How does learning work, how much data is needed to learn and don't cross the streams!

Learning is the process by which a model is constructed, the model describes a set of observations. The more compact the model, the better the learning process is considered. This is manifest in the ability of the model to predict and generalize (out-of-sample) data. But let's not confuse learning with classification. Again, the essence of learning is the model construction and the condensed representation of the observations.

So how many observations, data elements, are required to construct a model?

[Nassim Taleb addresses this question here: https://arxiv.org/pdf/1802.05495.pdf]

The typical answer is for a Gaussian/normal distribution, 30 observations, simple. We construct a model of the mean and variance of the data by calculating the average and variance from our 30 sample observations.

Clearly this is not true in all cases, we do not always have simple normal distributions. And in more complex case we would require more observations. But let's assume the magic number of 30 is true.

So what happens when you increase the dimensionality of the data. Well the 'curse of dimensionality' takes over. The number of observations increases exponentially. So now a relatively simple model of two or three dimensions, for example macro nutrients Carbs/Protein/Fat, would require 900 to 27000 observations. This says, that if we wanted to describe the effect a diet has on a person and we measure the macro nutrients, we would need a sample size in the tens of thousands.

Now what happens when we go to something like micro nutrients, well there are nine essential amino acids, that gives us 30^9, a very large number (19683000000000). So even in an ideal case where all other confounding variables were isolated, you would still need an enormous test population (actually two or three, control/placebo groups as well).

This is why Prof. John Ionidies says: "Risk-conferring nutritional combinations may vary by an individual’s genetic background, metabolic profile, age, or environmental exposures. Disentangling the potential influence on health outcomes of a single dietary component from these other variables is challenging, if not impossible" [my emphasis], John P. A. Ioannidis, MD, DSc, The Challenge of Reforming Nutritional Epidemiologic Research

What does that mean in practice? It means that all research based on observational data done today vastly underestimates the amount of data they need. Yep, nothing published is good science. Sugar is bad for you? Fat is good? Eggs? Vaccines?

But wait you say, that can't be, I know somethings work. Gravity seems to be true and it is based on observational data (at least at first it was).

There are two answers to this, first our subjective definition of truth, gravity is true, is bolstered by a good argument. We perceive the fact as being true if we have confidence in the fact, and even bad data science provides confidence.

But the better answer is that when we analyze gravity we simplify the problem space, it ends up being a single dimensional problem, and we don't need that many observations, thirty is enough.

Wait, gravity is complicated, just measuring the canon ball vs. the musket ball confused lots of people, including Galileo. There are factors such as wind resistance that confound the variables and confuse the measurements. But we simplified.

The trick to simplification is abstraction. Going up a level of representation....

Too much data not enough information: a survival guide to the information age

Search This Blog

How does learning work, how much data is needed to learn and don't cross the streams!

Comments

Post a Comment

Popular posts from this blog

III) Metrics

0.0 Introduction to advanced concepts in AI and Machine Learning

No a penguin is not an ashcan; how to evolve from supervised to semi-supervised learning