What Clever Hans can teach “Bayesians” about how science works and why data is not enough.

Anyone who’s ever taken an introductory psych course has probably heard the story of “Clever Hans,” the horse who, for a time, was credited with possessing intellectual gifts far above his pay grade. He was, to all appearances, able to perform arithmetic, guess at the composers of melodies, string together letters to make words and sentences, spell out the name of the painter of a picture, and so on, all by pawing at the ground with his hoof an appropriate number of times. (Letters were given number tags).

A natural suspicion that the whole performance was a hoax was challenged; first, by the experience of a well-known zoologist, who proceeded to ask Hans the questions himself, and still managed to elicit correct answers, and then by the investigation of a scientific commission set up by the German board of education, which reported that they could c discern no signals, intentional or unintentional, by which Hans might be being alerted to the correct response.

Eventually, however, biologist/psychologist Oscar Pfungst and colleagues, on the basis of careful experiments, were able to demonstrate that the impossible was not possible, that Hans was, indeed, receiving subtle, unintentional cues indicating the correct answers.

How did they achieve this? Not, in the first instance, by observing yet more instances. Their first move was to prove, by a simple test, that Hans was getting outside help in answering the questions.

Specifically, they showed that when the questioners themselves did not know the answers to the questions, or were screened off from the horse’s view, Hans’s performance plummeted:

With blinkers on, Hans performance was impaired, but he was still able to produce correct responses with some frequency.

When larger blinkers were employed, or a tent separating him from his questioner, performance collapsed. (The investigators had surmised from Hans’ strenuous efforts to see his questioner, that the normal blinkers had allowed glimpses and some degree of success in response, and so had proceeded to this refinement).

Similarly, if the questioner did not know the answer, Hans performance was poor. And still, Hans, even in this case, it was sometimes better than it should have been if Hans were not figuring things out for himself. It turned out to be necessary that no one in the room know the right answers for Hans to fail completely.

It thus became apparent that his questioners had, unbeknownst to themselves, been signaling the horse when to stop hoofing. Yet, so subtle were the clues, that even after this link between questioner and horse had been thus carefully demonstrated, clued-in, eagle-eyed spectators of the Hans show continued to insist that no such signs existed! (Hans’s trainer, meanwhile, at first stunned by the horse’s failures when unable to view his questioner, soon recovered his faith in the horse’s abilities and remained “as ardent an exponent of the belief in the horse’s intelligence as he had ever been.”)

In order to ascertain the source of Hans information, Pfungst conducted a careful series of experiments. Eschewing “natural conditions” (so popular in today’s vision science practice), he felt it necessary to set up special, controlled conditions:

The observations on the horse under ordinary conditions would have been quite insufficient for arriving at a decision as to the tenability of the several possible explanations. For this purpose experimentation with controlled conditions was necessary.

An unusually acute observer,  Pfungst noticed that:

As soon as the experimenter had given a problem to the horse, he, involuntarily, bent his head and trunk slightly forward and the horse would then put the right foot forward and begin to tap, without, however, returning it each time to its original position. As soon as the desired number of taps was given, the questioner would make a slight upward jerk of the head. Thereupon the horse would immediately swing his foot in a wide circle, bringing it back to its original position. (This movement, which in the following exposition we shall designate as “the back step”, was never included in the count.). Now, after Hans had ceased tapping, the questioner would raise his head and trunk to their normal position. This second, far coarser movement was not the signal for the back-step, but always followed it. But whenever this second movement was omitted, Hans, who had already brought back his foot to the original position and had thereby put it out of commission, as it were, would give one more tap with his left foot.

These observations required unusual powers of discernment on the part of the observer, both because they were very minute, and because they were mixed in with others; in the case of the trainer, that very vivacious gentleman made sundry accompanying movements and was constantly moving back and forth. To abstract from these the essential and really effective movements was truly difficult. Other questioners had their own behavioral quirks. There was also the question of the timing of the cues – did they really come before, and not after, Hans made his decision to halt his count?

Again, it was understood by the investigators that the observations needed corroboration via carefully controlled experiments, and that such corroboration entailed the achievement of virtually perfect correlation among hypothesized influences (and thus virtually perfect potential control – producing them “at pleasure”) – over outcomes.

If it was true that these movements of the questioner guided the horse in his tapping, then the following must be shown: First, that the same movements were observed in Mr. von Osten in every case of successful response; secondly, that they recurred in the same order or with only slight individual changes in the case of all who were able to obtain successful responses from the horse, and that they were absent or occurred at the wrong time in all cases of unsuccessful response. Furthermore, it was observed that it was possible to bring about unsuccessful reactions on the part of the horse as soon as the movements were voluntarily suppressed, and conversely, that by voluntarily giving the necessary signs the horse might be made to respond at pleasure; so that anyone who possessed the knowledge of the proper signs could thereby gain control over the process of response on the part of the horse. 

Pfungst was clearly not interested in merely observing outcomes and modelling their probability distributions, but in actively testing hypotheses that would lead to perfect predictability. He successfully carried out his program, achieving control (and thus predictive accuracy) over Hans’ performance.

It’s worth reading about how ingeniously and carefully the investigators explored, via  series of questions and answers, all the facets and variations of Hans’ performance. And it’s worth noting, finally, that the ultimately successful project hinged on a series of testable conjectures that step-by-step refuted any vestiges of the notion that Hans was answering the questions on his own steam. Even the (eventually) key question “what visual cues was Hans using” did not emerge from the initial data, but was the product of insightful speculation and targeted tests.

How would a”Bayesian” have approached the problem of Hans?

How would a “Bayesian” have approached the problem of Hans’ intelligence? Would access to the very latest”Bayesian” techniques and automated packages have helped achieve, or even accelerate, Pfungst’s discoveries? Is discovery even part of the Bayesian program?

From my understanding of “Bayesianism” (and I would ask anyone to please correct me if I’m wrong), one would begin with a numerical estimate of one’s prior belief.

Belief in what? That horses are capable of doing sums? Of guessing composers? Etc? Or that Hans the specific individual horse is capable of such achievements? Or that he is not capable? Or that his trainer is feeding him the answers? Or that his trainer is feeding him the answers visually? The choice of which question to assign a prior to is perhaps obvious to a Bayesian, but not to me. So I’ll go with this: “How probable is it that Hans the individual horse is capable of doing sums in his head?”

We consult our feelings and prior information, and pick a number. This is our “prior.” We’ll assume it starts out pretty low. We might want to write it down so as not to forget it, because we’ll need to plug it in for our future calculations.

Now, according to Andrew Gelman, “Bayesian inference is conservative in that it goes with what is already known, unless the new data force a change.”

What is the new data, in this case? It could only be, the results of Hans’ public performances.

In these performances, Hans always answered the questions correctly. In other words, these data are undeviatingly positive. The more successful performances you were to observe, the more your belief in Hans’ abilities should increase, and the more your prior will tend to be “swamped” by the new data. (If you are a lazy Bayesian, and forego observation, your prior may be less labile.)

Keeping an open mind, you continue to wait for, or actively seek out, new data. You go to more shows, read more news stories, checking to see if Hans’ performance is holding up. You keep updating.

You read the Commission’s report. Positive, positive, positive. You update again. Your belief probability should be pretty high by now; unless you’re a skeptic, (a complication that may present a computational challenge which will not be dealt with here).

You look at Pfungst’s data, and Hans’ mixed success rate there. You factor this new data into your probability estimate of Hans’ ability to do sums, which previous positive data had raised up; your estimate should probably fall, but it’s your decision. You keep an open mind, waiting on any new data that might rain down at any moment, or actively searching it out near and far.

It’s also, I suppose, possible that your prior knowledge caused you to discount all of this data, so your probability belief stays low…But as a Bayesian, is it really your role to question or interpret the data, as opposed to simply summing it up? How does rational interpretation figure into Stan algorithms? These complications will be discussed more below, but they are not typically of concern to Bayesians…What is of concern is that a number label be attached to their beliefs based on the data that are chosen for consumption and distributional analysis. The number is the thing. Statistical packages are now available to assist those not comfortable with the necessary mathematical operations on the data. These operations will help all who wish to achieve a probability estimate with which they feel more or less comfortable. Because it’s subjective, you don’t have to justify it to anyone; and agreement with Stan’s output is also optional. We each have our own truth, and that’s ok.

What’s the problem?

It should be obvious, first, that bare-naked references to “data” are vapid. “Data” – in this case correct answers vs incorrect answers from Hans’ performances, from the investigations of the commission, from the various tests (under various conditions) conducted by Pfungst during his process of discovery, could pile up ad infinitum. Even though Bayesians claim not to be “frequentists,” they actually employ data in a frequentist fashion, except that they include an initial term that is made up by each individual, the “prior” probability of…whatever. These frequency distributions of bare-naked data are supposed to inform an individual’s current probability belief. But this belief, arrived at in this way, can be of no theoretical value, i.e. of no interest to anyone interested in actually solving the problem at hand.

The events (here, correct/incorrect answers) that count as “data” are never directly linked to the truth or falsity of a hypothesis – here, the hypothesis that Hans is (or, alternatively, is not) thinking for himself. The same data, in this case, are consistent with either view, and relative frequencies of outcomes have nothing to say about the relative truth of either one. The data are wholly contingent on conditions; conditions count, but not in a Bayesian/frequentist way. They count in a logical way. When Hans gets things wrong with blinders on, we might surmise that they are acting as a distraction, especially as he was still performing relatively well. Or we might interpret this as an incomplete restriction on his view of cues, and try again with bigger blinders. Is the application of bigger blinders data-driven? How about in the case of the later tests, carefully designed and controlled to test various possibilities?

It should be clear that the positive data from the original public demonstrations of Hans’ abilities should not count in the same way that the negative data from certain of Pfungst’s experiments count. In other words, they cannot be meaningfully simply summed and organized into a probability distribution. The two kinds of data – “positive” and “negative” should not be pitted against each other as though they were coin tosses. Their value depends on the creative and logical interpretation (Feynman’s “imagination in a straitjacket”) of previous-data-plus-conditions, leading to new theory-inspired tests and interpretations of new data-plus-conditions.

Anyone who chooses to calibrate their beliefs on the basis of simple summing of the bare-naked data, without references to conditions and their theoretical implications, will obviously never achieve the control over outcomes that was achieved by Pfungst in the case of Hans. What’s more, the probability belief achieved in this way would fail to reflect the distribution of the responses under an infinite number of conditions, including when the investigator undertook to completely obstruct Hans’ access to revealing signals. This product of Bayesian inference, in other words, like all products of Bayesian inference, would have no real-world value.

The next question is, would it be appropriate to call theory-based selection and control of conditions, “data-driven”? If not, then science is not “data-driven.”

According to TechnopediaData driven is an adjective used to refer to a process or activity that is spurred on by data, as opposed to being driven by mere intuition or personal experience. In other words, the decision is made with hard empirical evidence and not speculation or gut feel. 

So, not theory, not “mere intuition…experience, speculation.” Just bare-naked data.

Without a theory, as Darwin observed, you might as well count the stones on Brighton Beach. Yet the “data-driven,” “data-mining,” blind correlation-seeking (multivariate analysis) framework has come to dominate (defending an intellectually vacant psychology unable to achieve reliable predictions) to the point that a “Bayesian” cottage industry has sprung up to turn the resulting confusion into a probability estimate the gullible can believe in. Wasn’t that a function of religion – to produce a (false) sense of control over things that were not under our control (because beyond our understanding)? Weird.

Short version: The probability of “the truth of a hypothesis” is supposed, by Bayesians, to be judged on the basis of the probability of certain events supposed to signal the action of the hypothesized forces or principles; But the probability of any observable event (like Han’s responses) is wholly contingent on conditions. The frequency of the events may be altered ad lib. Thus any reference to the probability of an event – “datum” – must contain reference to the specific conditions under which it arose. Explanation, i.e. inference as to factors dispositive to the outcome, requires the removal, via speculation of what those might be, of confounds, such that the outcome may be predicted with overwhelming confidence. Otherwise, the procedure is impotent in predicting outcomes, in controlling outcomes, and in using that control to generate new kinds of events (or series of events) with virtually zero probability of occurring prior to the selection and control of special conditions; and in so doing, corroborate prediction and, provisionally, the assumptions that led to them.

Continuing random thoughts

When, for example, the orbit of Uranus was shown not to quite agree with Newton’s prediction, the procedure was not simply to say, well, I guess we have to adjust our probability estimates for the locations of Uranus, or for our belief in Newton’s laws, or whatever. The response was to speculate (!!) about what could be going on. Was Newton’s theory wrong…or was there another body influencing the motion in accordance with Newton’s theory…? On investigation in light of this speculation, Neptune was discovered. Was this data-driven?


How to do “science” without really trying (A lesson from the Harvard Vision Sciences Laboratory).

It’s long been clear to me that the standards of evidence in vision science (and not only vision science) have collapsed to the standards (at best) of mathematics. For many, this might appear to be an unequivocally good thing; aren’t we just talking about greater precision in our predictions? No; what I mean by adopting the standards of mathematics is adopting an attitude of indifference towards empirical fact.

As Richard Feynman observed: Mathematics is not a science from our point of view, in the sense that it is not a natural science. The test of its validity is not experiment.

In other words:

Math isn’t science, and science isn’t math. The mathematicians…do not really care about what they’re talking…or whether what they say is true. If you state the axioms, ‘if such and such is so, and such and such is so’, then the logic can be carried out without knowing what the such and such words mean…[but as a scientist] you have to have some understanding of the connection of the words with the real world… 

In vision science, not only do investigators pay no attention to whether “such and such is so” before drawing out the implications of their assumptions; they often don’t even seem to care that such and such is most definitely NOT so; they just carry on as though it were.

A relatively straightforward illustration of this “mathematical” approach to science may be seen in a 2001 Nature Reviews Neuroscience article by Holcombe and Cavanagh, titled Early binding of feature pairs for visual perception. It begins as follows:

If features such as color and orientation are processed separately by the brain at early stages (Zeki, 1978; Triesman and Gelade, 1980), how does the brain subsequently match the correct color and orientation? 

The natural interpretation of this introductory statement to an experimental study is that “given that it is the case that color and orientation are processed separately…how is something that purportedly depends on this fact accomplished?” The addition of the two citations* adds to our sense that, at least to the best of the investigators’ knowledge, color and orientation are processed separately at early stages. Still, the framing has a curiously non-committal quality.

In fact, it was known with certainty well before 2001 that color and orientation cannot be explained on the basis of the activities of neurons in the “early stages” of the visual process.  Color perception is demonstrably mediated by global structural conditions and principles; it is not directly linked to local stimulation. A relatively recent and impressive demonstration of this fact is the colored cube by Purves and Lotto. (Lest there be any doubt, when I made the criticism on PubPeer, the first author agreed with me. (More on that reply below)). So it would appear that the “if” framing of the opening lines is of the mathematical variety.

Except that the false claim is to be corroborated by experiment

Maybe (incredibly) the authors didn’t realize the premise was false at the time they adopted it. No worries; won’t experiments falsify a false premise? A falsification would seem even more inevitable given that the data in these experiments are perceptual; observers are asked to report what they see. Such data is inherently not capable of tapping into “early stage” processes, because, as mentioned above, the perception of local color is contingent on the analysis of global conditions.

The investigators leap over this obstacle with an additional specious assumption, reporting that:

We found that spatially superimposed pairings of orientation with either color or luminance could be reported even for extremely high rates of presentation, which suggests that these features are coded in combination explicitly by early stages…

The breezy inference that the perceptual data for stimuli presented at high rates of presentation tap into processes at “early stages” is entirely without foundation. The authors are evidently at a loss to provide even cosmetic supporting citations. The  casual, off-the-cuff style with which the claim is thrown out seems to imply that it is so well-established, or so obvious, as to need no further support. Again, this is not the case. It is an example of the “Low-level vision” myth, derided by Teller (1984) as the “nothing mucks it up proviso” although, as I discuss in the miracle of spatial filters, it was subsequently and fatuously embraced by Graham (e.g. 2016), who considered it reasonable to simply assert that, under certain conditions (e.g. brief presentations), the brain becomes “transparent” down to the lower levels. Holcombe and Cavanagh (2001) have clearly adopted this view.

According to Holcombe’s PubPeer reply,

Our mention of “color” at early stages should be read as processing of chromatic signals, such as differences in cone outputs, which is enough to differentiate the stimuli we used. 

Again, given that the data were perceptual, this reply can only be interpreted as indicating a continuing adherence to the “low-level vision myth.” The fact that the activity of the cones – aka the photoreceptors – underlies all features of perception makes the statement seem even sillier. Cones, unlike rods, are capable of producing colored percepts, but their relative activity levels are not directly linked to perceived color any more than they are linked to perceived shape or orientation.

Data interpreted in the light of such flimsy premises is obviously of no theoretical value.

The two authors’ easy-going attitude toward scientific practice continues to the end. Evidently, their experiments were not meant to narrow down possibilities, but simply to achieve consistency with one of who-knows-how-many others:

The high pairing rates for spatially superimposed feature pairs suggest, among other possibilities, that some features may be assessed in combination from early levels.

Among other possibilities? Which possibilities? How should we choose among them? Clearly not on the basis of this experiment (even if it were validly premised).

The take-home lesson is that it is not difficult to produce results that you can claim to be in line with your assumptions, as long as you (and your reviewers) are not fussy about the empirical and logical validity of those assumptions. I would like to note here as well that the handling of the data is irrelevant to the fundamental problem; i.e., the problem is not statistical in nature. The problem has to do with the conceptual integrity of the reasoning on the basis of which the data was produced and interpreted, which is a different and more fundamental issue. This intellectual integrity is the only thing that distinguishes scientific interpretations of observables from pseudoscientific ones.

Continuing thoughts

In addition to the “Low-level vision” myth, Holcombe and Cavanagh (2001) implicitly adopt the spatial-filtering myth, reflected in their use of gratings as stimuli. Sine-wave gratings are the configurations that neurons in the early visual system are supposed to be “signal detectors” of in this mythology. The action of these detectors is discerned by investigators via the binary forced choice method, in which what observers may report to see is restricted to what the investigators want to hear (the putative “signal,” usually a grating). The method is (pseudo) legitimized by “Signal detection theory.” Because the investigators simply assert that sine-wave gratings are the special configuration that detectors magically organize and detect, they may be combined with other configurations, other variations in stimulation, without losing their signal character. Anything added to the “signal” configuration is termed “noise.” Thus, Holcombe and Cavanagh (2001) place their luminance grating on a “noise background.”

But this signal-noise distinction is artificial. When you “superimpose” two different configurations, the combination, perceptually, will not necessarily consist of one plus one. A completely new perceptual organization may arise. For the signal detection theorist, this doesn’t matter, because the detector magically decomposes the image into the components to which it is supposed to be “tuned.” But for the non-magical thinker, the distinction shows a fundamental misunderstanding of perception. The magical thinkers, furthermore, arbitrarily define threshold of detection as the point where observers get 75% “correct” answers. Again, answers may, by design, only fall in to “correct” and “incorrect” categories. The 75% level is supposed to be above the rate that would be achieved by pure guessing, but there is no indication that the value was tested for significance.

In addition, even “correct” answers don’t necessarily imply that the percept corresponded to the putative “signal;” observers may be making an inference based on some clue. But this is the beauty of the signal-detection myth: It doesn’t matter what you saw, exactly, because any perceptual experience is simply interpreted as reflecting the response of the putative detector. If the detector is supposed to detect gratings, and I flash one so quickly that all you see is a smudge (and you, the observer, read that as an indication of the fleeting presence of the grating), then that’s not a sign of a smudge detector, or a process that grouped points according to some principles of figure-ground organization, but of a grating detector. The detector hypothesis has immunity from truth, and let’s us keep the data simple.

The casual approach extends to Holcombe and Cavanagh’s (2001) additional experiments to test if “features are processed together if they form part of a single object or group.” There is, of course, no discussion of the non-trivial problem of how perceptual groupings arise. The authors simply refer to studies that “suggest” features are processed together if…, and borrow random, unanalyzed stimuli second-hand from these some of these studies. They don’t even bother to cite specific studies, but merely an in-press review. The generate some data, but it’s not worth discussing.

I would just add for completeness that at least one of only four observers was an author.

Kiorpes, Tang & Movshon (1999

This old article provides another clear example of how normalized is the practice of employing arbitrary assumptions to justify a method and guide data interpretation. The “experiments” involved are not experiments in the sense of being tests of assumptions, because they are not designed to falsify the assumptions even if they are false. They are simply measurements interpreted as though the assumptions were ture:

“Noise masking paradigms have been used to investigate the mechanisms underlying detection performance. Barlow (1977), Pelli (1981, 1990), and others (e.g. Burgess, Wagner, Jennings & Barlow, 1981; Kersten, 1984) have shown that, under certain assumptions, the limitations on contrast detection in the presence of masking noise of varying contrast power can be partitioned into additive and non-additive components. In this scheme, overall visual efficiency (that is, the fidelity with which we detect signals in noise) results from the combined action of two stages…We have adopted this framework to try to understand the poor contrast sensitivity of amblyopes.”

The meaning of the passage is clear: There is a “scheme” that “has been used” entailing “certain assumptions,” and if we adopt this scheme, we can interpret our data in terms of “additive and non-additive components” and “the combined action of two stages.” It’s a simple choice, like what to put on in the morning.

*The two citations turn out, not unusually, to be purely cosmetic.

The miracle of spatial filters 2

DSCN0218The spatial filter story is contradicted by casual observation; this observation was also made by Solomon and Pelli (1994), in their Nature cover article, “The visual filter mediating letter identification.”  (I’ve commented on this article in PubPeer). The observation, which is mentioned as an afterthought in a figure caption, flatly contradicts their main claim, but is treated as a minor clarification. Which is pretty extraordinary.

Solomon and Pelli’s (1994) claim is that the visual system possesses “parallel visual filters, each tuned to a band of spatial frequency.” The reference is to the rate of change between light and dark (post-Fourier analysis) on the retina. It is specified with respect to the retinal surface, e.g. in degrees of visual angle, or cycles per mm on the retina.

However, in the caption to Figure 1, the authors note  that “changes in viewing distance, from 3 to 60 cm, hardly affect the visibility of any given letter, indicating that the channel scales with letter size.”  So frequency, now, is not defined with respect to the retina; nor with respect to any objective geometric scale. Definition such as “cycles per mm” are now null and void. What is the new definitions? Cycles per LETTER.

That this definition is even more unhinged than the original concept should be obvious. There is no objective definition of letter; letters are just shapes, and their shapes are a matter of convention. (Thus we have, to begin with, a sampling problem; why should the results from an arbitrary selection of letter-shapes generalize?) These shapes-as-letters are learned. So the claim is that channels are constructed post-literacy, and, regardless of letter shape, they are “tuned” to 3 cycles per letter. This is an extraordinary claim; and it is testable.

We could, for example, use a control of non-letter shapes, or letters from an unfamiliar language. Perhaps the “filter” is the same for all shapes! This would be quite a discovery. For a wider shape, the width of the preferred distances between light and dark would, of course have to be wider than for narrow shapes, so that we could keep the three cycle per letter “tuning.”

To summarize the proposal with its implications: retinal stimulation leads to shape perception; leads to letter recognition; leads to spontaneous, instantaneous construction of channel tuned to three-cycles-per-letter.

Even if our control experiment seemed to corroborate such assumptions, we would have strong reason to doubt it, since there is no conceivable reason why the visual system, having ascertained the shapes of the letters without the aid of these filters, would suddenly have to employ them in order to use the shapes to recognize the letters.

The way Solomon and Pelli (1994) use the traditional (though totally invalid) frequency concept and then conveniently and paradoxically fudge its definition to the point of non-existence, without anyone at Nature noticing, is quite impressive.

It should also be clear from the image in Figure 1 that the premise of any kind of frequency tuning is wholly arbitrary and contrived. The blobby presentation of letters, at any scale of blobbiness, is clearly less legible than it would be if there were no blobs, i.e. if the letters were solidly filled-in. With no blobs, we would have zero intensity changes within a letter. How would our visual theoreticians describe this in spatial filter terms? Hard edges require an infinite series to analyze in Fourier terms (I’m just repeating what I’ve read); so maybe it’s impossible to perceive solidly-filled-in-letters…Or maybe the explanation will have to await future research.

The contradiction inherent in the spatial frequency of contrast notion is equally evident in the figure labelled “sweep grating” on the NYU webpage on “spatial frequency channels”  by Michael Landy. This figure is supposed to illustrate the role of spatial frequency in contrast perception, with frequency specified in terms of cycles per mm of retinal surface. The results of experiments with such figures are illustrated graphically in the next figure, showing a typical “contrast sensitivity function.” The graph shows that the middle frequencies (again, measured in terms of cycles per mm of retina) are visible at lower contrasts than the high or low. Casual inspection of the “sweep grating,” however, shows that, like Solomon and Pelli’s (1994) letters, the frequency/contrast correlation isn’t robust; as relatively large changes in viewing distance little affects the outcome.

But maybe I’m the crazy one. The cycles/letter thing has been on since before Solomon and Pelli (1994), and is still going strong, for example with a recent article by Wang and Legge (2018)Comparing the minimum spatial-frequency content for recognizing Chinese and alphabet characters.” 

So: The visual image is sufficiently organized into figure and ground to produce perceived shapes, which are then recognized as letters, which are then Fourier-analyzed, which triggers the activity of “channels” with different spatial frequency preferences, and if the letter contains a component of three light/dark cycles, then that channel will get excited, and somehow this will help us to recognize the letter.






The “Bayesian” cover for poor scientific practice


I believe the so-called Bayesian practices represent a de facto attempt to legitimize ineffective scientific practice that is unable to reduce uncertainty (i.e. confounds in the data) to acceptable levels, but still aims to present a facade of progress. For robust findings with heuristic power practitioners substitute endless discussions about how probable a result is and whose probability estimate should be believed; but since probability estimates are not testable, the discussion just goes on and on, generating heat but not light. These are discussions that at the highest level of rationality led to a Humean dead end. I recently commented on a couple of papers authored or co-authored by Andrew Gelman. Below is an edited/expanded version of PubPeer comments I made on Gelman and Shalizi (2013)/”Philosophy and the practice of Bayesian statistics,” as well as a shorter comment on Feldman (2017)/”What are the “true” statistics of the environment?”

Testable priors?

Gelman and Shalizi (2013) take a stand against “Bayesian” statistics’ “subjective prior,” which is good, as the subjective prior is ridiculous. However, their view of science as essentially a counting game – the key characteristic of the “Bayesian” school on behalf of which they advocate (albeit with caveats) – is a view that drains scientific research of its potential to add to knowledge of the world.

Statistical models have their place. However, the emphasis on statistics has become culturally conjoined with conceptual and logical laziness – with counting without giving too much consideration to what, exactly, is being counted; the results of such practice are given the honorary title of “data.”

Despite their rejection of the subjective prior, Gelman and Shalizi, as Bayesians, require a prior probability to plug into their formulas. That means they need to take into account all of the forces and all of the factors in the universe that led up to this moment, and given current conditions, take a position on the probability of outcome X. Obviously, there is no rational way to do this. But, they argue, it doesn’t matter much what prior you choose, because the prior is testable, and adjustable.

How is it testable? By taking the data already generated and running a sampling simulation, plugging the results into the formula and checking the match. But this means that all the factors that influenced the outcome remain implicit, unanalyzed, unacknowledged, possibly unknown, uncritically mashed together in the statistical blender. And it doesn’t correct for sampling error.

Here’s what would be a better test (assuming it was worth testing “priors,” which it isn’t): Test it on a new sample. How would this sample be chosen? We would have to specify conditions. How would we know what conditions to specify? The actual confounds in any situation are infinite; we would have to narrow them down. We would have to ask the original data collectors what we should control for. This information is not contained in the prior, which is just numbers, not qualities. Our description of our prior, in other words, would need to contain a footnote about the conditions to which it applies. (And for new data to apply to the previous estimate, we would  have to ensure that conditions, implicit or explicit, remained unchanged).

What if we took a stab at replication with a new dataset, and the distribution of data reflecting the particular correlation of interest were compatible with our previous prior assumption? Would this mean that our prior was “correct?” Would this mean that the next sample we tested would also work this way? Would this mean that factors that weren’t involved in our present sample won’t radically affect future sampling distributions? How would we know? What if the sample turned up a different prior value? Would this mean that our previous selection was not correct, or that we got a different outcome by… chance? But we calculated the prior based on “data!” Does this mean anything at all? Will it allow us to predict any future events with any confidence?

Blindly taking events measured in a small slice of time and using that sample as the basis for a simulation doesn’t tell us why we got a particular distribution in the first place or why we should expect the next sample to have a similar character. So, in contrast to what Gelman and Shalizi are claiming, the choice of prior probability isn’t testable. It can only be inferred as a matter of induction, which the authors say they’re against.

Gambling on science

The problem here is that scientists use numbers to test theories, but theories aren’t about numbers. Theories consist of ideas, arguments, creative solutions to real-world problems, referring to real-world situations, that are as yet unresolved. Scientists don’t sit around taking stabs at untestable-in-principle “priors.” They actually predict what they expect with a very high degree of certainty to happen, not what they think might happen, with probability x. When they succeed, the scientific project may make one of its magical leaps forward.

In other words, a genuine finding in science happens when someone says, in this and this place, at this time, under these specific conditions, you will observe X. If you do observe X, then, assuming that X was otherwise very unlikely to be observed, and assuming that your grounds for expecting it (your arguments/theory) are rational and empirically sound, then we may use the arguments to generate new predictions about other things that we predict, in light of the theory, should definitely happen, and test them. This definiteness of prediction applies even in the context of (random) measurement error, the limits of which should keep the prediction well within the highly-unlikely-in-normal-circumstances range; and the degree of precision available may well improve with time.

If all science did was to estimate probabilities, (on god knows what basis), then not observing X at that place at that time under those conditions would, unless the probability was assumed to be 1, be meaningless, whether or not a Bayesian somewhere chose to alter their belief or selection of prior.

The key to scientific discovery is thinking creatively about observable facts and going beyond them, making bold guesses about what lies beneath, and what this theoretical construct implies for what we may observe going forward given special, theoretically-based conditions; not looking backward at confounded data and fitting them to statistical distributions. This is why the predictions of science are high-risk in prospect, but, when successful, highly replicable/predictable/low risk in retrospect. Whereas there is no risk in the first phase phase of atheoretical data-collection/analysis and choice of “prior,” but very high risk in the second, replication phase, which almost always fails.

Here is a simple example illustrating the difference: Let’s say I want to construct a projectile to intercept certain heavenly bodies that occasionally float across my sky.  Reliable successes will presuppose many complex assumptions with practical implications; hits will validate those assumptions, at least for the time being. Even close calls will have value, because of the way they will reflect on those underlying, unobservable assumptions.

The current statistical approach to science is, conversely, risk-free; Bayesians would simply throw up a million projectiles at a time, count the hits, compare the distribution to their imagined priors, adjust the priors, whatever. They could then deliver a probability function describing how likely it was that the projectile hit the body. But this wouldn’t teach them anything, or help them aim their next projectile (in the way our scientist can) because hits occur randomly, not by design, not on the basis of a theory which they understand and test. This seems to be the nature of the current approach to science in many domains. It should be evident that no new knowledge, practical or otherwise, may come of it, except by isolated and uninterpretable luck. It’s no way to get to the moon.

Red State Blue State

And here is an actual example of such a risk-free project, a study on “red states” vs “blue states” by Gelman et al (2010), cited by Gelman and Shalizi (2013). The goal of this study is described as follows:

“Income inequality in the United States has risen during the past several decades. Has this produced an increase in partisan voting differences between rich and poor?”

So, basically, do rich people vote differently than poor people?

It should be obvious that the question is very vague, regardless of approach. It should be obvious that answering it entails imagining and taking into account a great number of confounds.

In fact, the authors’ question is a little more specific than initially stated; it refers to the U.S., and to the two main political parties, and to the individual States, and to a particular time frame. Still, confounds are necessarily legion.

The question seems to imply an expectation; rich vote differently from poor. (After all, the authors probably wouldn’t ask, do brunettes vote differently from blondes?) So the authors think there might be a difference. On what basis do they think this? It isn’t particularly clear from the conceptually-vague, barely-there introduction. (They also seem to have misunderstood the current Democratic Party, and the interests it serves).

Strikingly, the authors are up-front about their lack of interest in forming an organized, control-enabling argument to justify going to the trouble to collect voting data (as well as readers’ trouble in reading it):

“We offer no sweeping story here; instead, we share some statistical observations on inequality and voting over time in the country as a whole and in the states…” (“Sweeping story” is apparently Bayesian for “hypothesis-which-allows-us-to-control-for-specific-factors-we-consider-relevant-and-thus-would-render-our-data-interpretable).

The results are highly predictable, “revealing patterns that suggest complex connections between inequality, geography, and partisan voting in the United States.”

Well, thanks for sharing, but I could have told you that the situation was complex without collecting or analyzing any data at all…If I had to put a number to it would say the chances of the situation being complex were 100%. Not much new knowledge there.

At the end, they conclude that:

“Income predicts vote choice about as well now as it did 30 years ago, but with a new geographic pattern. In poor states, income is associated with Republican voting much more than before, while in many rich states, the relation between income and vote choice is nearly zero.”

So, income never predicted vote choice very well, it still doesn’t, and we don’t know why, but here are the stats for this time and this place.

The short version of the story is that the authors made a prediction – there will be a general correlation between income inequality and voting patterns – that was not borne out, at least not in any interpretable way. This failure is due to the conceptual laziness in doing the work to consider the problem more deeply, and instead making a casual, crude prediction and producing muddy, confounded, hardly informative “data”that they generously offer us their confounded data set as a free gift.

Science is about thinking up rational stories about things we can’t observe, and that we can’t count, that enable successful predictions about things we can observe, if we know where and how to look. It’s not about counting and sharing blindly collected “data.” Who cares what the “prior” of a successful, or an unsuccessful, well-founded prediction is supposed to be? The point is to improve the theory so it makes definite predictions. We’re trying to radically reduce uncertainty, not engage in endless, fruitless, philosophical-in-the-worst-sense-of-the-word discussions about how to measure it.

Real scientists have always known theory comes first

People involved in science understand that you need theories to find the useful data to begin with. One of my favorite quotes is from Darwin, who said that without a theory you might as well count the stones on Brighton Beach. Similarly, Leonardo wrote that: “He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.” “Practice” here is data-collection with poor theoretical/methodological preparation.

“Statistical hypotheses” are hypotheses without a theory behind them. As such they’re just a crapshoot, as Leonardo and Darwin understood. Which is ironic, because Bayesianism passes itself off as no-risk; just collect more “data” and you’ll get nearer the probability-truth-number. As J. Gallant said in a recent PubPeer conversation, in science, “First, you measure…that’s what the government pays us for.” I suspect that the spread of “Bayesian” pseudoscientific techniques is correlated with the increased and suffocating control over science by risk-averse bureaucrats and businessmen.

I don’t think they’re getting what they’re paying for.

Speaking of Popper

The authors’ claim to tilt toward hypothesis-testing in the Popperian style:

“Popper tried to say how science ought to work…”

No, he tried to explain how science actually works, when it works. You can agree or disagree with his analysis, but the point was to explain the type of practice that actually produces progress.

“We have generally found Popper’s ideas on probability and statistics to be of little use and will not discuss them here.”

Have you discussed them elsewhere? To casually dismiss the views of a serious thinker like Popper – who spent an awful lot of time discussing probability – as though they were beneath consideration doesn’t seem credible. It leads one to wonder…

An old blog post by Gelman on Popper and Bayes indicates to me that he hasn’t grasped Popper’s insights. He says:

“Our progress in applied modeling has fit the Popperian pattern pretty well: we build a model out of available parts and drive it as far as it can take us, and then a little farther.”

By “available parts” I assume Gelman is talking about available facts; but as Popper well knew, scientific “models” go far beyond available facts. (As a Bayesian, he may not even be referring to facts about the natural world, but disembodied numerical values drained of reference, what Bayesians and others often refer to as “data.”) The “Popperian pattern,” as understood by working scientists such as Feynman, is to make smart guesses that go well beyond the available information, though constrained by it (“imagination in a straitjacket); Popper emphasized especially the value of bold guesses, more likely than not to fail but highly fruitful when successful. It was definitely not a philosophy of available parts (facts) being lumped together, but of creative leaps of faith based on rational arguments. (I need to reread Kuhn, but it’s my impression that the two, Popper and Kuhn, are not actually as different as people think. Failed hypotheses being replaced by fundamentally different ones are revolutionary moments in science; and it is also a fact that older hypotheses predicted many otherwise unsuspected facts, including facts that were their downfall. This is one of the surprising things about the hypothetico-deductive process, and why Popper described fruitful hypotheses that fail (i.e. all hypotheses) as approximations to truth.

The man who thinks he’s a poached egg…

That arguments about probabilities end up running around in circles in a dead-end cul-de-sac is illustrated by a recent article by Feldman (2017) “What are the “true” statistics of the environment?” He concludes:

“In Conventional Wisdom, cognitive agents can achieve optimal inference by adopting a statistical model that is close to the true probabilities governing the environment as possible, and they are relentlessly driven by evolution toward such a model. In the subjectivist framework advocated here, distinct observers form an interconnected network of partially overlapping but distinguishable belief systems, none of whom has special claim to the truth. On this view—as in traditional Bayesian philosophy—“true” probabilities are not accessible and play no role. To speak of certain environmental probabilities as objectively true—no matter how accustomed many of us are to speaking that way—is a fallacy.”

Bayesians have been groping their way to a Hume-ean epistemology in the context of which, in the words of Bertrand Russell, the man who believes that he is a poached egg is to be condemned solely on the basis that he is in the minority.

Having read Feldman’s concluding remarks, what is the point of reading the arguments leading up to them? It’s simply a case of “Your guess is as good as mine.”

Science does have a special claim to truth, as evidenced by it’s success in controlling natural phenomena. Otherwise, why do we make a distinction between scientific and other beliefs? What’s the basis for this distinction?