Where to start? Obviously, the high end is scared to death of blind tests, given the energy that they put into circulating false claims about them.
One of the things that stood out in this article was the mention of John Atkinson's dissapointment with blind tests based on alleged blind tests by Martin Collums:
"But when you have taken part in a number of these blind tests and experienced how two amplifiers you know from personal experience to sound extremely different can still fail to be identified under blind conditions, then perhaps an alternative hypothesis is called for: that the very procedure of a blind listening test can conceal small but real subjective differences. Having taken part in quite a number of such blind tests, I have become convinced of the truth in this hypothesis. Over 10 years ago, for example, I failed to distinguish a Quad 405 from a Naim NAP250 or a TVA tube amplifier in such a blind test organized by Martin Colloms. Convinced by these results of the validity in the Consumer Reports philosophy, I consequently sold my exotic and expensive Lecson power amplifier with which I had been very happy and bought a much cheaper Quad 405—the biggest mistake of my audiophile career!
Read more at
www.audiostream.com/content/blind-testing-golden-ears-and-envy-oh-my#2VCdyiPqi82mtw6Z.99"
So who is Martin Colloms?
He's this guy:
www.colloms.com/Among other things, a technical advisor to Stereophile and other high end magazines. He is a leading subjectivist who preaches that everything sounds different.
He's the guy who writes this sort of stuff:
www.colloms.com/pages/exerpts.aspxWhat does a long time advocate of science in audio say about Colloms' blind tests?
www.bostonaudiosociety.org/bas_speaker/wishful_thinking.htm"In the November 1990 Stereophile editor John Atkinson and staffer Will Hammond provide a good model for wishful-thinking analysis in discussing the results of their CD-tweak listening tests ("As We See It: Music, Fractals and Listening Tests"). In January of 1991 Martin Colloms reprises the original wishful-thinking paradigm in discussing his well-known 1986 amplifier comparisons ("As We See It: Working the Front Line").
I call their analyses wishful because they draw conclusions based on evidence that doesn't support such findings.
In the CD-tweak test Atkinson and Hammond conducted a 3222-trial single-blind listening experiment to determine whether CD tweaks (green ink, Armor-All, expensive transports) altered the sound of compact-disc playback. Subjects overall were able to identify tweaked vs untweaked CDs only 48.3% of the time, and the proportion that scored highly (five, six, or seven out of seven trials--Stereophile's definition of a keen-eared listener) was well within the range to be expected if subjects had been merely guessing.
Atkinson declared that there were "some listeners who could and did hear a difference." In response to several letters showing how the statistics didn't support this conclusion, Hammond insisted that "...the total of the tweaks used resulted in a sonic difference that was detected correctly well beyond the probability of it being a chance occurrence" (February 1991, p. 65).
Given the numbers published, this conclusion is simply not supported. However, there were analyses which seemed to support positive results. For example, an analysis of one musical selection, through all listening sessions, judged by males comparing different transports is shown as being "significant: p.001," i.e., the probability of these scores occurring from chance alone is less than 0.1%.
Further analysis shows 71% (132/186) correct identifications when A and B were different and only 32% correct (62/194) when they were the same. The first proportion would be significant when compared with the 50% criterion, which is a score that exceeds 50% by an amount that depends on the size of the sample. The difference between 71% and 32%, moreover, seems too great to be a chance happening.
So doesn't this support their conclusions? Nope-they used the wrong criterion for comparison. When the trials where B was different from A (A-B or B-A) are combined with the trials where A and B were the same (A-A and B-B), the combined score of 50.7% correct is not significantly different from what one would expect by chance. The data do suggest two important things, though. First, listeners are disposed to report differences even when there are none. This group in this example reported a difference 68% of the time when the second presentation was the same as the first. Second, one should have an equal number of same (A-A/B-B) and different (A-B/B-A) trials when the 50% criterion is employed. Otherwise the criterion score must be adjusted to account for response bias, the tendency of subjects to report differences even when a component is compared with itself. [There is an additional bias problem in the later trials if the subjects know that the number of same and different trials are equal. This is not a simple matter. Pub.]
This sort of response bias was first seen in the blind amplifier tests staged by Quad in 1978. In those, the experimenters used the preference style of test: subjects were asked, "Do You Prefer A, Prefer B, or Have No Preference?" Subjects expressed a preference for either A or B 35% of the time when the amplifier was being compared with itself. They were, in other words, biased to prefer A or B (i.e., to report a difference) even when A = B. The people at Quad reported this bias correctly, concluding that based on the numbers these subjects were unable to identify amplifiers by sound alone.
Years later, in 1986, Martin Colloms claimed to have proved that amplifiers sound different with a 63% correct rate in a double-blind test report ("Amplifiers Do Sound Different," Hi-Fi News and Record Review, May 1986). In this case Colloms made large analytical errors. He ignored an unusually large part of his experiment (approximately 25% of the trials), a choice that may have introduced experimental bias. Colloms based his analysis only on the trials where the amplifiers were different, without compensating for the response bias already discussed. Listeners scored 63.3% correct during those trials where the amplifiers were different (95 of the 150 A-BB-A trials). However, subjects scored correctly only 65% of the time when the amplifiers were the same (26 of 40 A-A/B-B trials.) Another way of saying this is that subjects reported a difference 35% of the time (14/40 trials) when there could have been no difference.
There are two analytical ways to compensate: 1) compare the correct rate of the sames and the differents; 63.3% vs 65% is not a significant difference, and 2) adjust the criterion score. Because of response bias, we would expect a hypothetical 100-trial study in which differences were inaudible and which had all different comparisons to produce 67.5 correct responses-35 correct responses because of bias plus 50% of the remaining 65 trials by guessing. Thus a 63.3% correct rate is below the 67.5% expected due to chance alone. [It seems to me that Nousaine is trying to have it both ways here: if a score in the neighborhood of 67% is to be expected on the A-BB-A trials because of bias toward reporting a difference, then 65% correct is all the more significant on the A-A/B-B trials, where the subjects must overcome this bias in their answers. If you combine the "same" and "different" trials for the Colloms tests, as the author does for the CD-tweak tests, the results do appear significant. See the note at the end of the article. Pub.]
Note that the much attacked ABX technique, where a forced choice is made, is free of this problem. In an ABX test a criterion of 50% due to chance is correct given a large enough sample size; however, most researchers recommend a 75%-correct criterion to eliminate the possibility that small bias errors will influence the results
"
In short, in golden ear audio, one dirty hand washes the other.