Translate

Mostrando postagens com marcador estatística. Mostrar todas as postagens
Mostrando postagens com marcador estatística. Mostrar todas as postagens

17 novembro 2022

Existe somente um único teste

 Na Wikipedia há um total de 104 testes estatísticos. Como lidar com tudo isto? Bom, Allen Downey, em 2011, postou no seu blog: Há somente um único teste:



In summary, don't worry about finding the "right" test. There's no such thing. Put your effort into choosing a test statistic (one that reflects the effect you are testing) and modeling the null hypothesis. Then simulate it, and count!

Em 2016, Downey postou novamente sobre o mesmo assunto. 

02 abril 2019

Significância estatística

Esta é uma discussão que interessa aos pesquisadores: há um movimento para repensar a questão da significância estatística. Em outras palavras, os famosos 5% de qualquer pesquisa. Um manifesto, com mais de 800 assinaturas chama a atenção para o uso incorreto do assunto:

Pesquisas de centenas de artigos descobriram que os resultados estatisticamente não significativos são interpretados como indicando "nenhuma diferença" ou "nenhum efeito" em cerca de metade (ver "Interpretações erradas" e Informações suplementares). (...) Concordamos e pedimos que todo o conceito de significância estatística seja abandonado. (...)

Em vez disso [proibição dos valores p], e em consonância com muitos outros ao longo das décadas, estamos pedindo uma parada no uso de valores de P de maneira convencional e dicotômica - para decidir se um resultado refuta ou apóia uma hipótese científica


Bogard critica um pouco a abordagem do artigo:

Embora eu concorde com os sentimentos do resto do artigo da Nature, tenho medo de que os ideais expressos pelos autores possam ser abusados ​​por outros, querendo fugir das salvaguardas do rigor científico ou não compreender completamente os princípios da inferência estatística.

Segundo ele, Gellman também teria preocupações semelhantes. Bogard confessa que o assunto não é trivial, mesmo para ele:

É difícil para os Phds que passaram a vida toda fazendo essas coisas. É difícil para os profissionais que fizeram suas carreiras com isso. Isso é difícil para mim.

Uma palavra de alento no final:

O economista Noah Smith discutiu o retrocesso em relação aos valores de p há alguns anos. Ele afirmou corretamente que "se as pessoas estão fazendo ciência corretamente, esses problemas não serão importantes a longo prazo".

19 outubro 2017

A questão do p-valor na pesquisa científica

Geralmente as pesquisa científicas trabalham com o p-valor. Geralmente trabalhamos com um p-valor de 5%. Entretanto, recentemente, muitas críticas estão sendo dirigidas para esta estatística. Alguns periódicos estão simplesmente proibindo de usar o termo. E muitas pesquisas, quando replicadas, estão apresentando um p-valor acima de 5%, o que "não confirma" a conclusão dos trabalhos publicados.

Muitas sugestões tentam resolver este problema através de testes mais rigorosos e replicação das pesquisas. Outra solução seria mexer no p-valor usado tradicionalmente, de 5%:

Essa inconsistência é típica de muitos estudos científicos. É particularmente comum para p-valor em torno de 0,05 . Isso explica por que uma proporção tão alta de resultados estatisticamente significativos não se replicam.

Em setembro, meus colegas e eu propusemos uma nova idéia: Somente os valores de P inferiores a 0,005 devem ser considerados estatisticamente significativos. Os valores de P entre 0,005 e 0,05 devem ser chamados de sugestivos.

Em nossa proposta, resultados estatisticamente significativos são mais propensos a replicar, mesmo depois de explicar as pequenas probabilidades anteriores que geralmente pertencem a estudos nas ciências sociais, biológicas e médicas.

Além disso, pensamos que a significância estatística não deve servir como um limite de linha brilhante para publicação. Os resultados estatisticamente sugestivos - ou mesmo os resultados que são em grande parte inconclusivos - também podem ser publicados, com base em se eles relataram ou não importantes evidências preliminares sobre a possibilidade de que uma nova teoria possa ser verdadeira.


A base da ideia dos autores é o Teorema de Bayes.

14 abril 2017

Alan Smith: Por que devemos amar as estatísticas



Você acha que é bom em adivinhar dados estatísticos? Mesmo nos considerando bons em matemática ou não, nossas habilidades de entender e trabalhar com os números são realmente limitadas, diz o especialista em visualização de dados, Alan Smith. Nesta agradável palestra, Smith explora a relação entre o que sabemos e o que achamos que sabemos.

13 maio 2016

Falso Positivo

Aparentemente a CIA e NSA (fotografia) estão usando “metadados” para determinar a probabilidade de uma pessoa ser terrorista. Usando a rede de telefonia celular e algoritmos de aprendizagem de máquina, as agências de espionagem dos EUA estão trabalhando no Paquistão para impedir a proliferação do terrorismo. Gelman, citando Grothoff e Porup mostra que uma taxa de falso positivo de 0,18% numa população de 55 milhões de pessoas (do Paquistão que usa celular) significa que 99 mil inocentes serão taxados de terroristas.

18 outubro 2015

Excel é inadequado para Análises Estatísticas


Segue três artigos mostrando a razão do Excel ser inadequado pra análises estatísticas. Conforme os autores dos trabalhos:

No statistical procedure in Excel should be used until Microsoft documents that the procedure is correct; it is not safe to assume that Microsoft Excel’s statistical procedures give the correct answer. Persons who wish to conduct statistical analyses should use some other package.

Resumo:

The reliability of statistical procedures in Excel are assessed in three areas: estimation (both linear and nonlinear); random number generation; and statistical distributions (e.g., for calculating p-values). Excel's performance in all three areas is found to be inadequate. Persons desiring to conduct statistical analyses of data are advised not to use Excel.

On the accuracy of statistical procedures in Microsoft Excel 97 -B.D. Cullough,Berry Wilson- Computational Statistics & Data Analysis Elsevier 28 July 1999


Resumo

Some of the problems that rendered Excel 97, Excel 2000 and Excel 2002 unfit for use as a statistical package have been fixed in Excel 2003, though some have not. Additionally, in fixing some errors, Microsoft introduced other errors. Excel's new and improved random number generator, at default, is supposed to produce uniform numbers on the interval (0,1); but it also produces negative numbers. Excel 2003 is an improvement over previous versions, but not enough has been done that its use for statistical purposes can be recommended.

On the accuracy of statistical procedures in Microsoft Excel 2003- B.D. McCullough,Berry WilsonComputational Statistics & Data Analysis Elsevier 15 June 2005


Resumo

Excel 2007, like its predecessors, fails a standard set of intermediate-level accuracy tests in three areas: statistical distributions, random number generation, and estimation. Additional errors in specific Excel procedures are discussed. Microsoft’s continuing inability to correctly fix errors is discussed. No statistical procedure in Excel should be used until Microsoft documents that the procedure is correct; it is not safe to assume that Microsoft Excel’s statistical procedures give the correct answer. Persons who wish to conduct statistical analyses should use some other package.







22 dezembro 2014

Aquecimento Global: Derretimento estatístico

The Global Warming Statistical Meltdown by Judith Curry

At the recent United Nations Climate Summit, Secretary-General Ban Ki-moon warned that “Without significant cuts in emissions by all countries, and in key sectors, the window of opportunity to stay within less than 2 degrees [of warming] will soon close forever.” Actually, this window of opportunity may remain open for quite some time. A growing body of evidence suggests that the climate is less sensitive to increases in carbon-dioxide emissions than policy makers generally assume—and that the need for reductions in such emissions is less urgent.

According to the U.N. Framework Convention on Climate Change, preventing “dangerous human interference” with the climate is defined, rather arbitrarily, as limiting warming to no more than 2 degrees Celsius (3.6 degrees Fahrenheit) above preindustrial temperatures. The Earth’s surface temperatures have already warmed about 0.8 degrees Celsius since 1850-1900. This leaves 1.2 degrees Celsius (about 2.2 degrees Fahrenheit) to go.

In its most optimistic projections, which assume a substantial decline in emissions, the Intergovernmental Panel on Climate Change (IPCC) projects that the “dangerous” level might never be reached. In its most extreme, pessimistic projections, which assume heavy use of coal and rapid population growth, the threshold could be exceeded as early as 2040. But these projections reflect the effects of rising emissions on temperatures simulated by climate models, which are being challenged by recent observations.

Human-caused warming depends not only on increases in greenhouse gases but also on how “sensitive” the climate is to these increases. Climate sensitivity is defined as the global surface warming that occurs when the concentration of carbon dioxide in the atmosphere doubles. If climate sensitivity is high, then we can expect substantial warming in the coming century as emissions continue to increase. If climate sensitivity is low, then future warming will be substantially lower, and it may be several generations before we reach what the U.N. considers a dangerous level, even with high emissions.

The IPCC’s latest report (published in 2013) concluded that the actual change in 70 years if carbon-dioxide concentrations double, called the transient climate response, is likely in the range of 1 to 2.5 degrees Celsius. Most climate models have transient climate response values exceeding 1.8 degrees Celsius. But the IPCC report notes the substantial discrepancy between recent observation-based estimates of climate sensitivity and estimates from climate models.

Nicholas Lewis and I have just published a study in Climate Dynamics that shows the best estimate for transient climate response is 1.33 degrees Celsius with a likely range of 1.05-1.80 degrees Celsius. Using an observation-based energy-balance approach, our calculations used the same data for the effects on the Earth’s energy balance of changes in greenhouse gases, aerosols and other drivers of climate change given by the IPCC’s latest report.

We also estimated what the long-term warming from a doubling of carbon-dioxide concentrations would be, once the deep ocean had warmed up. Our estimates of sensitivity, both over a 70-year time-frame and long term, are far lower than the average values of sensitivity determined from global climate models that are used for warming projections. Also our ranges are narrower, with far lower upper limits than reported by the IPCC’s latest report. Even our upper limits lie below the average values of climate models.

Our paper is not an outlier. More than a dozen other observation-based studies have found climate sensitivity values lower than those determined using global climate models, including recent papers published in Environmentrics (2012),Nature Geoscience (2013) and Earth Systems Dynamics (2014). These new climate sensitivity estimates add to the growing evidence that climate models are running “too hot.” Moreover, the estimates in these empirical studies are being borne out by the much-discussed “pause” or “hiatus” in global warming—the period since 1998 during which global average surface temperatures have not significantly increased.

This pause in warming is at odds with the 2007 IPCC report, which expected warming to increase at a rate of 0.2 degrees Celsius per decade in the early 21st century. The warming hiatus, combined with assessments that the climate-model sensitivities are too high, raises serious questions as to whether the climate-model projections of 21st century temperatures are fit for making public policy decisions.

The sensitivity of the climate to increasing concentrations of carbon dioxide is a central question in the debate on the appropriate policy response to increasing carbon dioxide in the atmosphere. Climate sensitivity and estimates of its uncertainty are key inputs into the economic models that drive cost-benefit analyses and estimates of the social cost of carbon.

Continuing to rely on climate-model warming projections based on high, model-derived values of climate sensitivity skews the cost-benefit analyses and estimates of the social cost of carbon. This can bias policy decisions. The implications of the lower values of climate sensitivity in our paper, as well as similar other recent studies, is that human-caused warming near the end of the 21st century should be less than the 2-degrees-Celsius “danger” level for all but the IPCC’s most extreme emission scenario.

This slower rate of warming—relative to climate model projections—means there is less urgency to phase out greenhouse gas emissions now, and more time to find ways to decarbonize the economy affordably. It also allows us the flexibility to revise our policies as further information becomes available.

First draft

I learned a lot about writing an op-ed through this process. Below is my first draft. This morphed into the final version based on input from Nic, another journalist and another person who is experienced in writing op-eds, plus input from the WSJ editors. All of the words in the final version have been approved by me, although the WSJ editors chose the title.

The challenge is to simplify the language, but not the argument, and keep it interesting and relevant while at the same not distorting the information. Below is my first draft:

Some insensitivity about climate change

At the recent UN Climate Summit, Secretary-General Ban-Ki Moon stated: “Without significant cuts in emissions by all countries, and in key sectors, the window of opportunity to stay within less than 2 degrees will soon close forever.”

In the context of the UN Framework Convention on Climate Change, preventing ‘dangerous human interference’ with the climate has been defined – rather arbitrarily – as limiting warming to more than 2oC above preindustrial temperatures. The Earth’s surface temperatures have already warmed about 0.8oC, leaving only 1.2oC before reaching allegedly ‘dangerous’ levels. Based upon global climate model simulations, the Intergovernmental Panel on Climate Change (IPCC) 5th Assessment Report (AR5; 2013) projects a further increase in global mean surface temperatures with continued emissions to exceed 1.2oC sometime within the 21st century, with the timing and magnitude of the exceedance depending on future emissions.

If and when we reach this dangerous level of human caused warming depends not only on how quickly emissions rise, but also on the sensitivity of the climate to greenhouse gas induced warming. If climate sensitivity is high, then we can expect substantial warming in the coming century if greenhouse gas emissions continue to increase. If climate sensitivity is low, then future warming will be substantially lower.

Climate sensitivity is the global surface warming that occurs when the concentration of carbon dioxide in the atmosphere doubles. Equilibrium climate sensitivity refers to the rise in temperature once the climate system has fully warmed up, a process taking centuries due to the enormous heat capacity of the ocean. Transient climate response is a shorter-term measure of sensitivity, over a 70 year timeframe during which carbon dioxide concentrations double.

The IPCC AR5 concluded that equilibrium climate sensitivity is likely in the range 1.5°C to 4.5°C and the transient climate response is likely in the range of 1.0°C to 2.5°C. Climate model simulations produce values in the upper region of these ranges, with most climate models having equilibrium climate sensitivity values exceeding 3.5oC and transient climate response values exceeding 1.8oC.

At the lower end of the sensitivity ranges reported by the IPCC AR5 are values of the climate sensitivity determined using an energy budget model approach that matches global surface temperatures with greenhouse gas concentrations and other forcings (such as solar variations and aerosol forcings) over the last century or so. I coauthored a paper recently published in Climate Dynamics that used this approach to determine climate sensitivity. Our calculations used the same forcing data given by the IPCC AR5, and we included a detailed accounting of the impact of uncertainties in the forcing data on our climate sensitivity estimates.

Our results show the best (median) estimate for equilibrium climate sensitivity is 1.64oC, with a likely (17–83% probability) range of 1.25–2.45oC. The median estimate for Transient Climate Response is 1.33oC with a likely range of 10.5-1.80oC. Most significantly, our new results support narrower likely ranges for climate sensitivity with far lower upper limits than reported by the IPCC AR5. Our upper limits lie below – for equilibrium climate sensitivity, substantially below – the average values of climate models used for warming projections. The true climate sensitivity may even be lower, since the energy budget model assumes that all climate change is forced, and does not account for the effects of decadal and century scale internal variability associated with long-term ocean oscillations.

These new climate sensitivity estimates adds to the growing evidence that climate models are running ‘too hot.’ At the heart of the recent scientific debate on climate change is the ‘pause’ or ‘hiatus’ in global warming – the period since 1998 during which global average surface temperatures have not increased. This observed warming hiatus contrasts with the expectation from the 2007 IPCC Fourth Assessment Report that warming would proceed at a rate of 0.2oC/per decade in the early decades of the 21st century. The warming hiatus combined with assessments that the climate model sensitivities are too high raises serious questions as to whether the climate model projections of 21st century have much utility for decision making.

The sensitivity of our climate to increasing concentrations of carbon dioxide is at the heart of the public debate on the appropriate policy response to increasing carbon dioxide in the atmosphere. Climate sensitivity and estimates of its uncertainty are key inputs into the economic models that drive cost-benefit analyses and estimates of the social cost of carbon.

Continuing to use the higher global climate model-derived values of climate sensitivity skews the cost-benefit analyses and estimates of the social cost of carbon. The implications of the lower values of climate sensitivity in our paper is that human caused warming near the end of the 21st century should be less than the 2oC ‘danger’ level for all but the most extreme emission scenario considered by the IPCC AR5. This delay in the warming – relative to climate model projections – relaxes the phase out period for greenhouse gas emissions, allowing more time to find ways to decarbonize the economy affordably and the flexibility to revise our policies as further information becomes available.

10 dezembro 2014

Falhas Metodológicas das pesquisas empíricas em contabilidade

Some Methodological Deficiencies in Empirical Research Articles in Accounting. Accounting Horizons: September 2014
Resumo:

This paper uses a sample of the regression and behavioral papers published in The Accounting Review and the Journal of Accounting Research from September 2012 through May 2013. We argue first that the current research results reported in empirical regression papers fail adequately to justify the time period adopted for the study. Second, we maintain that the statistical analyses used in these papers as well as in the behavioral papers have produced flawed results. We further maintain that their tests of statistical significance are not appropriate and, more importantly, that these studies do not—and cannot—properly address the economic significance of the work. In other words, significance tests are not tests of the economic meaningfulness of the results. We suggest ways to avoid some but not all of these problems. We also argue that replication studies, which have been essentially abandoned by accounting researchers, can contribute to our search for truth, but few will be forthcoming unless the academic reward system is modified.

Keywords:  research methodology, statistical analysis

Received: September 2013; Accepted: May 2014 ;Published Online: May 2014

Thomas R. Dyckman and Stephen A. Zeff (2014) Some Methodological Deficiencies in Empirical Research Articles in Accounting. Accounting Horizons: September 2014, Vol. 28, No. 3, pp. 695-712.

 http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2324266


Thomas R. Dyckman is a Professor Emeritus at Cornell University and an Adjunct Professor at Florida Gulf Coast University, and Stephen A. Zeff is a Professor at Rice University.

Recomendações dos Autores:

In summary we have endeavored to make the following points:

First, authors must adequately defend their selection of the sample period by convincing the reader that the period is stable itself and in relation to periods in close proximity.

Second, the accounting academy should actively seek and reward replications as an essential element in its aspirations to be a scientific community.

Third, authors should attend to the economic significance as well as the statistical significance of their investigations.

Fourth, authors should respect the limitation of conventional hypothesis tests applied to their data, which implies enhanced caution when declaring results to be statistically significant.

Fifth, authors could consider reporting the use of statistical intervals as a way to mitigate the problems of determining the most likely alternative hypothesis and thereby the appropriate Type ll error.

Sixth, authors need to be sure that, in their “Conclusions” section, they discuss the limitations of their research and how these limitations might be overcome, as well as suggest extensions for future research.
Seventh, authors should consider the use of descriptive statistics and other approaches as a means of, or support for, establishing the validity of their research objective.

Eighth, editors should consider requiring authors of accepted papers to provide a complete description of their methodology, including data collection, accuracy, and verification

29 novembro 2014

Pianogram

A imagem abaixo mostra quantas vezes cada tecla do piano é pressionado em relação ao total para uma determinada dada peça de piano.É um pianogram da obra Opus 10 número 5 de Chopin.




Aqui no link existem outras.

13 novembro 2014

Como mentir com índices de performance internacional


“CROOKS already know these tricks. Honest men must learn them in self-defence,” wrote Darrell Huff in 1954 in “How to Lie With Statistics”, a guide to getting figures to say whatever you want them to. Sadly, Huff needs updating.

The latest way to bamboozle with numbers is the “performance index”, which weaves data on lots of measures into a single easy-to-understand international ranking. From human suffering to perceptions of corruption, from freedom to children’s happiness, nowadays no social problem or public policy lacks one (see article). Governments, think-tanks and campaigners love an index’s simplicity and clarity; when well done, it can illuminate failures, suggest solutions and goad the complacent into action. But there are problems. Competing indices jostle in the intellectual marketplace: the World Economic Forum’s Global Gender Gap ranking, published last week, goes head to head with the UN’s Gender Inequality Index, the Index of Women’s Power from Big Think, an internet forum—and even The Economist’s own Glass Ceiling Index. Worse, some indices are pointless or downright misleading.


As easy as 1, 2, 3

Which to trust, and which to ignore? In the spirit of Huff, here is our guide to concocting a spurious index. Use it to guard against guile—or follow it to shape public perceptions and government policies armed only with a catchy title, patchy data and an agenda.

First, banish pedantry and make life easier for yourself by using whatever figures are to hand, whether they are old, drawn from small or biased samples, or mixed and matched from wildly differing sources. Are figures for a country lacking? Use a “comparator”, no matter how dubious; one index of slavery, short of numbers for Ireland and Iceland, uses British figures for both (aren’t all island nations alike?). If the numbers point in the wrong direction, find tame academics and businessfolk to produce more convenient ones, and call their guesses “expert opinion”. If all that still fails to produce what you want, tweak the weighting of the elements to suit.

Get the presentation right. Leaving your methodology unpublished looks dodgy. Instead, bury a brief but baffling description in an obscure corner of your website, and reserve the home page for celebrity endorsements. Get headlines by hamming up small differences; minor year-on-year moves in the rankings may be statistical noise, but they make great copy.

Above all, remember that you can choose what to put in your index—so you define the problem and dictate the solution. Rankings of business-friendliness may favour countries with strict laws; don’t worry if they are never enforced. Measures of democracy that rely on turnout ignore the ability of autocrats to get out the vote. Indices of women’s status built on education levels forget that, in Saudi Arabia, women outnumber men in universities because they are allowed to do little else but study. If you want prostitution banned, count sex workers who cross borders illegally, but willingly, as “trafficking victims”. Criticism can always be dismissed as sour grapes and special pleading. The numbers, after all, are on your side. You’ve made sure of that.


Fonte: aqui
From the print edition: Leaders

03 novembro 2014

Grande parte dos achados publicados em Finanças são falsos positivos

Entrevista com Campbell Harvey, PHD em Chicago. Professor da Duke University. Ex-editor do Journal of Finance. Vice-preseidente da American Finance Association




Q: Investors often rely on financial research when developing strategies. Your recent findings suggest they should be wary. What did you find?

Campbell Harvey: My paper is about how we conduct research as both academics and practitioners. I was inspired by a paper published in the biomedical field that argued that most scientific tests that are published in medicine are false. I then gathered information on 315 tests that were conducted in finance. After I corrected the test statistics, I found that about half the tests were false. That is, someone was claiming a discovery when there was no real discovery.

Q: What do you mean “correcting the tests”?

Campbell Harvey: The intuition is really simple. Suppose you are trying to predict something like the returns on a portfolio of stocks. Suppose you try 200 different variables. Just by pure chance, about 10 of these variables will be declared “significant” – yet they aren’t. In my paper, I show this by randomly generating 200 variables. The simulated data is just noise, yet a number of the variables predict the portfolio of stock returns. Again, this is what you expect by chance. The contribution of my paper is to show how to correct the tests. The picture above looks like an attractive and profitable investment. The picture below shows 200 random strategies (i.e. the data are made up). The profitable investment is just the best random strategy (denoted in dark red). Hence, it is not an attractive investment — its profitability is purely by chance!
200_strategies



Q: So you provide a new set of research guidelines?

Campbell Harvey: Exactly. Indeed, we go back in time and detail the false research findings. We then extrapolate our model out to 2032 to give researchers guidelines for the next 18 years.

Q: What are the practical implications of your research?

Campbell Harvey: The implications are provocative. Our data mainly focuses on academic research. However, our paper applies to any financial product that is sold to investors. A financial product is, for example, an investment fund that purports to beat some benchmark such as the S&P 500. Often a new product is proposed and there are claims that it outperformed when it is run on historical data (this is commonly called “backtesting” in the industry). The claim of outperformance is challenged in our paper. You can imagine researchers on Wall Street trying hundreds if not thousands of variables. When you try so many variables, you are bound to find something that looks good. But is it really good – or just luck?

Q: What do you hope people take away from your research?

Campbell Harvey: Investors need to realize that about half of the products they are sold are false – that is, there is expected to be no outperformance in the future; they were just lucky in their analysis of historical data.

Q: What reactions have Wall Street businesses had so far to your findings?

Campbell Harvey: A number of these firms have struggled with this problem. They knew it existed (some of their products “work” just by chance). It is in their own best interest to deliver on promises to their clients. Hence, my work has been embraced by the financial community rather than spurned.

Professor Harvey’s research papers, “Evaluating Trading Strategies“, “…and the Cross-Section of Expected Returns” and “Backtesting” are available at SSRN for free download.

11 outubro 2014

Poder da Estatística Bayesiana


Statistics may not sound like the most heroic of pursuits. But if not for statisticians, a Long Island fisherman might have died in the Atlantic Ocean after falling off his boat early one morning last summer.

The man owes his life to a once obscure field known as Bayesian statistics — a set of mathematical rules for using new data to continuously update beliefs or existing knowledge.

The method was invented in the 18th century by an English Presbyterian minister named Thomas Bayes — by some accounts to calculate the probability of God’s existence. In this century, Bayesian statistics has grown vastly more useful because of the kind of advanced computing power that did not exist even 20 years ago.

It is proving especially useful in approaching complex problems, including searches like the one the Coast Guard used in 2013 to find the missing fisherman, John Aldridge (though not, so far, in the hunt for Malaysia Airlines Flight 370).

Now Bayesian statistics are rippling through everything from physics to cancer research, ecology to psychology. Enthusiasts say they are allowing scientists to solve problems that would have been considered impossible just 20 years ago. And lately, they have been thrust into an intense debate over the reliability of research results.
Thomas Bayes

When people think of statistics, they may imagine lists of numbers — batting averages or life-insurance tables. But the current debate is about how scientists turn data into knowledge, evidence and predictions. Concern has been growing in recent years that some fields are not doing a very good job at this sort of inference. In 2012, for example, a team at the biotech company Amgen announced it had analyzed 53 cancer studies and found it could not replicate 47 of them.

Similar follow-up analyses have cast doubt on so many findings in fields such as neuroscience and social science that researchers talk about a “replication crisis”

Some statisticians and scientists are optimistic that Bayesian methods can improve the reliability of research by allowing scientists to crosscheck work done with the more traditional or “classical” approach, known as frequentist statistics. The two methods approach the same problems from different angles.

The essence of the frequentist technique is to apply probability to data. If you suspect your friend has a weighted coin, for example, and you observe that it came up heads nine times out of 10, a frequentist would calculate the probability of getting such a result with an unweighted coin. The answer (about 1 percent) is not a direct measure of the probability that the coin is weighted; it’s a measure of how improbable the nine-in-10 result is — a piece of information that can be useful in investigating your suspicion.

By contrast, Bayesian calculations go straight for the probability of the hypothesis, factoring in not just the data from the coin-toss experiment but any other relevant information — including whether you have previously seen your friend use a weighted coin.

Scientists who have learned Bayesian statistics often marvel that it propels them through a different kind of scientific reasoning than they had experienced using classical methods.

“Statistics sounds like this dry, technical subject, but it draws on deep philosophical debates about the nature of reality,” said the Princeton University astrophysicist Edwin Turner, who has witnessed a widespread conversion to Bayesian thinking in his field over the last 15 years.

Countering Pure Objectivity

Frequentist statistics became the standard of the 20th century by promising just-the-facts objectivity, unsullied by beliefs or biases. In the 2003 statistics primer “Dicing With Death,” Stephen Senn traces the technique’s roots to 18th-century England, when a physician named John Arbuthnot set out to calculate the ratio of male to female births.

Arbuthnot gathered christening records from 1629 to 1710 and found that in London, a few more boys were recorded every year. He then calculated the odds that such an 82-year run could occur simply by chance, and found that it was one in trillions. This frequentist calculation can’t tell them what is causing the sex ratio to be skewed. Arbuthnot proposed that God skewed the birthrates to balance the higher mortality that had been observed among boys, but scientists today favor a biological explanation over a theological one.

Later in the 1700s, the mathematician and astronomer Daniel Bernoulli used a similar technique to investigate the curious geometry of the solar system, in which planets orbit the sun in a flat, pancake-shaped plane. If the orbital angles were purely random — with Earth, say, at zero degrees, Venus at 45 and Mars at 90 — the solar system would look more like a sphere than a pancake. But Bernoulli calculated that all the planets known at the time orbited within seven degrees of the plane, known as the ecliptic.

What were the odds of that? Bernoulli’s calculations put them at about one in 13 million. Today, this kind of number is called a p-value, the probability that an observed phenomenon or one more extreme could have occurred by chance. Results are usually considered “statistically significant” if the p-value is less than 5 percent.Photo
The Coast Guard, guided by the statistical method of Thomas Bayes, was able to find the missing fisherman John Aldridge.CreditDaniel Shea

But there is a danger in this tradition, said Andrew Gelman, a statistics professor at Columbia. Even if scientists always did the calculations correctly — and they don’t, he argues — accepting everything with a p-value of 5 percent means that one in 20 “statistically significant” results are nothing but random noise.

The proportion of wrong results published in prominent journals is probably even higher, he said, because such findings are often surprising and appealingly counterintuitive, said Dr. Gelman, an occasional contributor to Science Times.

Looking at Other Factors

Take, for instance, a study concluding that single women who were ovulating were 20 percent more likely to vote for President Obama in 2012 than those who were not. (In married women, the effect was reversed.)
Continue reading the main story

Dr. Gelman re-evaluated the study using Bayesian statistics. That allowed him to look at probability not simply as a matter of results and sample sizes, but in the light of other information that could affect those results.

He factored in data showing that people rarely change their voting preference over an election cycle, let alone a menstrual cycle. When he did, the study’s statistical significance evaporated. (The paper’s lead author, Kristina M. Durante of the University of Texas, San Antonio, said she stood by the finding.)

Dr. Gelman said the results would not have been considered statistically significant had the researchers used the frequentist method properly. He suggests using Bayesian calculations not necessarily to replace classical statistics but to flag spurious results.

A famously counterintuitive puzzle that lends itself to a Bayesian approach is the Monty Hall problem, in which Mr. Hall, longtime host of the game show “Let’s Make a Deal,” hides a car behind one of three doors and a goat behind each of the other two. The contestant picks Door No. 1, but before opening it, Mr. Hall opens Door No. 2 to reveal a goat. Should the contestant stick with No. 1 or switch to No. 3, or does it matter?

A Bayesian calculation would start with one-third odds that any given door hides the car, then update that knowledge with the new data: Door No. 2 had a goat. The odds that the contestant guessed right — that the car is behind No. 1 — remain one in three. Thus, the odds that she guessed wrong are two in three. And if she guessed wrong, the car must be behind Door No. 3. So she should indeed switch.

In other fields, researchers are using Bayesian statistics to tackle problems of formidable complexity. The New York University astrophysicist David Hogg credits Bayesian statistics with narrowing down the age of the universe. As recently as the late 1990s, astronomers could say only that it was eight billion to 15 billion years; now, factoring in supernova explosions, the distribution of galaxies and patterns seen in radiation left over from the Big Bang, they have concluded with some confidence that the number is 13.8 billion years.

Bayesian reasoning combined with advanced computing power has also revolutionized the search for planets orbiting distant stars, said Dr. Turner, the Princeton astrophysicist.

In most cases, astronomers can’t see these planets; their light is drowned out by the much brighter stars they orbit. What the scientists can see are slight variations in starlight; from these glimmers, they can judge whether planets are passing in front of a star or causing it to wobble from their gravitational tug.Photo
Andrew Gelman, a statistics professor at Columbia, says the Bayesian method is good for flagging erroneous conclusions. CreditJingchen Liu

Making matters more complicated, the size of the apparent wobbles depends on whether astronomers are observing a planet’s orbit edge-on or from some other angle. But by factoring in data from a growing list of known planets, the scientists can deduce the most probable properties of new planets.

One downside of Bayesian statistics is that it requires prior information — and often scientists need to start with a guess or estimate. Assigning numbers to subjective judgments is “like fingernails on a chalkboard,” said physicist Kyle Cranmer, who helped develop a frequentist technique to identify the latest new subatomic particle — the Higgs boson.


Others say that in confronting the so-called replication crisis, the best cure for misleading findings is not Bayesian statistics, but good frequentist ones. It was frequentist statistics that allowed people to uncover all the problems with irreproducible research in the first place, said Deborah Mayo, a philosopher of science at Virginia Tech. The technique was developed to distinguish real effects from chance, and to prevent scientists from fooling themselves.

Uri Simonsohn, a psychologist at the University of Pennsylvania, agrees. Several years ago, he published a paper that exposed common statistical shenanigans in his field — logical leaps, unjustified conclusions, and various forms of unconscious and conscious cheating.

He said he had looked into Bayesian statistics and concluded that if people misused or misunderstood one system, they would do just as badly with the other. Bayesian statistics, in short, can’t save us from bad science.

At Times a Lifesaver

Despite its 18th-century origins, the technique is only now beginning to reveal its power with the advent of 21st-century computing speed.

Some historians say Bayes developed his technique to counter the philosopher David Hume’s contention that most so-called miracles were likely to be fakes or illusions. Bayes didn’t make much headway in that debate — at least not directly.

But even Hume might have been impressed last year, when the Coast Guard used Bayesian statistics to search for Mr. Aldridge, its computers continually updating and narrowing down his most probable locations.

The Coast Guard has been using Bayesian analysis since the 1970s. The approach lends itself well to problems like searches, which involve a single incident and many different kinds of relevant data, said Lawrence Stone, a statistician for Metron, a scientific consulting firm in Reston, Va., that works with the Coast Guard.

At first, all the Coast Guard knew about the fisherman was that he fell off his boat sometime from 9 p.m. on July 24 to 6 the next morning. The sparse information went into a program called Sarops, for Search and Rescue Optimal Planning System. Over the next few hours, searchers added new information — on prevailing currents, places the search helicopters had already flown and some additional clues found by the boat’s captain.

The system could not deduce exactly where Mr. Aldridge was drifting, but with more information, it continued to narrow down the most promising places to search.

Just before turning back to refuel, a searcher in a helicopter spotted a man clinging to two buoys he had tied together. He had been in the water for 12 hours; he was hypothermic and sunburned but alive.

Even in the jaded 21st century, it was considered something of a miracle.


Fonte:A version of this article appears in print on September 30, 2014, on page D1 of the New York edition with the headline: The Odds, Continually Updated. Order Reprints|Today's Paper|Subscribe

18 março 2014

Intervalo de Confiança

Muitas pesquisas acadêmicas têm usado testes estatísticos para verificar se a hipótese apresentada se confirma ou não. E juntamente com estes testes temos a figura do intervalo de confiança. Assim, após levantar os dados e colocar estes valores num software estatístico, faz-se a interpretação dos resultados usando o intervalo de confiança.

Parece fácil e infalível. Basta olhar o resultado do software. Entretanto, três pesquisadores da Universidade de Groningen, além de um do Missouri, fizeram um levantamento com 120 pesquisadores e 442 estudantes, todos do campo da psicologia. Eles apresentaram seis frases envolvendo a interpretação de um determinado intervalo de confiança e solicitaram que os respondentes informassem se a frase era verdadeira ou não. E mas seis frases todas afirmações tinham como gabarito “falso”.

Em média os respondentes apresentaram desempenho ruim, errando, em média, mais de três afirmações. Isto é um sinal de que os pesquisadores, e seus pupilos, não conhecem a interpretação correta do que seria o intervalo de confiança. Obviamente que isto é um sinal alarmante para as pesquisas que estão sendo publicados no mundo. Assim, quando você ler sobre o resultado de uma pesquisa realizada por uma instituição de ensino renomada, desconfie da qualidade da análise. O pesquisador pode ter confundido os resultados.

Leia mais: HOEKSTRA, Rink et AL. Robust misinterpretation of confidence intervals. Psychon Bull Rev, 2014