zak david expresses critical views of some published research in empirical quantitative finance

(First, I too will say that I don’t know how many of the problems in financial research relate to other fields in size or dimension.)

The question is: with respect to quantitative investment research, how bad is the problem of data mining?

Having done a healthy share of paper replications over the past decade, and having been consistently disappointed when the models or techniques broke down on data shortly after (or even before) the authors’ sample periods, I’d say of course it’s a gigantic problem — spurious results are the norm. But also over those years, the granularity of market data available to researchers has become finer, and the ubiquity of machine learning software packages has placed predictive modeling in the hands of anyone with a pulse. As a result, more troubling than potential “false positives”, I’ve found papers so filled with coding errors and domain mistakes that the authors couldn’t have possibly found the existence of “a positive” in the first place (I hate neologisms but until you come up with something better these are called “impossitives”). So not only is it easier than ever to find spurious relationships, but there are several new classes of fatal errors that the current publishing and peer review system is poorly equipped to identify.

That said, this should also be clear: despite its pejorative connotations, data mining itself is not malicious and does not cause false positives. In data-driven trading or quantitative investment research, data mining is a big part of what you have to do. But successfully employing those techniques requires a lot of attention to the nuances of the domain, critical evaluation of your assumptions, and a commitment to exploring philosophies of statistics and science. When “pairs trading” was invented at Morgan Stanley in the early 1980’s, that came from a computer science graduate from Columbia who was calculating relative weights and correlations for groups of stocks sorted by industry sectors. In that same era, RenTech was getting started. Robert Frey, a former managing director, has said that their approach was “decidedly non-theoretical. Discipline is the key, not theory.” The founder, Jim Simons, tells an anecdote about sending people down to the NYFed building to copy historical interest rate information by hand that was kept in a written ledger. These people are perhaps the most successful data miners in all of finance.

An ever-increasing risk for catastrophic errors is when academics write code. A few years ago one of the most famous results in empirical economics was actually a spreadsheet error. More recently, I discovered numerous invalidating bugs in the ad hoc backtesting software created for a paper published by the editor of the journal Quantitative Finance. From long-term factor model research to high-frequency order book analysis, researchers are writing more lines of code of increasing complexity. There’s an often-cited statistic that the professional software industry produces 15-50 errors per 1,000 lines of code delivered. Yet to the best of my knowledge, there are no journals that have professional software developers review submitted code. Most journals still either don’t have or don’t enforce code submission and open publishing standards — an asinine move that both increases the uncertainty of the results as well as slows the progress of future research.

The other big problem, with a worrisome trend (especially among lower quality journals, and lower quality academics), is how much of the research and publishing apparatus is completely blasé about the differences in the domain knowledge required to analyze large cross-sections of quarterly stock data versus high-frequency trade and order book data. Examples:

  • This paper casually invented a metric to measure high-frequency “liquidity takers”, but if the author could have asked anyone in industry and realized he was counting market makers racing for queue position to provide liquidity.
  • This paper assumed stock volumes are normally distributed, so instead of finding evidence of high-frequency “quote stuffing” as the authors’ claims, instead, they found evidence that stock volumes look more like this.
  • This paper didn’t know what times futures markets were open, and also applauded itself for a high predictive accuracy on a contract no one trades.
  • This paper calculated closed PL from buying a futures contract with one expiration and selling a futures contract the next day with a different expiration.
  • This paper applies a denoising filter to the whole time series before predicting it, meaning that each point has information from the future in it. And the authors also added trading costs to their PL.
  • This paper tries to predict 5-minute returns for US stocks but forgot the fact that the market is closed on nights and weekends, and the authors also admitted to trying a whole bunch of models until one looked good in sample.

Each of those papers has numerous other problems. When you take these problems to the authors and publishers, their first response is always “oh no! we should do something!”, but at some point I think they all go: “what has reality done for me lately? It certainly didn’t publish this paper.”