Quantoisseur

Announcement – March 2021

cfsmith — Thu, 04 Mar 2021 13:48:41 +0000

This site will not be updated or monitored going forward.

Risk-Neutral Probability Distributions: CLK2020

cfsmith — Mon, 24 Aug 2020 08:00:27 +0000

Risk-neutral probability distributions (RND) are used to compute the fair value of an asset as a discounted conditional expectation of its future payoff. In 1978, Breeden and Litzenberger presented a method to derive this distribution for an underlying asset from observable option prices [1]. The derivation of the relationship is well presented in A Simple and Reliable Way to Compute Option-Based Risk-Neutral Distributions by Allan Malz which is summarized below [2].

In the absence of arbitrage, the European call option value can be related to the discounted expected terminal value under the risk-neutral distribution.

where

Differentiating the call option value with respect to the strike price gives what Malz refers to as the “exercise-price delta”.

This result contains the risk-neutral cumulative distribution function , the probability that the underlying price at expiration will be or lower.

To arrive at the risk-neutral probability density function, one more derivative with respect to the strike price is needed.

Now we will present an overview of deriving these distributions numerically using the infamous May 2020 WTI Crude Oil contract that went negative in April. Albeit these are American options, the analysis is interesting nonetheless.

To approximate the partial derivatives, we need option prices for a fine grid of strike prices. In fact, Breeden and Litzenberger simply assume that options are traded at every positive strike price. The first step is thus to fit a smooth function to the Black-Scholes implied volatility smile of the option prices. We interpolate implied volatilities rather than prices because the former tend to be smoother and better behaved. To ensure that the prices and implied volatilities are clean, open interest can be used to weight or exclude certain strikes. We constructed the smile using out-of-the-money calls and puts as these options tend to have more open interest than their in-the-money equivalents. Numpy’s polyfit is then used to interpolate and extrapolate the smile as needed.

The second step is to calculate the call prices using the Black-Scholes formula for constant rates with your fitted implied volatility function, underlying price, time to expiration, and observed risk-free rate.

The third step, following the mathematical derivation above, is to calculate using the first derivative of the call price with respect to the strike price. Numpy’s diff function will numerically difference your call price function. This effectively corresponds to using a forward difference to approximate the partial derivative.

To compare these distributions across strike prices, using moneyness, , as the x-axis is helpful. The fourth and final step, is to calculate using the second derivative of the call price with respect to the strike price and scaling it by .

It can be difficult to generate clean RNDs and through this process it became clear how cumbersome it would be to generate these over an extended time series. To evaluate the quality of the fit, we need to check a few conditions.

– The cumulative distribution needs to be bounded between 0 and 1
– The probability density function needs to integrate to 1 and remain positive
– The exercise-price delta needs to be monotone and bounded by and 0
– The strike weighted probability density function needs to integrate to the underlying futures price

There are additional arbitrage conditions to consider on the fitted implied volatility smile but our distributions meet the above conditions nicely so will be sufficient for our analysis. In fact, the absence of arbitrage is one of the few assumptions needed for the above mathematical derivation to hold. Further implicit assumptions include constant interest rates, that the call option price is twice differentiable and that a (smooth) probability density function of the price of the underlying asset exists to start with. Importantly, there are no further restrictions on the probabilistic nature of the underlying asset price process.

The above technique was repeated on the CLK2020 option chain for the four dates shown in the figure below to see how the option implied volatility RNDs reacted to the developing macro landscape including Covid-19, OPEC+, and crude oil physical storage capacity. The analysis was not extended into April as the possibility of negative prices violate some of the fundamental assumptions used.

The resulting risk-neutral densities below tell quite the colorful story. In January, the distribution has a slight negative skew but high kurtosis. In February and March, the tail risk begins to emerge as the kurtosis decreases. Finally in mid-March the distribution flattens, presenting a very expressive RND.

The corresponding fitted implied volatility smiles are shown below.

At the time, all eyes were on the calendar spreads and bear spreading the futures, short a nearby contract expiration month and long a deferred contact expiration month, was a popular trade. Based on the RND, this trade made sense if you believed that the bulk of the tail risk was in the nearby contract. Using the QuikStrike Commitment of Traders tool on the CME website we can look at the positioning of traders through this event [3]. The plots below show the positioning of the two groups that make up the “Non-Commercial” group of traders which includes hedge funds and large speculators.

The “Managed Money” began the year with a larger spreading position (likely bear?) which decreased into February as a more concentrated short position grew. In April and May, they capitalized on the series of events, quickly building a fervent long position to ride the rally back up.

The “Other Reportables” held a large spreading positioning from the end of February until May, likely capturing the historical calendar spread move nicely.

It is important to emphasize that risk-neutral distributions are different from the real-world distributions which govern the likelihood of events in financial markets. A RND is an artificial probability distribution which allows the computation of option prices without specifying the risk aversion of investors (this insight is the reason for Scholes’ and Merton’s Nobel Prize in 1997). Thus, they are a very convenient mathematical tool but not a description of the real world.

Intuitively, RNDs merge real-world probabilities with investor’s attitudes toward risk. They tend to inflate the probability of economically bad states (which are rare but investors are very fearful of them) and understate the probability of booms. Hence, RNDs tend to be more negatively skewed than real-world distributions are.

In fact, the above idea can be formulated more rigorously. Let’s use a discrete setting for simplicity. Suppose we want to price an asset which generates a random cash flow of tomorrow (e.g. payoff of a call option). Finance theory suggests that today’s price, , should equal

where the random variable takes the associated risk into account. Riskier payoffs should yield lower prices. is called the stochastic discount factor. In a discrete setting, the expectation is just the sum of outcomes weighted by the corresponding probabilities. Thus,

where . Hence, and making them valid pseudo-(risk-neutral)-probabilities. Furthermore, denoting the risk-free rate by , and because , we have , explaining the second equality. The final line is simply the discrete analogue to the equation at the beginning of this blog post: prices are discounted risk-neutral expectations of future payoffs.

This short derivation illustrates how merging the risk aversion (in form of ) and the real life probabilities () yield the risk-neutral probabilities (). This however distorts and dilutes the probabilities. We can recover the risk-neutral probabilities from option prices, but this does not tell us what the real probabilities are because we do not know the right stochastic discount factor.

Changes in the RND can be attributed to changes in investor’s preferences, real-world probabilities, or both. A priori, it is impossible to know which of the three cases is true. Disentangling real-world probabilities from risk aversion typically requires a general equilibrium model of the economy and remains an open question for current research, see Ross (2015) [4] for a recent attempt.

References

[1] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2642349

[2] https://www.newyorkfed.org/medialibrary/media/research/staff_reports/sr677.pdf

[3] https://www.cmegroup.com/tools-information/quikstrike/commitment-of-traders-energy.html

[4] https://onlinelibrary.wiley.com/doi/abs/10.1111/jofi.12092

Combinatorial Purged Cross-Validation Explained

cfsmith — Tue, 05 Nov 2019 12:59:31 +0000

In this tutorial I explain how to adapt the traditional k-fold CV to financial applications with purging, embargoing, and combinatorial backtest paths.

Building a Basic Cross-Sectional Momentum Strategy – Python Tutorial

cfsmith — Tue, 08 Oct 2019 16:36:57 +0000

In this tutorial we utilize the free Alpha Vantage API to pull price data and build a basic momentum strategy that is rebalanced weekly. This approach can be adapted for any feature you’d like to explore. Let me know what you’d like to see in the next video!

Python & Data Science Tutorial – Analyzing a Random Dataset

cfsmith — Mon, 02 Sep 2019 22:27:58 +0000

Using the Dynamic Mode Decomposition (DMD) to Rotate Long-Short Exposure Between Stock Market Sectors

cfsmith — Tue, 19 Mar 2019 12:28:58 +0000

Co-Author: Eric Kammers

Part 1 – Theoretical Background

The Dynamic Mode Decomposition (DMD) was originally developed for its application in fluid dynamics where it could decompose complex flows into simpler low-rank spatio-temporal features. The power of this method lies in the fact that it does not depend on any principle equations of the dynamic system it is analyzing and is thus equation-free [1]. Also, unlike other low-rank reconstruction algorithms like the Singular Value Decomposition (SVD), the DMD can be used to make short-term future state predictions.

The algorithm is implemented as follows [2].

1. We begin with a x matrix, , containing data collected from sources over evenly spaced time periods, from the system of interest.

2. From this matrix two sub-matrices are constructed, and , which are defined below.

We can consider a Koopman operator such that and rewrite as

whose columns now are elements in a Krylov space.

3. The SVD decomposition of is computed.

Then based on the variance captured by the singular values and the application of the algorithm, the number of desired reconstructions ranks can be chosen.

4. The matrix is constructed such that it is the best mapping between the two sub-matrices.

can be approximated with from evaluating the expression

where , , and are the truncated matrices from the SVD reduction of . The eigenvalue problem associated with is

where is the rank of approximation that was chosen previously. The eigenvalues contain information on the time dynamics of our system and the eigenvectors can be used to construct the DMD modes.

5. The approximated solution for all future times, , can now be written as

where , is the initial amplitude of each mode, is the matrix whose columns are eigenvectors , and is the vector of coefficients. Finally, all that needs to be computed is the initial coefficient values which can be found by looking at time zero and solving for via a pseudo-inverse in the equation

To summarize the algorithm, we will “train” a matrix on a subset of the data whose eigenvalues and eigenvectors contain necessary information to make future state predictions for a given time horizon.

Part 2 – Basic Demonstration

We begin with a basic example to demonstrate how to use the pyDMD package. First, we construct a matrix where each row is a snapshot in time and each column can be thought of as a different location in our system being sampled.

Now we will attempt to predict the predict the 6th row using a future state prediction from the DMD fitted on the first 5 rows.

import numpy as np
from pydmd import DMD
df = np.array([[-2,6,1,1,-1],
               [-1,5,1,2,-1],
               [0,4,2,1,-1],
               [1,3,2,2,-1],
               [2,2,3,1,-1],
               [3,1,3,2,-1]])

dmd = DMD(svd_rank = 2) # Specify desired truncation
train = df[:5,:]
dmd.fit(train.T) # Fit the model on the first 5 rows
dmd.dmd_time['tend'] *= (1+1/6) # Predict one additional time step
recon = dmd.reconstructed_data.real.T # Make prediction

print('Actual :',df[5,:])
print('Predicted :',recon[5,:])

Two SVD ranks were used for the reconstruction and the result is pleasantly accurate for how easily it was implemented.

Part 3 – Sector Rotation Strategy

We will now attempt to model the stock market as a dynamic system broken down by sectors and use the DMD to predict which sectors to be long and short in over time. This is commonly known as a sector rotation strategy. To ensure that we have adequate historical data we will use 9 sector ETFs: XLY, XLP, XLE, XLF, XLV, XLI, XLB, XLK, and XLU from 2000-2019 and rebalance monthly. The strategy is implemented as follows:

Fit a DMD model using the last N months of monthly returns. The SVD rank reconstruction number can be chosen as desired.
Use the DMD model to predict the next month’s snapshot which are the returns of each ETF.
Construct the portfolio by taking long positions in the top 5 ETFs and short positions in the bottom 4 ETFs. Thus, we are remaining very close to market neutral.
Continue this over time by refitting the model monthly and making a new prediction for the next month.

Though the results are quite sensitive to changes in the model parameters, some of the best parameters achieve Sharpe ratios superior to the long only portfolio while remaining roughly market neutral which is very encouraging and warrants further exploration with a proper, robust backtest procedure.

The code and functions used to produce this plot can found here. There are also many additional features of the pyDMD package that we did not explore that could potentially improve the results. If you have any questions, feel free to reach out by email at coltonsmith321@gmail.com

References

[1] N. Kutz, S. Brunton, B. Brunton, and J. Proctor, Dynamic Mode Decomposition: Data-Driven Modeling of Complex Systems. 2016.

[2] Mann, Jordan & Nathan Kutz, J. Dynamic Mode Decomposition for Financial Trading Strategies. Quantitative Finance. 16. 10.1080/14697688.2016.1170194. 2015.

Quantifying the Impact of the Number of Decks and Depth of Penetration While Counting Blackjack

cfsmith — Wed, 25 Apr 2018 12:23:00 +0000

Counting Blackjack has become an interest of mine over the past few months. After learning the basics of the Hi-Lo counting strategy I thought it would be beneficial to analyze how much time I should expect to have the advantage during my trips.

The Hi-Lo Count is the most used and discussed counting strategy for Blackjack because of its simplicity and effectiveness. Each card is given a value of either -1, 0, or +1. The low cards (2-6) are given values of +1. The neutral cards (7-9) are given values of 0. The high cards (10-Ace) are given values of -1. At any point in the shoe, the running count is the summation of card values dealt up until that point. The running count at the beginning of the shoe is zero and if all cards of the shoe were dealt, it would end at zero. To calculate the true count, the running count is divided by the number of decks remaining in the shoe. This standardizes the true count, so it is comparable at any point of the shoe. The true count represents the player vs the house edge and instructs the player to increase their bet when the advantage is in their favor. The player’s advantage increases as the true count increases because there are more high cards left in the shoe, which then gives the player better hands and causes the dealer to bust more frequently. When playing perfect basic strategy, the player begins to have the advantage when the true count (TC) becomes larger than +1. If you’re interested in the details of basic strategy, card counting, and expected value, I encourage you to check out the available resources online that cover these topics thoroughly.

This advantage from counting can vary based on the number of decks being used and the depth of penetration (how far into the shoe the dealer places the shuffle card). It is widely preached that the player has a larger positive expected value when fewer decks are used with deeper penetration. Therefore, many casinos deal 6+ decks with penetration as low as 60% to make it more difficult for players to get an advantage.

To quantify the impact of penetration and the number of decks on the advantage from counting cards, I ran rudimentary simulations that dealt through shoes, one card at a time, while keeping track of the true count at each point. To start, I simply plotted the different paths of true counts for different shoes. In Figure 1 below, the paths of 1000 simulations with a shoe of 5 decks and poor penetration are shown. The true count never passes +/-10.

Figure 1: True Count Paths w/ Low Penetration

Since the true count depends on the number of decks remaining in the shoe, having deeper penetration can result in more extreme true counts. In Figure 2 below, the same amount of decks are used but now the shoe is dealt entirely through.

Figure 2: True Count Paths w/ Full Penetration

In all simulations the true count must end at zero since the running count of an entire shoe is zero. These situations when the true count is high and reverts back to zero are where the advantage from counting pays dividends. It is also clear from the simulations that the variance of the true count increases as the penetration becomes deeper which gives rise to more of these profitable periods. These periods also will return to neutral quicker though. A +5 TC could return to zero in one hand (5 cards) if there is only one deck remaining in the shoe. A +5 TC with 3 decks left would take 15 face cards to return to zero.

Next, to visualize how these simulations compare across different combinations of decks and penetration, I needed to choose what variables mattered. The average number of points per simulation where the TC > 2 quantifies the amount of points with a player advantage. Since this is dependent on the shoe size, these values were divided by the shoe size to give the percent of shoe with the advantage. Additionally, the variance of the true count was monitored because as we saw previously, a larger variance may amplify the advantage. The results are presented in Figure 3 below.

Figure 3: Heatmaps of Outcomes Based on Different Combinations of the Number of Decks and Depth of Penetration

These findings are consistent with previous analysis of Blackjack. The player can expect to spend the most time with the advantage in games using fewer decks and deeper penetration. The value of this advantage time is also improved by a larger variance. The variance is more dependent on the depth of penetration than the number of decks, but both affect its value. If we look at a standard game available at many casinos of 6 decks with 65% penetration, the player will have the TC > 2 for 13% of the shoe on average with a TC standard deviation of 1.3. If we were to find a 2-deck game with 90% penetration, the player will have the TC > 2 for 28% of the shoe on average with a TC standard deviation of 4.1. If these games have the same rules, the player will gain more of an edge with fewer decks and deeper penetration.

The code for generating this analysis can be found on Github here. An interesting topic to explore next would if the path that the TC takes when reverting back to zero affects expected value and if so, how?

Acknowledgements: Thank you to James Sweetman for helping me better understand the Blackjack concepts.

Constructing Continuous Futures Price Series

cfsmith — Mon, 05 Feb 2018 13:22:03 +0000

Welcome! If you enjoy these posts, please follow this blog via email and check out my Twitter feed located on the sidebar.

All of my previous analysis has focused on US equities, but today we begin the journey into another asset class, futures. Futures are traded via contracts where two parties agree to exchange a quantity of an asset for a price decided today and delivered at a specified date in the future. The expiration dates of the contracts vary based on the underlying asset and range from monthly to quarterly. To properly evaluate the profitability of trading strategies with historical futures contract data, it is necessary to combine these contracts into a continuous price series. This isn’t entirely straightforward because contango and backwardation factors cause contracts of the same underlying asset with different expiration dates to be priced differently. It is initially unclear how to best concatenate these price series, so I want to explore a few of the basic methods and their advantages. I’m interested in exploring futures strategies, so this was a necessary first step since Quandl’s free continuous futures data is of insufficient quality, but they provide high quality individual contract data. Becoming comfortable with the contract data while creating flexible, testable continuous price series is a valuable exercise. Additionally, I decided to use Python because I have not done a project with it and this is a useful applied problem to build some Python skills.

For this example, we will construct a variety of continuous price series for the commodity wheat. The first step is to pull the contract data from the Quandl API and store it appropriately (see the included code). To begin, let’s plot all the contracts’ prices to observe the behavior of the price data. As seen in Figure 1 below, although there is some consistency between the contracts, there is a significant amount of variance.

Figure 1: Futures Prices of All Wheat Contracts

Ideally, to make this a backtest-ready series, we need to be trading a single contract at each point in time (or possibly a combination of contracts). The further we are from a contract’s expiration; the more price speculation is embedded into the price. The front or nearest month contract refers to the contract which has the soonest expiration date and thus has the least amount of speculation. Generally, front month contracts have the most trading activity, as measured by open interest. When expiration approaches, traders will roll their positions over to the next contract or let them expire. A basic approach to construct a continuous series would be to always use the front month contract’s price and when the current front month contract expires, switch to the new front month contract. There is one caveat, the price of the contracts when you rollover may not be the same, and in general, won’t be the same. These gaps will create artificial, untradeable price movements in the continuous series. To create a smooth transition between contracts, we can adjust them in such a way so that there won’t be a gap. We’ll refer to the size of this gap as the adjustment factor. Forward adjusting would shift the next contract to eliminate the gap by subtracting the adjustment factor from the next contract’s price series. Backward adjusting would shift the previous contract to eliminate the gap by adding the adjustment factor to the previous contract’s price series. Figure 2 below shows an example of these adjustments for an actual rollover.

Figure 2: Backward/Forward Adjusting Example

Now, when this approach is extended over multiple contracts the adjustment factors will simply cumulate so that prices for every contract are appropriately adjusted. The quality of the data is the same whether you backward or forward adjust. The difference is what needs to be recalculated with each new contract and what the values represent. The backward adjusted series’ current values represent the actual market values thus the historical data needs to be recalculated when a new contract is added to the series. The forward adjusted series does not require recalculating historical data but since each new contract that is added to the series needs to be adjusted, the new prices will not represent the actual market values. Figure 3 below shows the fully adjusted wheat series. Notice that the difference between the forward and backward adjusted series remains constant. This difference is the total adjustment factor.

Figure 3: Expiration Adjusted Wheat Series

A point that becomes apparent, here, is that we are adjusting the price series, not the returns. The daily returns of the forward and backward adjusted series differ. When creating continuous prices, you are forced to choose between either correct P&L or correct returns. To adjust for correct returns, one would need to work with the daily log returns series of the contracts and then construct a usable price series from those. Dr. Ernest Chan’s second book covers this concept thoroughly on pg. 12-16.

Another approach to construct a continuous series is the perpetual method, which smooths the transitions between contracts by taking a weighted average of the contracts’ prices during the transition period. This can be weighted on time left to expiration, open interest, or other properties of the contracts. For this example, we will begin the transition to the next contract once its open interest becomes greater than the current contract and weight the prices during the transition based on open interest. As seen in Figure 4 below, this happens prior to the expiration of the contracts.

Figure 4: 2014 Wheat Contracts’ Open Interest

Like the previous example, one could also forward/backward adjust using the open interest crossover date which is more realistic because of better liquidity. This option is available in the attached code. In our case, after this crossover date, we transition to the next contract over the next 5 days (the number of days is adjustable) based on open interest. Figure 5 below shows the slightly smoother perpetual adjusted series.

Figure 5: Perpetual Adjusted Wheat Series

This smoothed price series may be advantageous for statistical research since it reduces noise in longer term signals but it contains prices that are not directly tradable. To trade the price during the transition period, one would have to rebalance their percentage of the current and next contract each day, which would incur transaction costs.

There are a variety of other adjustment methods, but the examples shown here provide a strong and sufficient foundation. A paper that I found very helpful and one that covers additional methods is available here. The Python code accompanying this post can be found here. I hope you found these examples helpful. In my next post, I am going to use these continuous series as I analyze futures trading strategies. Thanks for reading!

Cointegration, Correlation, and Log Returns

cfsmith — Mon, 06 Nov 2017 12:44:43 +0000

Co-Author: Eric Kammers

I recently created a Twitter account for the blog where I will curate and comment on content I find interesting related to finance, data science, and data visualization. Please follow me at @Quantoisseur (see the embedded stream on the sidebar). Enjoy the post!

The differences between correlation and cointegration can often be confusing. While there are some helpful explanations online, I wasn’t satisfied with the visual examples. When looking at a plot of an actual pair of symbols where the correlation and cointegration test results differ, it can be difficult to pinpoint which portions of the time series are responsible for these separate properties. To solve this, I decided to produce some basic examples with sinusoidal functions so I could solidify my understanding of these concepts.

First, let’s highlight the difference between cointegration and correlation. Correlation is more familiar to most of us, especially outside of the financial industry. Correlation is a measure of how well two variables move in tandem together over time. Two common correlation measures are Pearson’s product-moment coefficient and Spearman’s ranks-order coefficient. Both coefficients range from -1, perfect negative correlation, to 0, no correlation, to 1, perfect positive correlation. Positive correlation means that the variables move in tandem in the same direction while negative correlation means that they move in tandem but in opposite directions. When calculating correlation, we look at returns rather than price because returns are normalized across differently priced assets. The main difference between the two correlation coefficients is that the Spearman coefficient measures the monotonic relationship between two variables, while the Pearson coefficient measures their linear relationship. Figure 1 below shows how the different coefficients behave when two variables exhibit either a linear or nonlinear relationship. Notice how the Spearman coefficient remains 1 for both scenarios since the relationship in both cases is perfectly monotonic.

Figure 1: Pearson vs Spearman for Nonlinear and Linear Functions

Based on the distributions of the data, these coefficients can behave differently which I will explore with additional examples later in this post. Here are some resources for further clarification on the Pearson and Spearman coefficients.

Now, cointegration tests do not measure how well two variables move together, but rather whether the difference between their means remains constant. Often, variables with high correlation will also be cointegrated, and vice versa, but this isn’t always the case. In contrast to correlation, when testing for cointegration we use prices rather than returns since we’re more interested in the trend between the variables’ means over time than in the individual price movements. There are multiple cointegration tests, but in this case, I’ll be using the Augmented Dicky-Fuller test to evaluate the stationarity of the residuals from the linear model created with the pair’s price series.

Second, using log returns for financial calculations is, in many cases, preferable to using simple returns. There are many resources online explaining the advantages and disadvantages of using log returns. We will not dive into this topic too much, but some of the advantages are due to assuming a log normal distribution which makes them easier to work with and gives them convenient properties like time-additivity. Figure 2 below shows the relationship between log and simple returns.

Figure 2: Relationship between Simple and Log Returns

Furthermore, correlation is a second moment calculation meaning that it is only appropriate if higher moments are insignificant. Using log returns is better so we can ensure the higher moments are negligible and avoid having to use copulas.

Now with this framework, we can introduce some visual examples. Figure 3 below will be our baseline example which we will adjust in a variety of ways to examine how the values in the table react. In this figure, the red and green series are identical but are oscillating around different mean prices. The difference between the means of the variables is static over time which is why ADF test confirms their cointegration. The price, simple returns, and log returns correlations are all 1, perfectly positively correlated.

Figure 3: Baseline Example, Perfect Cointegration and Correlation

By phase shifting the green price series as seen in Figure 4 below, all the correlation coefficients now indicate a lack of correlation between the series. As expected, the pair remains cointegrated.

Figure 4: Perfect Cointegration but No Correlation

I now put the pair back in sync and the red series is adjusted as seen in Figure 5. The pair isn’t cointegrated anymore since the difference between their means fluctuates over time. The returns correlation coefficients agree that the series are strongly correlated while the price only supports a weak correlation.

Figure 5: No Cointegration but Strong Returns Correlation

In the above example, the Pearson and Spearman coefficients begin to diverge but now we’ll look at an example where they differ significantly. Since the Spearman coefficient is based on the rank-order of the variables and not the actual distance between them, it is known to be more resilient to large deviations and outliers. We can test this by adding an anomaly, possibly a data outage, to the top series by randomly choosing a period of 25 data points to set equal to 1. The effect can be observed in the table accompanying Figure 6 below. The Spearman coefficient supports strong positive correlation while the Pearson coefficient claims there is little to no correlation.

Figure 6: Outliers Effect on Pearson and Spearman Coefficient Calculations

The final example we will look at it is a situation where the returns are not strongly correlated but the prices are. Instinctively, I think I would side with the returns correlation results in Figure 7.

Figure 7: High Price Correlation but Low Returns Correlation

One aspect of these correlation tests we have been overlooking, is the distributions of the variables. In these sinusoidal examples, neither simple nor log returns are normally distributed. It is often advertised that the Pearson correlation coefficient requires the data to be normally distributed. One counter argument is the distribution only needs to be symmetric, not necessarily normal. The Spearman coefficient is a nonparametric statistic and thus does not require a normal distribution. In many of the previous examples, the two coefficients are functionally the same despite the odd distribution of the log returns. In Figure 8 below, we take our basic series and add random noise to one of them which creates a more normal distribution. The normality of these log returns are tested with the Shapiro-Wilk normality test. As seen in the right histogram, our basic sinusoidal wave’s log returns reject the null hypothesis that they are normally distributed. In the left histogram, the noisy wave’s log returns fail to reject the null hypothesis.

Figure 8: Distributions and Price Series when Noise is Added to One Series

Despite changing one variable’s distribution, the Pearson and Spearman coefficients remain about the same. Additionally, as seen in Figure 9 below, normalizing both variable’s distributions does not cause the coefficients to differ.

Figure 9: Noise Added to Both Price Series

These distribution examples do not fully support a side of the debate but I’m not convinced that the Pearson coefficient strictly requires normality.

Playing around with these examples was very helpful for my understanding of cointegration, correlation, and log returns. It is now very clear to me why returns, particularly log returns, are used when calculating correlation and why price is used to test for cointegration. The choice between using the Pearson or Spearman correlation coefficient is slightly more difficult but it can’t hurt to look at both and see how it impacts your data decisions!

The code to generate all the figures in this post can be found here.

Eric Kammers is a recent graduate of the University of Washington (2017) where he studied Industrial & Systems Engineering. He is actively seeking opportunities that will add value to his current skill-set. He is a strong-willed, self-driven individual who has the urge for life-time learning. He loves mathematics and statistics, especially applying their methods to practical problems in data science and engineering. LinkedIn: https://www.linkedin.com/in/ekammers/

Redesigning Data Visualization Atrocities #1

cfsmith — Mon, 18 Sep 2017 00:49:29 +0000

Hello all, for a while I’ve been wanting to diversify the content on my blog to include general data science and visualization samples. I finally got a burst of motivation this weekend when I was casually studying for the GRE and came across a figure in a practice test that, in my opinion, is an example of a failed visualization. Figure 1 below shows the original figure which was followed by a series of basic data interpretation questions.

Figure 1: Original Figure

Now before we dive into this, I’m sure the GRE purposefully uses unhelpful visualizations to ensure test takers understand the components of a graph (legend, axes, etc.) well enough to extract the necessary information via brute force inspection. Either way, we’re going to make this figure more accessible and visualizing assisting. There are 3 serious issues that I have with it.

The date ranges (top series) and individual months (bottom series) are seemingly unrelated but are plotted on the same graph using a broken y-axis with a discontinuous scale. The two parts of the graph are pretty well distinguished but to avoid associating their patterns/trends with each other, it would be recommended to separate these plots.
The only way to understand the time period for each line is by looking at the markers and legend. Besides their filling, these markers have no connection to their corresponding time periods.
The x-axis contains the different charitable causes where their order is irrelevant. By using a line plot with markers, it gives you the impression that order is important and that the upward trend, for example, could matter.

To solve the first problem, the plots are separated and placed side-by-side in Figure 2 below. Though not a huge improvement, the y-axes are now appropriate and the date ranges are clearly separated from the individual months.

Figure 2: Enhanced Figure – Y-Axes Fixed

Now to fix the second problem, we drop the meaningless markers and adopt a grayscale where more recent dates are darker than earlier dates. It still requires you to refer to the legend but once you understand the direction of the grayscale, the figure becomes much more digestible as seen below.

Figure 3: Enhanced Figure – Markers Fixed

Based on the dates in the figure, I assume adding color was not an option at the time of its creation but we’re going to go ahead and spice it up with some color now. Finally, to fix the third problem, the x-axis is now time, increasing left to right, and the individual lines represent the different charitable causes. The figure is immediately more intuitive as now trends along the x-axis have meaning. As seen in Figure 4 below, in the date range plot, your attention is immediately drawn to the increase in disaster relief between the 2^nd and 3^rd date range. In the monthly plot, the uniqueness of the disaster relief and child safety lines compared to the others is quickly realized.

Figure 4: Enhanced Figure – Independent Variable

Now that we have fixed my issues with the original figure, we have a clean visualization. There are still some improvements we can make though. Having to track the individual lines and values can be visually straining. It is much easier on our eyes to compare 2-D areas so by switching to a stacked bar chart, Figure 5 becomes even more accessible. The significant changes in private donations to disaster relief are still very prominent in the new figure. Unfortunately, by switching chart types, we lose information on the exact donation values for each cause in exchange for gaining information on the total amount of private donations.

Figure 5: Enhanced Figure – Stacked Bar Chart

Depending on the purpose of the figure, the trade-off between individual donation values and the total donation values may or may not be worth it. A somewhat happy medium when working with a stacked bar chart, with this many categories, is to use a radar chart instead. As seen in Figure 6 below, the individual donation values are available and by considering the area of the series, the total donation values can be extrapolated. Additionally, a monochromatic scale can be used to highlight time component of the series.

Figure 6: Enhanced Figure – Radar Chart

I hope this example highlighted the importance of choosing an appropriate visualization based on your audience and the aspects of the data that you want emphasized. Especially when working with a small data set like this, the details matter. Thanks for reading!