Historical correlations

This tutorial on correlation was inspired by us joking about the relationship between changes in Bitcoin prices and the S&P 500 American stock index.

Preliminary instructions:

  1. Install Python. Pyenv is absolutely essential.
  2. It's best practice to create a Python virtual environment but the simplest way to install dependencies is to run pip install numpy pandas plotly.
  3. Download all historic data from Yahoo Finance for SPY and XLF, the S&P 500 ETF and the S&P Finance ETF (which tracks a subset of the whole S&P 500 index). Yahoo will let you download a CSV file containing data since the 1990s.

Contact Ahmed Fasih. Open an issue on GitHub.

Lets use Pandas, which is a big but very well-designed and well-documented library, to load the two spreadsheets, merge them, and save their one-day percent change in a dataframe called df under df.spy1 and df.xlf.

In [1]:
import pandas as pd

spy = pd.read_csv('SPY.csv', parse_dates=['Date'])
xlf = pd.read_csv('XLF.csv', parse_dates=['Date'])

xlf['xlf'] = xlf['Close']
spy['spy'] = spy['Close']

xlf = xlf.loc[:, ['Date', 'xlf']]
spy = spy.loc[:, ['Date', 'spy']]

xlf = xlf.set_index('Date')
spy = spy.set_index('Date')

df = pd.concat([spy, xlf], axis=1).dropna()
df['spy1'] = df.spy.pct_change()*100
df['xlf1'] = df.xlf.pct_change()*100

Let's look at the two time series (i.e., vectors of data). I like Plotly because it outputs interactive graphs for the browser, unlike dead pixels output by Matplotlib.

In [2]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter(x=df.index, y=df.spy, name='SPY', line=dict(width=4)))
fig.add_trace(go.Scatter(x=df.index, y=df.xlf, name='XLF'), secondary_y=True)
fig.update_layout(title_text='SPY (left) and XLF (right), {} to {} (data: Yahoo Finance)'.format(str(df.index[0])[:10], str(df.index[-1])[:10]))
fig.update_xaxes(title_text="Date")
fig.update_yaxes(title_text="SPY", secondary_y=False)
fig.update_yaxes(title_text="XLF", secondary_y=True)
fig.update_layout(hovermode="x")
fig.show()

Let's look at a scatter plot of the one-day percent returns of SPY and XLF (for the dates where we have prices for both).

In [3]:
fig = go.Figure(data=[go.Scatter(
    x=df.spy1,
    y=df.xlf1,
    mode='markers',
    marker=dict(
        color=list(map(lambda x: x.year + x.month/12, df.index.to_pydatetime())),
        colorscale='Viridis',
        showscale=True
    ),
    hovertemplate = 'SPY: %{x:.2f}%, XLF: %{y:.2f}%<br>(%{text})',
    text=df.index,
)])
fig.update_layout(
    title='{} to {} (data: Yahoo Finance)'.format(str(df.index[0])[:10], str(df.index[-1])[:10]),
    xaxis_title="SPY 1 day % change",
    yaxis_title="XLF 1 day % change",
)
fig.show()

Correlation quantifies the perfection of the linear relationship between two series of numbers. A scatter plot of two perfectly correlated variables would show a straight line (the slope doesn't matter).

These two sets of numbers, SPY's percent change on the x axis and XLF's on the y axis, are highly correlated, though not perfectly so. Wikipedia gives a handy equation for the correlation coefficient, specifically the Pearson coefficient:

In [4]:
import numpy as np
def mycorrcoef(x, y):
    m = np.mean
    value = m(x * y) - m(x) * m(y)
    normalize = np.sqrt( (m(x*x) - m(x)**2) * (m(y*y) - m(y)**2) )
    return value / normalize

print(mycorrcoef(df.dropna().spy1, df.dropna().xlf1))
0.838648257148808

Perfect correlation would yield 1.0 or -1.0 (depending on whether the slope was positive or negative). We're pretty close to that: 0.84. Recall that XLF is a subset of SPY, specifically the financial firms, along with ten other industries.

Here are a couple of other ways to compute that same correlation, using Numpy and Pandas:

In [5]:
# using Numpy
print(np.corrcoef(df.dropna().spy1, df.dropna().xlf1)[0, 1])

# using pandas
print(df.corr().loc['spy1','xlf1'])
0.8386482571488082
0.8386482571488041

Thing is, this one number collapses twenty-two years of history into a single number. I wonder: what's the correlation between daily returns of SPY vs XLF in the first week of this dataset? The first year? The first five years?

Pandas can answer this question, with one of its magically helpful functions, expanding: it constructs an expanding window over the dataframe and evaluates the correlation over each point along the expansion.

In [6]:
fig = go.Figure(data=go.Scatter(x=df.index, y=df.spy1.expanding(5).corr(df.xlf1), mode='lines'),
                layout=dict(
                    title='SPY vs XLF daily % change, correlation in data window starting from {}'.format(str(df.index[0])[:10]),
                    xaxis_title="Data window end date",
                    yaxis_title="correlation",
            ))
fig.show()

To confirm your intuition: the rightmost value in this chart represents the correlation between the entirety of the dataset and corresponds to that 0.84 we computed above.

Similarly, the minimum of this graph, 0.68 in December 2000, corresponds to the correlation between these two datasets between 1998 (the start) and December 2000.

In other words, this graph shows what correlations your mycorrcoef function would have printed over history.

But we can see that the correlation between these two datasets has changed over time. It was roughly 80% (0.8 on the y axis) before the 2000 crash that accompanied the popping of the Tech Bubble, and for several years thereafter, XLF daily returns were less correlated than SPY returns.

I'm also curious—what are the correlations over the last week, last month, last year of the data? That is, for the time windows of varying lengths ending today?

In [7]:
fig = go.Figure(data=go.Scatter(x=df.index[::-1], y=df.spy1[::-1].expanding(5).corr(df.xlf1[::-1]), mode='lines'),
               layout=dict(
                    title='SPY vs XLF daily % change, correlation in data window ending {}'.format(str(df.index[-1])[:10]),
                    xaxis_title="Data window <i>start</i> date",
                    yaxis_title="correlation",
               ))
fig.update_layout(
)
fig.show()

As a sanity check, you can confirm that the leftmost point in the graph here corresponds to the global correlation we calculated above, 0.84.

Both the above graphs show the correlation over expanding windows of time: from 1998 onwards or from 2020 backwards. Suppose I wanted to make a service that printed the trailing correlation over a fixed window, that is, what the correlation between SPY and XLF daily returns has been over the last month, the last six months, the last year?

Pandas has more magic, called rolling, to answer this question.

In [8]:
fig = go.Figure(data=[go.Scatter(
    x=df.index,
    y=df.spy1.rolling(win).corr(df.xlf1),
    mode='lines',
    name='{} sessions'.format(win),
    line=dict(width=width),
    opacity=0.75
) for win, width in zip([240, 120, 20], [4, 2, 1])], 
                layout=dict(
                    title='Trailing correlations for several time horizons: SPY vs XLF daily percent returns',
                    xaxis_title='Window end date',
                    yaxis_title='correlation',
                    hovermode='x'
                ))
fig.show()

To confirm your understanding: if you published a newsletter that included the most recent month's correlations between SPY and XLF's daily returns, the thin "20 sessions" curve above gives the number you would have published on any given date.

This is most interesting. Even though the correlation of these two time series over all time was 0.84, we see that over shorter windows, the correlations move around a lot.

It almost seems a periodic cycle—the trailing monthly correlation hits near 1.0, meaning SPY and XLF's daily returns nearly moved in lock-step (though the correlation removes the absolute slope between the two). And with apparently just as much regularily, the one-month correlation drops substantially.

We see that the Tech Bubble saw significant decorrelation between financials and the broader S&P 500, possibly because SPY was heavily tech-weighted, and those stocks saw the most massive declines. We note that in contrast we can't really detect the 2008 Great Financial Crisis in these three curves—the declines in the broader index seem to have been quite correlated with those in financials since by then SPY was heavily-weighted towards financials.

So we can go the full distance: we can make a chart that shows the correlation between these two time series between any start and end dates.

In [9]:
wins = np.round(np.linspace(5, len(df) - 1, 500)).astype(int)
p = pd.DataFrame([df.spy1.rolling(win).corr(df.xlf1) for win in wins],
                 index=wins)
# Going down `p`'s rows changes the length of the window of the correlation.
# Going across `p`'s columns changes the end date of the window of the correlation.
# Now change this from *window size* to *start date*.
for i in range(p.values.shape[1]):
    n = np.sum(np.isfinite(p.values[:, i]))
    p.values[:n, i] = np.flip(p.values[:n, i])
# update the index (the value as you run down columns) from window size to start date
p.index = df.index[wins]

Above we make a new Pandas dataframe to contain the correlations between any start and end date. Let's plot it below with Plotly as a heatmap.

In [10]:
fig = go.Figure(data=go.Heatmap(
                    z=p,
                    x=p.columns,
                    y=p.index,
                    colorscale='Viridis',
                    zmin=0.5,
                    zmax=.95,
                    hovertemplate = '%{y}—%{x} corr: %{z:.2f}',
                ),
                layout=dict(
                    title='Correlation over all time windows: SPY vs XLF daily percent returns',
                    xaxis_title='Window end date',
                    yaxis_title='Window start date',
                ))
fig.show()

As you can verify by hovering above the heatmap, the 0.84 overall correlation corresponds to the lower-right pixel, i.e., the correlation between 1998 and recently. If you published a heatmap like this in a newsletter, the first one would be the part of the triangle with the x axis (window end date) equal to the date then. Over the years you'd get successively larger and larger triangles as you moved to the right of the x axis.

This lets us see in total detail what that 0.84 overall correlation between SPY and XLF daily returns really means. There are several epochs that we can break down the last twenty-two years into.

  1. The dark triangle in the lower-left corresponds to the period of low-ish correlation between SPY and XLF daily returns during the Tech Bubble popping.
  2. From roughly mid-2000 to mid-2008, SPY and XLF daily returns were very highly correlated. All time horizons that started and ended within this band saw correlation of roughly 0.9. This is the band of yellow in the lower-left (and to the right of the darker wedge). The transition between the Tech Bubble to this period is quite obvious in this graph.
  3. However, another major transition is quite visible starting around September 2008. Recall that September to October 2008 saw a trapdoor open under SPY, when it lost roughly 30% in a month (after it was around 20% off the 2007 October peak; sorry my Plotly skills aren't advanced enough to give you an interactive percent-change version of the first plot above to verify this right now). This means that every time horizon that began before the September 2008 inflection point and ended after it falls in that big rectangle of smooth green, of correlation roughly 0.85, which time horizons fully on either side of this inflection point have more opportunity to vary from 0.85. To understand this, scroll back up to the scatter plot of daily percent changes and note how all the dots that ring the central line, like a halo, are from this mid-2008 period. Those deviations from linearity form an envelope inside which all other points lie, and so the correlation over any time horizon that includes that halo is forced to be close to 0.85.
  4. Other bands, both vertical and horizontal, are visible in the data after the Great Financial Crisis.

Hopefully this tutorial offers the basics of using correlation to break down two time series. The statistics and signal processing of such data is very rich, so there are many, many more directions you can take such analysis.

Potentially interesting reading: