This tutorial on correlation was inspired by us joking about the relationship between changes in Bitcoin prices and the S&P 500 American stock index.
Preliminary instructions:
pip install numpy pandas plotly
.Contact Ahmed Fasih. Open an issue on GitHub.
Lets use Pandas, which is a big but very well-designed and well-documented library, to load the two spreadsheets, merge them, and save their one-day percent change in a dataframe called df
under df.spy1
and df.xlf
.
import pandas as pd
spy = pd.read_csv('SPY.csv', parse_dates=['Date'])
xlf = pd.read_csv('XLF.csv', parse_dates=['Date'])
xlf['xlf'] = xlf['Close']
spy['spy'] = spy['Close']
xlf = xlf.loc[:, ['Date', 'xlf']]
spy = spy.loc[:, ['Date', 'spy']]
xlf = xlf.set_index('Date')
spy = spy.set_index('Date')
df = pd.concat([spy, xlf], axis=1).dropna()
df['spy1'] = df.spy.pct_change()*100
df['xlf1'] = df.xlf.pct_change()*100
Let's look at the two time series (i.e., vectors of data). I like Plotly because it outputs interactive graphs for the browser, unlike dead pixels output by Matplotlib.
import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter(x=df.index, y=df.spy, name='SPY', line=dict(width=4)))
fig.add_trace(go.Scatter(x=df.index, y=df.xlf, name='XLF'), secondary_y=True)
fig.update_layout(title_text='SPY (left) and XLF (right), {} to {} (data: Yahoo Finance)'.format(str(df.index[0])[:10], str(df.index[-1])[:10]))
fig.update_xaxes(title_text="Date")
fig.update_yaxes(title_text="SPY", secondary_y=False)
fig.update_yaxes(title_text="XLF", secondary_y=True)
fig.update_layout(hovermode="x")
fig.show()
Let's look at a scatter plot of the one-day percent returns of SPY and XLF (for the dates where we have prices for both).
fig = go.Figure(data=[go.Scatter(
x=df.spy1,
y=df.xlf1,
mode='markers',
marker=dict(
color=list(map(lambda x: x.year + x.month/12, df.index.to_pydatetime())),
colorscale='Viridis',
showscale=True
),
hovertemplate = 'SPY: %{x:.2f}%, XLF: %{y:.2f}%<br>(%{text})',
text=df.index,
)])
fig.update_layout(
title='{} to {} (data: Yahoo Finance)'.format(str(df.index[0])[:10], str(df.index[-1])[:10]),
xaxis_title="SPY 1 day % change",
yaxis_title="XLF 1 day % change",
)
fig.show()
Correlation quantifies the perfection of the linear relationship between two series of numbers. A scatter plot of two perfectly correlated variables would show a straight line (the slope doesn't matter).
These two sets of numbers, SPY's percent change on the x axis and XLF's on the y axis, are highly correlated, though not perfectly so. Wikipedia gives a handy equation for the correlation coefficient, specifically the Pearson coefficient:
import numpy as np
def mycorrcoef(x, y):
m = np.mean
value = m(x * y) - m(x) * m(y)
normalize = np.sqrt( (m(x*x) - m(x)**2) * (m(y*y) - m(y)**2) )
return value / normalize
print(mycorrcoef(df.dropna().spy1, df.dropna().xlf1))
Perfect correlation would yield 1.0 or -1.0 (depending on whether the slope was positive or negative). We're pretty close to that: 0.84. Recall that XLF is a subset of SPY, specifically the financial firms, along with ten other industries.
Here are a couple of other ways to compute that same correlation, using Numpy and Pandas:
# using Numpy
print(np.corrcoef(df.dropna().spy1, df.dropna().xlf1)[0, 1])
# using pandas
print(df.corr().loc['spy1','xlf1'])
Thing is, this one number collapses twenty-two years of history into a single number. I wonder: what's the correlation between daily returns of SPY vs XLF in the first week of this dataset? The first year? The first five years?
Pandas can answer this question, with one of its magically helpful functions, expanding
: it constructs an expanding window over the dataframe and evaluates the correlation over each point along the expansion.
fig = go.Figure(data=go.Scatter(x=df.index, y=df.spy1.expanding(5).corr(df.xlf1), mode='lines'),
layout=dict(
title='SPY vs XLF daily % change, correlation in data window starting from {}'.format(str(df.index[0])[:10]),
xaxis_title="Data window end date",
yaxis_title="correlation",
))
fig.show()
To confirm your intuition: the rightmost value in this chart represents the correlation between the entirety of the dataset and corresponds to that 0.84 we computed above.
Similarly, the minimum of this graph, 0.68 in December 2000, corresponds to the correlation between these two datasets between 1998 (the start) and December 2000.
In other words, this graph shows what correlations your mycorrcoef
function would have printed over history.
But we can see that the correlation between these two datasets has changed over time. It was roughly 80% (0.8 on the y axis) before the 2000 crash that accompanied the popping of the Tech Bubble, and for several years thereafter, XLF daily returns were less correlated than SPY returns.
I'm also curious—what are the correlations over the last week, last month, last year of the data? That is, for the time windows of varying lengths ending today?
fig = go.Figure(data=go.Scatter(x=df.index[::-1], y=df.spy1[::-1].expanding(5).corr(df.xlf1[::-1]), mode='lines'),
layout=dict(
title='SPY vs XLF daily % change, correlation in data window ending {}'.format(str(df.index[-1])[:10]),
xaxis_title="Data window <i>start</i> date",
yaxis_title="correlation",
))
fig.update_layout(
)
fig.show()
As a sanity check, you can confirm that the leftmost point in the graph here corresponds to the global correlation we calculated above, 0.84.
Both the above graphs show the correlation over expanding windows of time: from 1998 onwards or from 2020 backwards. Suppose I wanted to make a service that printed the trailing correlation over a fixed window, that is, what the correlation between SPY and XLF daily returns has been over the last month, the last six months, the last year?
Pandas has more magic, called rolling
, to answer this question.
fig = go.Figure(data=[go.Scatter(
x=df.index,
y=df.spy1.rolling(win).corr(df.xlf1),
mode='lines',
name='{} sessions'.format(win),
line=dict(width=width),
opacity=0.75
) for win, width in zip([240, 120, 20], [4, 2, 1])],
layout=dict(
title='Trailing correlations for several time horizons: SPY vs XLF daily percent returns',
xaxis_title='Window end date',
yaxis_title='correlation',
hovermode='x'
))
fig.show()
To confirm your understanding: if you published a newsletter that included the most recent month's correlations between SPY and XLF's daily returns, the thin "20 sessions" curve above gives the number you would have published on any given date.
This is most interesting. Even though the correlation of these two time series over all time was 0.84, we see that over shorter windows, the correlations move around a lot.
It almost seems a periodic cycle—the trailing monthly correlation hits near 1.0, meaning SPY and XLF's daily returns nearly moved in lock-step (though the correlation removes the absolute slope between the two). And with apparently just as much regularily, the one-month correlation drops substantially.
We see that the Tech Bubble saw significant decorrelation between financials and the broader S&P 500, possibly because SPY was heavily tech-weighted, and those stocks saw the most massive declines. We note that in contrast we can't really detect the 2008 Great Financial Crisis in these three curves—the declines in the broader index seem to have been quite correlated with those in financials since by then SPY was heavily-weighted towards financials.
So we can go the full distance: we can make a chart that shows the correlation between these two time series between any start and end dates.
wins = np.round(np.linspace(5, len(df) - 1, 500)).astype(int)
p = pd.DataFrame([df.spy1.rolling(win).corr(df.xlf1) for win in wins],
index=wins)
# Going down `p`'s rows changes the length of the window of the correlation.
# Going across `p`'s columns changes the end date of the window of the correlation.
# Now change this from *window size* to *start date*.
for i in range(p.values.shape[1]):
n = np.sum(np.isfinite(p.values[:, i]))
p.values[:n, i] = np.flip(p.values[:n, i])
# update the index (the value as you run down columns) from window size to start date
p.index = df.index[wins]
Above we make a new Pandas dataframe to contain the correlations between any start and end date. Let's plot it below with Plotly as a heatmap.
fig = go.Figure(data=go.Heatmap(
z=p,
x=p.columns,
y=p.index,
colorscale='Viridis',
zmin=0.5,
zmax=.95,
hovertemplate = '%{y}—%{x} corr: %{z:.2f}',
),
layout=dict(
title='Correlation over all time windows: SPY vs XLF daily percent returns',
xaxis_title='Window end date',
yaxis_title='Window start date',
))
fig.show()