EOF analysis with scikit-learn
In this notebook I give a very simple (and rather uncommented) example of how to use scikit-learn to perform an Empirical Orthogonal Function decomposition (EOF analysis, often referred to as well as Principal Component Analysis or PCA) of a climate field, in this case the monthly Sea Surface Temperature (SST) anomalies in the Pacific.
import pandas as pd
import numpy as np
from numpy import ma
from matplotlib import pyplot as plt
from mpl_toolkits.basemap import Basemap as bm
%matplotlib inline
A small function to plot a 2D field on a map with basemap
def plot_field(m, X, lats, lons, vmin, vmax, step, cmap=plt.get_cmap('jet'), \
ax=False, title=False, grid=False):
if not ax:
f, ax = plt.subplots(figsize=(8, (X.shape[0] / float(X.shape[1])) * 8))
m.ax = ax
im = m.contourf(lons, lats, X, np.arange(vmin, vmax+step, step), \
latlon=True, cmap=cmap, extend='both', ax=ax)
m.drawcoastlines()
if grid:
m.drawmeridians(np.arange(0, 360, 30), labels=[0,0,0,1])
m.drawparallels(np.arange(-80, 80, 20), labels=[1,0,0,0])
m.colorbar(im)
if title:
ax.set_title(title)
load the SST data¶
I am using xray for reading / writing netcdf files and manipulating the N-Dimensional labelled arrays
contained in them, to know more about xray, refer to my previous blog post
import xray
The SST data comes from the NOAA ERSST V3b Dataset. The original data was downloaded from ftp://ftp.ncdc.noaa.gov/pub/data/cmb/ersst/v3b/netcdf/ and I made a concaneted version of the originally yearly files available here
dset = xray.open_dataset('/Users/nicolasf/data/SST/ER_SST/ersst.realtime.nc')
dset
Selects the period 1950 - 2013 and the tropical Pacific domain¶
dsub = dset.sel(time=slice('1950','2013'), zlev=0, lat=slice(-40,40), lon=slice(120,290))
lat = dsub['lat'].values
lon = dsub['lon'].values
sst = dsub['anom'].values
lons, lats = np.meshgrid(lon, lat)
sst.shape
reshape in 2D (time, space)¶
X = np.reshape(sst, (sst.shape[0], len(lat) * len(lon)), order='F')
np.any(np.isnan(X))
Mask the land points¶
type(X)
X = ma.masked_array(X, np.isnan(X))
type(X)
land = X.sum(0).mask
ocean = -land
keep only oceanic grid-points¶
X = X[:,ocean]
Standardize SST using the fit and transform methods of the sklearn.preprocessing.scaler.StandardScaler
¶
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler_sst = scaler.fit(X)
Once the scaler object has been 'trained' on the data, we can save it as a pickle object¶
from sklearn.externals import joblib
joblib.dump(scaler_sst, './scaler_sst.pkl', compress=9)
scaler_sst = joblib.load('./scaler_sst.pkl')
scales: use the transform
method of the scaler object¶
X = scaler_sst.transform(X)
verify that mean = 0 and std = 1¶
X.mean()
X.std()
X.shape
EOF decomposition¶
from sklearn.decomposition import pca
instantiates the PCA object¶
skpca = pca.PCA()
fit¶
skpca.fit(X)
Now saves the (fitted) PCA object for reuse in operations¶
joblib.dump(skpca, '../EOF.pkl', compress=9)
f, ax = plt.subplots(figsize=(5,5))
ax.plot(skpca.explained_variance_ratio_[0:10]*100)
ax.plot(skpca.explained_variance_ratio_[0:10]*100,'ro')
ax.set_title("% of variance explained", fontsize=14)
ax.grid()
keep number of PC sufficient to explain 70 % of the original variance¶
ipc = np.where(skpca.explained_variance_ratio_.cumsum() >= 0.70)[0][0]
ipc
The Principal Components (PCs) are obtained by using the transform
method of the pca
object (skpca
)¶
PCs = skpca.transform(X)
PCs = PCs[:,:ipc]
The Empirical Orthogonal Functions (EOFs) are contained in the components_
attribute of the pca
object (skpca
)¶
EOFs = skpca.components_
EOFs = EOFs[:ipc,:]
EOFs.shape
we can the reconstruct the 2D fields (maps)¶
EOF_recons = np.ones((ipc, len(lat) * len(lon))) * -999.
for i in range(ipc):
EOF_recons[i,ocean] = EOFs[i,:]
EOF_recons = ma.masked_values(np.reshape(EOF_recons, (ipc, len(lat), len(lon)), order='F'), -999.)
EOF_recons.shape
m = bm(projection='cyl',llcrnrlat=lat.min(),urcrnrlat=lat.max(),\
llcrnrlon=lon.min(),urcrnrlon=lon.max(),\
lat_ts=0,resolution='c')
type(EOF_recons)
EOF_recons *= 100
plot_field(m, EOF_recons[0,:,:], lats, lons, -5, 5, 0.1, grid=True, cmap=plt.get_cmap('RdBu_r'))
scale the Principal Components¶
from sklearn.preprocessing import StandardScaler
scaler_PCs = StandardScaler()
scaler_PCs.fit(PCs)
PCs_std = scaler_PCs.transform(PCs)
joblib.dump(scaler_PCs, './scaler_PCs.pkl')
PCdf = pd.DataFrame(PCs_std, index = dsub['time'], \
columns = ["EOF%s" % (x) for x in range(1, PCs_std.shape[1] +1)])
PCdf.head()
PCdf.to_csv('./EOF_ERSST_PCs.csv')
from scipy.signal import detrend
f, ax = plt.subplots(figsize=(10,4))
PCdf.ix[:,0].plot(ax=ax, color='k', label='PC1')
ax.axhline(0, c='0.8')
#ax.set_xlabel('period', fontsize=18)
ax.plot(PCdf.index, detrend(PCdf.ix[:,0].values), 'r', label='PC1 (trend removed)')
ax.grid('off')
ax.legend(loc=1);