In this notebook I give a very simple (and rather uncommented) example of how to use scikit-learn to perform an Empirical Orthogonal Function decomposition (EOF analysis, often referred to as well as Principal Component Analysis or PCA) of a climate field, in this case the monthly Sea Surface Temperature (SST) anomalies in the Pacific.

import pandas as pd
import numpy as np
from numpy import ma
from matplotlib import pyplot as plt
from mpl_toolkits.basemap import Basemap as bm

%matplotlib inline

A small function to plot a 2D field on a map with basemap

def plot_field(m, X, lats, lons, vmin, vmax, step, cmap=plt.get_cmap('jet'), \
               ax=False, title=False, grid=False):
    if not ax: 
        f, ax = plt.subplots(figsize=(8, (X.shape[0] / float(X.shape[1])) * 8))
    m.ax = ax
    im = m.contourf(lons, lats, X, np.arange(vmin, vmax+step, step), \
                    latlon=True, cmap=cmap, extend='both', ax=ax)
    m.drawcoastlines()
    if grid: 
        m.drawmeridians(np.arange(0, 360, 30), labels=[0,0,0,1])
        m.drawparallels(np.arange(-80, 80, 20), labels=[1,0,0,0])
    m.colorbar(im)
    if title: 
        ax.set_title(title)

load the SST data¶

I am using xray for reading / writing netcdf files and manipulating the N-Dimensional labelled arrays contained in them, to know more about xray, refer to my previous blog post

import xray

The SST data comes from the NOAA ERSST V3b Dataset. The original data was downloaded from ftp://ftp.ncdc.noaa.gov/pub/data/cmb/ersst/v3b/netcdf/ and I made a concaneted version of the originally yearly files available here

dset = xray.open_dataset('/Users/nicolasf/data/SST/ER_SST/ersst.realtime.nc')

dset

<xray.Dataset>
Dimensions:  (lat: 89, lon: 180, time: 800, zlev: 1)
Coordinates:
  * lon      (lon) float32 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.0 20.0 ...
  * lat      (lat) float32 -88.0 -86.0 -84.0 -82.0 -80.0 -78.0 -76.0 -74.0 ...
  * time     (time) datetime64[ns] 1948-02-15 1948-03-15 1948-04-15 ...
  * zlev     (zlev) float32 0.0
Data variables:
    err      (time, zlev, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    anom     (time, zlev, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    sst      (time, zlev, lat, lon) float64 nan nan nan nan nan nan nan nan ...
Attributes:
    Conventions: CF-1.0
    creation_date: 2010-03-23
    nco_openmp_thread_number: 1
    description: Smith, T.M., and R.W. Reynolds, 2003: Extended Reconstruction of Global Sea Surface Temperatures Based on COADS Data (1854-1997). Journal of Climate, 16, 1495-1510. A vailable at http://www.ncdc.noaa.gov/img/climate/research/sst/ersst.pdf  Climatology is based on 1971-2000 OI.v2 SST, In situ data: ICOADS2.4 before 2006 and NCEP in situ data from 2007 to present
    source: NOAA/National Climatic Data Center
    title: ERSST V3b in situ only
    history: Sun Oct  5 09:00:07 2014: ncrcat -O ersst.194802_ft.nc ersst.194803_ft.nc ersst.194804_ft.nc ersst.194805_ft.nc ersst.194806_ft.nc ersst.194807_ft.nc ersst.194808_ft.nc ersst.194809_ft.nc ersst.194810_ft.nc ersst.194811_ft.nc ersst.194812_ft.nc ersst.194901_ft.nc ersst.194902_ft.nc ersst.194903_ft.nc ersst.194904_ft.nc ersst.194905_ft.nc ersst.194906_ft.nc ersst.194907_ft.nc ersst.194908_ft.nc ersst.194909_ft.nc ersst.194910_ft.nc ersst.194911_ft.nc ersst.194912_ft.nc ersst.195001_ft.nc ersst...
    contact: Dick Reynolds, email: Richard.W.Reynolds@noaa.gov & Chunying Liu, email: Chunying.liu@noaa.gov

Selects the period 1950 - 2013 and the tropical Pacific domain¶

dsub = dset.sel(time=slice('1950','2013'), zlev=0, lat=slice(-40,40), lon=slice(120,290))

lat = dsub['lat'].values
lon = dsub['lon'].values
sst = dsub['anom'].values

lons, lats = np.meshgrid(lon, lat)

sst.shape

(768, 41, 86)

reshape in 2D (time, space)¶

X = np.reshape(sst, (sst.shape[0], len(lat) * len(lon)), order='F')

np.any(np.isnan(X))

True

Mask the land points¶

type(X)

numpy.ndarray

X = ma.masked_array(X, np.isnan(X))

type(X)

numpy.ma.core.MaskedArray

land = X.sum(0).mask

ocean = -land

keep only oceanic grid-points¶

X = X[:,ocean]

Standardize SST using the fit and transform methods of the `sklearn.preprocessing.scaler.StandardScaler`¶

from sklearn import preprocessing
scaler  = preprocessing.StandardScaler()

scaler_sst = scaler.fit(X)

Once the scaler object has been 'trained' on the data, we can save it as a pickle object¶

from sklearn.externals import joblib

joblib.dump(scaler_sst, './scaler_sst.pkl', compress=9)

['./scaler_sst.pkl']

scaler_sst = joblib.load('./scaler_sst.pkl')

scales: use the `transform` method of the scaler object¶

X = scaler_sst.transform(X)

verify that mean = 0 and std = 1¶

X.mean()

-8.4161496440623899e-19

X.std()

1.0000000000000002

X.shape

(768, 3133)

EOF decomposition¶

from sklearn.decomposition import pca

instantiates the PCA object¶

skpca = pca.PCA()

fit¶

skpca.fit(X)

PCA(copy=True, n_components=None, whiten=False)

Now saves the (fitted) PCA object for reuse in operations¶

joblib.dump(skpca, '../EOF.pkl', compress=9)

['../EOF.pkl']

f, ax = plt.subplots(figsize=(5,5))
ax.plot(skpca.explained_variance_ratio_[0:10]*100)
ax.plot(skpca.explained_variance_ratio_[0:10]*100,'ro')
ax.set_title("% of variance explained", fontsize=14)
ax.grid()

keep number of PC sufficient to explain 70 % of the original variance¶

ipc = np.where(skpca.explained_variance_ratio_.cumsum() >= 0.70)[0][0]

ipc

9

The Principal Components (PCs) are obtained by using the `transform` method of the `pca` object (`skpca`)¶

PCs = skpca.transform(X)

PCs = PCs[:,:ipc]

The Empirical Orthogonal Functions (EOFs) are contained in the `components_` attribute of the `pca` object (`skpca`)¶

EOFs = skpca.components_

EOFs = EOFs[:ipc,:]

EOFs.shape

(9, 3133)

we can the reconstruct the 2D fields (maps)¶

EOF_recons = np.ones((ipc, len(lat) * len(lon))) * -999.

for i in range(ipc): 
    EOF_recons[i,ocean] = EOFs[i,:]

EOF_recons = ma.masked_values(np.reshape(EOF_recons, (ipc, len(lat), len(lon)), order='F'), -999.)

EOF_recons.shape

(9, 41, 86)

m = bm(projection='cyl',llcrnrlat=lat.min(),urcrnrlat=lat.max(),\
            llcrnrlon=lon.min(),urcrnrlon=lon.max(),\
            lat_ts=0,resolution='c')

type(EOF_recons)

numpy.ma.core.MaskedArray

EOF_recons *= 100

plot_field(m, EOF_recons[0,:,:], lats, lons, -5, 5, 0.1, grid=True, cmap=plt.get_cmap('RdBu_r'))

scale the Principal Components¶

from sklearn.preprocessing import StandardScaler

scaler_PCs = StandardScaler()

scaler_PCs.fit(PCs)

StandardScaler(copy=True, with_mean=True, with_std=True)

PCs_std = scaler_PCs.transform(PCs)

joblib.dump(scaler_PCs, './scaler_PCs.pkl')

['./scaler_PCs.pkl', './scaler_PCs.pkl_01.npy', './scaler_PCs.pkl_02.npy']

PCdf = pd.DataFrame(PCs_std, index = dsub['time'], \
                    columns = ["EOF%s" % (x) for x in range(1, PCs_std.shape[1] +1)])

PCdf.head()

PCdf.to_csv('./EOF_ERSST_PCs.csv')

from scipy.signal import detrend

f, ax = plt.subplots(figsize=(10,4))
PCdf.ix[:,0].plot(ax=ax, color='k', label='PC1')
ax.axhline(0, c='0.8')
#ax.set_xlabel('period', fontsize=18)
ax.plot(PCdf.index, detrend(PCdf.ix[:,0].values), 'r',  label='PC1 (trend removed)')
ax.grid('off')
ax.legend(loc=1);

	EOF1	EOF2	EOF3	EOF4	EOF5	EOF6	EOF7	EOF8	EOF9
time
1950-01-15	-1.123907	-1.755836	-0.452906	-0.712406	2.069823	-2.100420	0.717192	-0.657735	0.274798
1950-02-15	-1.249496	-1.710119	-0.638251	-0.524621	1.085331	-2.025163	-0.233940	-0.988423	0.049161
1950-03-15	-1.092109	-1.315710	-0.655874	0.620964	1.108923	-0.767232	-0.514578	-1.483388	0.898097
1950-04-15	-1.212728	-1.132280	-0.407089	0.413160	0.271240	-0.201163	-0.498882	-0.545291	1.621271
1950-05-15	-1.616118	-1.008179	-0.620920	-0.114747	0.097757	0.207640	-1.446340	-1.372091	1.284412

EOF analysis with scikit-learn