xray
xray has been originally developed by scientists and engineers working at the Climate Corporation
xray is a Python package that allows to define and manipulate N-Dimensional labelled arrays. In a nutschell, whenever you've got data that is defined over more than 2 dimensions, and to each point along those dimensions can be associated a label (e.g. a latitute, a longitude, a timestamp, a depth, etc) then you definitely need to have a look at xray.
If this data model reminds you of the data structures introduced by the widely used pandas library, this is not a coincidence, coming right from the xray documentation is this disclaimer: "xray is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures."
In this post I will give a few reasons why I think that xray is destined to become a core Python package for people working with multi-dimensional arrays, especially - but not only - in the geosciences, before illustrating the power of xray with a few examples of how it simplifies considerably common operations on climate datasets.
Why is xray so great ?¶
Below is a list of reasons why I think that i) xray is great ii) it is destined to become an indispensable Python package for people who deal with multidimensional data:
-
xray is open-source (of course !) and has a non-restrictive license
-
It is now part of the anaconda Python distribution, which is currently probably the easiest way to install (and manage) the Python Scientific ecosystem on your machine, whatever your platform is. So to install xray all you have to do is to run:
conda install xray
-
It plays well with the rest of the Scientific ecosystem, notably:
-
It is based on Pandas, arguably one of the most popular Python library in the so-called 'data sciences'. Because xray tries and follow as much as possible the design choices and the API of Pandas, it means that if you are already a Pandas user, xray will feel extremely familiar. Stephan Hoyer, the lead developer of xray, is also contributing to Pandas, which ensures further integration with Pandas itself.
-
It also plays well with the native Python data structures, one killer feature for me: you can define a xray.Dataset (in a nutschell a container of labelled N-D arrays) from a Python dictionnary
-
There is support for out-of-core computation through dask: see this post from Stephan Hoyer on the continuum analytics blog.
-
xray data structures can be saved into (and read from) the efficient NetCDF file format, which is widely used in the earth sciences
-
It has a great documentation
For more on xray, I also suggest having a look at these talks from Stephan Hoyer:
- Introducing xray: extended arrays for scientific datasets at Pydata Silicon Valley
- xray: N D Labeled Arrays and Datasets at Scipy 2015.
There's also a nice post from Daniel Rothenberg touching upon xray and why he migrated from iris to xray.
In the remaining of this notebook, I will give a few examples of some features of xray, again head to the documentation for a comprehensive view of xray.
xray in action¶
the usual imports first
%matplotlib inline
import os
from datetime import datetime
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from mpl_toolkits.basemap import Basemap as bm
importing xray, here the latest stable version, installed via conda install xray
import xray
print(xray.__version__)
defining a little function to plot a map with basemap
def plot_field(m, X, lats, lons, vmin, vmax, step, cmap=plt.get_cmap('jet'), \
ax=False, title=False, grid=False):
if not ax:
f, ax = plt.subplots(figsize=(8, (X.shape[0] / float(X.shape[1])) * 8))
m.ax = ax
im = m.contourf(lons, lats, X, np.arange(vmin, vmax+step, step), \
latlon=True, cmap=cmap, extend='both', ax=ax)
m.drawcoastlines()
if grid:
m.drawmeridians(np.arange(0, 360, 30), labels=[0,0,0,1])
m.drawparallels(np.arange(-80, 80, 20), labels=[1,0,0,0])
m.colorbar(im)
if title:
ax.set_title(title)
Ingesting a necdf dataset¶
To know more about the netcdf format and specifications, I suggest heading here and here
I am gonna use the NOAA Interpolated Monthly Outgoing Longwave Radiation Dataset which I saved locally
dset = xray.open_dataset(os.path.join(os.environ['HOME'], 'data/olr.mon.mean.nc'))
dset is a xray.Dataset: "It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the netCDF file format."
dset
Accessing data and coordinates¶
To access data and coordinates vectors of a xray.Dataset
, you use a simple dict-like syntax
lat = dset['lat']
type(lat)
You can access the underlying Numpy array through the data
attribute of the lat
object
lat.data
Selecting along dimensions¶
A killer feature of xray is that it allows to select data along its dimensions using label-based indexing, see more here.
It means notably - something very handy for climate / ocean datasets - that you can select a specific date / time or period using the string representation of these dates / times
dset.sel(time=('1998-1-1'))
You can also use slices, and combine selection over several dimensions without having to stick to the dimension-ordering of the data, i.e even if the data in this specific dataset is defined over [time, lat, lon]
, you can pass the dimensions in any order to the sel
method, as long as you use the correct dimension name.
A word of warning however, when selecting a slice
, you need to follow the order of the labels, i.e. in this case the first latitude is the northermost latitude, so when slicing you need to go from North to South.
dset.sel(lat=slice(40,-40), time=slice('1998-1-1','2000-12-1'))
For the rest of the examples I am going to extract the Tropical Pacific Ocean, because I am particularly interested into what's happening there
dset = dset.sel(lat=slice(40,-40), lon=slice(120,300))
dset
lat = dset['lat']
lon = dset['lon']
despite the variables lat
and lon
being of type DataArray
, numpy's function are usually
able to deal directly with them, without having to pass them the underlying numpy array
lons, lats = np.meshgrid(lon, lat)
Plotting the mean Outgoing Longwave Radiation field for January 1998, i.e. during the core of the 1997/1998 El Niño event
m = bm(projection='cyl',llcrnrlat=lat.data.min(),urcrnrlat=lat.data.max(),\
llcrnrlon=lon.data.min(),urcrnrlon=lon.data.max(),\
lat_ts=0,resolution='c')
plot_field(m, dset.sel(time=('1998-1-1'))['olr'], lats, lons, 80, 300, 10, grid=True)
How does that compare with the climatology ? well, xray implements the groupby method of Pandas, making it
trivial to calculate a climatology
Calculates a monthly climatology using the groupby machinery¶
clim = dset.groupby('time.month').mean('time')
clim
Lets' plot the monthly climatology above, using the calendar
module to display the month's names
from calendar import month_name
f, axes = plt.subplots(nrows=4,ncols=3, figsize=(14,10))
f.subplots_adjust(hspace=0.1, wspace=0.1)
axes = axes.flatten()
for i, month in enumerate(range(1,13)):
ax = axes[i]
plot_field(m, clim['olr'][i,:,:], lats, lons, 150, 300, 5, ax=ax, title=month_name[month])
xray supports the “derived” datetime components implemented by pandas, including “year”, “month”, “day”, “hour”, “minute”, “second”, “dayofyear”, “week”, “dayofweek”, “weekday” and “quarter”, it also adds season to the list of datetime components, which defines the standard climatological seasons (December - February, March - May, June - August and September - November)
seas_clim = dset.groupby('time.season').mean('time')
f, axes = plt.subplots(nrows=2,ncols=2, figsize=(10,5))
f.subplots_adjust(hspace=0.2, wspace=0.2)
axes = axes.flatten('F')
for i, seas in enumerate(seas_clim['season'].values):
ax = axes[i]
plot_field(m, seas_clim['olr'][i,:,:], lats, lons, 150, 300, 5, ax=ax, title=seas, grid=True)
Similarly to Pandas, instead of using the standard methods that can be applied to a groupby object, you can define
your own function, and pass it to the groupby object via the apply
method.
For example, you can easily calculate the anomalies with respect to a given climatological period in two steps:
def demean(x):
return x - x.sel(time=slice('1981-1-1','2010-12-1')).mean('time')
olr_anoms = dset.groupby('time.month').apply(demean)
olr_anoms
Let's plot the anomalies for January 1998 (El Niño again)
plot_field(m, olr_anoms['olr'].sel(time='1998-1-1'), lats, lons, -80, 80, 5, title='', \
grid=True, cmap=plt.get_cmap('RdBu_r'))
dumps a xray.Dataset
object into a netcdf (Version 4) file¶
It is very easy to save a xray.Dataset
into a netcdf file with the to_netcdf
method.
Note: it works only with a xray.Dataset
object, if you have a Dataarray
, you can cast it to a Dataset
, an example below where I:
1) extract the Dataarray
holding the olr
variable
2) cast it again as a Dataset
olr = olr_anoms['olr']
type(olr)
olr = olr.to_dataset()
type(olr)
olr.to_netcdf('./olr_anomalies.nc')
Creates a xray dataset object from numpy arrays using a Python dictionnary¶
One of the killer feature of xray for me is the ability to create a xray.Dataset
from scratch
(and saves it to disk in netcdf)
Below I create the coordinates variables arrays: here latitudes, longitudes, level and time, i.e like in a
standard climate dataset
lon = np.linspace(0, 357.5, 144, endpoint=True)
lat = np.linspace(-90,90, 73, endpoint=True)
lons, lats = np.meshgrid(lon,lat)
lev = np.array([1000,925,850])
time = pd.date_range(start='2015-1-1',end='2015-1-3')
Then the data variable array
arr = np.random.randn(3,3,73,144)
To create a xray.Dataset
, you first create a Python dictionnary:
- The dictionnary keys are the names of the variables contained in the Dataset.
- The Dictionnary values are tuples, with first the (or the list of) dimension(s) over which the array varies, then the array itself (defined above)
d = {}
d['time'] = ('time',time)
d['latitudes'] = ('latitudes',lat)
d['longitudes'] = ('longitudes', lon)
d['level'] = ('level', lev)
d['var'] = (['time','level','latitudes','longitudes'], arr)
Then you pass this dictionnary to the Dataset
function
dset_from_dict = xray.Dataset(d)
You can easily add global and variables attributes
dset_from_dict.attrs['creation_date'] = datetime.utcnow().strftime("%Y-%m-%d")
dset_from_dict.longitudes.attrs['units'] = 'degrees_east'
dset_from_dict.latitudes.attrs['units'] = 'degrees_north'
Then you can save to disk as seen above
dset_from_dict.to_netcdf('./dset_from_dict.nc')
!ncdump -h ./dset_from_dict.nc
This was a very brief introduction to the power of xray for manipulating labelled N-dimensional arrays, again head
to the excellent xray documentation to learn more