0
votes

I have a large (5GB) temperature netCDF file. The file is has 4 dimensions: time, pressure level, latitude, longitude.

The dataset has 31 timepoints and I am only interested in 5 pressure levels.

My parameter is temperature t:

from netCDF4._netCDF4 import Dataset
# Load the dataset
dataset = Dataset(path)
factor = dataset.variables['t']

To extract a 'cube' of temperature data from my factor variable around a centre cell, I would simply do subsetting, like this:

radius = 5 
# +1 because the subsetting does not include last index
lats_bounds = [nearest_latitude_index-radius,nearest_latitude_index+radius + 1] 
lons_bounds = [nearest_longitude_index-radius,nearest_longitude_index+radius +1]

#all timepoints
times_bounds = [0, len(times)] 

#just the last 5 pressure levels
pressure_level_bounds = [len(levels)-5, len(levels)] 

results = factor[times_bounds[0]:times_bounds[1],pressure_level_bounds[0]:pressure_level_bounds[1], lats_bounds[0]:lats_bounds[1],lons_bounds[0]:lons_bounds[1]]

The problem is that results will now of type ndarray with shape (31,5,11,11) and size 18755 (31*5*11*11) where every index holds just a single value.

I need the values from results, but for each value, I also need its corresponding timepoint, pressure level, latitude and longitude.

Ideally, I'd like to do subsetting as I have done, but my final result would be an array of tuples... Something like this:

# corresponding timestamp, pressure level, latitude, longitude
# and the temperature value extracted.
final = [
(2342342, 1000, 24.532, 53.531, 277),
(2342342, 1000, 74.453, 26.123, 351),
(2342342, 1000, 80.311, 56,345, 131),
...
]

How can I achieve this?

2

2 Answers

1
votes

Check out xarray's isel. Translating the syntax from netCDF4 would look something like this:

ds = xr.open_dataset(path)
factor = ds['t']

# note that levels/lon/lat are the names of dimensions in your Dataset
subset = factor.isel(levels=slice(-5, None),
                     lon=[1, 18, 48, 99], lat=[16, 28, 33, 35])
stacked = subset.stack(points=('time', 'levels', 'lon', 'lat'))

# This subset can be converted to a `pandas.Series`:
data = stacked.to_pandas()

# or it can be converted to a list of tuples
df = data.reset_index()
final = [tuple(row[1].values) for row in df.iterrows()]

Xarray also supports label based indexers (i.e. lat=[29.3, 42.3]) but for that, you should use the sel method instead of isel.

-2
votes

I'd use Pandas for this task. But since you only have 35 times and 5 pressure levels, I'd first simplify your approach and figure out how to do a single time and pressure level and a single lat,lon. Then figure out how to loop thru those indices to get your tuples. Something like:

for i in range(0, len(times)):
   for j in range(0, len(levels):
     print( results[i, j, nearest_lat_idx, nearest_lon_idx) )

Of course you could also add loops for the lat and lon as well but it's a bit ugly.