NetCDF with tidync

Michael D. Sumner

2019-11-06

The goal of tidync is to ease exploring the contents of a NetCDF source and to simplify the process of data extraction.

Array dimension expressions virtually slice (i.e. index) into data arrays with immediate feedback, and ease programming for automating data extraction. No data is read until explicitly requested, and can be pulled directly as an array, or in long-form as a data frame.

NetCDF is Network Common Data Form, a data model and an API.

This is a very common, and very general way to store and work with scientific array-based data. NetCDF is defined and provided by Unidata.

Here we introduce traditional concepts of NetCDF, and show examples with built-in files and online sources to demonstrate tidync functionality.

NetCDF

NetCDF is a very widely used file format for storing array-based data as variables. The space occupied by a variable is defined by its dimensions and their metadata. Dimensions are by definition one-dimensional (i.e. an atomic vector in R of length 1 or more), an array with coordinate metadata, units, type and interpretation. The space of a variable is defined as one or more of the dimensions in the file. A given variable won’t necessarily use all the available dimensions and no dimensions are mandatory or particularly special.

Some conventions exist to define usage and minimal standards for metadata for particular file schemas, but these are many and varied, and not always adhered to. Tidync does not provide a data interpretation as it is intended for general use, including by tools that create data formats.

The R community is not particularly strong with use of NetCDF, though it is common and widely used it pales compared to use in general climate science work, and there the most used tool is the CDO Climate Data Operators. In R the most common tools used are ncdf4 and raster (which uses ncdf4).

Both the RNetCDF and ncdf4 packages provide a traditional summary format, familiar to many NetCDF users as the output of the command line program ncdump.

ice_file <- system.file("extdata", "ifremer", "20171002.nc", package = "tidync", mustWork = TRUE)
library(RNetCDF)
print.nc(open.nc(ice_file))
#> netcdf classic {
#> dimensions:
#>  ni = 632 ;
#>  nj = 664 ;
#>  time = 1 ;
#> variables:
#>  NC_INT time(time) ;
#>      NC_CHAR time:long_name = "time" ;
#>      NC_CHAR time:units = "hours since 1900-1-1 0:0:0" ;
#>  NC_BYTE concentration(ni, nj, time) ;
#>      NC_CHAR concentration:long_name = "sea-ice concentration" ;
#>      NC_CHAR concentration:units = "percent" ;
#>      NC_DOUBLE concentration:scale_factor = 1 ;
#>      NC_DOUBLE concentration:add_offset = 0 ;
#>      NC_BYTE concentration:missing_value = -128 ;
#>      NC_BYTE concentration:_FillValue = -128 ;
#>  NC_BYTE quality_flag(ni, nj, time) ;
#>      NC_CHAR quality_flag:long_name = "quality_flag" ;
#>      NC_CHAR quality_flag:units = "n/a" ;
#>      NC_DOUBLE quality_flag:scale_factor = 1 ;
#>      NC_DOUBLE quality_flag:add_offset = 0 ;
#>      NC_BYTE quality_flag:missing_value = -128 ;
#>      NC_BYTE quality_flag:_FillValue = -128 ;
#> 
#> // global attributes:
#>      NC_CHAR :CONVENTIONS = "COARDS" ;
#>      NC_CHAR :long_name = "Sea-ice concentration as observed by SSM/I" ;
#>      NC_CHAR :short_name = "PSI-F18-Concentration" ;
#>      NC_CHAR :producer_agency = "IFREMER" ;
#>      NC_CHAR :producer_institution = "CERSAT" ;
#>      NC_CHAR :netcdf_version_id = "3.4" ;
#>      NC_CHAR :product_version = "2.0" ;
#>      NC_CHAR :creation_time = "2017-280T08:26:34.000" ;
#>      NC_CHAR :time_resolution = "daily" ;
#>      NC_CHAR :grid = "NSIDC" ;
#>      NC_CHAR :pole = "south" ;
#>      NC_CHAR :spatial_resolution = "12.5 km" ;
#>      NC_CHAR :platform_id = "F18" ;
#>      NC_CHAR :instrument = "SSM/I" ;
#> }

Using ncdump at the command line on a suitable system would yield very similar output to the print above.

ncdump -h /path/to/extdata/ifremer/20171002.nc

With the ncdf4 package it’s a slightly different approach, but gives the same result.

print(ncdf4::nc_open(ice_file))

Notice how the listing above is organized by dimension and then by variable. It’s not particularly obvious that some variables are defined within the same set of dimensions as others.

A NetCDF file is a container for simple array based data structures. There is limited capacity 1 in the formal API for accessing data randomly within a variable, the primary mechanism is to define offset and stride (start and count) hyperslab indexes.

tidync

Tidync aims to ease exploration of the contents of a NetCDF file and provides methods extract arbitrary hyperslabs. These can be used directly in array contexts, or in “long form” database contexts.

On first contact with the file, the available variables are classified by grid and dimension. The “active” grid is the one that queries may be made against, and may be changed with the activate function.

Here we see variables are clearly grouped by the grid they exist in, where grid is a specific (and ordered!) set of dimensions. This allows us to see the set of variables that implicitly co-exist, they have the same shape. The first grid “D0,D1,D2” has two variables, concentration and quality_flag, and the second “D2” has only one variable time. There are no general rules here, a file might have any number of dimensions and variables, and any variable might be defined by one or more dimensions.

In this case the D2 grid has only one variable in its single dimension, and it happens to be a special kind of variable - a “coordinate dimension”, as indicated by the coord_dim flag. In the traditional ncdump summary above it’s easy to see there’s only really one data grid, in ni,nj,time that it holds two variables, and that time is a special coordinate dimension - in contrast neither ni or nj have an explicit 1-dimension variable. When there are many dimensions and/or many variables those patterns are not easy to see.

A particular grid was chosen by default, this is the “D0,D1,D2” grid composed of 3 dimensions, generally the largest grid will be chosen as that is usually the target we would be after. To choose a different grid we may nominate it by name, or by member variable.

By name we choose a grid composed of only one dimension, and the summary print makes a distinction based on which dimensions are active.

It is also possible to choose a grid by variable name, and in tidyverse style here we do not need to quote the name. If we choose a variable in the default grid, then it’s no different to the first case.

tidync(ice_file) %>% activate(time)
#> 
#> Data Source (1): 20171002.nc ...
#> 
#> Grids (2) <dimension family> : <associated variables> 
#> 
#> [1]   D0,D1,D2 : concentration, quality_flag
#> [2]   D2       : time    **ACTIVE GRID** ( 1  value per variable)
#> 
#> Dimensions 3 (1 active): 
#>   
#>   dim   name  length     min     max start count    dmin    dmax unlim 
#>   <chr> <chr>  <dbl>   <dbl>   <dbl> <int> <int>   <dbl>   <dbl> <lgl> 
#> 1 D2    time       1 1032204 1032204     1     1 1032204 1032204 FALSE 
#> # … with 1 more variable: coord_dim <lgl> 
#>   
#> Inactive dimensions:
#>   
#>   dim   name  length   min   max unlim coord_dim 
#>   <chr> <chr>  <dbl> <dbl> <dbl> <lgl> <lgl>     
#> 1 D0    ni       632     1   632 FALSE FALSE     
#> 2 D1    nj       664     1   664 FALSE FALSE

## choose grid by variable name, which happens to be the default grid here
tidync(ice_file) %>% activate(quality_flag)
#> 
#> Data Source (1): 20171002.nc ...
#> 
#> Grids (2) <dimension family> : <associated variables> 
#> 
#> [1]   D0,D1,D2 : concentration, quality_flag    **ACTIVE GRID** ( 419648  values per variable)
#> [2]   D2       : time
#> 
#> Dimensions 3 (all active): 
#>   
#>   dim   name  length     min     max start count    dmin    dmax unlim 
#>   <chr> <chr>  <dbl>   <dbl>   <dbl> <int> <int>   <dbl>   <dbl> <lgl> 
#> 1 D0    ni       632       1     632     1   632       1     632 FALSE 
#> 2 D1    nj       664       1     664     1   664       1     664 FALSE 
#> 3 D2    time       1 1032204 1032204     1     1 1032204 1032204 FALSE 
#> # … with 1 more variable: coord_dim <lgl>

## same as the default
tidync(ice_file) %>% activate("D0,D1,D2")
#> 
#> Data Source (1): 20171002.nc ...
#> 
#> Grids (2) <dimension family> : <associated variables> 
#> 
#> [1]   D0,D1,D2 : concentration, quality_flag    **ACTIVE GRID** ( 419648  values per variable)
#> [2]   D2       : time
#> 
#> Dimensions 3 (all active): 
#>   
#>   dim   name  length     min     max start count    dmin    dmax unlim 
#>   <chr> <chr>  <dbl>   <dbl>   <dbl> <int> <int>   <dbl>   <dbl> <lgl> 
#> 1 D0    ni       632       1     632     1   632       1     632 FALSE 
#> 2 D1    nj       664       1     664     1   664       1     664 FALSE 
#> 3 D2    time       1 1032204 1032204     1     1 1032204 1032204 FALSE 
#> # … with 1 more variable: coord_dim <lgl>

The data variables available on a grid can be expanded out as a single data frame, which all the coordinates copied out - this is not efficient(!) but if we craft our queries sensibly to read only what we need, it’s a very easy way to explore the data in a file.

The ‘hyper_filter’ function allows specification of expressions to subset a variable based on each dimension’s coordinate values.

If no expressions are included we are presented with a table containing a row for each dimension, its extent in coordinates and its length. For convenience we also assign the activate form to an R variable, though we could just chain the entire operation without this.

By specifying inequality expressions we see an implicit subsetting of the array. Everything so far is implicit to delay any file-based computation required to actually interact with the file and read from it.

Notice that these are “name = expr” paired expressions, because the right hand side may be quite general we need the left hand side name to be assured of the name of the dimension referred to.

We can also use the special internal variable ‘index’, which will test against position in the dimension elements ‘1:length’ rather than the values. It’s not different in this case because ni and nj are just position dimensions anyway. The special ‘dplyr’ adverbs like ‘between’ will work.

Data extraction

How to use these idioms to extract actual data?

We can now exercise these variable choice and dimension filters to return actual data, either in by slicing out a “slab” in array-form, or as a data frame.

hf <- concentration %>% 
  hyper_filter(ni = index > 150, 
               nj = dplyr::between(index, 30, 100))

## as an array
arr <- hf %>% hyper_array()
str(arr)
#> List of 2
#>  $ concentration: int [1:482, 1:71] NA NA NA NA NA NA NA NA NA NA ...
#>  $ quality_flag : int [1:482, 1:71] 8 8 8 8 8 8 8 8 8 8 ...
#>  - attr(*, "transforms")=List of 3
#>   ..$ ni  :Classes 'tbl_df', 'tbl' and 'data.frame': 632 obs. of  6 variables:
#>   .. ..$ ni       : int [1:632] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. ..$ index    : int [1:632] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. ..$ id       : int [1:632] 0 0 0 0 0 0 0 0 0 0 ...
#>   .. ..$ name     : chr [1:632] "ni" "ni" "ni" "ni" ...
#>   .. ..$ coord_dim: logi [1:632] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   .. ..$ selected : logi [1:632] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ nj  :Classes 'tbl_df', 'tbl' and 'data.frame': 664 obs. of  6 variables:
#>   .. ..$ nj       : int [1:664] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. ..$ index    : int [1:664] 1 2 3 4 5 6 7 8 9 10 ...
#>   .. ..$ id       : int [1:664] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..$ name     : chr [1:664] "nj" "nj" "nj" "nj" ...
#>   .. ..$ coord_dim: logi [1:664] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   .. ..$ selected : logi [1:664] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ time:Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of  6 variables:
#>   .. ..$ time     : num 1032204
#>   .. ..$ index    : int 1
#>   .. ..$ id       : int 2
#>   .. ..$ name     : chr "time"
#>   .. ..$ coord_dim: logi TRUE
#>   .. ..$ selected : logi TRUE
#>  - attr(*, "source")=Classes 'tbl_df', 'tbl' and 'data.frame':   1 obs. of  2 variables:
#>   ..$ access: POSIXct[1:1], format: "2019-11-06 22:30:18"
#>   ..$ source: chr "/rdsi/PRIVATE/raad/dev/__tmp_mike/RtmpDmmu4r/Rinst50021da4435/tidync/extdata/ifremer/20171002.nc"
#>  - attr(*, "class")= chr "tidync_data"

## as a data frame
hf %>% 
  hyper_tibble() %>% 
  dplyr::filter(!is.na(concentration)) %>% dplyr::distinct(concentration, quality_flag)
#> # A tibble: 105 x 2
#>    concentration quality_flag
#>            <int>        <int>
#>  1             0            0
#>  2             0           12
#>  3            42            0
#>  4            20            0
#>  5            21            0
#>  6            55            0
#>  7            17            0
#>  8            64            0
#>  9            25            0
#> 10            23            0
#> # … with 95 more rows

Interrogating data by dimensions

The connection object ‘hf’ is available for determining what is available and how we might cut into it. ‘time’ interestingly is of length 1, so perhaps adds nothing to the information about this otherwise 2D data set. If we were to open this file in ncdf4 or RNetCDF and wanted to take a subset of the file out, we would have to specify and start of 1 and a count of ` even though it’s completely redundant.

The ‘start’ and ‘count’ values reported are directly useable by the traditional API tools, and in particular by the functions ncdf4::ncvar_get() (varid, start, count), its counterpart in RNetCDF::var.get.nc() and command line tools like CDO.

But, it’s a pain to have to know the dimension of the variable and specify every slot. Code in the traditional API that looks like this

Obviously, it should be reasonable to specify only the count of any dimensions that we don’t want the entirety of. This problem manifests exactly in R arrays generally, we can’t provide information only about the dimensions we want, they have to be specified explicitly - even if we mean all of it.

It does not matter what we include in the filter query, it can be all, some or none of the available dimensions, in any order.

The other requirement we have is the ability to automate these task, and so far we have only interacted with the dimensionality information from the print out. For programming, there are functions hyper_vars(), hyper_dims() and hyper_grids() to report on elements in the source. The value of hyper_vars() and hyper_dims() is restricted to the active grid. The function hyper_grids() reports all available grids, with the currently active one indicated. The name of the current grid is also available via active().

Transforms

Under the hood tidync manages the relationship between dimensions and coordinates via transforms tables. This is a 1-dimensional mapping between dimension index and its coordinate value (if there’s no explicit value it is assumed to be equal to the 1-based index). These tables are available via the hyper_transforms() function.

Each table has columns <dimension-name>, index, id, name, coord_dim (whether the dimension has explicit coordinates, in the <dimension-name> column), and selected. The selected column records which of the dimension elements is currently requested by a hyper_filter query, and by default is set to TRUE. Expressions in the filter function work by updating this column.

Future improvements


  1. I.e. it’s not possible to query a file for an arbitrary sparse set of values, without constructing a degenerate hyperslab query for each point or reading a hyperslab containing cells not in the set. Do you know different? Please let me know!