1.3.4. Creating Surface Datasets

1.3.4.1. mksurfdata_esmf purpose

This tool is intended to generate fsurdat files (surface datasets) and landuse files for the CTSM. It can generate global, regional, and single-point fsurdat files, as long as a mesh file is available for the grid.

The subset_data tool allows users to make fsurdat files from existing fsurdat files when a mesh file is unavailable. Generally, users are encouraged to use the subset_data tool for generating regional and single-point fsurdat files.

1.3.4.2. Build Requirements

mksurfdata_esmf is a distributed memory parallel program (using Message Passing Interface -- MPI) that utilizes both ESMF (Earth System Modelling Framework) for regridding as well as PIO (Parallel I/O) and NetCDF output. As such, libraries must be built for the following:

  1. MPI

  2. NetCDF

  3. PIO

  4. ESMF

In addition for the build: python, bash-shell, CMake and GNU-Make are required

These libraries need to be built such that they can all work together in the same executable. Hence, the above order may be required in building them.

CTSM submodules cime and ccs_config are required, and we will show how these come in. A python environment that includes particular packages is also required. We demonstrate how to use the ctsm_pylib environment that we support in CTSM.

Note, PNETCDF is an optional library that can be used, but is NOT required.

Use cime to manage the build requirements

Important

CURRENTLY WORKS ONLY ON DERECHO IN CTSM (not CESM) CHECKOUTS

For users working on cime machines you can use the build script to build the tool. On other machines you'll need to do a port to cime and tell how to build for that machine. That's talked about in the cime documentation. And you'll have to make some modifications to the build script.

https://github.com/ESMCI/cime/wiki/Porting-Overview

Machines that already run CTSM or CESM have been ported to cime. So if you can run the model on your machine, you will be able to build the tool there.

To get a list of the machines that have been ported to cime:

# Assuming pwd is your CTSM or CESM checkout
cd cime/scripts
./query_config --machines

Note

In addition to having a port to cime, the machine also needs to have PIO built and able to be referenced with the env variable PIO which will need to be in the porting instructions for the machine. An independent PIO library is available on supported CESM machines.

Important

Currently we have run and tested mksurfdata_esmf on Derecho. Please see this github issue about mksurfdata_esmf on other CESM machines:

https://github.com/ESCOMP/CTSM/issues/2341

1.3.4.3. The complete process

If you have read the previous section, you are ready to proceed. The $CTSMROOT/tools/README.md goes through the complete process for creating input files needed to run CLM. The $CTSMROOT/tools/mksurfdata_esmf/README.md specifically goes through the complete process of generating surface and landuse datasets. We repeat those files here:

# CTSM Tools for Preprocessing of Input Datasets or Postprocessing of History Output
#### $CTSMROOT/tools/README.md

CTSM tools for analysis of CTSM history files -- or for creation or
modification of CTSM input files.

I.  General directory structure:

    `$CTSMROOT/tools`
        mksurfdata_esmf -- Create surface datasets.

        crop_calendars --- Regrid and process GGCMI sowing and harvest date files for use in CTSM.

        site_and_regional  Scripts for handling input datasets for site and regional cases.
                           These scripts both help with creation of datasets using the
                           standard process as well as subsetting existing datasets and overwriting
                           some aspects for a specific case.

        modify_input_files Scripts to modify CTSM input files. Specifically modifying the surface
                           datasets and mesh files.

        contrib ---------- Miscellaneous tools for pre or post processing of CTSM.
                           Typically these are contributed by anyone who has something
                           they think might be helpful to the community. They may not
                           be as well tested or supported as other tools.

II. Notes on building/running for each of the above tools:

    mksurfdata_esmf has a cime configure and CMake based build using the following files:

        gen_mksurfdata_build ---- Build mksurfdata_esmf
        src/CMakeLists.txt ------ Tells CMake how to build the source code
        Makefile ---------------- GNU makefile to link the program together
        cmake ------------------- CMake macros for finding libraries

    mkmapgrids and site_and_regional only contain scripts that do not need build files.

    Some tools have copies of files from other directories -- see the README.filecopies.md
    file for more information on this.

    Tools may also have files with the directory name followed by namelist to provide sample namelists.

        <directory>.namelist ------ Namelist to create a global file.

    These files are also used by the test scripts to test the tools (see the
    README.testing.md) file.

> [!NOTE]
> Be sure to change the path of the datasets referenced by these namelists to
> point to where you have exported your CESM inputdata datasets.

III. Process sequence to create input datasets needed to run CTSM

    1. Create ESMF MESH grid files (if needed)

       a. For standard resolutions these files will already be created. (done)

       b. Run `tools/site_and_regional/subset_data point` to create single-point datasets

       This creates just the fsurdat file as MESH files are NOT needed for single-point cases.

       c. Run `tools/site_and_regional/subset_data region` to create regional datasets subset from a global dataset

       This creates both the fsurdat file and MESH file needed to run.

       d. General custom grid

        You'll need to convert or create MESH grid files on your own (using scripts
        or other tools) for the general case where you have an unstructured grid, or
        a grid that is not regular in latitude and longitude, and that grid is custom
        and not merely subset from one of the global grids.

    2. Create surface datasets with mksurfdata_esmf on Derecho
        (See mksurfdata_esmf/README.md for more help on doing this)

       - gen_mksurfdata_build to build
       - gen_mksurfdata_namelist to build the namelist
       - gen_mksurfdata_jobscript_single to build a batch script to run on Derecho
       - Submit the batch script just created above

       - This step uses the results of step (1) entered into the XML database.
       - If datasets were NOT entered into the XML database, set the resolution
         by entering the mesh file using the options: --model-mesh --model-mesh-nx --model-mesh-ny

       Example: for 0.9x1.25 resolution for 1850

``` shell
       # On Derecho
       cd mksurfdata_esmf
       ./gen_mksurfdata_build
       ./gen_mksurfdata_namelist --res 0.9x1.25 --start-year 1850 --end-year 1850
       ./gen_mksurfdata_jobscript_single --number-of-nodes 2 --tasks-per-node 128 --namelist-file target.namelist
       qsub mksurfdata_jobscript_single.sh
```

    3. Add new files to XML data or using user_nl_clm (optional)

       See notes on doing this in step (1) above.

IV.  Notes on which input datasets are needed for CTSM

       global or regional grids
         - need fsurdata
         - need mesh files in env_run.xml ATM_DOMAIN_MESH and LND_DOMAIN_MESH

       single-point grids
        - Just need fsurdata
# Instructions for Using mksurfdata_esmf to Create Surface Datasets
#### $CTSMROOT/tools/mksurfdata_esmf/README.md

## Table of contents
1. Purpose NOW IN THE USER'S GUIDE https://escomp.github.io/CTSM/users_guide/using-clm-tools/creating-surface-datasets.html#mksurfdata-esmf-purpose
2. Build Requirements NOW IN THE USER'S GUIDE https://escomp.github.io/CTSM/users_guide/using-clm-tools/creating-surface-datasets.html#build-requirements
3. [Building the executable](#building-the-executable)
4. [Running a Single Submission](#running-for-a-single-submission)
5. [Running for Multiple Datasets](#running-for-the-generation-of-multiple-datasets)
6. [Notes](#notes)

<!-- ================== -->
### Building the executable
<!-- ================== -->

 Before starting, be sure that you have run

``` shell
# Assuming pwd is the tools/mksurfdata_esmf directory
 ./bin/git-fleximod update  # Assuming at the top level of the CTSM/CESM checkout
```

This will bring in CIME and ccs_config which are required for building.

``` shell
# Assuming pwd is the tools/mksurfdata_esmf directory
setenv DEBUG TRUE  # only if debugging and your shell is tcsh (in bash use: export DEBUG=TRUE)
 ./gen_mksurfdata_build         # For machines with a cime build
```

 Note: The pio_iotype value gets set and written to a simple .txt file
 by this build script. The value depends on your machine. If not running
 on derecho, casper, or izumi, you may need to update this, though
 a default value does get set for other machines.

<!-- ========================= -->
## Running for a single submission
<!-- ========================= -->

### Setup ctsm_pylib
 Work in the ctsm_pylib environment, which requires the following steps when
 running on Derecho. On other machines it will be similar but might be different
 in order to get conda in your path and activate the ctsm_pylib environment.

``` shell
# Assuming pwd is the tools/mksurfdata_esmf directory
 module load conda
 cd ../..  # or ../../../.. for a CESM checkout)
 ./py_env_create    # Assuming at the top level of the CTSM/CESM checkout
 conda activate ctsm_pylib
```

to generate your target namelist:

``` shell
# Assuming pwd is the tools/mksurfdata_esmf directory
 ./gen_mksurfdata_namelist --help
```

for example try --res 1.9x2.5 --start-year 1850 --end-year 1850:

``` shell
# Assuming pwd is the tools/mksurfdata_esmf directory
 ./gen_mksurfdata_namelist --res <resolution> --start-year <year1> --end-year <year2>
```

> [!TIP]
> **IF FILES ARE MISSING FROM** /inputdata, a target namelist will be generated
> but with a generic name and with warning to run `./download_input_data` next.
> **IF A SMALLER SET OF FILES IS STILL MISSING AFTER RUNNING** `./download_input_data`
> and rerunning `./gen_mksurfdata_namelist`, then rerun
> `./gen_mksurfdata_namelist with your options needed.
> and rerun `./download_input_data` until
> `./gen_mksurfdata_namelist` finds all files.

 Example, to generate your target jobscript (again use --help for instructions):

``` shell
# Assuming pwd is the tools/mksurfdata_esmf directory
 ./gen_mksurfdata_jobscript_single --number-of-nodes 2 --tasks-per-node 128 --namelist-file target.namelist
 qsub mksurfdata_jobscript_single.sh
```

 Read note about regional grids at the end.

<!-- ========================================= -->
## Running for the generation of multiple datasets
<!-- ========================================= -->
 Work in the ctsm_pylib environment, as explained in earlier section.
 gen_mksurfdata_jobscript_multi runs `./gen_mksurfdata_namelist` for you

``` shell
# Assuming pwd is the tools/mksurfdata_esmf directory
 ./gen_mksurfdata_jobscript_multi --number-of-nodes 2 --scenario global-present
 qsub mksurfdata_jobscript_multi.sh
```

 If you are looking to generate all (or a large number of) the datasets or the
 single-point (1x1) datasets, you are best off using the Makefile. For example

``` shell
# Assuming pwd is the tools/mksurfdata_esmf directory
 make all  # ...or
 make all-subset
```

 As of 2024/9/12 one needs to generate NEON and PLUMBER2 fsurdat files by
 running ./neon_surf_wrapper and ./plumber2_surf_wrapper manually in the
 /tools/site_and_regional directory.

<!-- = -->
## NOTES
<!-- = -->

# Guidelines for input datasets to mksurfdata_esmf

> [!TIP]
> ALL raw datasets \*.nc **FILES MUST NOT BE NetCDF4**.

Example to convert to CDF5

``` shell
nccopy -k cdf5 oldfile newfile
```

> [!TIP]
> The LAI raw dataset \*.nc **FILE MUST HAVE** an "unlimited" time dimension

Example to change time to unlimted dimension using the NCO operator ncks.

``` shell
ncks --mk_rec_dmn time file_with_time_equals_12.nc -o file_with_time_unlimited.nc
```

### IMPORTANT THERE HAVE BEEN PROBLEMS with REGIONAL grids!!

> [!CAUTION]
> See
>
> https://github.com/ESCOMP/CTSM/issues/2430

In general we recommend using subset_data and/or fsurdat_modifier
for regional grids.