Debugging techniques
Start with the CAM wiki.
CAM-SIMA-specific debugging
Build errors
Debugging tips if you get build errors:
- If the output indicates that the error message or failure is coming from somewhere within $CAM-SIMA/ccpp_framework:
- If you're getting a clear error message, it's likely that you have something wrong with your metadata
- If you're getting an error message that indicates that something is breaking in the framework code itself (something went uncaught) - consult the AMP SEs
- If the error happens during the atm build, you can see the full output of the atm build in the build log here:
bld/atm.bldlog.*
Run-time errors
- Start with the atm.log* - if the issue occurred during the execution of the CAM code, it will hopefully have a clear and concise error message
- Move to the cesm.log* - it will hopefully include a stack trace for the error in question; if the error did not occur in the CAM code (or CAM did not properly trap the error), it will help you identify the source of the issue.
- If neither log file contains helpful information, a few first steps:
- Resubmit the case; it could be a machine hiccup
- Turn on DEBUG mode (if it's not on already) and rebuild/rerun
- Look in your run directory for any log files called
PETXXX
- if there was an issue on the ESMF side of things, it will show up in one of these (there will be one PET file per processor) - Try a different compiler - maybe it'll give you a more helpful error message
- set NTASKS=1 (
./xmlchange NTASKS=1
), do a clean rebuild (as instructed), and run again; maybe running in serial will identify the error - Look for the
***************** HISTORY FIELD LIST ******************
in the atm.log* file; if it's not there, the error occurred at init time- If the error occurred during init time, try a new case with a different grid and/or dycore
- If the model ran for a few timesteps before dying (look for the
CAM-SIMA time step advanced
message in the atm.log* file), it's likely that one or more variable that you introduced or modified has gone off the rails (value has become very large or very small or zero)- Update your user_nl_cam to output all possible suspected variables to a history file at some point shortly before the model dies, then inspect the output to see if any are obviously wrong
- If the model completed all timesteps, try running a shorter case to see if the problem persists; if so, it's an error during the model finalization
- Run the TotalView debugger on izumi
- Use the old standard - print statements - to narrow down where the code is stopping
- Ask for help!
Unexpected answer changes
- Two paths here:
- You're getting unexpected DIFFs from the regression testing
- Consult with a scientist about whether differences are expected and for which configurations (compsets, resolutions, namelists parameters, etc)
- If the differences are very small (look like round-off), consult with the other AMP SEs on whether we're ok with this
- If the differences are indeed unexpected and larger than round-off, create a case using the code from the head of
development
and:- place print statements in both code bases (your development branch and the head of
development
) to identify where the numbers are going awry OR - run the TotalView debugger OR
- use the comparison tool described below (
$CAM-SIMA/tools/find_max_nonzero_index.F90
)
- place print statements in both code bases (your development branch and the head of
- You're getting unexpected answer changes compared with CAM
- Consult with other AMP SEs about whether the differences appear to be due to round-off error
- Use the comparison tool (LINK ONCE IT EXISTS):
$CAM-SIMA/tools/find_max_nonzero_index.F90
- This tool can help you narrow down where the issue begins by printing out values at a specific index and comparing those with the "truth" (from CAM)
- You're getting unexpected DIFFs from the regression testing
TotalView
- Grab an interactive node. You can do this by copying the following commands into a .csh script:
#! /bin/csh -f
#PBS -q long
# Number of nodes (CHANGE THIS if needed)
# #PBS -l walltime=6:00:00,nodes=1:ppn=16
# # output file base name
# #PBS -N test_dr
# # Put standard error and standard out in same file
# #PBS -j oe
# # Export all Environment variables
# #PBS -V
then run:
qsub -X -I <script>.csh
- Create and configure a new case (using gnu and only 1 task)
./create_newcase --pecount 1 --case <CASEDIR> --compset <COMPSET> --res <RESOLUTION> --compiler gnu --run-unsupported
- Turn on debug in the case
./xmlchange DEBUG=True
- Build the case (
./case.build
) - Run command
bash
to change to bash (if not already) - Run the following commands:
np=1
nthreads=1
source .env_mach_specific.sh
RUNDIR=`./xmlquery RUNDIR -value`
EXEROOT=`./xmlquery EXEROOT -value`
LID=`date '+%y%m%d-%H%M%S'`
cd $RUNDIR
mkdir timing
mkdir timing/checkpoints
echo `pwd`
export OMP_NUM_THREADS=$nthreads
totalview ${EXEROOT}/cesm.exe
exit
to exit the totalview window and give up the node