Reduce the memory usage that is important for ne1024 simulation#665
Reduce the memory usage that is important for ne1024 simulation#665sjsprecious wants to merge 5 commits into
Conversation
|
@mvertens - If you have a chance, I'd like to hear any thoughts you have on this change. (I have just skimmed it so far - haven't had a chance to look closely yet.) |
billsacks
left a comment
There was a problem hiding this comment.
The overall changes here look good to me. I appreciate your plugging in the use of lnd2rof_map, which seems like it previously wasn't hooked up. And I think I see why you needed to introduce a separate lnd2rof_fmap (though see my comments asking for this to be made more explicit in some documentation).
I do have a few requests... many about editing some comments, but a couple slightly more substantial... but still overall minor - for the most part this looks good to me.
Once you make the final changes, I'd like to see that at least one or two tests have been run with baseline comparisons to verify that these changes work and are bit-for-bit in out-of-the-box configurations. One test that I think would cover all of these changes would be a B compset test with the aoflux grid set to ogrid (the setting of ogrid is needed to cover the changes in med_phases_aofluxes_mod, and should still cover the other changes here). I'd like to see that run with comparisons against a baseline.
|
@sjsprecious and @billsacks I was also thinking about a test to show that the offline and inline are identical but in the past I've run into roundoff errors in the way the model calculates the mesh compared to offline. If that's the case here you may want to implement some type of tolerance. Maybe a nonissue if they are identical in this case. Other than that I think with the current set of updates the code looks good to me. |
|
Thanks @jtruesdal for your comments. I think one thing that can contribute to the difference here is that users may use different compiler/ESMF versions when generating the offline mesh. Once Bill provides the instructions about running the tests he suggested, I will let you know at least whether the code changes affect the current behavior. |
billsacks
left a comment
There was a problem hiding this comment.
The latest set of changes looks good - thank you @sjsprecious for making those changes.
A piece I am less confident about is whether the assignment of aoflux_mesh to lmesh (as opposed to doing an ESMF_MeshCreate there) is always safe. I am asking the ESMF group about that, along with the similar change you made in CDEPS.
Regarding testing, a good test would be SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid, since this covers the change to med_phases_aofluxes_mod in addition to your other changes. This is in the prebeta test list. It hasn't been run on recent versions of the code, so you'll need to generate baselines and then run with your latest changes with comparisons against baselines.
It would also be good to run aux_cime_baselines, or at least keep a careful eye on aux_cime_baselines when they are run in their nightly run following the merge of this PR. As with the associated CDEPS PR, @fischer-ncar could provide some input here on testing.
Regarding @jtruesdal 's point: I wasn't thinking of something as rigorous as testing behavior with online vs. offline generation of mappings - I'm mainly wanting to confirm that the changes here don't break anything or change answers for the case where we're still using online mapping.
|
I talked with @mvertens today. The good news is that she feels that this is NOT a problem:
However, @mvertens raised a question / concern about having two different mapping files: LND2ROF_FRAC_FMAPNAME and LND2ROF_FMAPNAME. I think I see why this was done, but @mvertens questioned whether this was necessary. Our initial question on this is: What is the difference in how these mapping files are generated? My sense is that LND2ROF_FMAPNAME should be generated with This discussion led me to find these related issues:
The summary of those issues is: Even though we currently have different mapping types for different lnd2rof mappings - and that's what you're supporting in the changes in this PR - this is considered to be a bug rather than a feature, and in fact we want to be using the same mapping type for all the mappings. So now I'm not sure what the right path forward is. I can see a few paths forward, which I lay out below... this probably needs some more discussion to figure out how we want to proceed. (1) Keep the current changes and merge this as is Pros:
Cons:
(2) Make an initial change where we change all of the lnd2rof mapping to be consd instead of consf, then change this PR to use a single LND2ROF mapping file and merge it Pros:
Cons:
(3) Don't change online mapping behavior for now, but use the same mapping file for the lnd2rof frac mapping as for the other lnd2rof mappings... this relies on the assumption that the differences between consf and consd mapping are inconsequential (which #408 suggests, but I'm not sure it applies in all cases). Bring this to main in this way. Pros:
Cons:
(4) Defer bringing in these changes until post CESM3 Pros:
Cons:
(5) Hybrid between (1) and (4): bring in the changes on this branch that do NOT rely on the new mapping file (LND2ROF_FRAC_FMAPNAME); leave those additional changes on a branch for now. |
|
Thanks @billsacks .
Yes,
Most fields between these two mapping files using different
I appreciate that you list all the available options here and provide detailed explanations. My two cents is that if my changes pass the regression tests that you suggested, it means that they do not change the current CMEPS behavior and should be safe to bring into CESM. I understand that providing two similar mapping files may be confusing, but if those XML variables are unset, the online mapping is used and the user may not even notice the existence of those XML variables. Regarding the option to use the same |
This PR reduces the memory usage that is critical for ultra-high resolution simulation such as ne1024. All the code changes are done by Claude under my supervisory.
The goal is to let the
lnd→rofconservative coupling maps be read from offline weight files instead of being computed online, which OOM-kills the job atne1024duringDataInitialize (ESMF_FieldRegridStore → GeomRendezvous → Zoltan_RCB). This is the implementation of thelnd2rof_consfOOM fix. There are two distinctlnd→rofmaps handled, plus anaofluxmemory optimization.In main, the
lnd2rof_mapattribute (driven by the pre-existingLND2ROF_FMAPNAMEXML var) was already read into thelnd2rof_mapvariable — but the fiveaddmap_fromcalls for the runoff fields (Flrl_rofsur, Flrl_rofi, Flrl_rofgwl, Flrl_rofsub, Flrl_irrig) hardcodedunset, so the file was silently ignored and the map was always built online.This is a separate map (
destarea, notfracarea) used duringmed_fraction_init, with no pre-existing namelist hook. The fix adds full new plumbing:LND2ROF_FRAC_FMAPNAME(default unset, env_run.xml, group run_domain) in config_component.xml.lnd2rof_fmap(modify_via_xml="LND2ROF_FRAC_FMAPNAME") innamelist_definition_drv.xml.lnd2rof_fmapattribute viaNUOPC_CompAttributeGet; if it is present, set, and the file exists on disk (inquire), it callsmed_map_routehandles_initwithmapfile=to read offline weights; otherwise it falls back to the original online path — no behavior change when unset. Adds ause NUOPCimport and local vars (isPresent/isSet/lexist, lnd2rof_fmap).optional, intent(in) :: mapfileargument throughmed_map_routehandles_initfrom_fieldbundle, forwarding it to the field-level routine (which already accepted mapfile) when present. Backward-compatible.Replaces
is_local%wrap%aoflux_mesh = ESMF_MeshCreate(lmesh, rc=rc)withis_local%wrap%aoflux_mesh = lmesh— reuses the ocean field's existing mesh handle instead of allocating a duplicate full ESMF mesh per rank at ne1024pg2. Safe because lmesh persists inFBArea(compocn)andaoflux_meshis never destroyed.