Gromacs engines#
This notebook showcases the use of the python classes used to steer gromacs from python. It will only work if the gromacs executables are available (e.g. in your $PATH variable).
There are two main classes you will use together:
asyncmd.gromacs.MDP, a python class which parses a gromacs molecular dynamics parameter file (.mdp) and makes its content available via a dictionary-like interfacethe
asyncmd.gromacs.GmxEngineor theasyncmd.gromacs.SlurmGmxEngine, both share a common interface and areasync/awaitenabled python wrappers to run gromacs locally or via the SLURM workload manager, respectively
Imports and some basic checks that everything is available#
%%bash
# if using the module system to make gromacs and friends available:
# check that they are loaded!
module list
Currently Loaded Modulefiles:
1) anaconda/3/2021.11 3) git-lfs/3.4 5) impi/2021.9 <aL>
2) git/2.43 4) gcc/12 6) gromacs/2023.5
Key:
<module-tag> <aL>=auto-loaded
%%bash
# unix only, check that gmx is available
which gmx
/mpcdf/soft/SLE_15/packages/skylake/gromacs/gcc_12-12.1.0-cuda_12.2-12.2.2-impi_2021.9-2021.9.0/2023.5/bin/gmx
%matplotlib inline
import os
import asyncio
import asyncmd
import asyncmd.gromacs as asyncgmx
Optional: setup logging to be more verbose#
loglevel = "WARN"
loglevel = "INFO" # comment this line if you want more logging
%config Application.log_level=loglevel
import logging
l = logging.getLogger("asyncmd")
l.setLevel(loglevel)
Setup working directory and the number of gromacs simulations to run in parallel#
n_engines = 4
scratch_dir = "slurm_gmx_engine_wdirs"
wdirs = [os.path.join(scratch_dir, f"engine_{i}") for i in range(n_engines)]
for d in wdirs:
if not os.path.isdir(d):
os.makedirs(d)
asyncmd.gromacs.MDP#
The MDP is a dictionary-like interface to a parsed gromacs molecular dynamics parameter file .mdp file to enable easy inspection and modification from python code. Most of the values are automatically cast to their respective types, e.g. nsteps will always be an int and ref-t will always be a list of float. The default for unknown parameters is a list of str to allow for the highest flexibility possible.
The class supports writing of its (possibly changed) content to a new .mdp file by using its .write() method and also knows if its content has been changed since parsing the original .mdp file. It even supports the (undocumented) keyformat CHARMM-GUI uses in which all - are replaced by _.
# one MDP object per engine, in principal we could use the same object but this way is more customizable,
# e.g. we could want to modify our setup have the engines run at a different temperatures
mdps = [asyncgmx.MDP("../../resources/gromacs/capped_alanine_dipeptide/md.mdp") for _ in range(n_engines)]
# lets have a look at what is inside
print("MDP has been changed since parsing: ", mdps[0].changed)
print("Parsed content:")
print("---------------")
for key, val in mdps[0].items():
print(key, " : ", val)
MDP has been changed since parsing: False
Parsed content:
---------------
integrator : md
dt : 0.002
nsteps : -1
nstxout : 20
nstvout : 20
nstlog : 20
nstxout-compressed : 0
nstlist : 50
ns-type : grid
cutoff-scheme : Verlet
rlist : 1.1
coulombtype : PME
rcoulomb : 1.1
rvdw : 1.1
Tcoupl : v-rescale
tc-grps : ['Protein', 'SOL']
tau-t : [0.5, 0.5]
ref-t : [300.0, 300.0]
Pcoupl : C-rescale
tau-p : 1.0
compressibility : [4.5e-05]
ref-p : [1.0]
gen-vel : no
constraints : h-bonds
NOTE: “Pcoupl = C-rescale” needs gromacs version >= 2021#
I.e. if you are running a recent version of gromacs (>= 2021) you should comment the line below in which we set “Pcoupl = Berendsen”.
# lets set the xtc output frequency to 0 in all MDPs, we will use the trr anyways
# we will also increase the trr output frequency by a bit and add the `continuation` parameter
nstout = 200
for i, mdp in enumerate(mdps):
if i == 0:
# Use something different for the first mdp
# Note that the nstout options all have the correct datatype (i.e. integer)
# so we can do inplace multiplication on it
mdp["nstvout"] *= 50
mdp["nstxout"] *= 50
mdp["nstlog"] *= 50
else:
mdp['nstvout'] = nstout
mdp["nstxout"] = nstout
mdp["nstlog"] = nstout
mdp["nstenergy"] = nstout
mdp["nstxout-compressed"] = 0
mdp["continuation"] = "yes" # dont apply constraints to the initial configuration
#mdp["Pcoupl"] = "Berendsen"
# have a look again
print("MDP has been changed since parsing: ", mdps[0].changed)
print("Parsed content:")
print("---------------")
for key, val in mdps[0].items():
print(key, " : ", val)
MDP has been changed since parsing: True
Parsed content:
---------------
integrator : md
dt : 0.002
nsteps : -1
nstxout : 1000
nstvout : 1000
nstlog : 1000
nstxout-compressed : 0
nstlist : 50
ns-type : grid
cutoff-scheme : Verlet
rlist : 1.1
coulombtype : PME
rcoulomb : 1.1
rvdw : 1.1
Tcoupl : v-rescale
tc-grps : ['Protein', 'SOL']
tau-t : [0.5, 0.5]
ref-t : [300.0, 300.0]
Pcoupl : C-rescale
tau-p : 1.0
compressibility : [4.5e-05]
ref-p : [1.0]
gen-vel : no
constraints : h-bonds
nstenergy : 200
continuation : yes
asyncmd.gromacs.GmxEngine (and asyncmd.gromacs.SlurmGmxEngine)#
Both provide the functionality of the gromacs grompp and mdrun executables in one class, i.e. given molecular dynamics parameters and possibly an initial configuration they will setup and steer a MD run. Their interfaces differ only in the additional sbatch_script that the slurm engine requires at initialization time, they can otherwise be used interchangeably. Both engines need the gromacs executables to be available, specifically gmx grompp and gmx mdrun (gmx_mpi mdrun for the SlurmGmxEngine). The SlurmGmxEngine naturally also must have access to the slurm executables, specifically sbatch, sacct and scancel. However all of these can be set either at initialization time via keyword arguments or globally as attributes to the uninitialized class.
Each engine has a prepare() method (which will call grompp) and multiple methods to then run the simulation, namely run(), run_walltime() and run_nsteps(). The additional prepare_from_files() method can be used to continue a previous MD run from given deffnm and workdir (assuming all files/parts are there), note that it will (currently) not call grompp again and therefore assumes that the portable run input file (.tpr) allows for the continuation (i.e. has no or a sufficiently large integration step limit).
Since we will be using the SlurmGmxEngine here, we need an additional slurm submission script (sbatch_script). A minimal example is included with the examples and printed below. The string “{mdrun_cmd}” will be replaced by asyncmd with the specific gromacs command to generate a requested trajectory.
Note that you most likely want to adapt at least the partition to the cluster you are running on. When using asyncmd for different molecular systems, naturally different CPU/GPU-resources are needed to run most efficient. As always: benchmark your system and then choose the most efficient settings for your cluster-setup and molecular-system combination.
# this is just to have a look at the file content of the slurm submission script
from pygments import highlight
from pygments.lexers import BashLexer
from pygments.formatters import HtmlFormatter
import IPython
with open('../../resources/gromacs/capped_alanine_dipeptide/mdrun.slurm') as f:
code = f.read()
formatter = HtmlFormatter()
IPython.display.HTML('<style type="text/css">{}</style>{}'.format(
formatter.get_style_defs('.highlight'),
highlight(code, BashLexer(), formatter)))
#!/bin/bash -l
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --mem=4500
### Things you might want to set to run resource-efficient (non-exhaustive)
##SBATCH --partition=
##SBATCH --time=
##SBATCH --nodes=
# Note: make sure that you activate the correct environment, preferably the same you run asyncmd from
source ~/asyncmd_workshop_test/source_modules_phys.sh
srun {mdrun_cmd}
# Let us create a list of identical engines to showcase the power of concurrent execution :)
engines = [asyncgmx.SlurmGmxEngine(mdconfig=mdp,
gro_file="../../resources/gromacs/capped_alanine_dipeptide/conf.gro", # required
top_file="../../resources/gromacs/capped_alanine_dipeptide/topol_amber99sbildn.top", # required
# NOTE this is the only additional thing needed for using the SlurmGmxEnigne w.r.t. GmxEngine
sbatch_script="../../resources/gromacs/capped_alanine_dipeptide/mdrun.slurm", # required!
# optional (can be omitted or None), however naturally without an index file
# you can not reference custom groups in the .mdp-file or MDP object
ndx_file="../../resources/gromacs/capped_alanine_dipeptide/index.ndx",
output_traj_type="trr", # optional, defaults to xtc so we need to specify it to get the trr traj we set above
# NOTE: you can use mdrun_extra_args to pass additional arguments to gmx mdrun,
# e.g. to set the number of threads and/or shift additional calculations to the gpu
#mdrun_extra_args=,
)
for mdp in mdps]
After setting the molecular dynamics parameters we can prepare a gromacs MD run.#
The gromacs engines prepare() method will call grompp, as with grompp you can use a specific starting configuration (the grompp -t option) or start from the structure file (.gro) the engine got at initialization.
Lets prepare the first engine without a starting structure:#
e0 = engines[0] # get it out of the list so tab-help/completion works
# the prepare method is an async def function (a coroutine) and must be awaited
await e0.prepare(starting_configuration=None, workdir=wdirs[0], deffnm="test")
Lets prepare all other engines at once with the same initial configuration#
We can use asyncio.gather to run all coroutines concurrently, for prepare this does not make a big difference (since it is fast), but the same mechanism enables us to run all 4 gromacs engines in parallel later.
# create an asyncmd.Trajectory of the initial configuration
init_conf = asyncmd.Trajectory(trajectory_files="../../resources/gromacs/capped_alanine_dipeptide/conf_in_alphaR.trr",
structure_file="../../resources/gromacs/capped_alanine_dipeptide/conf.gro",
)
# and prepare the engines (the return value of prepare is None)
await asyncio.gather(*(e.prepare(starting_configuration=init_conf, workdir=wdir, deffnm="test")
for e, wdir in zip(engines[1:], wdirs[1:])
)
)
[None, None, None]
Run the engines for a number of steps each.#
We will first run the last engine in the list alone and then all 4 concurrently for the same number of steps to show off the power of the concurrent execution of the gromacs subprocesses.
import time # import time to be able to show off ;)
nsteps = 100000
# run one engine and time it
start = time.time()
# the engine will return an asyncmd.Trajectory with the produced trajectory (part)
traj = await engines[-1].run_steps(nsteps=nsteps, steps_per_part=True)
end = time.time()
print(f"Running one engine for {nsteps} integration steps took {round(end - start, 4)} seconds.")
print(f"The produced trajectory ({traj}) has a length of {len(traj)} frames.")
print(f"This length is the number of steps divided by the engines output frequency (={engines[-1].nstout}).")
print("Note, that we are off by plus one because the initial configuration is in the trajectory for gromacs.")
print("Note also that this is only true when explicitly passing nsteps to the `run` methods, ")
print("unfortunately the real relation between frames and steps done is a bit more involved...")
print("See the docstring for `GmxEngine.steps_done` if you are brave and want to know more ;)")
Running one engine for 100000 integration steps took 105.54 seconds.
The produced trajectory (Trajectory(trajectory_files=slurm_gmx_engine_wdirs/engine_3/test.part0001.trr, structure_file=slurm_gmx_engine_wdirs/engine_3/test.tpr)) has a length of 501 frames.
This length is the number of steps divided by the engines output frequency (=200).
Note, that we are off by plus one because the initial configuration is in the trajectory for gromacs.
Note also that this is only true when explicitly passing nsteps to the `run` methods,
unfortunately the real relation between frames and steps done is a bit more involved...
See the docstring for `GmxEngine.steps_done` if you are brave and want to know more ;)
# run all engines at once and time it
start = time.time()
# Now each engine will return asyncmd.Trajectory with the produced trajectory (part)
# i.e. trajs will be a list of trajectories (in the same order as the engines in the list)
trajs = await asyncio.gather(*(e.run_steps(nsteps=nsteps, steps_per_part=True) for e in engines))
end = time.time()
print(f"Running all engines for {nsteps} integration steps took {round(end - start, 4)} seconds.")
print(f"But now we have a list of {len(trajs)} trajectories with {nsteps} steps each...")
for t in trajs:
print(t, f"with length: {len(t)}")
Running all engines for 100000 integration steps took 391.1852 seconds.
But now we have a list of 4 trajectories with 100000 steps each...
Trajectory(trajectory_files=slurm_gmx_engine_wdirs/engine_0/test.part0001.trr, structure_file=slurm_gmx_engine_wdirs/engine_0/test.tpr) with length: 101
Trajectory(trajectory_files=slurm_gmx_engine_wdirs/engine_1/test.part0001.trr, structure_file=slurm_gmx_engine_wdirs/engine_1/test.tpr) with length: 501
Trajectory(trajectory_files=slurm_gmx_engine_wdirs/engine_2/test.part0001.trr, structure_file=slurm_gmx_engine_wdirs/engine_2/test.tpr) with length: 501
Trajectory(trajectory_files=slurm_gmx_engine_wdirs/engine_3/test.part0002.trr, structure_file=slurm_gmx_engine_wdirs/engine_3/test.tpr) with length: 501
Note how the first engine has produced a different number of frames in the trajectory?
The reason is the different output frequency (nstout), so the same number of integration steps will result in a different number of frames.
Use prepare_from_files to initialize new engines and pick up where we left off with the ‘old’ ones.#
# create the engines
new_engines = [asyncgmx.SlurmGmxEngine(mdconfig=mdp,
gro_file="../../resources/gromacs/capped_alanine_dipeptide/conf.gro",
top_file="../../resources/gromacs/capped_alanine_dipeptide/topol_amber99sbildn.top",
sbatch_script="../../resources/gromacs/capped_alanine_dipeptide/mdrun.slurm", # required!
ndx_file="../../resources/gromacs/capped_alanine_dipeptide/index.ndx",
output_traj_type="trr", # optional, defaults to xtc
)
for mdp in mdps]
e0 = new_engines[0] # get one out for the autocomplete
# and initialize with prepare_from_files
await e0.prepare_from_files(workdir=wdirs[0], deffnm="test")
# and the others concurrent in one go
await asyncio.gather(*(e.prepare_from_files(workdir=wdir, deffnm="test")
for e, wdir in zip(new_engines[1:], wdirs[1:])
)
)
[None, None, None]
Now we can do another round of MD in all engines in parallel#
Note that the partnums indicate that we picked up exactly where we left of. We could additionally check using the trajectories .last_step and .first_step properties, compare and observe that the last step in the previous MD runs will be the first step in these here.
# run all engines at once and time it
start = time.time()
trajs = await asyncio.gather(*(e.run_steps(nsteps=nsteps, steps_per_part=True) for e in new_engines))
end = time.time()
print(f"Running all engines for {nsteps} integration steps took {round(end - start, 4)} seconds.")
print(f"But now we have a list of {len(trajs)} trajectories with {nsteps} steps each...")
for t in trajs:
print(t, f"with length: {len(t)}")
Running all engines for 100000 integration steps took 421.2215 seconds.
But now we have a list of 4 trajectories with 100000 steps each...
Trajectory(trajectory_files=slurm_gmx_engine_wdirs/engine_0/test.part0002.trr, structure_file=slurm_gmx_engine_wdirs/engine_0/test.tpr) with length: 101
Trajectory(trajectory_files=slurm_gmx_engine_wdirs/engine_1/test.part0002.trr, structure_file=slurm_gmx_engine_wdirs/engine_1/test.tpr) with length: 501
Trajectory(trajectory_files=slurm_gmx_engine_wdirs/engine_2/test.part0002.trr, structure_file=slurm_gmx_engine_wdirs/engine_2/test.tpr) with length: 501
Trajectory(trajectory_files=slurm_gmx_engine_wdirs/engine_3/test.part0003.trr, structure_file=slurm_gmx_engine_wdirs/engine_3/test.tpr) with length: 501
Run for specified walltime#
walltime = 0.02 # 0.01 h = 36 s
# run all engines at once and time it
start = time.time()
trajs = await asyncio.gather(*(e.run_walltime(walltime) for e in new_engines))
end = time.time()
print(f"Running all engines for {walltime} h (={walltime*60*60} s) took {round(end - start, 4)} seconds.")
Running all engines for 0.02 h (=72.0 s) took 301.0246 seconds.
Run for specified walltime or number of steps (depending on what is reached first)#
We can also use the generic run() method which takes one or both of the walltime and nsteps arguments, it will finish as soon as one of the conditions is fulfilled. As the run_steps() method it also accepts the steps_per_part argument making it particularly useful to run in chunks (of length walltime) but for a fixed total number of steps.
Note that we can either check if engine.steps_done < n_steps_desired (as we do below) or call the engine.run(nsteps=n_steps_desired) method until it returns None instead of a trajectory object, which indicates that the total number of steps done in that engine is exactly the requested number of total steps.
print([e.steps_done for e in new_engines])
print([e.steps_done < (max([e.steps_done for e in new_engines]) + 20000) for e in new_engines])
[269000, 269600, 269200, 369600]
[True, True, True, True]
walltime = 0.01 # 0.01 h = 36 s
nsteps = max([e.steps_done for e in new_engines]) + 20000
print(f"Original nsteps = {nsteps}")
# make sure that nsteps is a multiple of nstout for all engines
# (this is enforced when you run for a fixed number of steps to avoid stupid one-off errors that might happen
# because the frames and steps are not multiples of each other)
perfect_nsteps = all([nsteps % e.nstout == 0 for e in engines])
while not perfect_nsteps:
for e in engines:
if nsteps % e.nstout != 0:
nsteps += e.nstout - (nsteps % e.nstout)
perfect_nsteps = all([nsteps % e.nstout == 0 for e in engines])
print(f"Will run for {nsteps} steps in all engines!")
# now the actual trajectory generation
all_trajs = []
all_times = []
while any([e.steps_done < nsteps for e in new_engines]):
# run all engines at once and time it
start = time.time()
trajs = await asyncio.gather(*(e.run(walltime=walltime, nsteps=nsteps, steps_per_part=False)
for e in new_engines
)
)
end = time.time()
all_trajs.append(trajs)
all_times.append(end-start)
print(f"Ran for a total of {len(all_times)} loops. It took us {round(sum(all_times), 4)} seconds.")
Original nsteps = 389600
Will run for 390000 steps in all engines!
Ran for a total of 4 loops. It took us 452.6167 seconds.
# the last engine could already have produced a `None` instead of a trajectory in the last iteration
# (since it is some steps ahead of the others because we ran it alone at the beginning of the notebook)
all_trajs[-1]
[Trajectory(trajectory_files=slurm_gmx_engine_wdirs/engine_0/test.part0007.trr, structure_file=slurm_gmx_engine_wdirs/engine_0/test.tpr),
Trajectory(trajectory_files=slurm_gmx_engine_wdirs/engine_1/test.part0007.trr, structure_file=slurm_gmx_engine_wdirs/engine_1/test.tpr),
Trajectory(trajectory_files=slurm_gmx_engine_wdirs/engine_2/test.part0007.trr, structure_file=slurm_gmx_engine_wdirs/engine_2/test.tpr),
None]