simulatr on a cluster or cloud
example-remote.Rmd
If you have access to a distributed computing platform (e.g. a
computer cluster provided by your university or a cloud computing
service like AWS), then you can easily move your simulatr
simulation from your laptop to this platform.
1-3. Create a working simulatr
specifier object
The same simulatr
specifier object you created
on your laptop can be used on your distributed computing platform!
Please read the “simulatr
on your laptop” article (linked
above) if you have not done so already. Once you have a
simulatr
specifier object you are happy with, save it to
disk using a command like the following:
saveRDS(object = simulatr_spec, file = simulatr_spec_filename)
4. Run the simulation on your distributed computing platform
Running simulatr
on a distributed computing platform is
facilitated by a Nextflow pipeline, Katsevich-Lab/simulatr-pipeline
.
This pipeline takes as inputs the simulatr
specifier file
created in steps 1-3, as well as the desired limits on the the memory
and time usage of each individual process. It then adaptively
parallelizes the simulation tasks accordingly, splitting the replicates
for each combination of method and parameter setting into one or more
processes.
Below are instructions for running this pipeline on your distributed computing platform.
A. Install and configure Nextflow
Installing Nextflow is easy; follow the instructions here. Next, you must configure Nextflow to work with your specific distributed computing platform. To get familiar with Nextflow configuration, you can read this tutorial, see this list of configuration files at other institutions, read 5 Nextflow tips for HPC users and 5 more Nextflow tips for HPC users, and finally, consult the Nextflow documentation.
B. Download the simulatr
pipeline
To make sure you have the latest version of the simulatr
pipeline, download it using the shell command
C. Run the simulatr
pipeline
You can run the simulatr
pipeline using a command like
the following:
nextflow run katsevich-lab/simulatr-pipeline \
--simulatr_specifier_fp /path/to/simspec/obj \
--result_dir directory/for/results \
--result_file_name "simulatr_result.rds" \
--B_check 5 \
--B 100 \
--max_gb 8 \
--max_hours 4
The command-line arguments are described below. Note that only the
first (simulatr_specifier_fp
) is required; the rest have
sensible defaults, included in the following descriptions.
-
simulatr_specifier_fp
: (Required) The path to thesimulatr
specifier object saved at the end of steps 1-3 above. -
result_dir
: (Optional) The directory in which to write the output file. Defaults to the current working directory. -
result_file_name
: (Optional) The name of the output file. Defaults to"simulatr_result.rds"
. -
B_check
: (Optional) The number of initial simulation replicates to run for each combination of method and parameter setting in order to benchmark the memory and time required for adaptive parallelization. Defaults to 5. -
B
: (Optional) The number of simulation replicates to run for the full simulation. Defaults to the value in thesimulatr
specifier object. You may want to setB
to a small number during an initial trial run. -
max_gb
: (Optional) The maximum number of GB each process has available. This number is used in the adaptive parallelization scheme. Defaults to 8. -
max_hours
: (Optional) The maximum number of hours each process can run for. This number is used in the adaptive parallelization scheme. Defaults to 4. -
time_fudge_factor
: a positive real number indicating how liberal or conservative we should be in estimating the number of processors to use for a given method-grid row pair. A number in the interval (0,1) indicates that we should be conservative (i.e., request more processors than is probably necessary), while a number in the interval \((1, \infty)\) indicates that we should be liberal (i.e., request fewer processors than is probably necessary). The default is 0.8. -
mem_fudge_factor
: Similar totime_fudge_factor
, but for memory.
Upon successful completion of the pipeline, the results will be
written to disk in RDS format at the location requested. The results are
in the same format as returned by
check_simulatr_specifier_object()
; see “simulatr
on your laptop”.
5. Summarize and/or visualize the results
Read the results from disk using readRDS()
. The rest is
the same as in “simulatr
on your laptop”.