Generation
Introduction
Generating studies is one of the main functionalities of the study-da package. To understand how this is done, we will play with some dummy example configuration. In this case, we choose to use only two generations as it's enough to illustrate the main concepts, but you can have as many generations as you want.
Creating the study
Template scripts
Generation 1
First, let's define the scripts from which we would like to generate the jobs. We will start with something very simple and simply add two parameters from the configuration file.
Jinja2 placeholders
The following part of the script is not really made to be used as is, but to be filled in by the study-da package:
dict_mutated_parameters = {} ###---parameters---###
path_configuration = "{} ###---main_configuration---###""
The {} ###---and ---### are used to indicate placeholders for the actual values that will be filled in by the study-da package, using jinja2 under the hood. In practice, as it will be shown later in this tutorial, the {} ###---parameters---### will be replaced by a dictionary of parameters (the ones being scanned), and {} ###---main_configuration---### will be replaced by the path to the main configuration file. This allows to mutate selectively some of the parameters in the configuration file, and to write the modified configuration back to the disk. The parameter values are specific to each generated job.
Why these placeholders?
You may wonder why we use these weird {} ###--- and ---### placeholders, instead of the usual {{ and }} from jinja2. The reason is that we want to keep the template script a valid Python executable, and this choice of placeholders allows to do so. This is however quite arbitrary.
If you don't understand, no worries, it will get clearer as we go along and actually generate some jobs.
Understanding the script
The script is quite simple. It loads the main configuration file from a path that is not explicitely given yet, mutates the parameters in the configuration, adds two of them, and writes the modified configuration back to the disk.
It is assumed that the main configuration is always loaded from the above generation, and written back in the current generation. Therefore, each generation relies on the previous one. This is a simple way to chain the generations, and update the configuration with the mutated parameters every time.
Be careful with parameter names
As you can see in the script, parameter are accessed only with their names. No key is provided, while the corresponding yaml file might have a nested structure.
This is because the set_item_in_dic function is used to set the value of the parameter. This function will look for the parameter in the configuration file (everywhere) and set its value. Now, if two parameters have the same name, but are in different parts of the configuration file, the script will not work as expected.
This is the price of making the package as simple as possible. If you happen to have two parameters with the same name, you will have to modify one of them in the configuration file.
Why don't you generate all the configuration files directly?
You might very legitimately wonder why we don't produce the configuration files for all generations, mutating them appropriately, and then run the jobs. This is because, in a generic workflow, the scripts of each generation will modify the configuration file, and the next generation will depend on the modified configuration file. Therefore, it can't be created in advance.
Generation 2
The second generation script is just as simple:
As you can see, this script multiplies the result of the previous script (stored in the configuration, in the above generation) by a new parameter y, and writes the result to a text file.
Template configuration
The base configuration is always the same for all generations, although it does get modified (mutated) by the scripts. There's nothing special about it, it's just a simple yaml file:
Scan configuration
The scan configuration is an essential part of the study generation. It defines what are the parameters that will be scanned, and the values they will take. Here is a possible scan configuration for our dummy example:
name: example_dummy
# List all useful files that will be used by executable in generations below
# These files are placed at the root of the study
dependencies:
main_configuration: custom_files/config_dummy.yaml
# others:
# - custom_files/other_file.yaml
# - custom_files/module.py
structure:
# First generation is always at the root of the study
# such that config_dummy.yaml is accessible as ../config_dummy.yaml
generation_1:
executable: custom_files/generation_1_dummy.py
scans:
x:
# Values taken by x using a list
list: [1, 2]
# Second generation depends on the config from the first generation
generation_2:
executable: custom_files/generation_2_dummy.py
scans:
y:
# Values taken by y using a linspace
linspace: [1, 100, 3]
Let's exlain the different fields in the scan configuration:
- name: the name of the study, which will also correspond to the name of the rood folder of the study.
- dependencies: a list of files that are needed by the executable scripts. These files will be copied to the root of the study, so that they can be accessed by the scripts. Note that some configuration files are already provided by the package, and can be used directly (see e.g. 1_simple_collider and 2_tune_scan for examples). Understand teh dependencies is not so straightforward, and we will come back to it in the advanced tutorial.
-
structure: the structure of the study, with the different generations:
-
Each generation has an
executablefield, which is the path to the script that will be executed. These paths can correspond to local files (as in here), or to predefined templates (as, for instance, in the case studies, in which case just the name of the template is enough). - The
scansfield is a dictionary of parameters that will be scanned, and the values they will take. The values can be given as a list, a linspace or a logspace. Other possibilities (e.g. scanning generated string names, using nested variables, defining parameter in terms of other through mathematical expression) are all presented in the Case studies.
By default (if no specific keyword is provided), the cartesian product of all the parameter values will be considered to generate the jobs. This means that the number of jobs will be the product of the number of values for each parameter. In the example above, the number of jobs will be 6 (2 values for x and 3 values for y).
Conversely, one can decide to scan two parameters at the same time (useful, for instance, when scanning the tune diagonal in a collider) using the concomitant keyword. This is also used in tune scan case studie section.
Generating the study
Everything is now in place to generate our dummy study. A one line-command is enough to do it for us:
You should now see the corresponding study folder appear in the current directory.
Arguments of the create function
We will detail the structure soon, but let's first have a look at the other possible arguments of the create function. The full signature of the function is:
def create(
path_config_scan: str = "config_scan.yaml",
force_overwrite: bool = False,
dic_parameter_all_gen: Optional[dict[str, dict[str, Any]]] = None,
dic_parameter_all_gen_naming: Optional[dict[str, dict[str, Any]]] = None,
add_prefix_to_folder_names: bool = False,
) -> tuple[str, str]:
Let's detail the arguments:
force_overwritecan be useful if you want to overwrite an existing study. However, in most of the case, we submit the study in the same script, meaning that we might have to run the script several times. In this case, theforce_overwriteargument must be set toFalse, otherwise the study will be overwritten at each run, and you will lose whatever has already been computed.dic_parameter_all_genis a dictionary that allows to specify the parameters that will be scanned for each generation, instead of defining them the scan configuration file. This doesn't free you from defining the structure of the study in the scan configuration file! This can be useful when the way to define your parameters is more complex than a simple list, linspace or logspace, or if your parameters are functions of each other. This is explained in one of the case studies.dic_parameter_all_gen_namingis similar todic_parameter_all_gen, but allows to specify the parameters that will be scanned for each generation, and the way they will be named in the study folder. This is useful when you want to have a specific naming for your parameters, or if parameters are nested. This is also explained in the same case study.add_prefix_to_folder_namesis a boolean that allows to add a prefix to the folder names (each folder corresponding to a given job) of the study. This can sometimes help browsing the study, especially when the number of jobs is large.
The tree and study structure
The following study structure should have been generated (not all files are shown):
📁 example_dummy/
├─╴📁 x_1/
│ ├── 📁 y_1.0/
│ ├── 📁 y_50.5/
│ ├── 📁 y_100.0/
│ │ └── 📄 generation_2.py
│ └── 📄 generation_1.py
├─╴📁 x_2/
├─ 📄 tree.yaml
└─ 📄 config_dummy.yaml
And similarly, in the tree.yaml file:
x_1:
generation_1:
file: example_dummy/x_1/generation_1.py
y_1.0:
generation_2:
file: example_dummy/x_1/y_1.0/generation_2.py
...
As you can observe, by default, each folder corresponds to a given generation, and is named after the parameter value it corresponds to. In each folder, an executable script (a .py file) has been created, along with potential subgenerations.
If you open a given script, you will see that the placeholders have been replaced by the actual values of the parameters. For instance, for the parameter definition in the generation_1.py script in the x_1 folder now looks like:
This type of structure is very useful to keep track of the different jobs that have been generated, and to easily access the results of each job. Each job has its own executable, and only depends on the previous generation. Therefore, the jobs can be run independently, or in parallel, and can be individually debugged.
The tree file will be very useful to keep track of the state of each job, as illustrated in the second part of this tutorial.