Running multiple pygwb_pipe jobs

In practice, one will probably want to run pygwb on long stretches of data. This is achieved most easily by splitting the large data set in smaller chunks of data. These can then be analyzed individually, and combined after the analysis to form one overall result for the whole data set. To this end, pygwb comes with two scripts: pygwb_dag and pygwb_combine. The former allows the user to run pygwb_pipe (for which a tutorial can be found here) simultaneously on shorter stretches of data, whereas the latter allows to combine the output of the individual runs into an overall result for the whole data set.

The pygwb_dag script

1. Script parameteres

To be able to run multiple pygwb_pipe jobs simultaneously, pygwb relies on Condor. This requires a dag file, which contains information about all the jobs, i.e., running pygwb_pipe on different stretches of data. In pygwb, this file can be created by using the pygwb_dag script. To visualize the expected arguments of the script, one can call:

pygwb_dag --help

This will display the required parameters, together with a small description:

--subfile SUBFILE     Submission file.
--jobfile JOBFILE     Job file with start and end times and duration for each job.
--flag FLAG           Flag that is searched for in the DQSegDB.
--t0 T0               Begin time of analysed data, will query the DQSegDB. If used with jobfile, it is an optional argument if one does not wish to analyse the whole job
                      file
--tf TF               End time of analysed data, will query the DQSegDB. If used with jobfile, it is an optional argument if one does not wish to analyse the whole job
                      file
--parentdir PARENTDIR
                      Starting folder.
--param_file PARAM_FILE
                      Path to parameters.ini file.
--dag_name DAG_NAME   Dag file name.
--apply_dsc APPLY_DSC
                      Apply delta-sigma cut flag for pygwb_pipe.
--pickle_out PICKLE_OUT
                      Pickle output Baseline of the analysis.
--wipe_ifo WIPE_IFO   Wipe interferometer data to reduce size of pickled Baseline.
--calc_pt_est CALC_PT_EST
                      Calculate omega point estimate and sigma from data.

An important argument of the script, is the path to the job file, passed through --jobfile. The job file is a simple .txt file and contains the different jobs, or in other words, the different stretches of data to run the analysis on. For concretenes, consider the case where one would want to run pygwb on 12000 seconds of data, but split into smaller jobs. The job file could then look as follows:

1 0 4000  4000
1 4000  9000  5000
1 9000 12000 3000

The first column does not play a role, the second and third colum indicate the start and end time of the job, respectively, whereas the last column shows the duration of the job, i.e., the difference between end and start time. The job file therefore allows the script to know on which stretches of data to run. In case one wants to run on a subset of the jobs in the job file, one can pass an additional start and end time to the script through the --t0 and --tf arguments.

The --parentdir allows to pass the full path to the run directory, and the --param_file should point to the parameter file to be used by pygwb_pipe.

See also

For more information about pygwb_pipe and the usage of a parameter file, we refer the user to the tutorial here.

For the remainder of the arguments, we refer the user to the pygwb_pipe tutorial, as the dag file passes the relevant arguments to pygwb_pipe behind the screens, e.g., the parameter file and the apply_dsc flag.

Note that an additional argument should be passed to the script, namely the submission file. This file passes necessary information to Condor, and the cluster/server on which the user is running the pygwb jobs.

Warning

The Condor submission file, passed through --subfile, is not included in the pygwb package. Its specific implementation will depend on the server or cluster where the user runs the analysis. More information about Condor, together with inspiration for the submission file can be found here.

2. Running the script

The arguments described above can be passed to the script through the following command:

pygwb_dag {your-dag-file.dag} --subfile {full_path_to_subfile} --jobfile {full_path_to_jobfile} --parent_dir {full_path_to_parent_dir} --param_file {full_path_to_param_file}

Note

If the dag name was not specified when calling pygwb_dag in the previous step, the default name dag_name.dag is used.

The dag file is now created in the {full_path_to_parent_dir}/output folder. To submit the job to condor and actually run all the jobs, navigate to that folder and run the following line in the command line:

condor_submit_dag {your-dag-file.dag}

To check the status of the jobs, one can execute the command:

condor_q

For additional information on Condor jobs, we refer the user to the Condor documentation.

3. Output of the script

Once all the jobs submitted through Condor and the dag file finish running, the output folder should contain similar files as the ones already discussed in the pygwb_pipe tutorial here. However, there will be many more files compared to a single run, as pygwb_pipe was run for all the jobs, and therefore produced the output for each of the jobs. We refrain from repeating the information about the output of pygwb_pipe and refer to the previous tutorial for more information about the output.

Combining runs with pygwb_combine

The pygwb_dag script described above runs multiple pygwb_pipe jobs on stretches of data. For each of these runs, the usual pygwb_pipe output is produced (see here for more information on the output of the pygwb_pipe script). However, the user is usually interested in an overall result for the whole data set. This is where pygwb_combine comes in, by allowing the user to combine their separate results into an overall result. For example, all separate point estimate and variance spectra will be combined into one overall spectrum for the whole data set. More information on this procedure can be found in the pygwb paper.

1. Script parameteres

The required arguments of the pygwb_combine script can be displayed through:

pygwb_combine -h

This shows the following arguments with a short description:

--data_path DATA_PATH [DATA_PATH ...]
                      Path to data files or folder.
--alpha ALPHA         Spectral index alpha to use for spectral re-weighting.
--fref FREF           Reference frequency to use when presenting results.
--param_file PARAM_FILE
                      Parameter file
--h0 H0               Value of h0 to use. Default is pygwb.constants.h0.
--combine_coherence COMBINE_COHERENCE
                      Calculate combined coherence over all available data.
--coherence_path COHERENCE_PATH [COHERENCE_PATH ...]
                      Path to coherence data files, if individual files are
                      passed.
--out_path OUT_PATH   Output path.
--file_tag FILE_TAG   File naming tag. By default, reads in first and last
                      time in dataset.

2. Running the script

To run the script, one executes the following command:

pygwb_combine --data_path {my_pygwb_output_folder} --alpha {my_spectral_index} --fref {my_fref} --param_file {my_parameter_file_path} --out_path {my_combine_folder}

Note that not all arguments listed above are required to be able to run the script.

Warning

The --combine_coherence functionality is not supported when combining runs as a result of the pygwb_dag script.

3. Output of the script

As mentioned above, the output of the pygwb_combine script is one overall point estimate and variance (spectrum). The directory passed through the --out_path argument should contain a file that looks as follows:

point_estimate_sigma_spectra_alpha_0.0_fref_25_t0-tf.npz

This file contains the combined spectra, where the notation indicates it was run with a spectral index of 0, reference frequency of 25 Hz, and t0 and tf would be actual numbers corresponding to the start and end time of the analysis, respectively.

The keys of this npz file are:

['point_estimate', 'sigma', 'point_estimate_spectrum', 'sigma_spectrum',
'frequencies', 'frequency_mask', 'point_estimates_seg_UW', 'sigmas_seg_UW']

The value associated with the key can be accessed from the npz file through:

npzfile = numpy.load("point_estimate_sigma_spectra_alpha_0.0_fref_25_t0-tf.npz")
variable = npzfile["key"]

One obtains the value for the overall point estimate and its standard deviation through the point_estimate and sigma keys, respectively. The corresponding spectra are found by using the point_estimate_spectrum and sigma_spectrum keys. The frequencies for these spectra can be retrieved through the frequencies key. The frequency_mask key returns the notched frequencies. For more information about notching, check the demo here or the API of the notch module here. Lastly, one can also access the unweighted, i.e., without reweighting of the spectral index, point estimates and their standard deviations for every segment in the analysis. These are labeled with _UW at the end of the keys.

Tip

Not sure about what is exactly in the .npz file? Load in the file and print out all its keys as shown here.

If the pygwb_pipe analyses were run with the delta sigma cut turned on, a file delta_sigma_cut_t0-tf.npz should be present in the output directory as well. This file contains the following keys:

['naive_sigma_values', 'slide_sigma_values', 'delta_sigma_values',
'badGPStimes', 'delta_sigma_times', 'ifo_1_gates', 'ifo_2_gates',
'ifo_1_gate_pad', 'ifo_2_gate_pad']

The times flagged by the delta sigma cut that are excluded from the analysis can be retrieved with the 'badGPStimes' key. The alphas used for the delta sigma cut are stored in 'delta_sigma_alphas' key, the times in 'delta_sigma_times', and the actual values of the delta sigmas in 'delta_sigma_values'. The delta sigma cut computes both the naive and sliding sigma values, which are stored in the keys 'naive_sigma_values' and 'slide_sigma_values'.

If gating is turned on, the gated times are saved in 'ifo_{i}_gates' where i denotes the first and second onterferometer used for the analysis. The 'ifo_{i}_gate_pad' refers to the value of the parameter gate_tpad during the analysis.