Using pygwb_pipe: a quickstart manual

The various modules of the pygwb package can be combined into a pipeline, as done in the pygwb_pipe script. This script takes the data as input and outputs an estimator of the point estimate and variance of the gravitational-wave background (GWB) for these data. More information on how the various modules interact and are combined into a pipeline can be found in the pygwb paper.

Note

The proposed pygwb_pipe pipeline is only one of the many ways to assemble the pygwb modules, and users should feel free to create their own pipeline, that addresses their needs.

1. Script parameters

The parameters of the pygwb_pipe script can be visualized by running the following command:

pygwb_pipe --help

This will display the following set of parameters, which can be passed to the pipeline:

--param_file PARAM_FILE
                      Parameter file to use for analysis.
--output_path OUTPUT_PATH
                      Location to save output to.
--calc_coh CALC_COH
                      Calculate coherence spectrum from data.
--calc_pt_est CALC_PT_EST
                      Calculate omega point estimate and sigma from data.
--apply_dsc APPLY_DSC
                      Apply delta sigma cut when calculating final output.
--pickle_out PICKLE_OUT
                      Pickle output Baseline of the analysis.
--wipe_ifo WIPE_IFO   Wipe interferometer data to reduce size of pickled Baseline.
--t0 T0               Initial time.
--tf TF               Final time.
--data_type DATA_TYPE
                      Type of data to access/download; options are private,
                      public, local. Default is public.
--channel CHANNEL     Channel name; needs to match an existing channel. Default is
                      "GWOSC-16KHZ_R1_STRAIN"
--new_sample_rate NEW_SAMPLE_RATE
                      Sample rate to use when downsampling the data (Hz). Default
                      is 4096 Hz.
--input_sample_rate INPUT_SAMPLE_RATE
                      Sample rate of the read data (Hz). Default is 16384 Hz.
--cutoff_frequency CUTOFF_FREQUENCY
                      Lower frequency cutoff; applied in filtering in
                      preprocessing (Hz). Default is 11 Hz.
--segment_duration SEGMENT_DURATION
                      Duration of the individual segments to analyse (seconds).
                      Default is 192 seconds.
--number_cropped_seconds NUMBER_CROPPED_SECONDS
                      Number of seconds to crop at the start and end of the
                      analysed data (seconds). Default is 2 seconds.
--window_downsampling WINDOW_DOWNSAMPLING
                      Type of window to use in preprocessing. Default is "hamming"
--ftype FTYPE         Type of filter to use in downsampling. Default is "fir"
--frequency_resolution FREQUENCY_RESOLUTION
                      Frequency resolution of the final output spectrum (Hz).
                      Default is 1\/32 Hz.
--polarization POLARIZATION
                      Polarisation type for the overlap reduction function calculation; options are scalar, vector, tensor. Default is tensor.
--alpha ALPHA         Spectral index to filter the data for. Default is 0.
--fref FREF           Reference frequency to filter the data at (Hz). Default is 25 Hz.
--flow FLOW           Lower frequency to include in the analysis (Hz). Default is 20 Hz.
--fhigh FHIGH         Higher frequency to include in the analysis (Hz). Default is 1726 Hz.
--coarse_grain COARSE_GRAIN
                      Whether to apply coarse graining to the spectra. Default is 0.
--interferometer_list INTERFEROMETER_LIST [INTERFEROMETER_LIST ...]
                      List of interferometers to run the analysis with. Default is
                      ["H1", "L1"]
--local_data_path LOCAL_DATA_PATH
                      Path(s) to local data, if the local data option is chosen.
                      Default is empty.
--notch_list_path NOTCH_LIST_PATH
                      Path to the notch list file. Default is empty.
--N_average_segments_welch_psd N_AVERAGE_SEGMENTS_WELCH_PSD
                      Number of segments to average over when calculating the psd
                      with Welch method. Default is 2.
--window_fft_dict WINDOW_FFT_DICT
                      Dictionary containing name and parameters relative to which
                      window to use when producing fftgrams for psds and csds.
                      Default is "hann".
--calibration_epsilon CALIBRATION_EPSILON
                      Calibation coefficient. Default 0.
--overlap_factor OVERLAP_FACTOR
                      Factor by which to overlap consecutive segments for
                      analysis. Default is 0.5 (50% overlap)
--zeropad_csd ZEROPAD_CSD
                      Whether to zeropad the csd or not. Default is True.
--delta_sigma_cut DELTA_SIGMA_CUT
                      Cutoff value for the delta sigma cut. Default is 0.2.
--alphas_delta_sigma_cut ALPHAS_DELTA_SIGMA_CUT [ALPHAS_DELTA_SIGMA_CUT ...]
                      List of spectral indexes to use in delta sigma cut
                      calculation. Default is [-5, 0, 3].
--save_data_type SAVE_DATA_TYPE
                      Suffix for the output data file. Options are hdf5, npz,
                      json, pickle. Default is json.
--time_shift TIME_SHIFT
                      Seconds to timeshift the data by in preprocessing. Default
                      is 0.
--gate_data GATE_DATA
                      Whether to apply self-gating to the data in preprocessing.
                      Default is False.
--gate_tzero GATE_TZERO
                      Gate tzero. Default is 1.0.
--gate_tpad GATE_TPAD
                      Gate tpad. Default is 0.5.
--gate_threshold GATE_THRESHOLD
                      Gate threshold. Default is 50.
--cluster_window CLUSTER_WINDOW
                      Cluster window. Default is 0.5.
--gate_whiten GATE_WHITEN
                      Whether to whiten when gating. Default is True.
--tag TAG             Hint for the read_data function to retrieve one specific
                      type of data, e.g.: C00, C01
--return_naive_and_averaged_sigmas RETURN_NAIVE_AND_AVERAGED_SIGMAS
                      option to return naive and sliding sigmas from delta sigma
                      cut. Default value: False

As can be seen, all of the parameters above come with a brief description, which should help the user identify their functionality. In particular, we note that the above parameters are the ones present in the pygwb.parameters module. For more information, one can have a look at the pygwb paper, where more details are provided.

Tip

Feeling overwhelmed with the amount of parameters? Make sure to have a look to the pygwb.parameters documentation.

Note

The current default for the notch_list_path is an empty string, which means no notches are applied. If notching should be applied, a path to a notch list file can be added to these parameters. An example for such a notch list can be downloaded here. This particular notch list was used in the analysis for the third observing run of the LIGO-Virgo-KAGRA network. This file can also be found in the pygwb/pygwb_pipe folder.

2. Running the script

Although all of the parameters shown above can be passed to the script, we start by running pygwb_pipe without passing any optional parameters directly to the script. The only required argument is a path to a parameter file, which contains the parameter values to use for the analysis. As an example, one can run the script with the parameters.ini file provided in the pygwb_pipe directory of the repository. To test the pipeline, run the command:

pygwb_pipe --param_file pygwb_pipe/parameters.ini --apply_dsc False

The output of the command above should be:

2023-02-21 14:43:40.817 | SUCCESS  | __main__:main:160 - Ran stochastic search over times 1247644138-1247645038
2023-02-24 16:35:25.625 | SUCCESS  | __main__:main:163 - POINT ESTIMATE: -6.496991e-06
2023-02-24 16:35:25.625 | SUCCESS  | __main__:main:164 - SIGMA: 2.688128e-06

However, one could have decided to run with different parameters. An option is to modify the parameters.ini file, or one could also pass the parameters as arguments to the script directly. For example:

pygwb_pipe --param_file {path_to_param_file} --apply_dsc True --gate_data True

Warning

Passing any parameters through the command line overwrites the value in the parameters.ini file.

Note: detector–specific parameters

It is possible to pass detector–specific parameters, both in the .ini file and through shell. The syntax is:

param: IFO1:val1,IFO2:val2

For example, if passing different channel names for LIGO Hanford and LIGO Livingston:

channel: H1:GWOSC-16KHZ_R1_STRAIN,L1:PYGWB-SIMULATED_STRAIN

These are the same when passing through shell:

--channel H1:GWOSC-16KHZ_R1_STRAIN,L1:PYGWB-SIMULATED_STRAIN

3. Output of the script

As mentioned previously, the purpose of the pygwb analysis package is to compute an estimator of the GWB, through the computation of a point estimate and variance spectrum, which can be translated into one point estimate and variance. By default, the output of the analysis will be saved in the ./output folder of your run directory, unless otherwise specified through the --output_path argument of the script.

A few files can be found in this directory, including a version of the parameters file used for the analysis. Note that this takes into account any parameters that were modified through the command line. This file will have the naming convention parameters_{t0}_{length_of job}_final.ini.

Additionally, the power-spectral densities (PSDs) and cross-spectral densities (CSDs) are saved in a file with naming convention:

psds_csds_{start_time_of_job}_{job_duration}.npz

Tip

Not sure about what is exactly in a file? Load in the file and print out all its keys as shown here.

Printing these keys displays the following:

npzfile = numpy.load("psds_csds_{start_time_of_job}_{job_duration}.npz")
print(list(npzfile.keys()))

['freqs', 'avg_freqs', 'csd', 'avg_csd', 'psd_1', 'psd_2', 'avg_psd_1', 'avg_psd_2',
 'csd_times', 'avg_csd_times', 'psd_times', 'avg_psd_times',
 'coherence', 'psd_1_coh', 'psd_2_coh', 'csd_coh', 'n_segs_coh']

The above keys of the .npz have corresponding data associated to them, which can be read using:

variable = npzfile['{key}']

More specifically, the frequencies for naive estimates can be accessed through the 'freqs' key, whereas the ones for averaged estimates of the spectral densitities can be accessed through the 'avg_freqs' key. Additionally, the CSD can be read using the 'csd' key and the average CSD can be found with the key 'avg_csd'. Analogously, one can load the PSDs of the interferometers. One can also read the times associated to these spectral densities by using the keys '{insert_spectral_density}_times'. If the --calc_coh argument was set to True during the analysis, the coherence information will also be stored in this file under the 'coherence' key together with the PSDs, CSD and amount of segments used to compute coherence.

Note

Depending on the parameters used to run pygwb_pipe, some keys above might not have a avalue associated to them.

A second file contains the actual point estimate spectrum, variance spectrum, point estimate and variance. These can be found in:

point_estimate_sigma_{start_time_of_job}_{job_duration}.npz

This file can be read in similarly to the previous file, and has the following keys:

['frequencies', 'frequency_mask', 'point_estimate_spectrum', 'sigma_spectrum',
'point_estimate', 'sigma', 'point_estimate_spectrogram', 'sigma_spectrogram',
'badGPStimes', 'delta_sigma_alphas', 'delta_sigma_times', 'delta_sigma_values',
'naive_sigma_values', 'slide_sigma_values', 'ifo_1_gates', 'ifo_1_gate_pad',
'ifo_2_gates', 'ifo_2_gate_pad']

Note

Depending on the parameters used to run pygwb_pipe, some keys above might not have a avalue associated to them, in particular the ones related to gating and the delta sigma cut.

The file and associated keys can be read in via the same code as the one shown above. The 'frequencies' key reads the frequencies corresponding to those of the point_estimate_spectrum, which can in turn be read using the key that is called the same. The spectrograms are read in a nalogously, but with spectrogram at the end of the name instead of spectrum. The key 'frequency_mask' provides information about the frequencies which were notched, i.e. not used, in the analysis. The overall point estimate and its standard deviation can be loaded using the 'point_estimate' and the 'sigma' keys.

The output of the data quality checks in pygwb are also saved in the same file. The output of the delta sigma cut is stored in different keys. First, one can find the times which are not allowed in the analysis using the key 'badGPStimes', i.e., the times that do not pass the cut. The spectral indices used for the delta sigma cut are stored in 'delta_sigma_alphas', times in 'delta_sigma_times', and the actual values of the computed delta sigmas can be found through the 'delta_sigma_values' key. The cut computes both the naive and sliding sigma values, which are also stored in the keys 'naive_sigma_values' and 'slide_sigma_values'.

If gating was applied during the analysis, the gated times are saved in 'ifo_{i}_gates' where i can be 1 or 2, labeling the interferometer. The 'ifo_{i}_gate_pad' refers to the value of the parameter gate_tpad during the analysis.

To conclude, if the script was run with --pickle_out True, a pickle file will be present in the output directory, containing a pickled version of the baseline. This contains all the information present in the other two npz files, but allows the user to create a baseline object from this pickle file. More information about how to create a baseline from such a file can be found here.

Warning

Saving pickle files can take up a lot of memory. Furthermore, loading in a baseline from pickle file can take quite some time. Working with npz files is therefore recommended, when possible.

Note

Depending on the parameters used to run pygwb_pipe, the output of the script and amount of files might differ from the one described here.

This tutorial provides a brief overview of the pygwb_pipe script and how to run it for one job, i.e., a small stretch of data. In practice, however, one probably wants to analyze months, if not years, of data. To address this need, pygwb_pipe can be run on multiple jobs, i.e., different stretches of data, through parallelization using Condor (more information about Condor can be found here). The concrete implementation within the pygwb package is outlined in the following tutorial.