Tutorial Step 3: Working with Data QualityIn this tutorial, we will take a look at how data quality information is stored in LIGO data files. If you are not already comfortable with using Python to read a LIGO data file, you may want to take a look at Step 2 of this tutorial.
What's Data Quality?
For a more detailed explanation of the meaning of various flags, see the data documentation.
How is data quality information stored?Data quality information is stored in LIGO data files as a 1 Hz time series for each category. Notice this is a different sampling rate than the 4096 Hz rate which is used to store strain data. So, for example, the first sample in a data quality channel applies to the first 4096 samples in the corresponding strain channel.
In the S5 data set, there are 18 data quality categories, as well as 5 injection categories, each represented as a 1 Hz time series. Let's print out a list of the S5 data quality channel names from the LIGO data file.
In the same directory where the data file you
downloaded in Step One,
create a file named
dq.py, place the following code:
Try running this code in the Python interpreter, either with
import numpy as np import matplotlib.pyplot as plt import h5py filename = 'H-H1_LOSC_4_V1-815411200-4096.hdf5' dataFile = h5py.File(filename, 'r') gpsStart = dataFile['meta']['GPSstart'].value dqInfo = dataFile['quality']['simple'] bitnameList = dqInfo['DQShortnames'].value nbits = len(bitnameList) for bit in range(nbits): print bit, bitnameList[bit]
run dqif you are using Canopy or iPython, or
reload(dq)if you are not. You should see a list of data quality category names. For an explanation of each category, see the data set documentation.
All the data quality categories are combined into a bit mask, and
stored as an array with 4096 entries (one entry for each second of data).
In the LIGO data file, this is
DQmask, and you can extract it from the
LIGO data file by adding this code to
Each sample in this bit mask encodes all 18 data quality categories as a different digit in a binary number. A one in a particular digit means the corresponding flag is "passed", so the data is "good" at that category level, and a zero means the flag is off, so the data is "bad" at that category level. For example, a DQmask value of
qmask = dqInfo['DQmask'].value
000000001000100011in S5 data would mean that the detector data is available (DATA), and that the data passes the
BURST_CAT1criteria, but fails all other data quality conditions.
This is a compact way to store a lot of information, but to work with data quality, we'll want to put things in a more convienient form. Let's try to unpack some of this data quality information.
In most cases, you will not want to keep track of
every data quality category, but only a few values that
are important for your search. For example, to search for
transient gravitational wave sources, such as supernovae or
compact object mergers, you may
want to keep track of the
as well as all of the
categories. As an example,
let's try unpacking just the BURST_CAT1 and DATA categories.
Try adding this to
Now, the variable sci will be 0 everywhere the data fails the
sci = (qmask >> 0) & 1
burst1 = (qmask >> 9) & 1
DATA category, and 1 everywhere the data passes the
is stored in bit 9, the burst1 variable will correspond to this category.
In a typical case, you only want to use data where all of the
categories of interest are "passed", signaling relatively clean data.
To accomplish this, just take the logical, element-wise AND of
all the channels you need:
To confirm that
goodData_1hz = sci & burst1
goodData_1hz is on when both the
DATA and BURST_CAT1 categories are passed, we can make a plot of
these three DQ channels. Add the following to
When you run
plt.plot(goodData_1hz + 4, label='Good_Data') plt.plot(burst1 + 2, label='BURST_CAT1') plt.plot(sci, label='DATA') plt.axis([0, 4096, -1, 8]) plt.legend(loc=1) plt.xlabel('Time (s)')
dq.py, you should see a plot of these three channels that looks like this:
Here are some things to notice in the plot:
- The channels are off-set vertically so they can all be seen - each channel is either 0 or 1 at any time.
- A 1 means the category is passed, and so that second of data is good to use in your analysis.
- In this example, the "Good_Data" flag is the logical AND of the other two.
- The BURST_CAT1 flag is 0 for a short time around 1600 s into the file. You may need to zoom in on your plot to see this. So, there are 3 segments of "good data" in this file.
We'll see examples of how to use segment lists in the next step.dummy = np.zeros(goodData_1hz.shape) masked_dummy = np.ma.masked_array(dummy, np.logical_not(goodData_1hz) ) segments = np.ma.flatnotmasked_contiguous(masked_dummy) segList = [(int(seg.start+gpsStart), int(seg.stop+gpsStart)) for seg in segments]
You can see all the code described in this tutorial as dq.py here.
- When you are comfortable with data quality information, you can go to the next step of this tutorial: Step 4: Using the example API