Tutorial: Download Bulk-release Strain Files Programatically

In the previous tutorial we learned how to download strain files using the website pages. In this tutorial we will learn how to download strain files using a Python script and GWOSC's public API.

To learn more about the API, visit the API documentation page.

Tutorials in this series:

Automating data discovery

Let us assume we want to investigate a number of astrophysical events, each precisely timed, and we want to run a search in the minutes around each event. Suppose we have GPS times for each event. As an example, we can take the times of a few gamma-ray bursts that happened during the O1 run.

Here are the first few, written as a python list of tuples:

event_list = [
  (1126089475, "150912A"),
  (1126103088, "150912600"),
  (1126730615, "150919A"),
  (1126935466, "150922A"),
  (1126991509, "150922883"),
  (1127027273, "150923297"),
  (1127038714, "150923429"),
  (1127189385, "150925A"),
  (1127722852, "151001348"),
  (1128160518, "151006A"),
  ...
]

If the list is rather long, we will want to process it with a python script. Below we take the previous list and query the GWOSC API for data surrounding each event.

Fetching Segment Information: Is There Good Data?

First we will want to find out which detectors were operating when, or more specifically, when are they operating with good data quality. You can code against this with the JSON version of the Timeline tool. Here is code to make this decision. It asks for the duty cycle of the CAT2 data for H1, in 128 seconds surrounding the given time.

import requests


def fetch_timeline(dataset, detector, timeline_id, gps_start, gps_end):
    url = (
      f"https://gwosc.org/timeline/segments/json/"
      f"{dataset}/{detector}_{timeline_id}/{gps_start}/{gps_end - gps_start}/"
    )
    response = requests.get(url)
    response.raise_for_status()
    return response.json()


def main():
    for gps_time, event in event_list:
        detector = "H1"  # One of the detectors we are interested in
        timelineData = fetch_timeline("O1", detector, "DATA", gps_time - 64, gps_time + 64)
        segments = timelineData["segments"]
        print(f"Usable data segments around {gps_time}: {segments}")


if __name__ == "__main__":
    main()

Thus we can select for further study only those GRBs where there is good data cover from all detectors.

Fetching the Strain Data

For those GRBs where there is good data, we can download the 4096-second HDF5 files of strain data. These will be up to 120 Mbyte in size.

To obtain the download link we first query for it and then initiate the download. Below is an example.

import requests


def fetch_strain(gps_time, detector, dataset, file_frmt="gwf"):
    # Get the download url
    fetch_url = (
        f"https://gwosc.org/archive/links/"
        f"{dataset}/{detector}/{gps_time}/{gps_time}/json/"
    )
    response = requests.get(fetch_url)
    response.raise_for_status()
    json_response = response.json()
    for strain_file in json_response["strain"]:
        if strain_file["detector"] == detector and strain_file["format"] == file_frmt:
            download_url = strain_file["url"]
            filename = download_url[download_url.rfind("/") + 1 :]
            break
    else:
        raise ValueError(f"Strain url not found for detector {detector}.")

    print(f"Downloading data file")
    with requests.get(download_url, stream=True) as r:
        r.raise_for_status()
        with open(filename, "wb") as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)


def main():
    # GPS time of a GRB
    gps_time = 1126089475
    detector = "H1"
    fetch_strain(gps_time, detector, "O1")


if __name__ == "__main__":
    main()

Now that we have the data files, we can go back to the earlier tutorials, read in the data, and run a search at the times of the GRBs.

Fetching Strain Statistics

For each of the 4096-second data files, there are statistics available about the proportion of time (duty cycle) that the various data quality flags are on, similar to the 128-second averages computed above from the Timelines. There are also gross statistics about the strain vector during the time of the file, the minimum, maximum, mean, and standard deviation; also two band-limited RMS in two bands: ??-200 Hz, and 200-1000 Hz.

import requests


def fetch_stats(gps_time, detector, dataset="O1"):
    url = f"https://gwosc.org/archive/links/{dataset}/{detector}/{gps_time}/{gps_time}/json/"
    response = requests.get(url)
    response.raise_for_status()
    json_response = response.json()
    print("Stats dictionary:")
    print(json_response["strain"][0])


def main():
    t = 1126089475
    detector = "H1"
    strain = fetch_stats(t, detector)


if __name__ == "__main__":
    main()