Understanding what images are available for a given patient via BigQuery API

We had the following inquiry outside of the forum asking for help navigating certain aspects of IDC data. I thought the response could be of interest to the broader IDC user community.

“For a given patient, what images are there for that patient, over time? How are they related? What is the most recent one?”

To answer this kind of question, one should use BigQuery SQL against the IDC curated metadata for the hosted collections. Of course, this will require understanding of the underlying data model, and can be confusing and intimidating at first. Thus the following background before we get to the actual queries.

Background

Images are associated with a specific patient using the PatientID DICOM attribute (each attribute is extracted into the column named correspondingly in the IDC dicom_all table - see more in the documentation here). Although DICOM does not prescribe PatientID to be globally unique, in practice, for the data hosted by IDC, those identifiers should be unique.

To understand the “what images” part of the question, you need to understand the hierarchical organization of DICOM data. Patient, defined by PatientID, has one or more imaging studies, uniquely identified by the unique identifier in the StudyInstanceUID attribute. Think of the imaging study as all the images collected for a given patient during the imaging exam (between the time patients lies down on the scanner table and getting off that table).

Individual DICOM studies consist of one or more imaging series. For example, a CT imaging exam typically would include a scout image, followed by one or more cross-sectional scans and potentially contrast-enhanced scan. Each of those would typically show up as a separate DICOM series, with the series uniquely identified by the identifier in the SeriesInstanceUID attribute.

Finally, individual series for cross-sectional imaging modalities (CT, MR, PET studies) typically consist of multiple instances, where one DICOM instance corresponds to a single two-dimensional slice, which is saved as a separate file. Those instances are uniquely identified by value in the SOPInstanceUID attribute.

StudyDate and SeriesDate can be used to infer temporal aspect of the data. Note that:

  1. The dates you will find for IDC data have been shifted to reduce the risk of patient de-identification. Those are not real dates. But the intervals between the dates are expected to be preserved.
  2. In some cases you may find data that has dates completely removed. Do not be surprised by that.
  3. Under certain circumstances, you may find individual series within the study having SeriesDate different from StudyDate. This can happen when annotations for images in a given study were added at a point in time after the image acquisition. It also appears that in the pathology imaging community the convention is to assign longitudinally collected pathology slides to the same DICOM study, which is different from the conventions used in radiology (@hackermd can clarify this further I am sure!).

Queries

The easiest way to experiment with the following queries is by running them from the BigQuery console.

Count the number of studies for each distinct PatientID value, include the collection this patient belongs to

SELECT
  PatientID,
  COUNT(DISTINCT(StudyInstanceUID)) AS num_studies,
  STRING_AGG(DISTINCT(collection_id)) AS collection_id
FROM
  `bigquery-public-data.idc_current.dicom_all`
GROUP BY
  PatientID
ORDER BY
  num_studies desc

Get the list of unique study identifiers for a given patient, together with the list of modalities included, ordered by StudyDate, starting from the most recent one

SELECT
  StudyInstanceUID,
  STRING_AGG(DISTINCT(SAFE_CAST(StudyDate AS STRING))) as study_dates,
  STRING_AGG(DISTINCT(Modality)) as modalities_in_study
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  PatientID = "A860070"
GROUP BY
  StudyInstanceUID
ORDER BY
  study_dates DESC

Get the list of series identifiers and series modalities for a given study in the list above

SELECT
  DISTINCT(SeriesInstanceUID),
  Modality
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  StudyInstanceUID = "1.3.6.1.4.1.14519.5.2.1.99.1071.31546385885949769813488655621452"

Given the identifier of the series, get individual instances and locations of the corresponding files in IDC storage buckets (in this particular case, the series has DX (Digital Radiography) modality, and consists of a single instance - in the general case you can have more than one instance/file per series)

SELECT 
  gcs_url
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  SeriesInstanceUID = "1.3.6.1.4.1.14519.5.2.1.99.1071.76557368585346779276989447427568"

Now that we have the gs:// URIs, the files can be fetched using gsutil or s5cmd, as discussed in this documentation article.

If you want to see more examples and learn more about searching IDC data, you should consider going over this notebook: IDC-Examples/getting_started.ipynb at master · ImagingDataCommons/IDC-Examples · GitHub.

Please reply in this thread with any questions or comments to the above!

FYI @ACharbonneau @david.pot @dclunie

3 Likes