We had the following inquiry outside of the forum asking for help navigating certain aspects of IDC data. I thought the response could be of interest to the broader IDC user community.
“For a given patient, what images are there for that patient, over time? How are they related? What is the most recent one?”
To answer this kind of question, one should use BigQuery SQL against the IDC curated metadata for the hosted collections. Of course, this will require understanding of the underlying data model, and can be confusing and intimidating at first. Thus the following background before we get to the actual queries.
Background
Images are associated with a specific patient using the PatientID
DICOM attribute (each attribute is extracted into the column named correspondingly in the IDC dicom_all
table - see more in the documentation here). Although DICOM does not prescribe PatientID
to be globally unique, in practice, for the data hosted by IDC, those identifiers should be unique.
To understand the “what images” part of the question, you need to understand the hierarchical organization of DICOM data. Patient, defined by PatientID
, has one or more imaging studies, uniquely identified by the unique identifier in the StudyInstanceUID
attribute. Think of the imaging study as all the images collected for a given patient during the imaging exam (between the time patients lies down on the scanner table and getting off that table).
Individual DICOM studies consist of one or more imaging series. For example, a CT imaging exam typically would include a scout image, followed by one or more cross-sectional scans and potentially contrast-enhanced scan. Each of those would typically show up as a separate DICOM series, with the series uniquely identified by the identifier in the SeriesInstanceUID
attribute.
Finally, individual series for cross-sectional imaging modalities (CT, MR, PET studies) typically consist of multiple instances, where one DICOM instance corresponds to a single two-dimensional slice, which is saved as a separate file. Those instances are uniquely identified by value in the SOPInstanceUID
attribute.
StudyDate
and SeriesDate
can be used to infer temporal aspect of the data. Note that:
- The dates you will find for IDC data have been shifted to reduce the risk of patient de-identification. Those are not real dates. But the intervals between the dates are expected to be preserved.
- In some cases you may find data that has dates completely removed. Do not be surprised by that.
- Under certain circumstances, you may find individual series within the study having
SeriesDate
different fromStudyDate
. This can happen when annotations for images in a given study were added at a point in time after the image acquisition. It also appears that in the pathology imaging community the convention is to assign longitudinally collected pathology slides to the same DICOM study, which is different from the conventions used in radiology (@hackermd can clarify this further I am sure!).
Queries
The easiest way to experiment with the following queries is by running them from the BigQuery console.
Count the number of studies for each distinct PatientID
value, include the collection this patient belongs to
SELECT
PatientID,
COUNT(DISTINCT(StudyInstanceUID)) AS num_studies,
STRING_AGG(DISTINCT(collection_id)) AS collection_id
FROM
`bigquery-public-data.idc_current.dicom_all`
GROUP BY
PatientID
ORDER BY
num_studies desc
Get the list of unique study identifiers for a given patient, together with the list of modalities included, ordered by StudyDate
, starting from the most recent one
SELECT
StudyInstanceUID,
STRING_AGG(DISTINCT(SAFE_CAST(StudyDate AS STRING))) as study_dates,
STRING_AGG(DISTINCT(Modality)) as modalities_in_study
FROM
`bigquery-public-data.idc_current.dicom_all`
WHERE
PatientID = "A860070"
GROUP BY
StudyInstanceUID
ORDER BY
study_dates DESC
Get the list of series identifiers and series modalities for a given study in the list above
SELECT
DISTINCT(SeriesInstanceUID),
Modality
FROM
`bigquery-public-data.idc_current.dicom_all`
WHERE
StudyInstanceUID = "1.3.6.1.4.1.14519.5.2.1.99.1071.31546385885949769813488655621452"
Given the identifier of the series, get individual instances and locations of the corresponding files in IDC storage buckets (in this particular case, the series has DX (Digital Radiography) modality, and consists of a single instance - in the general case you can have more than one instance/file per series)
SELECT
gcs_url
FROM
`bigquery-public-data.idc_current.dicom_all`
WHERE
SeriesInstanceUID = "1.3.6.1.4.1.14519.5.2.1.99.1071.76557368585346779276989447427568"
Now that we have the gs://
URIs, the files can be fetched using gsutil
or s5cmd
, as discussed in this documentation article.
If you want to see more examples and learn more about searching IDC data, you should consider going over this notebook: https://github.com/ImagingDataCommons/IDC-Examples/blob/master/notebooks/getting_started.ipynb.
Please reply in this thread with any questions or comments to the above!