We had the following inquiry outside of the forum asking for help navigating certain aspects of IDC data. I thought the response could be of interest to the broader IDC user community.
“For a given patient, what images are there for that patient, over time? How are they related? What is the most recent one?”
To answer this kind of question, one should use BigQuery SQL against the IDC curated metadata for the hosted collections. Of course, this will require understanding of the underlying data model, and can be confusing and intimidating at first. Thus the following background before we get to the actual queries.
Images are associated with a specific patient using the
PatientID DICOM attribute (each attribute is extracted into the column named correspondingly in the IDC
dicom_all table - see more in the documentation here). Although DICOM does not prescribe
PatientID to be globally unique, in practice, for the data hosted by IDC, those identifiers should be unique.
To understand the “what images” part of the question, you need to understand the hierarchical organization of DICOM data. Patient, defined by
PatientID, has one or more imaging studies, uniquely identified by the unique identifier in the
StudyInstanceUID attribute. Think of the imaging study as all the images collected for a given patient during the imaging exam (between the time patients lies down on the scanner table and getting off that table).
Individual DICOM studies consist of one or more imaging series. For example, a CT imaging exam typically would include a scout image, followed by one or more cross-sectional scans and potentially contrast-enhanced scan. Each of those would typically show up as a separate DICOM series, with the series uniquely identified by the identifier in the
Finally, individual series for cross-sectional imaging modalities (CT, MR, PET studies) typically consist of multiple instances, where one DICOM instance corresponds to a single two-dimensional slice, which is saved as a separate file. Those instances are uniquely identified by value in the
SeriesDate can be used to infer temporal aspect of the data. Note that:
- The dates you will find for IDC data have been shifted to reduce the risk of patient de-identification. Those are not real dates. But the intervals between the dates are expected to be preserved.
- In some cases you may find data that has dates completely removed. Do not be surprised by that.
- Under certain circumstances, you may find individual series within the study having
StudyDate. This can happen when annotations for images in a given study were added at a point in time after the image acquisition. It also appears that in the pathology imaging community the convention is to assign longitudinally collected pathology slides to the same DICOM study, which is different from the conventions used in radiology (@hackermd can clarify this further I am sure!).
The easiest way to experiment with the following queries is by running them from the BigQuery console.
Count the number of studies for each distinct
PatientID value, include the collection this patient belongs to
SELECT PatientID, COUNT(DISTINCT(StudyInstanceUID)) AS num_studies, STRING_AGG(DISTINCT(collection_id)) AS collection_id FROM `bigquery-public-data.idc_current.dicom_all` GROUP BY PatientID ORDER BY num_studies desc
Get the list of unique study identifiers for a given patient, together with the list of modalities included, ordered by
StudyDate, starting from the most recent one
SELECT StudyInstanceUID, STRING_AGG(DISTINCT(SAFE_CAST(StudyDate AS STRING))) as study_dates, STRING_AGG(DISTINCT(Modality)) as modalities_in_study FROM `bigquery-public-data.idc_current.dicom_all` WHERE PatientID = "A860070" GROUP BY StudyInstanceUID ORDER BY study_dates DESC
Get the list of series identifiers and series modalities for a given study in the list above
SELECT DISTINCT(SeriesInstanceUID), Modality FROM `bigquery-public-data.idc_current.dicom_all` WHERE StudyInstanceUID = "220.127.116.11.4.1.14518.104.22.168.99.1071.31546385885949769813488655621452"
Given the identifier of the series, get individual instances and locations of the corresponding files in IDC storage buckets (in this particular case, the series has DX (Digital Radiography) modality, and consists of a single instance - in the general case you can have more than one instance/file per series)
SELECT gcs_url FROM `bigquery-public-data.idc_current.dicom_all` WHERE SeriesInstanceUID = "22.214.171.124.4.1.145126.96.36.199.99.1071.76557368585346779276989447427568"
Now that we have the
gs:// URIs, the files can be fetched using
s5cmd, as discussed in this documentation article.
If you want to see more examples and learn more about searching IDC data, you should consider going over this notebook: IDC-Examples/getting_started.ipynb at master · ImagingDataCommons/IDC-Examples · GitHub.
Please reply in this thread with any questions or comments to the above!