Understanding what images are available for a given patient via BigQuery API

We had the following inquiry outside of the forum asking for help navigating certain aspects of IDC data. I thought the response could be of interest to the broader IDC user community.

“For a given patient, what images are there for that patient, over time? How are they related? What is the most recent one?”

To answer this kind of question, one should use BigQuery SQL against the IDC curated metadata for the hosted collections. Of course, this will require understanding of the underlying data model, and can be confusing and intimidating at first. Thus the following background before we get to the actual queries.

Background

Images are associated with a specific patient using the PatientID DICOM attribute (each attribute is extracted into the column named correspondingly in the IDC dicom_all table - see more in the documentation here). Although DICOM does not prescribe PatientID to be globally unique, in practice, for the data hosted by IDC, those identifiers should be unique.

To understand the “what images” part of the question, you need to understand the hierarchical organization of DICOM data. Patient, defined by PatientID, has one or more imaging studies, uniquely identified by the unique identifier in the StudyInstanceUID attribute. Think of the imaging study as all the images collected for a given patient during the imaging exam (between the time patients lies down on the scanner table and getting off that table).

Individual DICOM studies consist of one or more imaging series. For example, a CT imaging exam typically would include a scout image, followed by one or more cross-sectional scans and potentially contrast-enhanced scan. Each of those would typically show up as a separate DICOM series, with the series uniquely identified by the identifier in the SeriesInstanceUID attribute.

Finally, individual series for cross-sectional imaging modalities (CT, MR, PET studies) typically consist of multiple instances, where one DICOM instance corresponds to a single two-dimensional slice, which is saved as a separate file. Those instances are uniquely identified by value in the SOPInstanceUID attribute.

StudyDate and SeriesDate can be used to infer temporal aspect of the data. Note that:

  1. The dates you will find for IDC data have been shifted to reduce the risk of patient de-identification. Those are not real dates. But the intervals between the dates are expected to be preserved.
  2. In some cases you may find data that has dates completely removed. Do not be surprised by that.
  3. Under certain circumstances, you may find individual series within the study having SeriesDate different from StudyDate. This can happen when annotations for images in a given study were added at a point in time after the image acquisition. It also appears that in the pathology imaging community the convention is to assign longitudinally collected pathology slides to the same DICOM study, which is different from the conventions used in radiology (@hackermd can clarify this further I am sure!).

Queries

The easiest way to experiment with the following queries is by running them from the BigQuery console.

Count the number of studies for each distinct PatientID value, include the collection this patient belongs to

SELECT
  PatientID,
  COUNT(DISTINCT(StudyInstanceUID)) AS num_studies,
  STRING_AGG(DISTINCT(collection_id)) AS collection_id
FROM
  `bigquery-public-data.idc_current.dicom_all`
GROUP BY
  PatientID
ORDER BY
  num_studies desc

Get the list of unique study identifiers for a given patient, together with the list of modalities included, ordered by StudyDate, starting from the most recent one

SELECT
  StudyInstanceUID,
  STRING_AGG(DISTINCT(SAFE_CAST(StudyDate AS STRING))) as study_dates,
  STRING_AGG(DISTINCT(Modality)) as modalities_in_study
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  PatientID = "A860070"
GROUP BY
  StudyInstanceUID
ORDER BY
  study_dates DESC

Get the list of series identifiers and series modalities for a given study in the list above

SELECT
  DISTINCT(SeriesInstanceUID),
  Modality
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  StudyInstanceUID = "1.3.6.1.4.1.14519.5.2.1.99.1071.31546385885949769813488655621452"

Given the identifier of the series, get individual instances and locations of the corresponding files in IDC storage buckets (in this particular case, the series has DX (Digital Radiography) modality, and consists of a single instance - in the general case you can have more than one instance/file per series)

SELECT 
  gcs_url
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  SeriesInstanceUID = "1.3.6.1.4.1.14519.5.2.1.99.1071.76557368585346779276989447427568"

Now that we have the gs:// URIs, the files can be fetched using gsutil or s5cmd, as discussed in this documentation article.

If you want to see more examples and learn more about searching IDC data, you should consider going over this notebook: https://github.com/ImagingDataCommons/IDC-Examples/blob/master/notebooks/getting_started.ipynb.

Please reply in this thread with any questions or comments to the above!

FYI @ACharbonneau @david.pot @dclunie

2 Likes

@fedorov I suspect this question is in the same topic space, but please feel free to fork as needed.

Regarding understanding/identifying what images are available over time (i.e. across releases), I am trying to make certain I understand my notes. I’ll summarize what I think now; please flag anything I got wrong, missed or misrepresented.


  • With respect to the hierarchy of data structures associated with a given patient: patients have attached studies (summaries of observations), which contain series (files grouped by a particular method, workflow or analysis), which contain instances (individual files).
  • Each of these data structures has (at least) two IDs within any given IDC data release: (patientID, studyInstanceUID, seriesInstanceUID and SOPInstanceUID) provided by the submitter, and (idc_case_id, crdc_study_uuid, crdc_series_uuid and crdc_instance_uuid) automatically assigned during ingest to IDC.
  • With respect to the submitter-provided IDs for these data objects: these IDs can change across IDC releases, reflecting changes in/updates to object metadata, at the submitter's discretion. So if object-associated metadata is updated, then affected submitter-provided IDs may or may not change depending on the submitter's local policy governing when and when not to update identifiers (e.g. the change was too minor to bother).
  • The IDC-assigned IDs, on the other hand, are guaranteed to change if any updates are made to metadata associated with the attached objects across releases (because DICOM bundles metadata and data and these IDs reflect/represent specific object contents, similarly to MD5 hashes for files).

I’ll stop there – was that a fair summary?

Upshot/hypothesis: there is currently no way to track the “same” series, study, etc. across successive releases using only information present in the identifiers, because data/metadata updates can trigger the issuance of new IDs.

FYR @david.pot [i’d include Amanda, but there’s a new-user restriction on how many users I can mention in a post]

Your definitions of studies/series/instances are not how they are defined in the DICOM standard (I don’t have the official definitions of those at my fingertips, so I will need to get back to you on this!), but for the most part it is a fair summary.

This will depend on the nature of change. It is highly unlikely, if not impossible, for all identifiers to change. For example, if a new series is added to an existing study, StudyInstanceUID will remain the same, and there will be a new series identified by SeriesInstanceUID, which will also have StudyInstanceUID of the parent study among its metadata.

If a series is changed, and the submitted decided that a new SeriesInstanceUID is warranted, then indeed (to my knowledge) there is no information to establish correspondence of the new newly introduced series to its predecessor.

I have to emphasize that this kind of changes, where UIDs for existing studies/series are changed are highly infrequent (I am aware of only one occasion of when that happened), although I do not have the exact number. I would say those are corner cases.

1 Like