DICOM metadata query assistance

Hey there, I’m trying to build a report containing a variety of DICOM metadata about the collections from TCIA. I haven’t gotten to the part about trying to exclude the datasets that aren’t actually in TCIA yet because I’m stuck just getting the basic query to work. Any thoughts on what’s going wrong in the SQL below? The goal is to have a series-level report that contains all of these fields in it for each series/row. If you can also show me how to exclude the stuff that isn’t TCIA data that would be great too. I’m assuming it would be to add some filter on the DOI URL to ignore things in Zenodo? Thanks!

SELECT
    SeriesInstanceUID,
    STRING_AGG(DISTINCT Modality, ', ') AS Modality,
    STRING_AGG(DISTINCT Laterality, ', ') AS Laterality,
    STRING_AGG(DISTINCT ImageLaterality, ', ') AS ImageLaterality,
    STRING_AGG(DISTINCT SeriesDescription, ', ') AS SeriesDescription,
    ARRAY_TO_STRING(ImageType, ', ') AS ImageType,
    STRING_AGG(DISTINCT AcquisitionType, ', ') AS AcquisitionType,
    STRING_AGG(DISTINCT AcquisitionNumber, ', ') AS AcquisitionNumber,
    STRING_AGG(DISTINCT ContrastBolusAgent, ', ') AS ContrastBolusAgent,
    STRING_AGG(DISTINCT ContrastBolusTotalDose, ', ') AS ContrastBolusTotalDose,
    ARRAY_TO_STRING(ConvolutionKernel, ', ') AS ConvolutionKernel,
    STRING_AGG(DISTINCT StudyDescription, ', ') AS StudyDescription,
    STRING_AGG(DISTINCT BodyPartExamined, ', ') AS BodyPartExamined,
    STRING_AGG(DISTINCT SliceThickness, ', ') AS SliceThickness,
    STRING_AGG(DISTINCT PatientPosition, ', ') AS PatientPosition,
    ARRAY_TO_STRING(ARRAY(SELECT ARRAY_TO_STRING(arr, ', ') FROM UNNEST(ImageOrientationPatient) AS arr), ', ') AS ImageOrientationPatient,
    STRING_AGG(DISTINCT Source_DOI, ', ') AS Source_DOI,
    STRING_AGG(DISTINCT collection_name, ', ') AS collection_name
FROM
    `bigquery-public-data.idc_current.dicom_all`
GROUP BY
    SeriesInstanceUID

1 Like

Justin, I recommend using BigQuery console to debug the queries: https://console.cloud.google.com/bigquery.

You were quite close - I did a couple of small changes, and this query works:

SELECT
  SeriesInstanceUID,
  STRING_AGG(DISTINCT Modality, ', ') AS Modality,
  STRING_AGG(DISTINCT Laterality, ', ') AS Laterality,
  STRING_AGG(DISTINCT ImageLaterality, ', ') AS ImageLaterality,
  STRING_AGG(DISTINCT SeriesDescription, ', ') AS SeriesDescription,
  STRING_AGG(DISTINCT(ARRAY_TO_STRING(ImageType, ', '))) AS ImageType,
  STRING_AGG(DISTINCT AcquisitionType, ', ') AS AcquisitionType,
  STRING_AGG(DISTINCT AcquisitionNumber, ', ') AS AcquisitionNumber,
  STRING_AGG(DISTINCT ContrastBolusAgent, ', ') AS ContrastBolusAgent,
  STRING_AGG(DISTINCT ContrastBolusTotalDose, ', ') AS ContrastBolusTotalDose,
  STRING_AGG(DISTINCT(ARRAY_TO_STRING(ConvolutionKernel, ', '))) AS ConvolutionKernel,
  STRING_AGG(DISTINCT StudyDescription, ', ') AS StudyDescription,
  STRING_AGG(DISTINCT BodyPartExamined, ', ') AS BodyPartExamined,
  STRING_AGG(DISTINCT SliceThickness, ', ') AS SliceThickness,
  STRING_AGG(DISTINCT PatientPosition, ', ') AS PatientPosition,
  STRING_AGG(DISTINCT(ARRAY_TO_STRING(ImageOrientationPatient,'/'))) AS ImageOrientationPatient,
  STRING_AGG(DISTINCT Source_DOI, ', ') AS Source_DOI,
  STRING_AGG(DISTINCT collection_name, ', ') AS collection_name
FROM
  `bigquery-public-data.idc_current.dicom_all`
GROUP BY
  SeriesInstanceUID

You can definitely have an approximation with the approach you mention, by doing a join on DOI (and perhaps discarding slide microscopy SM modality), but since TCIA data and DOIs are not versioned, and IDC mirrors TCIA on the release cycle, while TCIA does updates intermittently, I don’t think there is an easy way to guaranteed the metadata you extract from IDC will match the content of TCIA at any given time.

I think the only way to do this relatively confidently is by comparing the series-level hash returned by TCIA API and the one available in IDC BigQuery index in series_hash.

Thanks! I was working in the BigQuery console, but sadly even with the error messages it was providing I couldn’t figure out how to fix the issues I was running into.

but sadly even with the error messages it was providing I couldn’t figure out how to fix the issues I was running into

I feel your pain, and I agree! Those messages often are rather cryptic, but it’s the best that is available…

Glad this works for you, and let us know if you have any further questions!