I’m trying to run a query on idc-index based on an older version of the IDC dataset, using a query that works perfectly in BigQuery (see below). However, when I attempt to run it through idc-index, I’m getting numerous SQL errors that I don’t encounter in BigQuery.
Here’s the query I’m using as a reference:
WITH
idc_instances_per_series AS (
SELECT
SeriesInstanceUID,
COUNT(DISTINCT(SOPInstanceUID)) AS num_instances,
COUNT(DISTINCT(ARRAY_TO_STRING(ImagePositionPatient, "/"))) AS position_count,
MAX(SAFE_CAST(SliceThickness AS float64)) AS max_SliceThickness,
MIN(SAFE_CAST(SliceThickness AS float64)) AS min_SliceThickness,
STRING_AGG(DISTINCT(SAFE_CAST("LOCALIZER" IN UNNEST(ImageType) AS string)), "") AS has_localizer
FROM
`bigquery-public-data.idc_v14.dicom_all`
WHERE
Modality = "CT" AND
access = "Public"
GROUP BY
SeriesInstanceUID
)
SELECT
dicom_all.SeriesInstanceUID,
ANY_VALUE(dicom_all.collection_id) AS collection_id,
ANY_VALUE(dicom_all.PatientID) AS PatientID,
ANY_VALUE(idc_instances_per_series.num_instances) AS num_instances,
ANY_VALUE(CONCAT("https://viewer.imaging.datacommons.cancer.gov/viewer/", dicom_all.StudyInstanceUID, "?seriesInstanceUID=", dicom_all.SeriesInstanceUID)) AS idc_url
FROM
`bigquery-public-data.idc_v14.dicom_all` AS dicom_all
JOIN
idc_instances_per_series ON dicom_all.SeriesInstanceUID = idc_instances_per_series.SeriesInstanceUID
WHERE
idc_instances_per_series.min_SliceThickness >= 1
AND idc_instances_per_series.max_SliceThickness <= 5
AND idc_instances_per_series.num_instances > 50
AND idc_instances_per_series.num_instances / idc_instances_per_series.position_count = 1
AND has_localizer = "false"
GROUP BY
SeriesInstanceUID
My questions are:
- How do I index an old version on idc-index to achieve the same result as in BigQuery?
- What adjustments might be needed in the SQL syntax to avoid errors on idc-index?
Any insights on achieving parity between the two environments would be greatly appreciated!