How to Get SQL running on BigQuery to work with idc-index package?

I’m trying to run a query on idc-index based on an older version of the IDC dataset, using a query that works perfectly in BigQuery (see below). However, when I attempt to run it through idc-index, I’m getting numerous SQL errors that I don’t encounter in BigQuery.

Here’s the query I’m using as a reference:

WITH
  idc_instances_per_series AS (
    SELECT
      SeriesInstanceUID,
      COUNT(DISTINCT(SOPInstanceUID)) AS num_instances,
      COUNT(DISTINCT(ARRAY_TO_STRING(ImagePositionPatient, "/"))) AS position_count,
      MAX(SAFE_CAST(SliceThickness AS float64)) AS max_SliceThickness,
      MIN(SAFE_CAST(SliceThickness AS float64)) AS min_SliceThickness,
      STRING_AGG(DISTINCT(SAFE_CAST("LOCALIZER" IN UNNEST(ImageType) AS string)), "") AS has_localizer
    FROM
      `bigquery-public-data.idc_v14.dicom_all`
    WHERE
      Modality = "CT" AND
      access = "Public"
    GROUP BY
      SeriesInstanceUID
  )
SELECT
  dicom_all.SeriesInstanceUID,
  ANY_VALUE(dicom_all.collection_id) AS collection_id,
  ANY_VALUE(dicom_all.PatientID) AS PatientID,
  ANY_VALUE(idc_instances_per_series.num_instances) AS num_instances,
  ANY_VALUE(CONCAT("https://viewer.imaging.datacommons.cancer.gov/viewer/", dicom_all.StudyInstanceUID, "?seriesInstanceUID=", dicom_all.SeriesInstanceUID)) AS idc_url
FROM
  `bigquery-public-data.idc_v14.dicom_all` AS dicom_all
JOIN
  idc_instances_per_series ON dicom_all.SeriesInstanceUID = idc_instances_per_series.SeriesInstanceUID
WHERE
  idc_instances_per_series.min_SliceThickness >= 1
  AND idc_instances_per_series.max_SliceThickness <= 5
  AND idc_instances_per_series.num_instances > 50
  AND idc_instances_per_series.num_instances / idc_instances_per_series.position_count = 1
  AND has_localizer = "false"
GROUP BY
  SeriesInstanceUID

My questions are:

  1. How do I index an old version on idc-index to achieve the same result as in BigQuery?
  2. What adjustments might be needed in the SQL syntax to avoid errors on idc-index?

Any insights on achieving parity between the two environments would be greatly appreciated!

1 Like

@surajpai thank you for reaching out with this question!

Achieving parity between BigQuery and idc-index in terms of running queries is not possible.

IDC BigQuery datasets contain numerous tables, many of which are very large (e.g., dicom_all table, as of IDC v20, contains over 45M rows, >900 columns, and has the size of over 200GB!). Querying such large tables is often not practical on a laptop.

idc-index purpose is to support basic search operations by packaging a table containing a small subset of columns and aggregating metadata at the series level (to reduce the size of the table) for the main index.

If you want to run BigQuery queries, you will need to use Google BigQuery console (after completing the prerequisites in this tutorial). You can also run BigQuery queries from Python, as discussed in this tutorial.

If there are certain columns that you would like to see in idc-index, please let us know and we can discuss it. But we try to keep the tables in idc-index small to manage the size of the package, install time, and query complexity.

I realized I did not answer the specific questions you asked! :smiley:

idc-index will always correspond to the latest version of IDC. You cannot search metadata for a specific past version. The reason for this is again that we need to keep the package small, and support only the limited basic search operations. If you want to search past versions, you will need to use BigQuery.

You can, however, download files from the past versions, as defined by the manifest. If you saved manifest using IDC Portal, and the referenced DICOM series were modified later, downloading the content referenced in the manifest will give you the files for the specific version.

The tables that are packaged by idc-index are documented here: https://idc-index.readthedocs.io/en/latest/column_descriptions.html. The queries that you can run through idc-index can only use those tables in the FROM clause, and can only access the columns from those tables!