IDC May 2023 release

Today we are celebrating the v14 release of IDC data! :sparkler:

Although this release does not include any new data, it has been in the works for many months, and is indeed very special.

We are very excited to announce the new partnership of IDC with the AWS Open Data Sponsorship program! Through this partnership, we are making versioned IDC data available from AWS S3 buckets, see IDC entry in the AWS Marketplace here: National Cancer Institute Imaging Data Commons (IDC) Collections - Registry of Open Data on AWS. Now you can choose to download IDC data from either GCP or AWS storage location, depending on your compute needs and preferences. We updated our download instructions and “Getting started” tutorials to explain how you can choose the source of downloading IDC data. We thank our partners at the AWS Open Data Sponsorship program, and especially Erin Chu, for the support and collaboration :pray: ! We look forward to further expanding our collaboration and partnership with AWS to benefit cancer imaging researchers!

We re-organized the files in our storage buckets into series-level folder hierarchy. There are many implications of such seemingly inconsequential change.

By introducing this hierarchy, we are now able to support data manifests that are defined as the list of series, which cuts the size of the manifest by 2 orders of magnitude for most common imaging modalities.

Because the manifests defining IDC cohorts are now much smaller, we are able to generate those manifest directly from the IDC Portal. If you create and save a cohort in IDC Portal, you now have the option to export the cohort as an s5cmd manifest (while being able to choose whether you want the data to be downloaded from GCP or AWS), which you can feed directly to the s5cmd command, and download files corresponding to your cohort very efficiently. See IDC download instructions for details!

You might remember we discussed integration of the VolView viewer with IDC in the March 2023 release announcement. The process of VolView invocation was quite awkward, since the user was expected to create a manifest listing files corresponding to all instances of a series. With the series-level folders introduction, VolView can be pointed to the series folder, and you can generate that URL with a query. Kudos to Forrest Li at Kitware for the quick implementation of the feature allowing parameterization of VolView with the bucket folder in the URL!

For example, the following query will return the path to the series folder in the AWS bucket (and you can replace aws_url with gcp_url, if you prefer to fetch from GCP Storage) corresponding to a random CT series from the NLST collection:

SELECT
  DISTINCT(REGEXP_EXTRACT(aws_url,r'..:\/\/[a-z-]*\/[0-9a-z-]*'))
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  collection_id = "nlst"
  AND Modality = "CT"
ORDER BY
  RAND()
LIMIT
  1

For me, this query returned gs://public-datasets-idc/1be5c250-96bc-420e-a7e0-50bd23ef4e93. Next, you can pass this folder location via the urls parameter to VolView application to see volume rendering of this series: https://volview.kitware.app/?urls=gs://public-datasets-idc/1be5c250-96bc-420e-a7e0-50bd23ef4e93.

If you are wondering why we did not reorganize the files into the patient/study/series hierarchy, instead of just series-level folders, the answer is - because IDC data is versioned. A change of any instance within the series will currently trigger a new versioned series folder, compared to the previous release. With a patient/study/series folder hierarchy, a change of a single instance would require replication of the entire patient folder, leading to increased potential data duplication and storage costs.

Let us know what you think about this update, and what features you would like to see prioritized in IDC!

Reminders

  • If you have any questions about IDC, you can email them to support@canceridc.dev or start a new thread in IDC forum.
  • Please drop by IDC Office Hours to ask any questions about IDC: every Tuesday 16:30 – 17:30 (New York) and Wednesday 10:30-11:30 (New York) via Google Meet at https://meet.google.com/xyt-vody-tvb.
  • Free cloud credits are available for those who want to explore features of Google Cloud not included in the free tier (e.g., Cloud Compute Engine, Vertex AI, using Healthcare API for your data): apply [here].(Application for Imaging Data Commons pilot cloud credits)

Congratulations to the IDC team on these major accomplishments!