Thanks to the hard work of a superb team, IDC is now in production release!
Thanks to @farahank for announcing the production release!
The main highlights of the production release:
- the amount of data available increased from ~1TB in the initial pilot release to >16TB
- support for digital pathology added
- introduction of versioning to support reproducible science
- examples of use cases released to the community
- API for programmatic access to cohorts
Details on the major milestones and improvements that were accomplished by @IDC_team in less than 12 months since the initial introduction of the IDC pilot:
- the number of collections available in IDC increased from 27 (~1TB) to 113 (>16TB of data)
- we added support for DICOM digital pathology
- we added support for IDC data versioning, which means you will always be able to access the precise set of files you used in your analysis as defined by DICOM
SOPInstanceUIDs that are unique and resolvable within IDC, or CRDC Globally Unique Identifiers (GUIDs), even if the collection(s) containing those files has been updated
- a number of analysis use cases have been developed, and are now available as Colab Notebooks demonstrating examples of how IDC data can be analyzed on the cloud
- DeepPrognosis use case - replication study, 2 year survival score of NSCLC patients
- Lung Nodules segmentation and prognosis use case - NSCLC patients nodules segmentation (nnU-Net) and prognosis (DeepPrognosis)
- Thoracic Organs at Risk segmentation use case - NSCLC patients thoracic OAR segmentation (nnU-Net)
- Tissue classification in slide microscopy images - this tutorial builds on the publication “Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning” (Coudray et al., 2018), one of the most cited pathomics publications in recent years
- IDC API is now live, enabling programmatic access to the functionality available through the IDC portal, including authenticated operations with IDC cohorts. IDC API complements Google BigQuery and Cloud Storage APIs that are available to query IDC metadata tables and retrieve files hosted in IDC
- numerous features and bug fixes were implemented in IDC Portal; most prominently, cohorts support was enhanced by integrating with IDC data versioning - no matter what version of IDC data you used to form your cohort, you will always be able to export the manifest, or apply the cohort filter to the current IDC data version
- we added various examples demonstrating how Google Cloud tools can be used to enable exploration and analysis of IDC data, and reproducibility in AI reserch. You can learn how to
- set up a GCP Compute Engine VM with the desktop interface to 3D Slicer
- use Google DataStudio to build a highly customizable dashboard to explore metadata related to your cohort beyond what can be done with IDC portal, see live dashboard here
- use BigQuery and SQL to get quick access to all of the DICOM metadata extracted from the 36+ million (and counting) DICOM instances available in IDC
- IDC implemented security controls and gained Authority to Operate at the Federal Information Security Modernization Act (FISMA) Low level
- our launch of the IDC pilot cloud credit program was successful, with a growing number of community members onboarded and using IDC credit allocation for their research (you can see some highlights of this work presented by IDC users in the recording of the “Infrastructure and Standards” session at SIIM Conference on Machine Intelligence in Medical Imaging 2021,
- we had numerous presentations and tutorials at such venues as MICCAI, RSNA, ASTRO, AAPM, SIIM CMIMI, NCI Imaging Community webinar
- we published an open access manuscript with the overview and vision for IDC role in the community, accompanied by demonstration videos highlighting some of the key functions of the system
We need input from YOU to guide our development!
Please give IDC a try: we have free cloud credits to help you get started. We welcome you to join our community and help us build this resource to benefit cancer research.
This looks like a lot of great additions to IDC. As a person working on CDA, I had one quick question regarding data versioning. We currently have a script that directly queries the BigQuery tables. Am I correct in my understanding that each data version will be stored under a different dataset (idc_v2, idc_v3, idc_v4, etc.)?
Thank you @Donovan_Ruth!
Yes, this is correct. In addition to this, we are maintaining
idc_current view that can be used as an alias to the latest version.