NCI Imaging Data Commons data release v16 is out, and here’s a summary of new developments we would like to highlight.
While you can see all of the new collections in our release notes, RMS-Mutation-Prediction is a new slide microscopy (SM) collection we would like to highlight. It contains whole slide images in DICOM-TIFF format accompanying a recent publication from the group led by Dr. Javed Khan at NCI:
D. Milewski et al., “Predicting molecular subtype and survival of rhabdomyosarcoma patients using deep learning of H&E images: A report from the Children’s Oncology Group,” Clin. Cancer Res., vol. 29, no. 2, pp. 364–378, Jan. 2023, doi: 10.1158/1078-0432.CCR-22-1663.
Rhabdomyosarcoma (RMS) is the most common soft tissue tumor in children and adolescent young adults, accounting for 3% of all pediatric cancers. This collection contains SM images for 403 patients enrolled in IRB-approved clinical trials or tissue banking studies from Children’s Oncology Group. Along with the images we include clinical data containing such information as diagnosis, primary site of the tumor and its histological classification.
While the images included in the collection can be explored in IDC Portal, the portal does not provide access to the collection-specific clinical information. To help you appreciate the clinical information available along with the images for this collection, and demonstrate how collection-specific clinical data can be integrated into an interactive interface, we utilized Google DataStudio (rebranded as “Looker” by Google) to prepare a dashboard focused on RMS-Mutation-Prediction (it will take a bit of time to load, please be patient!). For any collection in IDC, a dashboard like this one can be set up rather easily, allowing users who do not have expertise in SQL explore any of the metadata attributes interactively (here is a brief tutorial on how to get started with using DataStudio).
As always, SQL-savvy users can follow our clinical data tutorial) to learn how to merge clinical data with the imaging metadata.
Interactive live dashboard is here!
Collections hosted in IDC can be broadly categorized in three groups:
- Data that IDC replicates from an existing repository, without performing any transformation to the original representation of the data. This is the case in the situation where the original representation of the data is DICOM format. All of the images in the radiology collections IDC replicates from TCIA belong to this category.
- Data that IDC replicates from an existing repository, after harmonization of the data representation into the DICOM format. As an example, all of the whole slide imaging collections in the TCGA and CPTAC programs are available in Genomics Data Commons and TCIA, respectively, but are stored in the vendor-specific formats, while IDC contains those images converted to the DICOM-TIFF format. Other collections in this group are those in the Human Tumor Atlas Network (HTAN) program and Visible Human Dataset.
- Data that does not exist in any other external repository, such as RMS-Mutation-Prediction and nnU-Net-BPR-annotations.
It is often the case that data from the first category has DOIs that we reference from IDC. Referencing the DOI of the dataset in category 2 is, however, misleading, since the files hosted by IDC are different from the original representation. And, of course, data from the collections in the third group will not have an DOI assigned.
To remedy this situation, we are developing a process where all datasets in the third group received by IDC will have DOI assigned. In the future, we also plan to review all of the collections in group 2 and assign DOIs to those collections that contain conversion results. This way it is easier to acknowledge the contributors of the data, easier to link various items related to the dataset (e.g., code used in the conversion, manuscript describing the process of creation of the dataset or its analysis, etc).
With this release we are happy to announce that we established a new NCI Imaging Data Commons Zenodo community that will be used to maintain collections in groups 2 and 3, with Zenodo provisioning DOIs for those data descriptors.
RMS-Mutation-Prediction collection mentioned earlier is one of the first items in that community:
Clunie, D., Khan, J., Milewski, D., Jung, H., Bowen, J., Lisle, C., Brown, T., Liu, Y., Collins, J., Linardic, C. M., Hawkins, D. S., Venkatramani, R., Clifford, W., Pot, D., Wagner, U., Farahani, K., Kim, E. & Fedorov, A. DICOM converted whole slide hematoxylin and eosin images of rhabdomyosarcoma from Children’s Oncology Group trials. (2023). doi:10.5281/zenodo.8225132.
IDC Zenodo data descriptors include the files corresponding to the collection (if the total size of the compressed files is below 50 Gb) (which is the case for the nnU-Net-BPR-annotations Zenodo descriptor, or the manifests that can be used to download the corresponding files from IDC (as is the case for the RMS-Mutation-Prediction descriptor). In either case, you can interact with the data via IDC Portal or BigQuery SQL to explore, visualize and combine it with other collections.
We hope this mechanism will help us better recognize contributors of data to IDC, make those datasets even more discoverable, and improve their provenance!
NCI Imaging Data Commons Zenodo community is here!
We are continuously working to simplify the process of data download from IDC. With this release we introduce new UI elements to get the command for downloading specific series, or manifest for a specific DICOM study. All you need is to have
s5cmd installed and in your system path. You can see a demo of this functionality in the screencast below.
- If you have any questions about IDC, you can email them to firstname.lastname@example.org or start a new thread in IDC forum.
- Please drop by IDC Office Hours to ask any questions about IDC: every Tuesday 16:30 – 17:30 (New York) and Wednesday 10:30-11:30 (New York) via Google Meet at https://meet.google.com/xyt-vody-tvb .
- Free cloud credits are available for those who want to explore features of Google Cloud not included in the free tier (e.g., Cloud Compute Engine, Vertex AI, using Healthcare API for your data): apply here
(as always, the live dashboard for the screenshot above is available here)