Dataset Size Discrepancies (CancerImageArchives vs. IDC)

amasquelin · September 20, 2024, 4:07pm

Hi,

Our lab is interested in downloading and mirroring some IDC datasets onto our clusters (NLST and LIDC-IDRI). However, these is some confusion about how much space the data will take. For examples, the NLST is listed as having 11.14TB of CT data on CancerImageArchives, while on the IDC platform, it shows that the NLST takes 23.95TB.

Would you be able to clarify which is more accurate?

fedorov · September 20, 2024, 4:25pm

(for the full disclosure, @amasquelin asked this question by email yesterday, and I encouraged Axel to post it to the forum - it definitely took me more than 15 minutes to prepare the response )

Both numbers are accurate!

IDC and TCIA are distinct entities. While IDC hosts all of the TCIA public DICOM collections, for many of those collections IDC contains additional data not available in TCIA (mostly, analysis/AI-based curation results). IDC NLST collection contains a lot more than the CT images available in TCIA, which explains the much larger size.

In addition to the CT images you see in the TCIA NLST collection, IDC NLST collection currently includes DICOM slide microscopy images collected by the trial, and analysis results contributed by various initiatives. It is likely that IDC will have more analysis results for NLST in the future - this list is not expected to be static!

TotalSegmentator-CT-Segmentations: segmentations of the anatomical regions and organs for most of the NLST CT images of using TotalSegmentator + first order and shape radiomics features for each segment
BAMF-AIMI-Annotations: AI-based volumetric segmentations for a small subset of the CT images
nnU-Net-BPR-annotations: AI-based annotation of the body part and volumetric segmentations of chest organs using nnUNet package

You can still download just the CT images!

I prepared a (short) notebook that explains IDC NLST collection in more detail, and shows how to explore it and access individual components.

In a nutshell, if you just want to know how to do this (I encourage you to go through the notebook though), it goes like this:

Install idc-index python package that simplifies access to IDC data:

pip install --upgrade idc-index

Select just the CT images from the NLST collection and download them:

from idc_index import IDCClient

# instantiate the client
client = IDCClient()

# select CT DICOM series from the NLST collection
nlst_ct_series = client.index[(client.index['collection_id'] == 'nlst') & (client.index['Modality'] == 'CT')]['SeriesInstanceUID']

# download files for the selected series
client.download_from_selection(
    seriesInstanceUID=list(nlst_ct_series["SeriesInstanceUID"]),
    downloadDir=".",
)

@amasquelin you can follow the same approach with the LIDC-IDRI collection, and please post here if you need help, or to report problems with the aforementioned notebook!

Topic		Replies	Views
NLST images a subset of those in CDAS? Data nlst	2	23	August 22, 2024
New in IDC v18: TotalSegmentator segmentations and radiomics features for NLST CTs Announcements release	6	3347	June 19, 2024
IDC is presented at a MICCAI 2020 tutorial tomorrow! Announcements tutorial_event	3	496	October 8, 2020
IDC Nov 2024 release v20: nuclei segmentations + more data and features Announcements release	2	420	January 17, 2025
IDC March 2023 release Announcements release	0	800	March 16, 2023

Dataset Size Discrepancies (CancerImageArchives vs. IDC)

Related topics