Dataset Size Discrepancies (CancerImageArchives vs. IDC)

Hi,

Our lab is interested in downloading and mirroring some IDC datasets onto our clusters (NLST and LIDC-IDRI). However, these is some confusion about how much space the data will take. For examples, the NLST is listed as having 11.14TB of CT data on CancerImageArchives, while on the IDC platform, it shows that the NLST takes 23.95TB.

Would you be able to clarify which is more accurate?

2 Likes

(for the full disclosure, @amasquelin asked this question by email yesterday, and I encouraged Axel to post it to the forum - it definitely took me more than 15 minutes to prepare the response :smiley:)

Both numbers are accurate!

IDC and TCIA are distinct entities. While IDC hosts all of the TCIA public DICOM collections, for many of those collections IDC contains additional data not available in TCIA (mostly, analysis/AI-based curation results). IDC NLST collection contains a lot more than the CT images available in TCIA, which explains the much larger size.

In addition to the CT images you see in the TCIA NLST collection, IDC NLST collection currently includes DICOM slide microscopy images collected by the trial, and analysis results contributed by various initiatives. It is likely that IDC will have more analysis results for NLST in the future - this list is not expected to be static!

You can still download just the CT images!

I prepared a (short) notebook that explains IDC NLST collection in more detail, and shows how to explore it and access individual components.

In a nutshell, if you just want to know how to do this (I encourage you to go through the notebook though), it goes like this:

Install idc-index python package that simplifies access to IDC data:

pip install --upgrade idc-index

Select just the CT images from the NLST collection and download them:

from idc_index import IDCClient

# instantiate the client
client = IDCClient()

# select CT DICOM series from the NLST collection
nlst_ct_series = client.index[(client.index['collection_id'] == 'nlst') & (client.index['Modality'] == 'CT')]['SeriesInstanceUID']

# download files for the selected series
client.download_from_selection(
    seriesInstanceUID=list(nlst_ct_series["SeriesInstanceUID"]),
    downloadDir=".",
)

@amasquelin you can follow the same approach with the LIDC-IDRI collection, and please post here if you need help, or to report problems with the aforementioned notebook!

1 Like