Help with the download of CCDI images while filtering by diagnosis/embedding medium

Hello,

My name is Charlie and I’m a pathology resident at University of Colorado Anschutz and am hoping to get some help with downloading whole slide images for an upcoming research project.

I unfortunately have very little coding experience and am a bit confused about some of the GitHub instructions.

We are looking to download images from the CCDI study and hoping to filter these slide images by diagnosis and by embedding medium (Paraffin wax/tissue freezing medium).

Hoping I can get some help with me with this, thanks!

Best,
Charlie

Charlie, I am happy to help with this. Before we proceed, do you mind if I move your question to the public section of the user support forum, so that other users of IDC could benefit from the answer?

Sure, thanks a lot!

···

From: Andrey Fedorov via Imaging Data Commons notifications@canceridc.discoursemail.com
Sent: Monday, December 15, 2025 9:21:24 AM
To: Sawyer, Charles charles.m.sawyer@cuanschutz.edu
Subject: [Imaging Data Commons] [PM] Imaging Data Commons Help


[External Email - Use Caution]
fedorov
December 15

Charlie, I am happy to help with this. Before we proceed, do you mind if I move your question to the public section of the user support forum, so that other users of IDC could benefit from the answer?

Visit Message or reply to this email to respond to IDC support email triage group (9).

To unsubscribe from these emails, click here.


@charlie it is relatively easy to do what you need, but with a little bit of python programming. Here are the steps. I also put those into this Colab notebook for your convenience.

Install prerequisites

Assuming you have python>= 3.10 and pip available on your system this will install the helper idc-index python package that provides functionality to search and download from IDC.

pip install --upgrade idc-index

Initialize IDC client

IDCClient provides the API interface for search/download. This code will also install sm_index, which is a supporting table that contains metadata for slide microscopy.

from idc_index import IDCClient

idc_client = IDCClient()
idc_client.fetch_index('sm_index')

Get metadata for the CCDI-MCI collection

This queries metadata needed to answer your question using SQL query. It is also possible to do this with Pandas interface, but in my view, SQL is easier to understand.

# join metadata for the CCDI-MCI collection between general purpose "index" table
# and slide microscopy specific "sm_index" table

query = """
SELECT sm_index.*
FROM sm_index
JOIN index
ON sm_index.SeriesInstanceUID = index.SeriesInstanceUID
WHERE index.collection_id = 'ccdi_mci'
"""

# select specific columns from the resulting pandas DataFrame to simplify presentation
result = idc_client.sql_query(query)[['SeriesInstanceUID',
                                      'embeddingMedium_CodeMeaning',
                                      'tissueFixative_CodeMeaning',
                                      'primaryAnatomicStructure_CodeMeaning',
                                      'primaryAnatomicStructureModifier_CodeMeaning',
                                      'admittingDiagnosis_CodeMeaning'
                                      ]]

Once you have that result dataframe, you can check the distinct values for the attributes of your interest like this:

import numpy as np

# col is your column name
distinct_embedding = result['embeddingMedium_CodeMeaning'].apply(lambda a: tuple(a) if isinstance(a, np.ndarray) else a).unique()
distinct_fixative = result['tissueFixative_CodeMeaning'].apply(lambda a: tuple(a) if isinstance(a, np.ndarray) else a).unique()

distinct_diagnoses = result['admittingDiagnosis_CodeMeaning'].unique()

Each of the columns that end in _CodeMeaning are coded values, and if you are interested, you can also get actual codes in the source dataframe to match those to terminologies to establish standard nomenclature (let me know if you want to know more about this).

Download the corresponding files

You can use any of the attributes in the dataframe to select slides based on the specific criteria.

# as an example, download only the slides corresponding to the specific diagnosis to the current directory
idc_client.download_dicom_series(seriesInstanceUID=list(result[result['admittingDiagnosis_CodeMeaning']=='Papillary microcarcinoma']['SeriesInstanceUID'].values),downloadDir='.')

The downloaded files will be organized into the collection > patient > DICOM study > DICOM series hierarchy, such as below, for the example above:

.
└── ccdi_mci
    ├── PBBUZR
    │   └── 2.25.284501970215674986517591459630441617620
    │       └── SM_1.3.6.1.4.1.5962.99.1.3842171162.1504937859.1760483827994.4.0
    │           ├── 78823b4c-6a9c-42d1-9a93-21e4a874ebc1.dcm
    │           ├── b592b1d9-3549-4e34-8c54-0595368896de.dcm
    │           ├── e03aa23e-b3e7-4d26-a9b2-4c28768d16e7.dcm
    │           ├── ee2a0121-9034-4e53-af1f-12bb05907a01.dcm
    │           └── f9d3a87d-d50e-4a50-9e2c-56c0333ea36b.dcm
    └── PBCVHK
        └── 2.25.110173333371564063542056355496660362981
            └── SM_1.3.6.1.4.1.5962.99.1.3098578027.9979142.1738265398379.4.0
                ├── 25de43e3-de6d-4b05-a41a-38bc37121ed4.dcm
                ├── 3a6d9d72-636d-4603-b950-9891df567f28.dcm
                ├── 72d8ecd1-442c-4fa8-a4f5-52de6399f54a.dcm
                ├── b85d9f7c-9d20-4cce-9ed5-7b7982dd4453.dcm
                └── ecff8e9a-d32a-4c0b-83c0-c615acd47896.dcm

7 directories, 10 files

Please let us know if this answers your question!