Downloading the histopathology slides

Respected support team, I hope this message finds you well.

we have initiated a project related to triple negative breast cancer.

We will train the Ai foundation model on the histopathology slides to predict the patient’s chemotherapy response.

We have filtered 114 triple negative breast cancer patients.

My queries are:

1).

There are approximately 5 slides in each patient data on the database. I don’t know which one I should select for chemotherapy response training.

2.)

When we Download the histopathology slides (5 slides), we get approximately 40-45 DICOM format files. The slides available on the database divides into small data files after downloading ?

Here is the link to one of the patient histopathology slides.

https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.264603219603655743685261578937554222193/series/1.3.6.1.4.1.5962.99.1.1236651128.1276575975.1637619223672.2.0

1 Like

Hi, and welcome to IDC!

The main thing to know upfront: IDC histopathology slides are stored in DICOM Slide Microscopy (SM) format, and each slide downloads as a folder of .dcm files rather than a single SVS or TIFF. This is expected — the files represent the different resolution levels of the image pyramid. Tools like wsidicom, TIAToolbox (DICOMWSIReader), and QuPath (v0.4+) all handle this natively by pointing them at the folder.

To find and download slides, the idc-index Python package is the easiest starting point (no authentication needed):

from idc_index import IDCClient
client = IDCClient()

# Find slide microscopy series (here: TCGA-LUAD collection)
slides = client.sql_query("""
    SELECT i.PatientID, i.SeriesInstanceUID, i.series_size_MB
    FROM index i
    WHERE i.collection_id = 'tcga_luad' AND i.Modality = 'SM'
""")

# Download — each series lands in its own subfolder
client.download_from_selection(
    seriesInstanceUID=slides["SeriesInstanceUID"].tolist(),
    downloadDir="./slides",
    dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID"
)

Note that slides can be several GB each, so it’s worth checking total size (SUM(series_size_MB)) before kicking off a large download.

For a fuller introduction to searching and working with IDC pathology data, the Getting started with digital pathology notebook (runnable for free on Colab) is a good next step. You may also be interested in this recent post: Get started with TIAToolbox to analyze IDC pathology images.

Finally, really the easiest interface to navigate IDC and learn how to use its content is the Claude skill we announced in this post: Imaging Data Commons Claude skill launched!

Feel free to follow up here with any questions!

thank you so much sir… i have further some more question regarding the sildes :Does TSA always indicate a tumor slide in TCGA-BRCA? and Is the DX1 slide also considered a primary tumor diagnostic slide?

1 Like

Short Answer

  • No, the TSA suffix does not always indicate a tumor slide. It indicates a physical slide type (Top Slide, variant A), not tissue origin (TCGA Barcode - GDC Docs). In TCGA-BRCA, 91 out of 635 TSA slides are from normal tissue (verified via idc-index sm_index query — see evidence below).
  • DX1 slides are FFPE diagnostic slides (Janowczyk, 2016). In TCGA-BRCA, all 1,061 DX1 slides come from tumor samples (1,060 primary tumor + 1 metastatic), but this reflects the TCGA collection protocol — not a universal rule that DX means tumor.

Understanding the TCGA Slide Barcode

A TCGA slide barcode like TCGA-A2-A0T2-01Z-00-DX1 has two independent encodings (TCGA Barcode - GDC Docs):

Barcode segment Position Encodes Determines tissue type?
Sample type code Positions 14-15 (e.g., 01) Tissue origin: 01 = Primary Solid Tumor, 11 = Solid Tissue Normal (TCGA Sample Type Codes) Yes
Slide suffix Final segment (e.g., DX1) Physical slide preparation type (TCGA Barcode - GDC Docs) No

The sample type code (not the slide suffix) determines whether a slide is tumor or normal.

Slide Suffix Meanings

The slide suffix encodes the physical preparation method and position, not the tissue origin (TCGA Barcode - GDC Docs):

Prefix Full Name Preparation Description
TS Top Slide Frozen section Cut from the top of a tissue portion during surgery; adjacent to tissue used for genomic analysis (TCIA TCGA Guide)
BS Bottom Slide Frozen section Cut from the bottom of a tissue portion (TCGA Barcode - GDC Docs)
MS Middle Slide Frozen section Cut from the middle of a tissue portion (TCGA Barcode - GDC Docs)
DX Diagnostic Slide FFPE (formalin-fixed paraffin-embedded) Permanent diagnostic-quality slide from clinical pathology workflow (Janowczyk, 2016)

The letter or number after the prefix (e.g., A in TSA, 1 in DX1) indicates the slide order within that type (TCGA Barcode - GDC Docs).

Evidence from TCGA-BRCA Data in IDC

The following counts were obtained by querying TCGA-BRCA slides in IDC using sm_index.ContainerIdentifier and sm_index.primaryAnatomicStructureModifier_CodeMeaning via idc-index 0.11.10.

Note: The ContainerIdentifier column was added to sm_index in idc-index 0.11.10. If you are using an older version, upgrade first: pip install --upgrade idc-index

TSA appears on both tumor AND normal slides

Sample type 01 (Primary Solid Tumor):  542 TSA slides
Sample type 06 (Metastatic):             2 TSA slides
Sample type 11 (Solid Tissue Normal):   91 TSA slides

TSA is not a tumor indicator. It appears on 91 normal tissue slides in TCGA-BRCA.

DX slides appear almost exclusively on tumor samples (in TCGA-BRCA)

Sample type 01 (Primary Solid Tumor): 1060 DX1 slides, 67 DX2, 4 DX3, 1 DX4
Sample type 06 (Metastatic):             1 DX1 slide
Sample type 11 (Normal):                 0 DX slides

DX slides in TCGA-BRCA are exclusively from tumor samples. However, this reflects the TCGA collection protocol (diagnostic slides were prepared for tumor specimens), not a universal rule that DX = tumor.

Normal tissue slides are predominantly TS/BS types

Normal slides by type: TSA (91), TSB (93), TSC (54), TSD (28), TS1 (27), TS2 (19),
                       TS3 (18), BSA (18), BSB (9), TS4 (9), TSE (8), TS5 (6), ...

Normal tissue in TCGA-BRCA was primarily submitted as frozen sections (TS/BS), not diagnostic slides (DX).

Correct Way to Identify Tumor vs Normal

Always use the sample type code or structured DICOM metadata — never rely on slide suffix alone.

Note: The ContainerIdentifier column was added to sm_index in idc-index 0.11.10. If you are using an older version, upgrade first: pip install --upgrade idc-index

Approach 1: Use primaryAnatomicStructureModifier_CodeMeaning from sm_index

This column contains structured tissue type from DICOM specimen metadata (idc-index indices reference).

from idc_index import IDCClient
client = IDCClient()
client.fetch_index("sm_index")

# Get tumor slides in TCGA-BRCA
client.sql_query("""
    SELECT
        s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
        COUNT(*) as slide_count,
        COUNT(DISTINCT i.PatientID) as patient_count
    FROM sm_index s
    JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
    WHERE i.collection_id = 'tcga_brca'
    GROUP BY tissue_type
    ORDER BY slide_count DESC
""")
# Returns: Neoplasm, Primary (2704), Normal (399), None (8 metastatic)

Approach 2: Parse the TCGA barcode sample type code from ContainerIdentifier

The ContainerIdentifier column in sm_index stores the TCGA slide barcode (idc-index indices reference). Sample type codes are defined in the TCGA Sample Type Codes table: 01-09 = tumor, 10-19 = normal.

# Extract sample type code from barcode
client.sql_query("""
    SELECT
        SUBSTRING(SPLIT_PART(s.ContainerIdentifier, '-', 4), 1, 2) as sample_type_code,
        s.primaryAnatomicStructureModifier_CodeMeaning as tissue_type,
        COUNT(*) as slide_count
    FROM sm_index s
    JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
    WHERE i.collection_id = 'tcga_brca'
    GROUP BY sample_type_code, tissue_type
    ORDER BY sample_type_code
""")
# Returns: 01 → Neoplasm, Primary (2704), 06 → None (8), 11 → Normal (399)

References


This response was prepared using the imaging-data-commons Claude Code skill and verified against IDC data version v23 with idc-index 0.11.10.

The DICOM attributes and coded values can be used to determine this.

If you don’t want to use the DICOM information, to interpret the TCGA identifiers, you need to understand the TCGA “barcode” convention, but the TCGA documentation only describes frozen sections, as “top slide” or “bottom slide”, etc., and for FFPE slides you need to know that DX means FFPE (not frozen). Andrey J explains it here. See also our conversion notes, where you will find a descriptive example of the DICOM attributes.

As for the tissue type (tumor or not), this is described in the part of the barcode that is the “sample type”, documented by GDC here. We use that code to select an anatomy modifier that is a child of PrimaryAnatomicStructureSequence in the DICOM attributes.