IDC December 2023 release: more annotations and new viewer integrations

Just in time for the holidays, we are happy to share the news about Imaging Data Commons data release v17.

Our best wishes of peace, love and joy in the New Year! :christmas_tree: :tada: :dove:

New data: BAMF AIMI Annotations

As of this release, IDC brings you >50 TB (>43M image files) of publicly available image data!

Back in February, we shared with you the funding opportunity announcement soliciting applications to generate annotations for the imaging collections in IDC. Leidos Biomedical Research awarded that contract to a research group at BAMF Health, and IDC v17 release brings the annotations produced in that contract to you!

BAMF group led by Jeff Van Oss selected a subset of IDC images covering a range of modalities and cancer types, trained a range of nnU-Net-based models, applied those models to the selected cases, and completed quality checks for a subset of AI-annotated cases, correcting the results when necessary. You can use the direct link to the annotated cases to see the result for yourself: https://portal.imaging.datacommons.cancer.gov/explore/filters/?analysis_results_id=BAMF-AIMI-Annotations. The details of how the analysis was done are described in this preprint, and all of the code and weights are available publicly under BAMF GitHub organization, accompanied by Colab notebooks! We also prepared a custom dashboard specifically for this new collection, which is available here, and is demonstrated in the video below.

New features: OHIF v3 and VolView are integrated with the IDC Portal

You can now choose from several viewers while looking at the radiology images in IDC. In addition to the OHIF Viewer v2, which has been around since the initial release of IDC, we now also provide direct integration with OHIF Viewer v3 and Kitware VolView. In the future, our plan is to slowly phase out the use of OHIF Viewer v2, since it is no longer supported. We are actively working together with the core OHIF team to implement the currently missing v2 features in v3 - you can see the status of this work in this GitHub Project board.

See the demo of how you can choose a specific viewer in the video below. If you see any regressions, please open the thread in this forum so we can investigate it!

Reminders

  • If you have any questions about IDC, you can start a new thread in IDC forum (preferred) or email them to support@canceridc.dev.
  • Please drop by IDC Office Hours to ask any questions about IDC: every Tuesday 16:30 – 17:30 (New York) and Wednesday 10:30-11:30 (New York) via Google Meet at https://meet.google.com/xyt-vody-tvb .

Summary

(as always, the live dashboard for the screenshot above is available here)

1 Like

Congratulations to the IDC and the BAMF teams and all the best wishes for the New Year!

I forgot to include the link to the data release notes: there are quite a few new collections added in this release!

See https://learn.canceridc.dev/data/data-release-notes#v17-december-2023, and also copied below for convenience.


New radiology collections

  1. CMB-AML
  2. CT-Phantom4Radiomics
  3. EA1141
  4. ReMIND
  5. Vestibular-Schwannoma-MC-RC

New analysis results

  1. BAMF-AIMI-Annotations

    Collections analyzed:

    1. ACRIN-NSCLC-FDG-PET
    2. Anti-PD-1-Lung
    3. LUNG-PET-CT-Dx
    4. NSCLC Radiogenomics
    5. ProstateX
    6. QIN-Breast
    7. RIDER Lung PET-CT
    8. TCGA-KIRC
    9. TCGA-LIHC
    10. TCGA-LUAD
    11. TCGA-LUSC
  2. Prostate-MRI-US-Biopsy-DICOM-Annotations
    Collections analyzed:

    1. Prostate-MRI-US-Biopsy

Revised radiology collections

  1. Prostate-MRI-US-Biopsy
  2. CMB-CRC
  3. CMB-GEC
  4. CMB-LCA
  5. CMB-MEL
  6. CMB-MML
  7. CMB-PCA
  8. CPTAC-CCRCC
  9. CPTAC-PDA

New clinical metadata tables

  1. ea1141_demographics
  2. ea1141_mri
  3. ea1141_risk_model
  4. ea1141_screening
  5. ea1141_status_12mo
  6. ea1141_status_6mo
  7. ea1141_tomosynthesis
  8. htan_ohsu_demographics
  9. htan_vanderbilt_demographics
  10. htan_vanderbilt_diagnosis
  11. htan_vanderbilt_exposure
  12. htan_vanderbilt_familyhistory
  13. htan_vanderbilt_followup
  14. htan_vanderbilt_moleculartest
  15. htan_vanderbilt_therapy
  16. remind_clinical

Congrats! Regarding that dashboard, would it be possible to extend it with another section that describes which DICOM elements are populated most frequently to help users better understand which ones might be useful to query versus which ones are likely to be empty? Or if you think most others wouldn’t find this useful, is there a query we could run to determine that? cc: @JohnFreymann

Justin, this is an interesting idea, but it is a project in itself.

There are certain attributes that are expected to always be non-empty, such as PatientID, SOPInstanceUID, SeriesInstanceUID.

IDC BigQuery index is the superset of all attributes across all of the files in IDC. For a given DICOM instance, elements that are expected to be populated or might be populated will depend on the SOPClass/Modality. Listing attributes that are not empty across all classes may not be very helpful, as those will be biased by the number of instances of the individual DICOM object classes.

It might make more sense to have a listing of attributes that are non-empty for each object class/modality, but I am not sure how useful it would be, since I would expect the exploration to go from the other direction: given a specific question about the data, I would want to explore the attributes that may be helpful in answering that question, and I would then want to see which of those are not empty.

Checking empty vs non-empty counts is rather easy with BigQuery SQL (help from Perplexity.AI to write the query is gratefully acknowledged). Here’s the query for Modality, which, not surprisingly, is always populated:

SELECT 
    COUNT(CASE WHEN Modality IS NULL THEN 1 END) AS null_count,
    COUNT(CASE WHEN Modality IS NOT NULL THEN 1 END) AS not_null_count
FROM `bigquery-public-data.idc_current.dicom_all`

image

Then if we look, for example, into EchoTime, which is specific to MR imaging, it becomes a bit more interesting:

SELECT 
    COUNT(CASE WHEN EchoTime IS NULL THEN 1 END) AS null_count,
    COUNT(CASE WHEN EchoTime IS NOT NULL THEN 1 END) AS not_null_count
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE Modality = 'MR'

image

Modality is not always reliable, since images derived from MR may maintain the same modality, but would not necessarily contain acquisition parameters. We can use SOPClassUID instead, and this brings null_count a bit lower, but it is still quite large:

SELECT 
    COUNT(CASE WHEN EchoTime IS NULL THEN 1 END) AS null_count,
    COUNT(CASE WHEN EchoTime IS NOT NULL THEN 1 END) AS not_null_count
FROM `bigquery-public-data.idc_current.dicom_all`
WHERE SOPClassUID = '1.2.840.10008.5.1.4.1.1.4'

image

Are there specific questions you wanted to answer with this information @kirbyju ?

Thanks Andrey, these are great points. @JohnFreymann had instigated this question on one of our recent TCIA calls. It would probably be easiest to have him explain his motivations on one of our upcoming IRCoCo calls, but I agree that the answers he’s looking for probably need to be assessed on a per modality or per SOP Class basis. I’ll put a reminder on the agenda and play around with some variations on these queries in the meantime.