TCGA-GBM tutorial notebook

Hi,
We’re working on a tutorial notebook that will focus on a deep learning task using the TCGA-GBM (and probably LGG) cohorts, and we’re planning on having it ready soon.

In the meantime, I’ve created a temporary one that can be found here.
The rationale behind this notebook is to select a subset of studies from a cohort using the study date as the criteria. Specifically, I wanted to select the first study for each subject since I’m interested only in the pre-surgery ones.

Some questions and comments:

  • is there a better way to do it? For example, using BigQuery? I’m a SQL novice; I’ve tried looking around but couldn’t find a way to select only the earliest study for each subject using BigQuery. If not, should we keep this notebook as a quick reference for selecting studies based on the study date? (I guess it could be extended using other criteria).
  • Is there a way to order the studies in the IDC portal using the study date? If not, I think it could be a useful feature for the portal.
  • After selecting the examples studies in the notebook, there were some extra DICOM files that are not visible neither in the IDC portal list nor in the OHIF viewer (see attachment). I’ve tried opening those files with ITK-SNAP but got an error message (see attachment). What are these files?

Thanks!

Here’s a query that returns the earliest StudyDate for each of PatientID in the tcga_gbm collection:

SELECT
  PatientID,
  COUNT(DISTINCT(StudyDate)) AS studies_cnt,
  ARRAY_AGG(DISTINCT(StudyDate) ORDER BY StudyDate ASC)[OFFSET (0)] AS earliest_study_date,
  # full sorted list of StudyDate, for error checking
  #ARRAY_AGG(DISTINCT(StudyDate) ORDER BY StudyDate ASC) AS all_studies_dates
FROM
  `canceridc-data.idc_views.dicom_all`
WHERE
  collection_id = "tcga_gbm"
GROUP BY
  PatientID
ORDER BY
  studies_cnt DESC

Result:

No, this is not possible. As you can see, there is no date in the studies table. I think we discussed showing the date in the list, but decided to wait for a use case, which we now have (thank you!). I submitted a feature request, and you can track it here: Support sorting by date in studies/series tables · Issue #604 · ImagingDataCommons/IDC-WebApp · GitHub. I am not sure when it will be prioritized, but I agree it is a good feature to have.

This is puzzling, since I do not see those series in the explorer page, and I also cannot get those series via direct BigQuery query below:

SELECT
  DISTINCT(SeriesDescription)
FROM
  `canceridc-data.idc_views.dicom_all`
WHERE
  collection_id = "tcga_gbm"
  AND StudyInstanceUID LIKE "%837912142545"

Result:

Those are probably segmentations, stored as DICOM Segmentation objects, but they are from a separate analysis collection (here: DICOM-SEG Conversions for TCGA-LGG and TCGA-GBM Segmentation Datasets - TCIA DOIs - Cancer Imaging Archive Wiki), and should not have been included. I will need to look into your notebook to see how you got those files.

ITK-Snap does not support DICOM SEG. I do not know if this documentation is sufficient, but you can read more about this format, and the tools that support it, here: Derived objects - IDC User Guide.

Thanks! I wouldn’t have get there with my current understanding of SQL.

Does it probably depend on the way I’ve built the URL? This is the link I’ve used:
gs://idc-tcia-tcga-gbm/dicom/1.3.6.1.4.1.14519.5.2.1.2783.4001.867649433395377496837912142545

I’ve added the StudyInstanceUID to the fixed part of the URL.

Thanks!

1 Like

Oh yes - this explains it! We did replicate those analysis results in IDC, but for some reason (which no one remembers!) we had to exclude those from the data release. They will be included in a future data release.

For anything persistent, you should not rely on being able to access the data by scanning the storage bucket (on this note, in the upcoming data release, all DICOM files will be in a flat namespace in the buckets, so this assumption of being able to get the entire study as a storage bucket folder will not work).

Given your use case needs, you can probably work with those segmentations using this backdoor channel, and then switch to the official location once we include those segmentations in the data release (before publicizing the notebook).

Please let me know if you need further help working with those segmentations!

1 Like

Does it mean that in the future we won’t be able to download the data specifying the StudyInstanceUID?

I’d like to build the notebook so that it would be robust to future changes of the IDC; what would be the recommended way to access the data? And how is the current approach I’m using different from the ideal one?

Thanks!

You won’t be able to download the entire study by recursively copying a folder defined by StudyInstanceUID, because in the revised storage bucket layout that will be used in the upcoming data release there will be no hierarchical study/series/instance organization.

The recommended approach is to resolve the location of the files corresponding to the individual DICOM instances via BigQuery, and copy each instance separately instead of copying the study-level folder.

1 Like