TCGA-GBM tutorial notebook

giemmecci · April 26, 2021, 12:51am

Hi,
We’re working on a tutorial notebook that will focus on a deep learning task using the TCGA-GBM (and probably LGG) cohorts, and we’re planning on having it ready soon.

In the meantime, I’ve created a temporary one that can be found here.
The rationale behind this notebook is to select a subset of studies from a cohort using the study date as the criteria. Specifically, I wanted to select the first study for each subject since I’m interested only in the pre-surgery ones.

Some questions and comments:

is there a better way to do it? For example, using BigQuery? I’m a SQL novice; I’ve tried looking around but couldn’t find a way to select only the earliest study for each subject using BigQuery. If not, should we keep this notebook as a quick reference for selecting studies based on the study date? (I guess it could be extended using other criteria).
Is there a way to order the studies in the IDC portal using the study date? If not, I think it could be a useful feature for the portal.
After selecting the examples studies in the notebook, there were some extra DICOM files that are not visible neither in the IDC portal list nor in the OHIF viewer (see attachment). I’ve tried opening those files with ITK-SNAP but got an error message (see attachment). What are these files?

Thanks!

fedorov · April 26, 2021, 3:00am

Here’s a query that returns the earliest StudyDate for each of PatientID in the tcga_gbm collection:

SELECT
  PatientID,
  COUNT(DISTINCT(StudyDate)) AS studies_cnt,
  ARRAY_AGG(DISTINCT(StudyDate) ORDER BY StudyDate ASC)[OFFSET (0)] AS earliest_study_date,
  # full sorted list of StudyDate, for error checking
  #ARRAY_AGG(DISTINCT(StudyDate) ORDER BY StudyDate ASC) AS all_studies_dates
FROM
  `canceridc-data.idc_views.dicom_all`
WHERE
  collection_id = "tcga_gbm"
GROUP BY
  PatientID
ORDER BY
  studies_cnt DESC

Result:

No, this is not possible. As you can see, there is no date in the studies table. I think we discussed showing the date in the list, but decided to wait for a use case, which we now have (thank you!). I submitted a feature request, and you can track it here: Support sorting by date in studies/series tables · Issue #604 · ImagingDataCommons/IDC-WebApp · GitHub. I am not sure when it will be prioritized, but I agree it is a good feature to have.

This is puzzling, since I do not see those series in the explorer page, and I also cannot get those series via direct BigQuery query below:

SELECT
  DISTINCT(SeriesDescription)
FROM
  `canceridc-data.idc_views.dicom_all`
WHERE
  collection_id = "tcga_gbm"
  AND StudyInstanceUID LIKE "%837912142545"

Result:

Those are probably segmentations, stored as DICOM Segmentation objects, but they are from a separate analysis collection (here: https://wiki.cancerimagingarchive.net/display/DOI/DICOM-SEG+Conversions+for+TCGA-LGG+and+TCGA-GBM+Segmentation+Datasets), and should not have been included. I will need to look into your notebook to see how you got those files.

ITK-Snap does not support DICOM SEG. I do not know if this documentation is sufficient, but you can read more about this format, and the tools that support it, here: Derived objects - IDC User Guide.

giemmecci · April 26, 2021, 5:06am

Thanks! I wouldn’t have get there with my current understanding of SQL.

Does it probably depend on the way I’ve built the URL? This is the link I’ve used:
gs://idc-tcia-tcga-gbm/dicom/1.3.6.1.4.1.14519.5.2.1.2783.4001.867649433395377496837912142545

I’ve added the StudyInstanceUID to the fixed part of the URL.

Thanks!

fedorov · April 26, 2021, 2:13pm

Oh yes - this explains it! We did replicate those analysis results in IDC, but for some reason (which no one remembers!) we had to exclude those from the data release. They will be included in a future data release.

For anything persistent, you should not rely on being able to access the data by scanning the storage bucket (on this note, in the upcoming data release, all DICOM files will be in a flat namespace in the buckets, so this assumption of being able to get the entire study as a storage bucket folder will not work).

Given your use case needs, you can probably work with those segmentations using this backdoor channel, and then switch to the official location once we include those segmentations in the data release (before publicizing the notebook).

Please let me know if you need further help working with those segmentations!

giemmecci · May 10, 2021, 12:43am

Does it mean that in the future we won’t be able to download the data specifying the StudyInstanceUID?

I’d like to build the notebook so that it would be robust to future changes of the IDC; what would be the recommended way to access the data? And how is the current approach I’m using different from the ideal one?

Thanks!

fedorov · May 10, 2021, 1:49pm

You won’t be able to download the entire study by recursively copying a folder defined by StudyInstanceUID, because in the revised storage bucket layout that will be used in the upcoming data release there will be no hierarchical study/series/instance organization.

The recommended approach is to resolve the location of the files corresponding to the individual DICOM instances via BigQuery, and copy each instance separately instead of copying the study-level folder.

giemmecci · June 2, 2021, 2:37pm

Hi, here is a notebook that uses the TCGA-GBM cohort; it is still in development but it addresses some of the issues encountered during data cleaning/quality control.

This is the csv used to load the genomic information (I will insert the link directly in the notebook once it will be finalized).

GCP-optimized
Colab-optimized

I’ll update the links with the fully functional notebook as soon as they will be ready, thanks!

fedorov · June 2, 2021, 3:56pm

@david.pot is this an opportunity to explore the utility of CDA instead of relying on a CSV of unknown provenance?

david.pot · June 2, 2021, 7:18pm

Yes it is. Let me introduce CDA - the “Cancer Data Aggregator” to you @giemmecci . Please take a look first at: CDA on CRDC website. . This gives you context of where and what the CDA does. Next, please try out CDA’s “alpha” release (released in March of this year), by exploring the CDA MVP testing guide .

The use case is CDA can be queried with the TCGA identifiers you have found with IDC GBM images, and the related clinical data can be returned by CDA. We would love to have your feedback and experiences with the CDA. Please let us know what they are.

giemmecci · June 2, 2021, 9:37pm

Thanks! I’ll start testing the CDA for my notebook

giemmecci · June 4, 2021, 4:40pm

@fedorov, Benjamin Yan (@Interion ) just joined our lab as an intern, and since he will be working with publicly available data, we thought it could be a great opportunity to have him work on the IDC.

Would it be possible to add him to our current Google Cloud project? He has already been in the lab in the past, so he has previous experience working with imaging and cloud resources.

Thanks!

fedorov · June 4, 2021, 4:43pm

Sure, no worries! @Interion can you please fill out the form here, and mention in the comments that you would like to be added to Brad’s @bje project?

fedorov · June 4, 2021, 10:00pm

Gian Marco, thank you for sharing this. The issue you identified is something we have been thinking about earlier. It is very important to be able to identify specific kinds of series, especially for feeding existing analysis tools (this was the motivation that triggered our earlier discussions on this topic), and you identified additional aspects of data that need to be captured for the benefit of subsequent use.

What we had in mind was to start with additional public BigQuery table(s) that would contain series- and potentially study-level annotations that would be available to the users. As we gain more experience with what needs to be captured, how it should be captured, and as we have more data annotated in this way, we could explore various options for a more persistent location for those attributes (what @dclunie and I discussed so far was storing those as attributes in legacy enhanced converted objects for MR/CT/PT, or in a separate Structured Report object within the same study).

Based on our discussions so far, and your suggestions, we could consider capturing the following at the level of series:

inherent imaging contrast (i.e., T1/T2/FLAIR for MR)
acquisition plane (i.e., axial, coronal, sagittal)
presence of contrast, or timing with respect to contrast (potentially, also administration route for the contrast)
presence of artifacts or some kind of quality assessment (this of course can be come rather controversial!)
timing with respect to some event (i.e., pre- or post-surgical scan)

At least initially, this would be a manual process, and complexity will likely depend on the specific collection and task.

I also think that to do it right, we would probably (eventually!) need to have some mechanism of attribution (both in order to give credit to the contributor, but also to help with quality control), and some level of documentation of how this annotation was done. In a sense, such annotation is just another type of image-derived data, not unlike segmentation!

If you have any annotations like the above that you have already done and are willing to share this, please let me know - this will help us get started. Would be great to hear your thoughts.

giemmecci · June 7, 2021, 7:19pm

Thanks!
Yes, I did some basic data cleaning for the notebook that I can share: define imaging contrast, pre/post surgery, presence of artifacts (although is limited to the 10 exams I picked for the tutorial).

What would be the best way to share this information with you? I have no experience in working with legacy enhanced converted objects or Structured Report.

Thanks!

PS
I’m not sure what is the “ideal” target user of the IDC, but for example, it took me some time to learn how to obtain the earliest exam for each subject of a given cohort using BigQuery, since I had no previous experience with SQL.
I think it could be beneficial, if feasible, to allow users to select “only earliest” or “only latest” exams for a given cohort without the need to code it in BigQuery, since I think these would be common requests; from a “technical” point of view, would it be possible to have these options available directly in the IDC search configuration when building the cohort? (the same way we can check the box for specific “Modality”, “Primary site location”, etc.). And if so, do you think it could be a useful enough feature to justify the effort to make it available?

Thanks!

fedorov · June 7, 2021, 7:36pm

You can just share the link to the spreadsheet. Do you plan to keep working on it?

I submitted an issue, we will discuss and evaluate whether it can be prioritized.

fedorov · June 7, 2021, 7:45pm

Forgot to include the issue pointer: Allow selection of the latest/earliest study when multiple studies are present · Issue #640 · ImagingDataCommons/IDC-WebApp · GitHub …

giemmecci · June 7, 2021, 8:06pm

Thanks for submitting the issue!

If you mean working on the notebook, yes (I’m sorry it’s taking some time, but I need to stretch it between other projects ); if instead you mean working on performing quality control for the whole TCGA-GBM cohort, I wasn’t planning to do it.

I’ve created this spreadsheet as an example to report issues. If it looks good for you, I can complete it with the issues found for the studies I evaluated; otherwise, let me know if I should capture more/different information.

Thanks!

fedorov · June 7, 2021, 8:31pm

Great, thanks! I actually meant whether you plan to work on the curation spreadsheet. Thank you for sharing what you have.

On a separate note - do you insist on the GPL license for your contribution here GitHub - giemmecci/IDC: Examples of use cases of the IDC portal ? We definitely prefer non-restrictive open source licenses to maximize reuse for everyone.

giemmecci · June 7, 2021, 9:16pm

Yes, I will definitely be interested in working on it; does the spreadsheet I shared reflect what you are looking for? I will modify it to include all the issues I found with the other cases I’ve analyzed so that it will include other scenarios (series with missing description, duplicates, etc.).

No, not really; I can change it to a different license. What would work best for you?

Thanks!

wlongabaugh · June 9, 2021, 12:26am

@fedorov, @giemmecci: @Interion has been added to the project.

Topic		Replies	Views
How to submit the tutorial notebook? Use cases	11	611	April 19, 2021
Text2Cohort: a new LLM toolkit to query IDC database using Natural Language Queries Announcements	4	728	May 27, 2023
IDC/GCP demo notebooks Support documentation	9	814	October 9, 2020
DICOM metadata query assistance Support	3	170	November 20, 2023
Understanding what images are available for a given patient via BigQuery API Support bigquery	2	508	December 16, 2022

TCGA-GBM tutorial notebook

Related topics