TCGA-GBM tutorial notebook

Yes it is. Let me introduce CDA - the “Cancer Data Aggregator” to you @giemmecci . Please take a look first at: CDA on CRDC website. . This gives you context of where and what the CDA does. Next, please try out CDA’s “alpha” release (released in March of this year), by exploring the CDA MVP testing guide .

The use case is CDA can be queried with the TCGA identifiers you have found with IDC GBM images, and the related clinical data can be returned by CDA. We would love to have your feedback and experiences with the CDA. Please let us know what they are.


Thanks! I’ll start testing the CDA for my notebook

@fedorov, Benjamin Yan (@Interion ) just joined our lab as an intern, and since he will be working with publicly available data, we thought it could be a great opportunity to have him work on the IDC.

Would it be possible to add him to our current Google Cloud project? He has already been in the lab in the past, so he has previous experience working with imaging and cloud resources.


Sure, no worries! @Interion can you please fill out the form here, and mention in the comments that you would like to be added to Brad’s @bje project?

1 Like

Gian Marco, thank you for sharing this. The issue you identified is something we have been thinking about earlier. It is very important to be able to identify specific kinds of series, especially for feeding existing analysis tools (this was the motivation that triggered our earlier discussions on this topic), and you identified additional aspects of data that need to be captured for the benefit of subsequent use.

What we had in mind was to start with additional public BigQuery table(s) that would contain series- and potentially study-level annotations that would be available to the users. As we gain more experience with what needs to be captured, how it should be captured, and as we have more data annotated in this way, we could explore various options for a more persistent location for those attributes (what @dclunie and I discussed so far was storing those as attributes in legacy enhanced converted objects for MR/CT/PT, or in a separate Structured Report object within the same study).

Based on our discussions so far, and your suggestions, we could consider capturing the following at the level of series:

  • inherent imaging contrast (i.e., T1/T2/FLAIR for MR)
  • acquisition plane (i.e., axial, coronal, sagittal)
  • presence of contrast, or timing with respect to contrast (potentially, also administration route for the contrast)
  • presence of artifacts or some kind of quality assessment (this of course can be come rather controversial!)
  • timing with respect to some event (i.e., pre- or post-surgical scan)

At least initially, this would be a manual process, and complexity will likely depend on the specific collection and task.

I also think that to do it right, we would probably (eventually!) need to have some mechanism of attribution (both in order to give credit to the contributor, but also to help with quality control), and some level of documentation of how this annotation was done. In a sense, such annotation is just another type of image-derived data, not unlike segmentation!

If you have any annotations like the above that you have already done and are willing to share this, please let me know - this will help us get started. Would be great to hear your thoughts.

1 Like

Yes, I did some basic data cleaning for the notebook that I can share: define imaging contrast, pre/post surgery, presence of artifacts (although is limited to the 10 exams I picked for the tutorial).

What would be the best way to share this information with you? I have no experience in working with legacy enhanced converted objects or Structured Report.


I’m not sure what is the “ideal” target user of the IDC, but for example, it took me some time to learn how to obtain the earliest exam for each subject of a given cohort using BigQuery, since I had no previous experience with SQL.
I think it could be beneficial, if feasible, to allow users to select “only earliest” or “only latest” exams for a given cohort without the need to code it in BigQuery, since I think these would be common requests; from a “technical” point of view, would it be possible to have these options available directly in the IDC search configuration when building the cohort? (the same way we can check the box for specific “Modality”, “Primary site location”, etc.). And if so, do you think it could be a useful enough feature to justify the effort to make it available?


You can just share the link to the spreadsheet. Do you plan to keep working on it?

I submitted an issue, we will discuss and evaluate whether it can be prioritized.

1 Like

Forgot to include the issue pointer: Allow selection of the latest/earliest study when multiple studies are present · Issue #640 · ImagingDataCommons/IDC-WebApp · GitHub

Thanks for submitting the issue!

If you mean working on the notebook, yes (I’m sorry it’s taking some time, but I need to stretch it between other projects :slight_smile: ); if instead you mean working on performing quality control for the whole TCGA-GBM cohort, I wasn’t planning to do it.

I’ve created this spreadsheet as an example to report issues. If it looks good for you, I can complete it with the issues found for the studies I evaluated; otherwise, let me know if I should capture more/different information.


Great, thanks! I actually meant whether you plan to work on the curation spreadsheet. Thank you for sharing what you have.

On a separate note - do you insist on the GPL license for your contribution here GitHub - giemmecci/IDC: Examples of use cases of the IDC portal ? We definitely prefer non-restrictive open source licenses to maximize reuse for everyone.

Yes, I will definitely be interested in working on it; does the spreadsheet I shared reflect what you are looking for? I will modify it to include all the issues I found with the other cases I’ve analyzed so that it will include other scenarios (series with missing description, duplicates, etc.).

No, not really; I can change it to a different license. What would work best for you?


@fedorov, @giemmecci: @Interion has been added to the project.

1 Like

I’m not sure if it contains information that can be useful for the portal, but there has been a dataset cleaning effort performed on the TCGA-GBM/LGG cohorts described here Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features - PubMed

Differentiation of pre vs. post-surgery cases was one of the aims; some metadata are included here and they point to a separate TCIA cohort that contains only the pre-surgery sub-cohort here

@giemmecci yes, I am aware of that publication, and we were looking at it.

1 Like

For some reason, I can’t edit the original message, so a brief update on the Colab version:

@fedorov pointed out that Docker is not supported in Colab, and the segmentation tool used in the notebook needs Docker. So I’ll try to find a way around this issue.


Hi, I’ve been trying accessing my VM for a while, but I keep receiving this error message

How can I solve this issue?

EDIT: now it seems to be working; in the past, I had the same issue but it resolved in 5-10 minutes; this time it took way longer; what causes this difference?


Thanks for your work with the CDA.
Unfortunately, I wasn’t able to retrieve the information I’m looking for.

What I’d like to achieve is: given a cohort (i.e. TCGA-GBM), I’d like the list of all the subjects that show a mutation in a specific gene (i.e. IDH1).

The same search using the GDP platform gives back this:

None of the fields available in cdapython seem to report this information. Am I missing something? Thanks!

I ran into this myself in the past, and usually after some time the resources become available. You can also try changing the zone for your VM, as discussed in this post.

Thanks for giving CDA a try! I will let @david.pot respond as to whether this behavior is expected, and if there are plans to revise it in the future.

1 Like

Good morning @giemmecci . You are not missing anything as this functionality is not in CDA at this time. We will be looking into it, as this data comes from parsing and indexing open access MAF (mutation) files, rather than the APIs available from GDC. Your use case is very valuable to CDA and thank you for your efforts using it! Do you have other suggestions/issues using CDA, as you have exactly the background and research needs that CDA is trying to serve. David.

1 Like

7 posts were split to a new topic: Unable to re-run the notebook