How to modify an existing IDC cohort

giemmecci · March 30, 2021, 4:33pm

Hi,

Hopefully, I didn’t miss this from the documentation: how can I modify a cohort I created in the IDC portal up to the series level? For example, if a patient has multiple studies and I want to get rid of one study (or one/multiple series in a given study)because of motion artifacts, is there a way to do it directly in the portal (for example, using the OHIF viewer)?

Related to the question, is there a way to modify an existing IDC portal cohort from a notebook instance (e.g., Colab, or a VM instance from Google Cloud Platform)?
For example, let’s say I generated a cohort using the IDC portal (Cohort A, containing multiple studies for each patient); I then load this cohort on a notebook instance using BigQuery, do some programmatic actions on the cohort (e.g., select only one study per patient based on some criteria), and generate Cohort B, a subset of Cohort A.
Is there a way to modify the original IDC Cohort A in the portal to match the content of Cohort B?

Thanks!

fedorov · March 30, 2021, 6:25pm

Thank you for these comments, and you found just the right place to submit this request!

Ability to define cohort at the level of individual cases/studies/series was discussed and we plan to have it in the portal, but it is not yet on the roadmap. Your voice adds to the importance of that feature. We have a related issue in the portal issue tracker here:

No, this is not possible. We would need to think/discuss if/how this could be supported. It is probably most expedient to store the result of your cohort modification as a list of SOPInstanceUIDs in a BigQuery table. This way you can access and share it, but it would not help you access the cohort from the portal. I agree you raise a good point, and I also thought for some time this feature would be important to the users.

wlongabaugh · March 30, 2021, 10:54pm

The IDC API (still in test, not yet in production) does provide a way to retrieve and manipulate cohorts programmatically from a notebook. But we do not yet have the filtering ability to exclude specific series or studies. Stay tuned!

giemmecci · April 27, 2021, 8:35pm

To add an example to the importance of this feature: the main issue I had in using a “cloud-only” approach (meaning, I don’t download data to my device) is that it is tough to perform quality control and data cleaning.

Take, for example, the case TCGA-08-0522 from the TCGA-GBM cohort, where the T2 acquisition was repeated 4 times (due to motion artefacts); although one could come up with a programmatic way to deal with repeated exams (like using the metadata to pick the latest exam, which is probably the one with the best quality), the only way to be sure that we are using meaningful data is to actually open the scan (unless a quality control tool like this will prove to be a viable solution).

In this regard, I think it’s very important to give the users a tool that could easily allow performing quality control on a single series level.

Related to this point: let’s say I take care of “cleaning” a given cohort (getting rid of bad quality exams, or post-surgery exams, etc.). Is there a way to make “my” version of that cohort easily sharable with others so that they can retrieve the “filtered” version of the cohort from the IDC portal? (like a modified version of a manifest file).

Thanks!

pieper · April 27, 2021, 8:51pm

Yes, I agree it would be good to be able to iterate through the cohort, look at them in the viewer, and mark the series you want to use.

fedorov · April 27, 2021, 9:35pm

@giemmecci I completely agree with you about the importance of this feature. In fact, just earlier today we discussed this topic with @wlongabaugh @spaquett and @george.white, and I am very happy you independently added another use case for justifying the effort to support more granular definition of the cohort and modification of the cohort.

To give you the background as to why we spend so much time discussing whether this is important, here are few points, and I will let other folks comment on this further:

the concept of the portal is to support quick facet counting, and due to the limitations of the backend implementation on the size of the filter it can support. When we define the filter as a set of identifiers (at the series or study level), the filter will grow very big, and will push the limits. It is also my understanding that it may not be possible to combine facet selection with the cohort defined by a set of identifiers.
there are storage implications maintaining large cohorts defined as lists (while cohort definition is very compact when defined as facet selection).

It is very attractive from developer’s standpoint to think that cohorts can always be defined by selecting certain facet values. But for anyone familiar with the imaging domain, this approach becomes very limiting very quickly.

Independently from supporting the ability to define cohort as a set of identifiers (and more specifically, at the level of SOPInstanceUID granularity), I believe it is absolutely critical to be able to support persistence of cohorts at the level of identifiers to support reproducibility of analysis. Of course, the user defining the cohort can always maintain that final list of identifiers in BQ, but I think it is desirable to be able to maintain that list somewhere in IDC and not defer it to the user.

Would you be open at some point to join a meeting where we brainstorm the approaches (and limitations of those) related to granular definition of the cohort? I believe your voice and perspective could be very valuable.

giemmecci · April 27, 2021, 10:10pm

Thanks for the insights, and yes, I’d definitely be interested in joining a meeting; although, I’m not sure what I can bring to the table, I had to Google “facet counting” (I found this link very useful for my understanding The Definitive Guide to the Difference Between Filters and Facets ).

Thanks!

wlongabaugh · April 27, 2021, 10:50pm

Thanks for the link, though I don’t think Andrey was using “facet counting” in the same specific meaning as that article implies. In IDC, we tend to think about homogenous (filter) and heterogenous (facet) search spaces, but generally use those terms interchangeably. Maybe it’s time to start using those terms in that fashion.

giemmecci · April 27, 2021, 10:52pm

Thanks for the feedback! I suspected I’d probably end up looking for the wrong thing

fedorov · April 28, 2021, 1:50pm

All I would expect you to bring to the table is your experience to help explain the use case to developers, and get your feedback about the possible tradeoffs in the implementation of the related features.

Topic		Replies	Views
TCGA-GBM tutorial notebook Support question	27	2453	September 10, 2021
Understanding IDC portal cohort plots Support portal	25	1561	September 8, 2020
Cohort manifest content Developers portal	26	1190	September 9, 2020
Text2Cohort: a new LLM toolkit to query IDC database using Natural Language Queries Announcements	4	731	May 27, 2023
IDC May 2023 release Announcements release	1	412	May 11, 2023

How to modify an existing IDC cohort

Related topics