Cohort manifest content

fedorov · August 10, 2020, 8:50pm

We currently do not have the “Save cohort”, or a feature to download the manifest corresponding to the selected cohort, but we will need to define what will go into that manifest (or at least the initial version of it) before the release.

@spaquett and I discussed this briefly, and although it would be possible to dereference everything based on SOPInstanceUID, or SeriesInstanceUID (that’s what TCIA is using in their manifest) and the version of the dataset, it might be helpful to include other items, such as:

definition of the filter
collection/study/series/instance UIDs
location of the storage buckets/BQ tables/DICOM store in the preamble
cohort identification (it sounded like there will be name/ID for the cohort defined by the user, maintained by us)
IndexD details

Related questions are:

what should be the format of the manifest to support its interoperability with other resources of CRDC?
should we offer the option to download a manifest in the TCIA format, for those users of IDC that just want to download the data?

david.pot · August 11, 2020, 1:28pm

As I think I saw Steve reply to the ticket, this is a Post MVP activity in my opinion, given we are in feature freeze for the MVP.

fedorov · August 11, 2020, 2:23pm

@david.pot do you mean it is post-MVP to have cohort manifest at all?

I am pretty sure the capability to define and save a cohort has been on the MVP list for a long time now - that was my feeling from discussions with @spaquett and @wlongabaugh. We also have the following statement in our MVP scope document:

IDC MVP will support the following tasks:

Form cohorts based on the defined search criteria and store those cohorts in a persistent manner for subsequent use;

If we do have the capability to save cohort, we should then have a capability to get the manifest - otherwise, what is the point of saving the cohorts?

I will let Suzanne and Bill to comment on MVP vs post-MVP.

david.pot · August 11, 2020, 2:35pm

Agreeing, it is Suzanne and Bill that will let you know the reality of what is possible.

pieper · August 11, 2020, 3:45pm

@fedorov and I chatted about this earlier. Seemed to me that a Big Query table to define the cohort makes the most sense. So the user just needs to provide a destination for us to populate. The user can then export to csv, json, or just access the table directly as needed for their analysis.

wlongabaugh · August 12, 2020, 9:05pm

Yes, cohort saving and manifest download are part of the MVP. @spaquett is finishing up that feature. I think the format is still up for grabs.

Per @pieper’s suggestion, saving to BQ would be good, that means the WebApp service account would need BQ roles in the user’s project. We have that infrastructure in ISB-CGC, but it does have security implications…

fedorov · August 12, 2020, 9:30pm

I suggest for the MVP let’s define the export to file format, and support download as a TSV file. It would be trivial to then import that into BQ by the user, and we will not have the security implications. We can refine this process post-MVP based on user feedback.

How about the following for the content of the file:

collection ID, which (hopefully!) matches the bucket name in the storage under canceridc-data
PatientID
StudyInstanceUID
SeriesInstanceUID (potentially, can skip this, since we do not filter at the series level?)

We could also include in the header:

filter definition
location of the storage buckets
DICOMweb endpoint
cohort identifier

pieper · August 12, 2020, 9:40pm

@wlongabaugh wouldn’t it make more sense to write to the user’s project’s BQ under the user’s credentials?

@fedorov using a TSV file is fine, but it kind of breaks the “stay in the cloud” mantra. Of course I’m happy with whatever expedites the MVP.

For the TSV fields, I’d also suggest:

user id of the person who generated it
manifest format version number
date / time the manifest was generated
maybe a version of IDC that was used (e.g. dev vs test vs production etc).

wlongabaugh · August 12, 2020, 10:01pm

@pieper Yes, true, that is probably the way to do it. Though we would probably need to provide the option to the user when they sign in using their Google ID, as to what scopes they are approving. I personally would not provide those permissions to any web site.

pieper · August 12, 2020, 10:29pm

Interesting. Is there a way to only grant enough scope to write to a project-specific BQ table, or would you need to grant it for your whole account?

wlongabaugh · August 12, 2020, 10:42pm

I am not enough of an OAuth2 expert to know for sure, but all I have ever seen are granted scopes across all resources. To limit access, I could allow WebApp SA write access to a given BQ dataset.

rkikinis · August 12, 2020, 11:59pm

The arthrosis archive on nda stores the search in the system, linked to the user. I was a beta tester when they were setting it up a few years back

This email is for non-work related messages

fedorov · August 13, 2020, 1:54am

I agree export to BQ would be great (if possible). But I also think we should not lock the user to a specific workflow. It is reasonable to expect use cases that would require export into file, and there’s nothing that prevents us from supporting multiple export options.

spaquett · August 13, 2020, 3:04am

As far as I know, scopes in OAuth2 aren’t that specific. You grant permission to anything in that scope. Looking at Google’s OAuth2 documentation for BigQuery, the scopes are all broad:

https://developers.google.com/identity/protocols/oauth2/scopes#bigquery

I can think of a way to do something more light-weight than we do on ISB-CGC, which as a resource has a much heavier use case, however it would still involve the user taking a service account of ours and granting is BigQuery write access to a given dataset of theirs. If everyone thinks that might be a valuable use case, we can consider it for post-MVP.

spaquett · August 13, 2020, 5:17pm

@fedorov That sounds fine - @pieper I can add in User ID but there’s not much point, since it’s just a number relevant to Django. Unless you mean their login, aka the email?

pieper · August 13, 2020, 5:31pm

Yes, in this case google account of the user that generated the manifest.

Perhaps also we should allow a set of user-defined tags, plus maybe a free text description field.

fedorov · August 13, 2020, 6:55pm

I think those features fit better into the UI for the cohort storage, instead of prompting to populate them at the export time. The at the export time user can decide which attributes should be included.

pieper · August 13, 2020, 7:55pm

We were collecting the list of fields to include in the cohort manifest and I think these should be included.

fedorov · August 13, 2020, 8:00pm

Yes, definitely. But now that we identified them, we also need to have a plan where they will come from and when, right?

pieper · August 13, 2020, 8:23pm

Yes, it makes sense to populate the metadata like tags and descriptions in the main UI with the search terms. Then export is just serializing what’s in the UI.

I would think the metadata would be auto-saved in a cache, like a recent-queries list, and then you should be able to restore and export those.

Getting back to the topic of manifest content, it would also be good to have a unique id for the cohort and a maybe a reference to a parent cohort if you started with a previous query and edited the search terms.

Topic		Replies	Views
Error message in exporting cohorts to BigQuery Support bug	4	505	May 13, 2021
TCIA manifest for the IDC cohorts Feedback and features feature	1	489	August 21, 2020
Problem exporting cohort to csv/tsv/json, missing headers and invalid format Support bug	4	290	September 26, 2023
Question: IDC API Availability Support	10	975	December 7, 2021
How to modify an existing IDC cohort Feedback and features feature	9	541	April 28, 2021

Cohort manifest content

Related topics