Cohort manifest content

We currently do not have the “Save cohort”, or a feature to download the manifest corresponding to the selected cohort, but we will need to define what will go into that manifest (or at least the initial version of it) before the release.

@spaquett and I discussed this briefly, and although it would be possible to dereference everything based on SOPInstanceUID, or SeriesInstanceUID (that’s what TCIA is using in their manifest) and the version of the dataset, it might be helpful to include other items, such as:

  • definition of the filter
  • collection/study/series/instance UIDs
  • location of the storage buckets/BQ tables/DICOM store in the preamble
  • cohort identification (it sounded like there will be name/ID for the cohort defined by the user, maintained by us)
  • IndexD details

Related questions are:

  • what should be the format of the manifest to support its interoperability with other resources of CRDC?
  • should we offer the option to download a manifest in the TCIA format, for those users of IDC that just want to download the data?

As I think I saw Steve reply to the ticket, this is a Post MVP activity in my opinion, given we are in feature freeze for the MVP.

@david.pot do you mean it is post-MVP to have cohort manifest at all?

I am pretty sure the capability to define and save a cohort has been on the MVP list for a long time now - that was my feeling from discussions with @spaquett and @wlongabaugh. We also have the following statement in our MVP scope document:

IDC MVP will support the following tasks:

  • Form cohorts based on the defined search criteria and store those cohorts in a persistent manner for subsequent use;

If we do have the capability to save cohort, we should then have a capability to get the manifest - otherwise, what is the point of saving the cohorts?

I will let Suzanne and Bill to comment on MVP vs post-MVP.

Agreeing, it is Suzanne and Bill that will let you know the reality of what is possible.

@fedorov and I chatted about this earlier. Seemed to me that a Big Query table to define the cohort makes the most sense. So the user just needs to provide a destination for us to populate. The user can then export to csv, json, or just access the table directly as needed for their analysis.

Yes, cohort saving and manifest download are part of the MVP. @spaquett is finishing up that feature. I think the format is still up for grabs.

Per @pieper’s suggestion, saving to BQ would be good, that means the WebApp service account would need BQ roles in the user’s project. We have that infrastructure in ISB-CGC, but it does have security implications…

1 Like

I suggest for the MVP let’s define the export to file format, and support download as a TSV file. It would be trivial to then import that into BQ by the user, and we will not have the security implications. We can refine this process post-MVP based on user feedback.

How about the following for the content of the file:

  • collection ID, which (hopefully!) matches the bucket name in the storage under canceridc-data
  • PatientID
  • StudyInstanceUID
  • SeriesInstanceUID (potentially, can skip this, since we do not filter at the series level?)

We could also include in the header:

  • filter definition
  • location of the storage buckets
  • DICOMweb endpoint
  • cohort identifier

@wlongabaugh wouldn’t it make more sense to write to the user’s project’s BQ under the user’s credentials?

@fedorov using a TSV file is fine, but it kind of breaks the “stay in the cloud” mantra. Of course I’m happy with whatever expedites the MVP.

For the TSV fields, I’d also suggest:

  • user id of the person who generated it
  • manifest format version number
  • date / time the manifest was generated
  • maybe a version of IDC that was used (e.g. dev vs test vs production etc).

@pieper Yes, true, that is probably the way to do it. Though we would probably need to provide the option to the user when they sign in using their Google ID, as to what scopes they are approving. I personally would not provide those permissions to any web site.

Interesting. Is there a way to only grant enough scope to write to a project-specific BQ table, or would you need to grant it for your whole account?

I am not enough of an OAuth2 expert to know for sure, but all I have ever seen are granted scopes across all resources. To limit access, I could allow WebApp SA write access to a given BQ dataset.

The arthrosis archive on nda stores the search in the system, linked to the user. I was a beta tester when they were setting it up a few years back

This email is for non-work related messages

I agree export to BQ would be great (if possible). But I also think we should not lock the user to a specific workflow. It is reasonable to expect use cases that would require export into file, and there’s nothing that prevents us from supporting multiple export options.

As far as I know, scopes in OAuth2 aren’t that specific. You grant permission to anything in that scope. Looking at Google’s OAuth2 documentation for BigQuery, the scopes are all broad:

https://developers.google.com/identity/protocols/oauth2/scopes#bigquery

I can think of a way to do something more light-weight than we do on ISB-CGC, which as a resource has a much heavier use case, however it would still involve the user taking a service account of ours and granting is BigQuery write access to a given dataset of theirs. If everyone thinks that might be a valuable use case, we can consider it for post-MVP.

@fedorov That sounds fine - @pieper I can add in User ID but there’s not much point, since it’s just a number relevant to Django. Unless you mean their login, aka the email?

Yes, in this case google account of the user that generated the manifest.

Perhaps also we should allow a set of user-defined tags, plus maybe a free text description field.

I think those features fit better into the UI for the cohort storage, instead of prompting to populate them at the export time. The at the export time user can decide which attributes should be included.

We were collecting the list of fields to include in the cohort manifest and I think these should be included.

Yes, definitely. But now that we identified them, we also need to have a plan where they will come from and when, right?

Yes, it makes sense to populate the metadata like tags and descriptions in the main UI with the search terms. Then export is just serializing what’s in the UI.

I would think the metadata would be auto-saved in a cache, like a recent-queries list, and then you should be able to restore and export those.

Getting back to the topic of manifest content, it would also be good to have a unique id for the cohort and a maybe a reference to a parent cohort if you started with a previous query and edited the search terms.