Cohort manifest content

Since we don’t currently have infrastructure in place for tags, those would have to be post-MVP. A text description field is already available, and could be included.

@spaquett I think we should also include the DOI. I don’t know where it will be in the final layout of the tables, but at the moment this would be the Source_DOI from idc-dev-etl:idc_tcia_views_mvp_wave0.dicom_all.

cc: @bill.clifford

A post was split to a new topic: TCIA manifest for the IDC cohorts

@spaquett I now think that we should include full gs:// URL to the individual files. This would be most convenient for the users that would want to replicate the data on the VM, and filelist can be piped to gsutil.

Finally getting around to this thread…
To me the basic question is what are manifest use cases? In particular, is a manifest something that a user will use once and then throw away? Or is a manifest something that a user will squirrel away?
Remember that the user can always get a new copy of the manifest from IDC.

A throw away manifest would be based on current (versioned) URLs, while a long lived manifest should be based on DOIs. I think it is debatable whether a manifest should include both. Perhaps best would be to only provide DOI based manifests, but provide a tool for easily obtaining the corresponding files. In this way there is no chance of a manifest becoming “stale”.

A DOI based manifest could also be more concise than a GCS URL based manifest. Note that a GCS URL based manifest must include a URL for every instance in the cohort because we can only version GCS entities at the instance level (without data duplication…no symlinks in GCS). However, a DRS DOI representing a series or study will be version specific so a cohort based on DOIs can capture the data in a cohort more efficiently.

I think for the MVP we should address the most immediate need: efficiently get the data corresponding to the manifest onto a VM. I think for the MVP we should use GCS URLs in the manifest.

Are there tools that would be equivalent in performance to parallel download with gsutil -m but work with DRS DOIs?

What is used in the manifests in ISB-CGC? What are the manifest use cases that proved to be important to the ISB-CGC users?

Yes, for the MVP, we should go with GCS URLs. I recently added a column of GCS URLs (with version suffix) to dicom_all.

I am not aware of any tools that work directly with DRS objects, probably something we have to create.

1 Like