Since we don’t currently have infrastructure in place for tags, those would have to be post-MVP. A text description field is already available, and could be included.
@spaquett I think we should also include the DOI. I don’t know where it will be in the final layout of the tables, but at the moment this would be the Source_DOI
from idc-dev-etl:idc_tcia_views_mvp_wave0.dicom_all
.
cc: @bill.clifford
@spaquett I now think that we should include full gs://
URL to the individual files. This would be most convenient for the users that would want to replicate the data on the VM, and filelist can be piped to gsutil
.
Finally getting around to this thread…
To me the basic question is what are manifest use cases? In particular, is a manifest something that a user will use once and then throw away? Or is a manifest something that a user will squirrel away?
Remember that the user can always get a new copy of the manifest from IDC.
A throw away manifest would be based on current (versioned) URLs, while a long lived manifest should be based on DOIs. I think it is debatable whether a manifest should include both. Perhaps best would be to only provide DOI based manifests, but provide a tool for easily obtaining the corresponding files. In this way there is no chance of a manifest becoming “stale”.
A DOI based manifest could also be more concise than a GCS URL based manifest. Note that a GCS URL based manifest must include a URL for every instance in the cohort because we can only version GCS entities at the instance level (without data duplication…no symlinks in GCS). However, a DRS DOI representing a series or study will be version specific so a cohort based on DOIs can capture the data in a cohort more efficiently.
I think for the MVP we should address the most immediate need: efficiently get the data corresponding to the manifest onto a VM. I think for the MVP we should use GCS URLs in the manifest.
Are there tools that would be equivalent in performance to parallel download with gsutil -m
but work with DRS DOIs?
What is used in the manifests in ISB-CGC? What are the manifest use cases that proved to be important to the ISB-CGC users?
Yes, for the MVP, we should go with GCS URLs. I recently added a column of GCS URLs (with version suffix) to dicom_all.
I am not aware of any tools that work directly with DRS objects, probably something we have to create.