I was wondering if there is documentation on how idc datasets from raw-data are being created (the initial download of DICOM and Annotations to structured datasets in GCP).
The etl_flow repo seems to provide a wide range of tools but not so much a recipe for transforming a specific data drop into an idc dataset.
Mweanwhile, do you have an automated process for updating Zenodo entries once a new version of a dataset is published?
@Marijn_Lems1 we don’t have documentation on this topic. I will add this to my notes regarding revisions of the documentation to provide at least some overview of the process.
But for now, maybe the following brief description will help you.
We rely on Google Healthcare API for managing DICOM stores with the data and extracting DICOM metadata. Once you have DICOM files at hand, you can create a Google Healthcare DICOM store and import those files into that store using instructions here: Importing and exporting DICOM data using Cloud Storage | Cloud Healthcare API | Google Cloud.
Once DICOM store is populated with content, exporting DICOM metadata into BigQuery is straightforward: Exporting DICOM metadata to BigQuery | Cloud Healthcare API | Google Cloud.
IDC ETL process covers a lot more topics, since we need to join DICOM metadata with collection-level metadata, assign additional UUIDs and manage versions of data, extract subsets of metadata into separate tables/views to simplify access. Those and many others are the topics handled by the etl_flow
repo you mentioned.
Hope this helps.