Ingestion pipeline documentation

Marijn_Lems1 · July 30, 2024, 8:14am

I was wondering if there is documentation on how idc datasets from raw-data are being created (the initial download of DICOM and Annotations to structured datasets in GCP).

The etl_flow repo seems to provide a wide range of tools but not so much a recipe for transforming a specific data drop into an idc dataset.

Mweanwhile, do you have an automated process for updating Zenodo entries once a new version of a dataset is published?

fedorov · July 30, 2024, 3:04pm

@Marijn_Lems1 we don’t have documentation on this topic. I will add this to my notes regarding revisions of the documentation to provide at least some overview of the process.

But for now, maybe the following brief description will help you.

We rely on Google Healthcare API for managing DICOM stores with the data and extracting DICOM metadata. Once you have DICOM files at hand, you can create a Google Healthcare DICOM store and import those files into that store using instructions here: Importing and exporting DICOM data using Cloud Storage | Cloud Healthcare API | Google Cloud.

Once DICOM store is populated with content, exporting DICOM metadata into BigQuery is straightforward: Exporting DICOM metadata to BigQuery | Cloud Healthcare API | Google Cloud.

IDC ETL process covers a lot more topics, since we need to join DICOM metadata with collection-level metadata, assign additional UUIDs and manage versions of data, extract subsets of metadata into separate tables/views to simplify access. Those and many others are the topics handled by the etl_flow repo you mentioned.

Hope this helps.

fedorov · May 2, 2025, 5:24pm

I added a bit more details and a figure on this topic here: Introduction | IDC User Guide

Topic		Replies	Views
Inquiry: BigQuery - idc_current Question Data	4	316	March 2, 2023
IDC/GCP demo notebooks Support documentation	9	814	October 9, 2020
Storing definitions of data collections as DICOM entities Developers	21	657	May 10, 2024
IDC pilot v3: August 2021 release announcement Announcements release	0	507	September 13, 2021
IDC November 2022 release Announcements release	0	355	November 18, 2022

Ingestion pipeline documentation

Related topics