Recently, one of the IDC users had a question about how to download whole slide images (needless to say, stored as DICOM Slide Microscopy objects) from IDC for the specific collections. Since this is a question of general interest, I decided to post the recipe in the forum.
As a prerequisite, you will need to install s5cmd
on your system, and note the location of the s5cmd
executable, or add it to your path. You will need this tool to download files from IDC.
Next you will need to select files that you want to download. You can do this using IDC Portal or BigQuery SQL.
IDC Portal
-
Select “Slide microscopy” in the Modality filter group on the left. You can use additional filters to, for example, select specific collection.
-
Click “Download images” button in the upper right corner.
-
Save the manifest as prompted, and run the
s5cmd
command as shown in the popup to download all of the images corresponding to your filter criteria:s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com run file_manifest_gcs.s5cmd
It is important to remember that the manifests corresponding to the filters defined using IDC Portal are defined at the level of DICOM studies, not DICOM series. To explain it on an example, if your filter selects Slide Microscopy modality, manifest will include all images from all studies that include Slide Microscopy. In general, DICOM studies can include more than one modallity.
BigQuery SQL
At the moment, it is not possible to download specific individual series, or a subset of the images corresponding to your filter definition. To achieve this, you will need to use BigQuery SQL. This brings couple of extra prerequisites, which are easy to satisfy: 1) you will need to have a Google Account; 2) you will need to create a Google Cloud Project. We have a tutorial on how to complete those prerequisites here.
Assuming you completed the prerequisites, you should be able to open BigQuery SQL console, where you can enter and run queries.
The following query will generate a manifest that contains all of the Slide Microscopy image series. You can run it using the BigQuery SQL console.
SELECT
ANY_VALUE(CONCAT("cp s3",REGEXP_SUBSTR(gcs_url, "(://.*)/"),"/* .")) as s5cmd_command
FROM
`bigquery-public-data.idc_current.dicom_all`
WHERE
modality = "SM"
GROUP BY
SeriesInstanceUID
Once you run the query, save the result as a local CSV file directly from the BigQuery console.
And run the same command as you did for the manifest downloaded using the portal (assuming you saved the query execution result as file_manifest_gcs.s5cmd
):
s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com run file_manifest_gcs.s5cmd
If you want to download from AWS instead, replace gcs_url
with aws_url
in the query and replace https://storage.googleapis.com
endpoint in the download command with https://s3.amazonaws.com
.
Using SQL gives you a lot of flexibility compared to the IDC Portal. Here are few examples:
- you can download just a subset of all of the series that meet your selection requirement using the
LIMIT
clause, which can be handy for quick experiments:
SELECT
ANY_VALUE(CONCAT("cp s3",REGEXP_SUBSTR(gcs_url, "(://.*)/"),"/* .")) as s5cmd_command
FROM
`bigquery-public-data.idc_current.dicom_all`
WHERE
modality = "SM"
GROUP BY
SeriesInstanceUID
# return only the first 3 rows
# effectively, selecting just the first 3 series
LIMIT
3
- you can choose the specific collection:
SELECT
ANY_VALUE(CONCAT("cp s3",REGEXP_SUBSTR(gcs_url, "(://.*)/"),"/* .")) as s5cmd_command
FROM
`bigquery-public-data.idc_current.dicom_all`
WHERE
modality = "SM"
# select only series from the CPTAC-BRCA collection
AND collection_id = "cptac_brca"
GROUP BY
SeriesInstanceUID
LIMIT
3
- with just a little bit of complexity, you can add SM-specific selection criteria that have been curated from DICOM metadata - in the example below we select those SM series that were obtained using 20x objective lens power:
SELECT
ANY_VALUE(CONCAT("s5cmd cp s3",REGEXP_SUBSTR(gcs_url, "(://.*)/"),"/* .")) AS s5cmd_command
FROM
`bigquery-public-data.idc_current.dicom_all` AS dicom_all
JOIN
# this table simplifies access to the DICOM attributes specific to SM
`bigquery-public-data.idc_current.dicom_metadata_curated_series_level` AS sm_attributes
ON
dicom_all.SeriesInstanceUID = sm_attributes.SeriesInstanceUID
WHERE
dicom_all.Modality = "SM"
# select only series from the CPTAC-BRCA collection
AND collection_id = "cptac_brca"
# select only the images collected using 20x objective lens power
AND sm_attributes.ObjectiveLensPower = 20
GROUP BY
dicom_all.SeriesInstanceUID
LIMIT
3
You can see the schema of the idc_current.dicom_metadata_curated_series_level
table in the BigQuery Console, or in https://github.com/ImagingDataCommons/etl_flow/blob/master/bq/generate_tables_and_views/derived_tables/BQ_Table_Building/derived_data_views/schema/dicom_metadata_curated_series_level.json.
If you have any follow up questions, please do not hesitate to ask in this thread!