How to download whole slide microscopy images from IDC

Recently, one of the IDC users had a question about how to download whole slide images (needless to say, stored as DICOM Slide Microscopy objects) from IDC for the specific collections. Since this is a question of general interest, I decided to post the recipe in the forum.

As a prerequisite, you will need to install s5cmd on your system, and note the location of the s5cmd executable, or add it to your path. You will need this tool to download files from IDC.

Next you will need to select files that you want to download. You can do this using IDC Portal or BigQuery SQL.

IDC Portal

  1. Select “Slide microscopy” in the Modality filter group on the left. You can use additional filters to, for example, select specific collection.
    image

  2. Click “Download images” button in the upper right corner.
    image

  3. Save the manifest as prompted, and run the s5cmd command as shown in the popup to download all of the images corresponding to your filter criteria: s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com run file_manifest_gcs.s5cmd
    image

It is important to remember that the manifests corresponding to the filters defined using IDC Portal are defined at the level of DICOM studies, not DICOM series. To explain it on an example, if your filter selects Slide Microscopy modality, manifest will include all images from all studies that include Slide Microscopy. In general, DICOM studies can include more than one modallity.

BigQuery SQL

At the moment, it is not possible to download specific individual series, or a subset of the images corresponding to your filter definition. To achieve this, you will need to use BigQuery SQL. This brings couple of extra prerequisites, which are easy to satisfy: 1) you will need to have a Google Account; 2) you will need to create a Google Cloud Project. We have a tutorial on how to complete those prerequisites here.

Assuming you completed the prerequisites, you should be able to open BigQuery SQL console, where you can enter and run queries.

The following query will generate a manifest that contains all of the Slide Microscopy image series. You can run it using the BigQuery SQL console.

SELECT
  ANY_VALUE(CONCAT("cp s3",REGEXP_SUBSTR(gcs_url, "(://.*)/"),"/* .")) as s5cmd_command
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  modality = "SM"
GROUP BY
  SeriesInstanceUID

Once you run the query, save the result as a local CSV file directly from the BigQuery console.

image

And run the same command as you did for the manifest downloaded using the portal (assuming you saved the query execution result as file_manifest_gcs.s5cmd):

s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com run file_manifest_gcs.s5cmd

If you want to download from AWS instead, replace gcs_url with aws_url in the query and replace https://storage.googleapis.com endpoint in the download command with https://s3.amazonaws.com.

Using SQL gives you a lot of flexibility compared to the IDC Portal. Here are few examples:

  • you can download just a subset of all of the series that meet your selection requirement using the LIMIT clause, which can be handy for quick experiments:
SELECT
  ANY_VALUE(CONCAT("cp s3",REGEXP_SUBSTR(gcs_url, "(://.*)/"),"/* .")) as s5cmd_command
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  modality = "SM"
GROUP BY
  SeriesInstanceUID

# return only the first 3 rows 
#  effectively, selecting just the first 3 series
LIMIT
  3
  • you can choose the specific collection:
SELECT
  ANY_VALUE(CONCAT("cp s3",REGEXP_SUBSTR(gcs_url, "(://.*)/"),"/* .")) as s5cmd_command
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  modality = "SM"

# select only series from the CPTAC-BRCA collection
  AND collection_id = "cptac_brca"
GROUP BY
  SeriesInstanceUID
LIMIT
  3
  • with just a little bit of complexity, you can add SM-specific selection criteria that have been curated from DICOM metadata - in the example below we select those SM series that were obtained using 20x objective lens power:
SELECT
  ANY_VALUE(CONCAT("s5cmd cp s3",REGEXP_SUBSTR(gcs_url, "(://.*)/"),"/* .")) AS s5cmd_command
FROM
  `bigquery-public-data.idc_current.dicom_all` AS dicom_all
JOIN
  # this table simplifies access to the DICOM attributes specific to SM
  `bigquery-public-data.idc_current.dicom_metadata_curated_series_level` AS sm_attributes
ON
  dicom_all.SeriesInstanceUID = sm_attributes.SeriesInstanceUID
WHERE
  dicom_all.Modality = "SM"
  # select only series from the CPTAC-BRCA collection
  AND collection_id = "cptac_brca"
  # select only the images collected using 20x objective lens power
  AND sm_attributes.ObjectiveLensPower = 20
GROUP BY
  dicom_all.SeriesInstanceUID
LIMIT
  3

You can see the schema of the idc_current.dicom_metadata_curated_series_level table in the BigQuery Console, or in https://github.com/ImagingDataCommons/etl_flow/blob/master/bq/generate_tables_and_views/derived_tables/BQ_Table_Building/derived_data_views/schema/dicom_metadata_curated_series_level.json.

If you have any follow up questions, please do not hesitate to ask in this thread!

2 Likes

I am sure this is exactly what I am looking for but it is going way over my head. Is there anyway someone can walk me through it?

Happy to help here @jcbpssmr!

Would you be able to join IDC Community Hours so we can help you?

IDC Community Office Hours open to all every Tuesday 16:30 – 17:30 (New York) and Wednesday 10:30-11:30 (New York) via Google Meet at https://meet.google.com/xyt-vody-tvb .