Bulk Download CT images with age and sex

Hi,

I am trying to create a data set of CT images which have age and sex embedded in the file. I have been looking through several files, and some of them do not have this attribute. Hence, I was wondering if there is a possibility to do a search where this is a condition? And when that is done, download only the CT images?

Best,
Hans Martin

Hans, yes, that is possible.

You will have to use BigQuery, and follow the instructions in this page Downloading data with s5cmd | prod | IDC User Guide with the following query for creating the manifest (i.e., plug in the query below into “Step 1: Create the manifest” section):

SELECT
  DISTINCT(CONCAT(series_aws_url, "* ."))
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  PatientAge IS NOT NULL
  AND PatientSex IS NOT NULL
  AND Modality = "CT"

The query above returns 29482 rows against the current IDC release v18.

Note that PatientAge values have “Age String” Value Representation defined in the DICOM standard here: https://dicom.nema.org/medical/dicom/current/output/chtml/part05/sect_6.2.html:

You can get the total number of series that fit your selection criteria to a much larger number with just a bit more effort. NLST is the largest collection of chest CTs in IDC, but all of the PatientAge values are blank. Patient age is, however, available in the accompanying tables containing clinical data.

The query below performs a join between the clinical data table that contains age and the dicom_all table used above for the entries corresponding to the series from the NLST collection:

SELECT
  DISTINCT(CONCAT(series_aws_url, "* .")),
  age,
  # this follows the collection-specific conventions for encoding gender
  CASE gender
    WHEN 1 THEN "M"
    WHEN 2 THEN "F"
  END
  AS PatientSex
FROM
  `bigquery-public-data.idc_current.dicom_all` AS dicom_all
JOIN
  `bigquery-public-data.idc_v18_clinical.nlst_prsn` AS nlst_prsn
ON
  dicom_all.PatientID = nlst_prsn.dicom_patient_id
WHERE
  Modality = "CT"
  AND collection_id = "nlst"

The query above will give you the manifest for another 203K+ CT series. You can use this query in conjunction with the download instructions on the page mentioned earlier: Downloading data with s5cmd | prod | IDC User Guide.

As the explanation, the query above uses table bigquery-public-data.idc_v18_clinical.nlst_prsn, which contains patient age and gender. That table shares PatientID with dicom_all, which we use to join the two tables. We also map collection-specific conventions used for recording patient sex to “M”/“F” used in DICOM. Age is encoded as an integer number.

You can see all the fields in that table (including the dictionary of the values for the gender field) using this query:

SELECT
  *
FROM
  `bigquery-public-data.idc_current_clinical.column_metadata`
WHERE
  collection_id = "nlst"
  AND table_name LIKE "%nlst_prsn%"

If you want to learn more about how to work with the clinical data accompanying IDC collections, see this tutorial: IDC-Tutorials/notebooks/advanced_topics/clinical_data_intro.ipynb at master · ImagingDataCommons/IDC-Tutorials · GitHub.