Hans, yes, that is possible.
You will have to use BigQuery, and follow the instructions in this page Downloading data with s5cmd | prod | IDC User Guide with the following query for creating the manifest (i.e., plug in the query below into “Step 1: Create the manifest” section):
SELECT
DISTINCT(CONCAT(series_aws_url, "* ."))
FROM
`bigquery-public-data.idc_current.dicom_all`
WHERE
PatientAge IS NOT NULL
AND PatientSex IS NOT NULL
AND Modality = "CT"
The query above returns 29482 rows against the current IDC release v18.
Note that PatientAge
values have “Age String” Value Representation defined in the DICOM standard here: https://dicom.nema.org/medical/dicom/current/output/chtml/part05/sect_6.2.html:
You can get the total number of series that fit your selection criteria to a much larger number with just a bit more effort. NLST is the largest collection of chest CTs in IDC, but all of the PatientAge
values are blank. Patient age is, however, available in the accompanying tables containing clinical data.
The query below performs a join between the clinical data table that contains age and the dicom_all
table used above for the entries corresponding to the series from the NLST collection:
SELECT
DISTINCT(CONCAT(series_aws_url, "* .")),
age,
# this follows the collection-specific conventions for encoding gender
CASE gender
WHEN 1 THEN "M"
WHEN 2 THEN "F"
END
AS PatientSex
FROM
`bigquery-public-data.idc_current.dicom_all` AS dicom_all
JOIN
`bigquery-public-data.idc_v18_clinical.nlst_prsn` AS nlst_prsn
ON
dicom_all.PatientID = nlst_prsn.dicom_patient_id
WHERE
Modality = "CT"
AND collection_id = "nlst"
The query above will give you the manifest for another 203K+ CT series. You can use this query in conjunction with the download instructions on the page mentioned earlier: Downloading data with s5cmd | prod | IDC User Guide.
As the explanation, the query above uses table bigquery-public-data.idc_v18_clinical.nlst_prsn
, which contains patient age and gender. That table shares PatientID
with dicom_all
, which we use to join the two tables. We also map collection-specific conventions used for recording patient sex to “M”/“F” used in DICOM. Age is encoded as an integer number.
You can see all the fields in that table (including the dictionary of the values for the gender
field) using this query:
SELECT
*
FROM
`bigquery-public-data.idc_current_clinical.column_metadata`
WHERE
collection_id = "nlst"
AND table_name LIKE "%nlst_prsn%"
If you want to learn more about how to work with the clinical data accompanying IDC collections, see this tutorial: IDC-Tutorials/notebooks/advanced_topics/clinical_data_intro.ipynb at master · ImagingDataCommons/IDC-Tutorials · GitHub.