I have a question on the image of non-cancer ct lung image

Hello, professor. I am a student in University of Seoul. I am wondering that if I can get a normal person CT image for datasets. Because, I can search ct image for lung cancer image, but me and my teammates are struggling from get the normal person ct image which is really hard to find. I want to ask you to get a normal person ct image if not, please let me know how to get ct image because we need that for a simple research.

This is a great question!

Indeed, most of the images in IDC are from cancer patients. However, the NLST collection contains a dataset of chest CTs collected in a cancer screening trial, and thus will include images for non-cancer patients.

I am not that familiar with that collection, but you could probably use some of the metadata in the bigquery-public-data.idc_v9.nlst_canc table to select patients that did not have cancer.

As an example, you can query for the counts of patients that have distinct values of clinical_n (Clinical N code for staging, AJCC 6, per dictionary here) with this query:

SELECT
  clinical_n,
  COUNT(DISTINCT(pid)) as num_patients
FROM
  `bigquery-public-data.idc_v9.nlst_canc`
GROUP BY
  clinical_n

which will result in the following:

image

But I do not know if the missing value for clinical or pathological stage, for example, can be used as the indication that there was no cancer in that patient. @dclunie do you know?

In the “person” table supplied by the NLST folks there is a “lung_cancer” column described as:

Confirmed Lung Cancer?

Does the participant have a confirmed lung cancer diagnosis?

0=“No”
1=“Yes”

1 Like

Thank you for your help once again, but I have a question on open data which is# LIDC-IDRIthat we are using this data.

Is this data including non-cancer lung ct image? If that is the case, we will be very happy to get this idea.

2022년 5월 25일 (수) 오전 9:01, David Clunie via Imaging Data Commons <notifications@canceridc.discoursemail.com>님이 작성:

@imjkang7 I am sorry I have not responded earlier.

I am afraid we do not have the information for the LIDC-IDRI collection whether specific patient didn’t have cancer.

But you can select patients in the NLST collection that did not have a confirmed cancer.

NLST collection is accompanied by several tables containing clinical data, see Files and metadata - IDC User Guide. If you follow the link, you will find the schema describing the metadata in those tables. One of those tables, prsn, contains the column can_scr table, defined as “Indicates whether the cancer followed a positive, negative, or missed screen, or whether it occurred after the screening years.” with values "0=“No Cancer” 1=“Positive Screen” 2=“Negative Screen” 3=“Missed Screen” 4=“Post Screening”.

We confirmed with the curators of the NLST collection that patients that have value 0 (“No cancer”) in that table can be used to identify “normal” subjects that did not have a confirmed lung cancer diagnosis.

Putting this all together, you can first identify patients that did not have cancer using the following query:

SELECT
  distinct(pid)
FROM
  `bigquery-public-data.idc_v9.nlst_prsn`
WHERE
  can_scr = 0

You can join those patient identifiers with the DICOM metadata in IDC dicom_all table to get CT scans for all patients that had negative cancer. The query below will give you the list of all SeriesInstanceUIDs for CT scans from those non-cancer patients, and the URL to the viewer, so you can visually examine those series.

SELECT
  distinct(SeriesInstanceUID),
  CONCAT("https://viewer.imaging.datacommons.cancer.gov/viewer/",StudyInstanceUID,"?seriesInstanceUID=",SeriesInstanceUID) as viewer_url
FROM
  `bigquery-public-data.idc_v9.nlst_prsn` as prsn
JOIN
  `bigquery-public-data.idc_v9.dicom_all` as dicom_all
ON
  SAFE_CAST(prsn.pid AS STRING) = dicom_all.PatientID
WHERE
  can_scr = 0 AND Modality = "CT"

You can download the series identified using the instructions in this documentation page selecting by specific SeriesInstanceUID: Downloading data - IDC User Guide. You can download all of them, but this will take some time for ~200K series.