NLST Data Completeness: Missing 28 CT Cancer Patients

Hi everyone,

I am working with the NLST collection in IDC and have performed a detailed completeness check against the NLST User Guide and original publications. I am using the idc_index package to work with the data, much like in the very useful sample notebooks. I have verified the data but have a few outstanding questions regarding missing records and columns:

Metric Reference Our Count Match
Total participants 53,452 53,452
CT arm patients (with imaging) 26,722* 26,254 ~
Total cancer (both arms) 2,058 2,058
CT arm cancer 1,089 1,061 ~
X-ray arm cancer (inferred) 969 997 ~

* Reference: ~26,722 CT arm participants in original study

Arm Published (paper) Dataset (newer DB)
CT 1,060 1,089
X-ray 941 969
Total 2,001 2,058

Reference values from [Table 6] of the NLST User guide
**

  1. Missing 28 CT Cancer Patients (Imaging vs. Clinical DB)**

I identified 26,254 patients in the CT arm with imaging data (which matches the TCIA manifest I manually downloaded to double check) vs the 26,722 reported in the publication. I guess this decrease makes sense because not all patients may have had imaging. Within this set, I found 1,061 patients with confirmed cancer using can_scr (linking nlst_prsn to nlst_canc via dicom_patient_id).

  • My count (1,061) matches the original NEJM 2011 paper count (1,060) within a 0.1% margin.

  • However, the current NLST User Guide (Table 6) and newer database versions list 1,089 cancers in the CT arm.

  • When I infer the X-ray arm cancer count (Total 2,058 - My CT 1,061), I get 997. This is exactly 28 higher than the reference X-ray count (969), which perfectly accounts for the 28 ‘missing’ CT cancer cases (i.e., they are in the dataset’s total count but missing from the CT imaging arm). This count seems to match the newer database counts from the NLST user guide, which is weird.

  • Question: Are these ~28 missing patients cases that exist in the clinical database but whose images were never transferred to TCIA/IDC? I am mainly interested in making sure my data set is complete, and that everything went correctly during the download.

2. Missing conflc and rndgroup Columns

I noticed that the conflc (Confirmed Lung Cancer) and rndgroup (Randomization Group) columns are missing from the nlst_prsn table when accessed via IDCClient, even though they are listed in the NLST User Guide.

  • I am currently inferring rndgroup based on Modality (CT vs X-ray) and determining cancer status via other columns.

  • Question: Are these columns available in a different table or view? While I have worked around this by inferring rndgroup from Modality and cancer status from can_scr, it would be helpful to know if the direct flags are available.

1 Like

@kosmasgal thank you for raising those questions, and glad you are able to complete this analysis using the tools we provide to accompany the data!

The history of the NLST dataset is that it was initially available from TCIA as a restricted access collection, which subsequently became open access and mirrored at IDC. Since you observe that the numbers for CT images are identical between TCIA manifest and what you find in IDC, this confirms there are no download errors.

We did observe that there are some patient identifiers in the clinical data that are not accompanied by the images, but did not investigate this further. I think the best way to do this would be to reach out to NCI CDAS (the entity responsible for collecting the data and sharing it with TCIA/IDC for public distribution) via this form https://cdas.cancer.gov/contact/nlst/.

NLST clinical data available publicly via TCIA and IDC is just a subset of the clinical data that was collected. If you would like to have access to additional attributes, you will need to apply for access from CDAS using the link available here: Datasets - NLST - The Cancer Data Access System.

Please let us know if you have any further questions or feedback about NLST collection, IDC or any of the tools we maintain! We are always interested in hearing from the users on how we can make IDC even better.

1 Like

Thank you so much for the prompt and detailed response @fedorov! You really do cover my points. As you suggested I’ll also reach out to the NLST team for completeness’s sake and add their reply to this post in case someone in the future encounters the same issue.

I also found another little quirk!

I have identified a discrepancy between the TCIA (Cancer Imaging Archive) data manifest (manifest-NLST_allCT.tcia) and the data currently indexed in the IDC (Imaging Data Commons).

A total of 12 CT series that exist in the TCIA manifest are not indexed in IDC.

Series Breakdown by Patient

The 12 missing series belong to two patients:

Patient ID Total Missing CT Series
126153 8 series
215303 4 series
TOTAL 12 series

Patient Status Comparison

Both patients are present in IDC, but the counts for their CT series are inconsistent:

  • Patient 126153:

    • In TCIA Manifest: 8 CT series

    • Indexed in IDC: 1 CT series

    • Result: 7 CT series are missing from the IDC index.

  • Patient 215303:

    • In TCIA Manifest: 12 CT series

    • Indexed in IDC: 8 CT series (across 2 studies) plus other series (10 SR, 5 SEG)

    • Result: 4 CT series are missing from the IDC index.

The 12 missing SeriesInstanceUIDs are:

  • Patient 126153 (8 series):

  • 1.2.840.113654.2.55.124920821357452508840472848733176720100

  • 1.2.840.113654.2.55.142299719953499919217431589345749071387

  • 1.2.840.113654.2.55.142382132672452692150734693915194789257

  • 1.2.840.113654.2.55.188627327646685672678367533541182805068

  • 1.2.840.113654.2.55.252348289640389137888240251318457974542

  • 1.2.840.113654.2.55.285617960144868132779721559793059662586

  • 1.2.840.113654.2.55.291684382394288579210996275839384021559

  • 1.2.840.113654.2.55.43665829455927473870872470762447960539

  • Patient 215303 (4 series):

  • 1.3.6.1.4.1.14519.5.2.1.7009.9004.134883620837966968413091837630

  • 1.3.6.1.4.1.14519.5.2.1.7009.9004.217834446566745966670844385685

  • 1.3.6.1.4.1.14519.5.2.1.7009.9004.241503579592351904660372998455

  • 1.3.6.1.4.1.14519.5.2.1.7009.9004.319376119264764431009619572825

Verification and Resolution

I attempted to verify the presence of these 12 series using IDC tools:

  1. IDC Index Search: A search of the full IDC index (all modalities, all collections) confirmed zero of the 12 missing SeriesInstanceUIDs were found.

  2. IDC Download Attempt: An attempt to download the series using IDCClient.download_from_selection() failed with a “No data found” error.

  3. Successful Download: I was successful in downloading the data by creating a custom manifest of the missing UIDs and using the NBIA Data Retriever tool provided by TCIA.

Conclusion

These 12 CT series are present on the TCIA platform but have not been mirrored or indexed into the IDC/GCS infrastructure. As a result, they are only accessible through TCIA’s native tools, such as the NBIA Data Retriever, and cannot be accessed using standard IDC tooling.

1 Like

@kosmasgal your attention to detail and the will to get to the answer is absolutely amazing! I am so happy that we have users like you that identify this kind of issues, investigate them AND report your findings on this forum! Huge thank you for this - you are doing a service both to IDC and to other users of our data!

TL;DR: I think you are right that somehow those series were missed while mirroring content from TCIA. We will investigate why this happened!

Here’s what I did to do the preliminary investigation. And one of the reasons I go into the details of investigation is because somehow I feel that I may have finally found the right user to appreciate the power of IDC metadata interrogation in you @kosmasgal!

Confirmed the issue is not in idc-index

The metadata available via idc-index is a hand-picked subset of columns from the complete set of metadata available in IDC BigQuery datasets (you can learn how to get started with using IDC BigQuery tables in this notebook, which will require you to first complete the prerequisites covered in this notebook). In order to confirm that those series were not lost while generating idc-index indices, I checked the series identifiers you located against the BigQuery content, which is populated directly from the DICOM files we ingest.

The following query (which you can execute in Google Cloud BigQuery console after completing the prerequisites tutorial mentioned earlier) does the check, and does not return any rows.

WITH
  filter_uids AS (
    SELECT uid
    FROM
      UNNEST(
        [
          '1.2.840.113654.2.55.124920821357452508840472848733176720100',
          '1.2.840.113654.2.55.142299719953499919217431589345749071387',
          '1.2.840.113654.2.55.142382132672452692150734693915194789257',
          '1.2.840.113654.2.55.188627327646685672678367533541182805068',
          '1.2.840.113654.2.55.252348289640389137888240251318457974542',
          '1.2.840.113654.2.55.285617960144868132779721559793059662586',
          '1.2.840.113654.2.55.291684382394288579210996275839384021559',
          '1.2.840.113654.2.55.43665829455927473870872470762447960539',
          '1.3.6.1.4.1.14519.5.2.1.7009.9004.134883620837966968413091837630',
          '1.3.6.1.4.1.14519.5.2.1.7009.9004.217834446566745966670844385685',
          '1.3.6.1.4.1.14519.5.2.1.7009.9004.241503579592351904660372998455',
          '1.3.6.1.4.1.14519.5.2.1.7009.9004.319376119264764431009619572825'])
        AS uid
  )
SELECT SeriesInstanceUID
FROM `bigquery-public-data.idc_current.dicom_all` AS dicom_all
JOIN filter_uids
  ON dicom_all.SeriesInstanceUID = filter_uids.uid
WHERE collection_id = "nlst"

Conclusion: the issue is not in idc-index.

Confirmed the missing series were not present in any of the earlier releases of IDC

IDC data is versioned. As we update IDC content from version to version, most of the changes are due to adding new content. But in some cases we also modify or remove content available earlier due to issues that need to be remedied, changed patient eligibility, and other reasons.

IDC BigQuery idc_current dataset points to the latest/current version, which is v23 as of writing.

The following query, which relies on the GoogleSQL Procedural Language capabilities - advanced stuff!) looks for the missing series identifiers across all IDC versions through v23, and does not find any of the missing series either.

-- Step 1: Declare variables
DECLARE idc_versions ARRAY<INT64>;
DECLARE latest_idc_version INT64 DEFAULT 23;
DECLARE union_all_query STRING;

-- Step 2: Get all idc_versions
SET idc_versions = GENERATE_ARRAY(1, latest_idc_version);

-- Step 3: Collect all distinct SeriesInstanceUIDs from NLST across all versions
SET union_all_query = (
  SELECT
    STRING_AGG(
      FORMAT(
        """
      SELECT DISTINCT
        SeriesInstanceUID,
        CONCAT('v', CAST(%d AS STRING)) AS idc_version
      FROM
        `bigquery-public-data.idc_v%d.dicom_all`
      WHERE
        collection_id = 'nlst' AND Modality = 'CT'
        """,
        version,
        version),
      " UNION ALL ")
  FROM UNNEST(idc_versions) AS version
);

-- Step 4: Check if any of the series identified as missing are found across all versions
EXECUTE
  IMMEDIATE
    FORMAT(
      """
  WITH all_versions AS (
    %s
  ),
  filter_uids AS (
    SELECT uid
    FROM UNNEST([
      '1.2.840.113654.2.55.124920821357452508840472848733176720100',
      '1.2.840.113654.2.55.142299719953499919217431589345749071387',
      '1.2.840.113654.2.55.142382132672452692150734693915194789257',
      '1.2.840.113654.2.55.188627327646685672678367533541182805068',
      '1.2.840.113654.2.55.252348289640389137888240251318457974542',
      '1.2.840.113654.2.55.285617960144868132779721559793059662586',
      '1.2.840.113654.2.55.291684382394288579210996275839384021559',
      '1.2.840.113654.2.55.43665829455927473870872470762447960539',
      '1.3.6.1.4.1.14519.5.2.1.7009.9004.134883620837966968413091837630',
      '1.3.6.1.4.1.14519.5.2.1.7009.9004.217834446566745966670844385685',
      '1.3.6.1.4.1.14519.5.2.1.7009.9004.241503579592351904660372998455',
      '1.3.6.1.4.1.14519.5.2.1.7009.9004.319376119264764431009619572825'
    ]) AS uid
  )
  SELECT DISTINCT
    a.SeriesInstanceUID,
    a.idc_version
  FROM
    all_versions AS a
    INNER JOIN filter_uids AS f
      ON a.SeriesInstanceUID = f.uid
""",
      union_all_query);

Conclusion: it appears that we never had those series in our posession.

Next steps

I will seek help of @bill.clifford who is IDC lead for ETL and ingestion of data from TCIA. Some of the possibilities could be (other than bugs in our ingestion pipelines):

  • TCIA API that we use to query the content did not return those specific series (we do not rely on the manifests available on the collection pages to synchronize content, and I do not know if those manifests and TCIA API results are consistent)
  • the files were downloaded, but not ingested (they might have failed ingestion into IDC DICOM stores if there are severe DICOM conformance issues - although I would think we should have detected those errors on import)

We will investigate and follow up on this thread.

Again, huge thank you for raising and documenting these issues!

1 Like

Thank you very much for all your kind words Andrey. I am happy to help :slight_smile: Your team’s python library has made my analysis much easier and reproducible.

I just heard back from the CDAS team and they have confirmed the patient numbers I identified in my query, so I think the original topic has been fully resolved. Here is the quote from the ticket:

Hello Kosmas,
Of the 26,722 participants, images were only received for 26,254. Similarly, although we have 1,089 cancers in the CT arm, we only have images for 1,061 of them. Does this answer your question?

Regards,
Doug

Thank you for taking the time to investigate the 12 missing series in depth. I am interested to know the outcome when you have it. If I encounter anything else interesting I’ll let you know.

1 Like

@kosmasgal we looked into this, and confirmed that indeed we are missing those series you identified. We don’t know exactly what went wrong and why they were not ingested originally. We plan to make available in the next release of IDC data. Thank you for flagging this issue!

1 Like

That’s awesome. Happy to help.