Availability of lung cancer annotations

I received this message on LinkedIn with regard to the NLSTseg dataset lesion/tumor annotations. I thought it would be good to post a response here.

Our group is developing machine learning models for lung cancer prediction, and we’re currently looking for suitable datasets for external validation. We applied for access to NLST, but it seems the standard NLST imaging release doesn’t include tumor/lesion annotations.

Thank you for your interest!

The original NLST data provides the axial slice number/lung lobe that the lesion is located in, but not a mask, as you correctly pointed out. However, if this data is still useful, IDC also provides clinical metadata tables where you can quickly search for patients (see here for a Google Colab notebook).

To address your interest in actual segmentation masks, over the past few years, we have enhanced the original NLST dataset in IDC with both organ-level masks and nodule/tumor masks. We either used publicly available AI models and ran inference, or we took annotations that already existed and harmonized them to the DICOM format. Here is a list of annotations we have available for free to view and download:

  1. We ran TotalSegmentator CT on all of NLST and obtained 70+ organ segmentations for each patient. Here is an example of a patient and the segmentations produced. You can read this post for more information about these annotations and how to access them.

  2. We harmonized the NLSTseg nodule/tumor masks into the DICOM format and made these publicly available. You can read this post for more information.

  3. We also converted the Sybil tumor bounding boxes to DICOM and ingested them into IDC. This post has more information about these annotations.

Apart from NLST, Imaging Data Commons has other lung cancer imaging datasets that may be suitable. You can search the IDC Portal here: https://portal.imaging.datacommons.cancer.gov/explore/

  1. NSCLC-Radiomics contains patients with both organ and tumor segmentations.
  2. Anti-PD-1_Lung also contains tumor segmentations. See the TCIA page for more info.

Hope this helps!

1 Like

Note that all of the NLST CT and histopathology images are available from IDC without having to undergo any application process!

This is also indicated on the CDAS NLST page here: NLST - The Cancer Data Access System.

One does need to go through the application/approval process to access full clinical data, but not to access the images.

To download all of the NLST collection all you need to do is install the prerequisite idc-index python package:

pip install --upgrade idc-index

and launch this command (assuming you have sufficient disk space):

idc download nlst

Note that right now if you download the NLST collection from IDC using the command line idc download tool, you will download all of the analysis results collections (such as those segmentations that @deepa mentioned earlier) along with the images. It is on my list to add that, and I just created an issue to do this: Add the options to download individual analysis results · Issue #217 · ImagingDataCommons/idc-index · GitHub.

If you want to download just the CT images from the NLST collection, you would need to write a small python script:

from idc_index import IDCClient

client = IDCClient()

query = """
SELECT SeriesInstanceUID
FROM index
WHERE Modality = 'CT' AND collection_id = 'nlst'
"""
result = client.sql_query(query)['SeriesInstanceUID'].values.tolist()

client.download_from_selection(seriesInstanceUID=result, downloadDir=".")

If you have any questions or suggestions, please let us know!

1 Like

Thanks Deepa and Fedorov,

Thank you very much for the detailed explanation and for sharing all these resources. I’ve spent some time browsing the IDC posts and collections you mentioned, and they are extremely helpful.

This clarifies our confusion around NLST data quite a bit. From what I understand now, if our goal is to get “both lung organ masks and tumor/nodule masks”, using the AIMI Annotations would be more appropriate than NLSTseg, since NLSTseg itself does not appear to include lung organ segmentations.

Please correct me if I misunderstood,

Best,
Yutong

If that is your goal, I recommend you look at all of the sources of lung and tumor segmentations that you can find!

  • The TotalSegmentator-CT-Segmentations collection Deepa mentioned earlier contains segmentations of the anatomy for most of NLST, including lung lobe segmentations. Those were not manually verified, but from what we’ve seen, lung segmentations by TotalSegmentator are very good.
  • AIMI Annotations has segmentations for a subset of that collection, and some of them were expert-verified. But only a subset, and the segmentations are not as detailed. Same applies to tumor segmentations - it is available only for a subset, and only some of them were verified manually.
  • Sybil and NLSTSeg on the other hand contains segmentations created manually by experts, for all of the cancer-positive NLST subjects.

So if I were you, I would combine those - there’s nothing wrong with taking lung segmentations from one collection and tumor segmentations from another. You can configure the portal to show you all cases that have tumor segmentation from AIMI Annotations collection, and check whether expert annotations from Sybil/NLSTSeg agree with AI segmentations in AIMI Annotations. Here’s filter configuration: https://portal.imaging.datacommons.cancer.gov/explore/filters/?analysis_results_id=BAMF-AIMI-Annotations&SegmentedPropertyCategoryCodeSequence=SCT:49755003&collection_id=NCI_Trials&collection_id=nlst. And at least in this case those do not agree: https://viewer.imaging.datacommons.cancer.gov/v3/viewer/?StudyInstanceUIDs=1.2.840.113654.2.55.238034941445508011386463276954045956831.

Make sure you follow the pointers and read about the collections to understand how those various annotations were created before you used them. You can find the reference in the collection tooltip on the left hand side of the explore page, or on the collections page. Let us know if you need help!