I received this message on LinkedIn with regard to the NLSTseg dataset lesion/tumor annotations. I thought it would be good to post a response here.
Our group is developing machine learning models for lung cancer prediction, and we’re currently looking for suitable datasets for external validation. We applied for access to NLST, but it seems the standard NLST imaging release doesn’t include tumor/lesion annotations.
Thank you for your interest!
The original NLST data provides the axial slice number/lung lobe that the lesion is located in, but not a mask, as you correctly pointed out. However, if this data is still useful, IDC also provides clinical metadata tables where you can quickly search for patients (see here for a Google Colab notebook).
To address your interest in actual segmentation masks, over the past few years, we have enhanced the original NLST dataset in IDC with both organ-level masks and nodule/tumor masks. We either used publicly available AI models and ran inference, or we took annotations that already existed and harmonized them to the DICOM format. Here is a list of annotations we have available for free to view and download:
We ran TotalSegmentator CT on all of NLST and obtained 70+ organ segmentations for each patient. Here is an example of a patient and the segmentations produced. You can read this post for more information about these annotations and how to access them.
We harmonized the NLSTseg nodule/tumor masks into the DICOM format and made these publicly available. You can read this post for more information.
We also converted the Sybil tumor bounding boxes to DICOM and ingested them into IDC. This post has more information about these annotations.
If you want to download just the CT images from the NLST collection, you would need to write a small python script:
from idc_index import IDCClient
client = IDCClient()
query = """
SELECT SeriesInstanceUID
FROM index
WHERE Modality = 'CT' AND collection_id = 'nlst'
"""
result = client.sql_query(query)['SeriesInstanceUID'].values.tolist()
client.download_from_selection(seriesInstanceUID=result, downloadDir=".")
If you have any questions or suggestions, please let us know!
Thank you very much for the detailed explanation and for sharing all these resources. I’ve spent some time browsing the IDC posts and collections you mentioned, and they are extremely helpful.
This clarifies our confusion around NLST data quite a bit. From what I understand now, if our goal is to get “both lung organ masks and tumor/nodule masks”, using the AIMI Annotations would be more appropriate than NLSTseg, since NLSTseg itself does not appear to include lung organ segmentations.
If that is your goal, I recommend you look at all of the sources of lung and tumor segmentations that you can find!
The TotalSegmentator-CT-Segmentations collection Deepa mentioned earlier contains segmentations of the anatomy for most of NLST, including lung lobe segmentations. Those were not manually verified, but from what we’ve seen, lung segmentations by TotalSegmentator are very good.
AIMI Annotations has segmentations for a subset of that collection, and some of them were expert-verified. But only a subset, and the segmentations are not as detailed. Same applies to tumor segmentations - it is available only for a subset, and only some of them were verified manually.
Sybil and NLSTSeg on the other hand contains segmentations created manually by experts, for all of the cancer-positive NLST subjects.
Make sure you follow the pointers and read about the collections to understand how those various annotations were created before you used them. You can find the reference in the collection tooltip on the left hand side of the explore page, or on the collections page. Let us know if you need help!