Issues with the "QIN multi-site collection of Lung CT data with Nodule Segmentations" analysis collection

fedorov · October 15, 2020, 4:25pm

Although we claim we include this collection QIN multi-site collection of Lung CT data with Nodule Segmentations - TCIA DOIs - Cancer Imaging Archive Wiki, we actually don’t.

SELECT
  SOPInstanceUID,
  Modality,
  collection_id
FROM
  `canceridc-data.idc_views.dicom_all`
WHERE
  Source_DOI = "10.7937/K9/TCIA.2015.1BUVFJR7"

I think what happened is that some instances from that collection got pulled in because of the TCIA API limitations (it was, or maybe still is, incorrectly assigning segmentations from that collection that were done for LIDC to the LIDC collection), but it contains more than just those LIDC segs.

The right thing to do would be to completely exclude that analysis results collection, until we include all of the original collections it corresponds to, and until it includes all of the instances, not a subset, but probably it would be too much effort between now and broad release.

@bill.clifford what do you think? Should we just document this limitation in the documentation?

fedorov · October 15, 2020, 4:32pm

Similar concern applies to this collection: https://doi.org/10.7937/TCIA.2019.wgllssg1, which includes analysis results across 4 original collections, but we only include the ones for I-SPY. I am going to document these limitations in Data release notes - IDC User Guide for now.

bill.clifford · October 15, 2020, 6:35pm

I’m frankly not sure I understand the problem. Is it that the series that are attributed to this analysis collection are actually original LIDC data?

fedorov · October 15, 2020, 6:48pm

The collection in question contains the following, according to the wiki page:

Note: analysis results for 4 collections, with 378 segmentations.

And what is available as part of that collection in IDC is 90 instances, all corresponding to the data from LIDC collection.

So the issue is that it is not a correct statement on the IDC collections page that IDC includes this collection, since it is not included in its entirety.

Does this explanation make sense?

bill.clifford · October 15, 2020, 7:06pm

I think that it is fine to document that we only include 3rd party data for those collections that we are making public.
Moreover, removing such data would require Suzanne to re-spin the SOLR table. I think it is more important that she work on the manifest download solution.

fedorov · October 15, 2020, 7:37pm

I agree it is more important that we have the manifest download solution!

Topic		Replies	Views
Segmentation data in RIDER lung CT database Support documentation	3	365	May 19, 2023
Issues with NSCLC-Radiomics collection Data	1	342	July 19, 2022
Dataset Size Discrepancies (CancerImageArchives vs. IDC) Data documentation	1	47	September 20, 2024
How do I access the accompanying clinical data of the Anti-PD-1 _ Lung collection? Support anti_pd_1_lung	8	53	December 25, 2024
Mislabeled segmentations for NSCLC Radiogenomics Support nsclc_radiogenomics	3	170	November 1, 2023

Issues with the "QIN multi-site collection of Lung CT data with Nodule Segmentations" analysis collection

Related topics