Issues with the "QIN multi-site collection of Lung CT data with Nodule Segmentations" analysis collection

Although we claim we include this collection QIN multi-site collection of Lung CT data with Nodule Segmentations - TCIA DOIs - Cancer Imaging Archive Wiki, we actually don’t.

SELECT
  SOPInstanceUID,
  Modality,
  collection_id
FROM
  `canceridc-data.idc_views.dicom_all`
WHERE
  Source_DOI = "10.7937/K9/TCIA.2015.1BUVFJR7"

I think what happened is that some instances from that collection got pulled in because of the TCIA API limitations (it was, or maybe still is, incorrectly assigning segmentations from that collection that were done for LIDC to the LIDC collection), but it contains more than just those LIDC segs.

The right thing to do would be to completely exclude that analysis results collection, until we include all of the original collections it corresponds to, and until it includes all of the instances, not a subset, but probably it would be too much effort between now and broad release.

@bill.clifford what do you think? Should we just document this limitation in the documentation?

Similar concern applies to this collection: https://doi.org/10.7937/TCIA.2019.wgllssg1, which includes analysis results across 4 original collections, but we only include the ones for I-SPY. I am going to document these limitations in Data release notes - IDC User Guide for now.

I’m frankly not sure I understand the problem. Is it that the series that are attributed to this analysis collection are actually original LIDC data?

The collection in question contains the following, according to the wiki page:

image

Note: analysis results for 4 collections, with 378 segmentations.

And what is available as part of that collection in IDC is 90 instances, all corresponding to the data from LIDC collection.

So the issue is that it is not a correct statement on the IDC collections page that IDC includes this collection, since it is not included in its entirety.

Does this explanation make sense?

I think that it is fine to document that we only include 3rd party data for those collections that we are making public.
Moreover, removing such data would require Suzanne to re-spin the SOLR table. I think it is more important that she work on the manifest download solution.

I agree it is more important that we have the manifest download solution!