@kosmasgal your attention to detail and the will to get to the answer is absolutely amazing! I am so happy that we have users like you that identify this kind of issues, investigate them AND report your findings on this forum! Huge thank you for this - you are doing a service both to IDC and to other users of our data!
TL;DR: I think you are right that somehow those series were missed while mirroring content from TCIA. We will investigate why this happened!
Here’s what I did to do the preliminary investigation. And one of the reasons I go into the details of investigation is because somehow I feel that I may have finally found the right user to appreciate the power of IDC metadata interrogation in you @kosmasgal!
Confirmed the issue is not in idc-index
The metadata available via idc-index is a hand-picked subset of columns from the complete set of metadata available in IDC BigQuery datasets (you can learn how to get started with using IDC BigQuery tables in this notebook, which will require you to first complete the prerequisites covered in this notebook). In order to confirm that those series were not lost while generating idc-index indices, I checked the series identifiers you located against the BigQuery content, which is populated directly from the DICOM files we ingest.
The following query (which you can execute in Google Cloud BigQuery console after completing the prerequisites tutorial mentioned earlier) does the check, and does not return any rows.
WITH
filter_uids AS (
SELECT uid
FROM
UNNEST(
[
'1.2.840.113654.2.55.124920821357452508840472848733176720100',
'1.2.840.113654.2.55.142299719953499919217431589345749071387',
'1.2.840.113654.2.55.142382132672452692150734693915194789257',
'1.2.840.113654.2.55.188627327646685672678367533541182805068',
'1.2.840.113654.2.55.252348289640389137888240251318457974542',
'1.2.840.113654.2.55.285617960144868132779721559793059662586',
'1.2.840.113654.2.55.291684382394288579210996275839384021559',
'1.2.840.113654.2.55.43665829455927473870872470762447960539',
'1.3.6.1.4.1.14519.5.2.1.7009.9004.134883620837966968413091837630',
'1.3.6.1.4.1.14519.5.2.1.7009.9004.217834446566745966670844385685',
'1.3.6.1.4.1.14519.5.2.1.7009.9004.241503579592351904660372998455',
'1.3.6.1.4.1.14519.5.2.1.7009.9004.319376119264764431009619572825'])
AS uid
)
SELECT SeriesInstanceUID
FROM `bigquery-public-data.idc_current.dicom_all` AS dicom_all
JOIN filter_uids
ON dicom_all.SeriesInstanceUID = filter_uids.uid
WHERE collection_id = "nlst"
Conclusion: the issue is not in idc-index.
Confirmed the missing series were not present in any of the earlier releases of IDC
IDC data is versioned. As we update IDC content from version to version, most of the changes are due to adding new content. But in some cases we also modify or remove content available earlier due to issues that need to be remedied, changed patient eligibility, and other reasons.
IDC BigQuery idc_current dataset points to the latest/current version, which is v23 as of writing.
The following query, which relies on the GoogleSQL Procedural Language capabilities - advanced stuff!) looks for the missing series identifiers across all IDC versions through v23, and does not find any of the missing series either.
-- Step 1: Declare variables
DECLARE idc_versions ARRAY<INT64>;
DECLARE latest_idc_version INT64 DEFAULT 23;
DECLARE union_all_query STRING;
-- Step 2: Get all idc_versions
SET idc_versions = GENERATE_ARRAY(1, latest_idc_version);
-- Step 3: Collect all distinct SeriesInstanceUIDs from NLST across all versions
SET union_all_query = (
SELECT
STRING_AGG(
FORMAT(
"""
SELECT DISTINCT
SeriesInstanceUID,
CONCAT('v', CAST(%d AS STRING)) AS idc_version
FROM
`bigquery-public-data.idc_v%d.dicom_all`
WHERE
collection_id = 'nlst' AND Modality = 'CT'
""",
version,
version),
" UNION ALL ")
FROM UNNEST(idc_versions) AS version
);
-- Step 4: Check if any of the series identified as missing are found across all versions
EXECUTE
IMMEDIATE
FORMAT(
"""
WITH all_versions AS (
%s
),
filter_uids AS (
SELECT uid
FROM UNNEST([
'1.2.840.113654.2.55.124920821357452508840472848733176720100',
'1.2.840.113654.2.55.142299719953499919217431589345749071387',
'1.2.840.113654.2.55.142382132672452692150734693915194789257',
'1.2.840.113654.2.55.188627327646685672678367533541182805068',
'1.2.840.113654.2.55.252348289640389137888240251318457974542',
'1.2.840.113654.2.55.285617960144868132779721559793059662586',
'1.2.840.113654.2.55.291684382394288579210996275839384021559',
'1.2.840.113654.2.55.43665829455927473870872470762447960539',
'1.3.6.1.4.1.14519.5.2.1.7009.9004.134883620837966968413091837630',
'1.3.6.1.4.1.14519.5.2.1.7009.9004.217834446566745966670844385685',
'1.3.6.1.4.1.14519.5.2.1.7009.9004.241503579592351904660372998455',
'1.3.6.1.4.1.14519.5.2.1.7009.9004.319376119264764431009619572825'
]) AS uid
)
SELECT DISTINCT
a.SeriesInstanceUID,
a.idc_version
FROM
all_versions AS a
INNER JOIN filter_uids AS f
ON a.SeriesInstanceUID = f.uid
""",
union_all_query);
Conclusion: it appears that we never had those series in our posession.
Next steps
I will seek help of @bill.clifford who is IDC lead for ETL and ingestion of data from TCIA. Some of the possibilities could be (other than bugs in our ingestion pipelines):
- TCIA API that we use to query the content did not return those specific series (we do not rely on the manifests available on the collection pages to synchronize content, and I do not know if those manifests and TCIA API results are consistent)
- the files were downloaded, but not ingested (they might have failed ingestion into IDC DICOM stores if there are severe DICOM conformance issues - although I would think we should have detected those errors on import)
We will investigate and follow up on this thread.
Again, huge thank you for raising and documenting these issues!