SOPInstances listed multiple times in BigQuery table

I found that there are some files that appear multiple times in the IDC v10 dicom_pivot table, so the same crdc_instance_uuid has 4 rows. Everything about these rows is the same except some of the measurement fields like SUVbw, Volume, and Total_Lesion_Glycolysis. All of the SOP identifier fields are the same, along with crdc uuid fields. Is there a reason that these different measurements are not included in some kind of annotation identifier or SOP uuid?

1 Like

Donovan is part of Cancer Data Aggregator, and is indexing the IDC data.

Thank you for the report! I am sure @bill.clifford will comment on this soon.

We are investigating this.

@Donovan_Ruth can you let us know what exactly you are trying to accomplish, and why you are using dicom_pivot_table and not some other table for your purposes?

I think the issue is on our side in not providing clear guidance in the documentation, and I suspect dicom_pivot_table is not the table you should be using, but we need your help to understand what is going on.

We (CDA) are trying to gather metadata for all files available in the IDC. For now we are trying to get a lot of metadata regarding the files themselves including annotations, along with any information on the patient and/or specimen that was the source of the image.

Would it be useful for us to set up a time to chat about what data we’re trying to get and what Bill finds out about the possible duplicates?

So this was discussed internally, and I think there was some misunderstanding in terms of expectations about the content of the dicom_pivot table. The background story for that table is that it was put together to support operation of the portal, and released for the sake of transparency.

BUT as you can see from Files and metadata - IDC User Guide, that table is not even documented.

Multiple occurrence of distinct values of SOPInstanceUID in that table (which are propagated from dicom_derived_all - another table that is not documented) is expected, since the latter table has one row per measurement, and there may be multiple segments/measurements per instance. My description of dicom_derived_all and dicom_pivot are very superficial since I have not used those tables, and they are not referenced from any of the learning materials we maintain. @spaquett will be able to provide more details.

Yes, I think it does make sense to meet and summarize the result of the discussion here.