Best DICOM tag for provenance info?

We are anonymizing imaging datasets from multiple institutions for a ROBIN collaboration and with the intent of sharing this data potentially on the IDC platform. We would like to include in the DICOM tags information about the project, something to the effect of “anonymized for usage in the ROBIN Collaboration”. Can the DICOM experts out there recommend the tag that would be most appropriate for this purpose? Do we want to avoid any private (odd-numbered) tags in the event that they may get filtered out/removed?

Here are the potential addresses we’ve considered as candidates:

(0008,4000) Identifying Comments (Retired)

(0012,0010) ClinicalTrialSponsorName

(0012,0020) ClinicalTrialProtocolID

(0012,0021) ClinicalTrialProtocolName

(0012,0030) ClinicalTrialSiteID

(0012,0031) ClinicalTrialSiteName

(0012,0040) ClinicalTrialSubjectID

(0012,0042) ClinicalTrialSubjectReadingID

(0012,0050) ClinicalTrialTimePointID

(0012,0051) ClinicalTrialTimePointDescription

(0012,0060) CoordinatingCenterName

(0012,0062) PatientIdentityRemoved = YES

** (0012,0063) DeIdentificationMethod ** = {CTP Default: based on DICOM PS3.15 AnnexE. Details in 0012,0064} potentially augmenting this tag with the name of our imaging network?

(0012,0064) DeIdentificationMethodCodeSeq = 113100/113105/113107/113108/113109

(0013,0010) BlockOwner = CTP

(0013,1010) ProjectName

(0013,1011) TrialName

(0013,1012) SiteName

(0013,1013) SiteID

(0020,4000) Image Comments

Thanks for any suggestions! Please let me know if I can clarify further.

Cheers,
Eve

@locastre I would recommend using the recently introduced mechanism of including DOI in the DICOM object.

We have been using the following approach for the recent new datasets in IDC:

  1. reserve DOI using Zenodo
  2. include the reserved DOI into the DICOM files for the new collection, as described in DICOM CP 2335.
  3. publish data simultaneously with releasing the Zenodo data descriptor

You can see an example of how this approach works in DICOM converted Slide Microscopy images for the Cancer Moonshot Biobank initiative collections as an example. Note that for large collections, we do not deposit the actual data to Zenodo, but instead create data descriptors that contain manifests that can be used to download the data. So for this collection it would be:

  1. download (as an example) this manifest cmb_gec-idc_v19-aws.s5cmd
  2. install s5cmd: pip install s5cmd
  3. download files listed in the manifest using this command: s5cmd --no-sign-request run ./cmb_gec-idc_v19-aws.s5cmd
  4. observe that your DICOM file contains DOI that points back to the Zenodo record above.
dcmdump df949ce3-ac9a-4a9d-a62f-bdc78902680d.dcm|grep Clinical
(0012,0010) LO [National Cancer Institute (NCI)]        #  32, 1 ClinicalTrialSponsorName
(0012,0020) LO [CMB-GEC]                                #   8, 1 ClinicalTrialProtocolID
(0012,0021) LO [Cancer Moonshot Biobank - Gastroesophageal Cancer (CMB-GEC)] #  60, 1 ClinicalTrialProtocolName
    (0012,0020) LO [doi:10.5281/zenodo.11099112]            #  28, 1 ClinicalTrialProtocolID
(0012,0030) LO (no value available)                     #   0, 0 ClinicalTrialSiteID
(0012,0031) LO (no value available)                     #   0, 0 ClinicalTrialSiteName
(0012,0040) LO [MSB-03738]                              #  10, 1 ClinicalTrialSubjectID
(0012,0050) LO [archival]                               #   8, 1 ClinicalTrialTimePointID
(0012,0060) LO (no value available)                     #   0, 0 ClinicalTrialCoordinatingCenterName

I believe this is the most principled way to achieve what you need.

If you plan to submit your data to IDC, we should discuss this with the IDC stakeholders to make sure we can support you first.

cc: @ulrike @granger

Not sure why (perhaps version of dcmtk is not updated with necessary data elements yet), but @fedorov’s example elides a few details, like the Other Clinical Trial Protocol IDs Sequence (0012,0023) and the Issuer of Clinical Trial Protocol ID (0012,0022).

See the full description of Clinical Trial Protocol ID in the current release of the standard.

If you use a recent (since 20240325) release of my dicom3tools with dcdump, it should show these relatively new data elements with the proper names.

Wrt. to the aspect of your original question about de-identification description, that should be described separately from the project identification, and specifically using Patient Identity Removed (0012,0062), and De-identification Method (0012,0063) and/or De-identification Method Code Sequence (0012,0064). See the description in PS3.15 and the attribute description in the Patient Module. Also consider the Longitudinal Temporal Information Modified (0028,0303) attribute,

1 Like

Ah, nice catch! Indeed, one needs to compile DCMTK from source to see those.

Here’s the complete output with the items mentioned by David now included:

 $ dcmdump ./20d92594-9368-40cd-b086-d5c6aea96121.dcm |grep Clini
(0012,0010) LO [National Cancer Institute (NCI)]        #  32, 1 ClinicalTrialSponsorName
(0012,0020) LO [CMB-AML]                                #   8, 1 ClinicalTrialProtocolID
(0012,0021) LO [Cancer Moonshot Biobank - Acute Myeloid Leukemia (CMB-AML)] #  58, 1 ClinicalTrialProtocolName
(0012,0022) LO [NCI]                                    #   4, 1 IssuerOfClinicalTrialProtocolID
(0012,0023) SQ (Sequence with undefined length #=1)     # u/l, 1 OtherClinicalTrialProtocolIDsSequence
    (0012,0020) LO [doi:10.5281/zenodo.11099112]            #  28, 1 ClinicalTrialProtocolID
    (0012,0022) LO [DOI]                                    #   4, 1 IssuerOfClinicalTrialProtocolID
(0012,0030) LO (no value available)                     #   0, 0 ClinicalTrialSiteID
(0012,0031) LO (no value available)                     #   0, 0 ClinicalTrialSiteName
(0012,0040) LO [MSB-01723]                              #  10, 1 ClinicalTrialSubjectID
(0012,0050) LO [on treatment]                           #  12, 1 ClinicalTrialTimePointID
(0012,0060) LO (no value available)                     #   0, 0 ClinicalTrialCoordinatingCenterName