Storing definitions of data collections as DICOM entities

Hi, I am not directly affiliated with IDC (yet), but I have a discussion topic that should be relevant to this audience and was pointed here by @fedorov:

I have been wondering whether and how it makes sense to represent training / validation / test group definitions in DICOM entities. My very specific goal is to be able to use a single DICOM UID for unambiguously referencing the data that should be used for training (and possibly another UID for a validation set). I thought it should suffice to give connection information to a PACS to my deep learning/CNN-based segmentation model algorithm container, together with these UIDs, and the training should be able to pull all necessary data from the PACS.

Taking a step back, I guess there are a few related use cases, potentially also within IDC, such as uniquely defining (versions and variants of) data collections.

My specific idea / impression was that Key Object Selection Documents may be usable for this purpose (disclaimer: impression based on heresay; I have not used them before), that could potentially refer to DICOM SEGs that could in turn refer to the original images, allowing for the training to pull all intended training data from the server.

Does my idea sound feasible? Are there role models / predecessors / related endeavors or use cases I did not see yet? Any public spec or experiences?


Hi @hmeine - thanks for posting :+1: I agree that whatever we do in this area needs to leverage the dicom metadata and allow referencing input data and processing variables using a consistent structure that reuses dicom conventions. Whether it’s ever part of the standard doesn’t matter much at this point, but we can think of it as being in draft mode while the details are being worked through. I don’t think we have any really good examples yet, but I’m interested to hear what others have done.

@hmeine welcome to IDC forum! :fireworks:

I assume that you want to define your data collections in a way that spans multiple patients. If that is indeed the case, my understanding is that saving an object with such collection as a DICOM object would conflict with the DICOM view of the real world, since that object would need to fit into the patient->study->series->instance hierarchy, and what is going to be that PatientID specified in that object (whatever it is going to be - KO or SR)?

Of course, nothing prevents you from hacking your way through and make a container that has that list, but I am not entirely sure about the benefit. I completely agree definition of the collection/cohort should leverage DICOM UIDs, but it is not clear to me DICOM is even the right mechanism to communicate collections/cohorts.

In IDC collection/cohort can be defined using one of the 2 approaches, which will uniquely and persistently resolve to the list of files corresponding to that cohort:

  1. as a list of SOPInstanceUIDs selected from IDC dicom_all table + IDC data version number
  2. as an SQL query against IDC dicom_all table + IDC data version number

Related issue is that DICOM is not intended to handle versioning, and research applications and repositories must deal with versioning, so this adds another complexity, and perhaps motivation for repository- or solution-specific approach to cohort definition.

Of course, within individual patient/study one can utilize various existing capabilities of the standard to select individual series, or specify provenance of the analysis artifacts etc.

I would definitely love to hear @dclunie and @hackermd thoughts on this.

Coincidentally, I am at our regular weekly Office Hours here until 11:30am Boston time - if anyone wants to drop by and discuss this topic in person!

Actually DICOM just added the capability to do something like this in Sup 223:

but it is brand new and not implemented yet, though there is no reason why you could not manually create and store an “Inventory Instance”, which as Sup 223 is a so-called “non-patient object”.


Thanks to both of you for the quick replies! I must admit that I did not consider the possibility that my motivation to store these as DICOM objects would be higher than yours. :wink:

So, let me take the “good cop” role and put my DICOM preaching hat on:

  • I am motivated to store the list permanently and immutably, and it makes a lot of sense that it is “close to the data itself”; it kind of belongs together IMO.
  • From the specific perspective of my task, it would be so convenient to specify one UID for the dataset, instead of a list that is not as easy to pass around via cmdline args, environment variables or so, and might also break / lose entries without notice.

I understand that this would be stretching the purpose of a regular PACS / the DICOM data model, OTOH I would find it quite disappointing to have to setup another server with a different protocol (HTTP/S3/…) just for serving the list of cases.

Thanks David (as you probably guessed, your post arrived after I composed mine), the Inventory Information Model looks very interesting indeed! I will have to study it. When implementing it, I would probably run into the problem that the new tags and constants are not not part of dcmtk / pydicom / dicom-rs etc. dictionaries yet, right?

I am not so concerned about it being a new supplement, since I am trying to get something running (in a “tentatively standard-conformant way”) which is only my own pieces of software talking to each other (and with a PACS, which hopefully does not care about unknown objects).

Thanks @dclunie I wasn’t aware of that either.

I can’t speak for all libraries, but you should be able to supplement the dictionaries to define the new codes as a workaround and of course could submit PRs to add anything that’s missing.

@dclunie I was completely unaware of this - this is great, and we should discuss this in more detail and evaluate its utility in the context of IDC!

Are there any publicly available samples? I looked around the FTP server, but either there are none, or (more likely) I have no idea how to navigate those folders.

No samples AFAIK - this is brand new (was just completed last week).

1 Like

@hmeine I think it is quite possible that your PACS may balk at those unknown objects, since that supplement is introducing a new SOPClassUID, and it would be quite logical for a PACS to refuse object classes not listed in the conformance statement.

I am considering Orthanc or DCM4CHEE; this is not about a clinical installation (just a research installation under our control in a hospital). Would you also expect these less clinically oriented „PACS-like systems“ to have a problem with unknown SOP Class UIDs?

I don’t know, I don’t have much experience with those. But I think it should be easy to test. You can, for example, take any DICOM object, replace SOPClassUID with any value (which would be a valid VR-wise, but not a value listed in the standard), and try to send and retrieve that object with Orthanc or DCM4CHEE. This must be much faster and easier than asking the maintainers through their forum etc.

Just a quick remark while I am busy with other stuff and still could not read the full supplement: The slides on Sup 223 state that the persistent inventory object is “Managed like other DICOM non-patient objects” (slide 9). I found that interesting because of the discussion above about DICOM focusing on per-patient objects.

When reading the supplement, one area where my intended use case and the supplement’s original motivation diverge is that creating a copy of the contents is not what I want; my inventory should just refer to the necessary other objects. It’s clear that the latter is somehow also supported (in any case, the original data will be fully identified), but much of the text seems to assume that one wants the inventory to contain all the data. I am not very experienced with reading the standard, even less so with supplements, so I do not feel very confident yet on how good of a fit this is.

Looking at the Inventory Module, it does appear that it references studies/series/instances by UIDs. I do not see where it would keep a copy of all of the content of their attributes.