Understanding IDC portal cohort plots

We had a team meeting today, and based on the discussion, the explanation of the behavior currently observed in IDC portal is the following (to the best of my understanding):

  • Plots that correspond to the groups of facets that have selected facets show counts of cases that have the data for those selected facets.

  • Plots that correspond to the groups of facets that do not have selected facets show counts of cases for all of the facets included in the group of facets for the cohort defined by the current filter.

As such, plots can only be interpreted in conjunction and in the context of the defined filter.

So in the example below, where “CT” is selected for the “Modality” facet group, “Modality” piechart shows the number of cases that have CT Modality. The other facets show the counts for the corresponding facet groups for all of the cases that are included in the cohort defined by the filter.

@wlongabaugh @spaquett @george.white would be great if you could confirm my summary above is correct, and feel free to amend for conciseness/clarity, since we will be using this for the documentation.

Lets use SQL to be precise. SQL queries to generate the charts would look something like the following:
select count(distinct caseId), BodyPartExamined where Modality =‘CT’ group by BodyPartExamined;
select count(distinct caseId), Modality where Modality=‘CT’ group by Modality;
select count(distinct caseId), ObjectClass where Modality=‘CT’ group by ObjectClass;
Now I left out the table name and I don’t know the actual column names but I hope this provides clarification.

1 Like

That’s a great idea George, although we still will need to come up with an explanation for the IDC users that does not require knowledge of SQL.

But in order to be precise, I think we need to have the precise table that is being queried by the frontend. Can @spaquett give a pointer to such table in BQ? Otherwise, these will be hard to confirm.

I am not sure the above is correct. As you can see from the filters configuration in the screenshot, “Object class” piechart includes a lot of categories, including Segmentation, for example. But it will will all depend on what is the table behind the query.

I am not saying the queries you provided should correspond to the behavior I would expect, but - assuming that the table being queried by Solr is similar to the DICOM metadata we have in BQ - the result of the query is quite different from what is shown in the portal interface.

Here’s the specific query against the actual table:

SELECT
  COUNT(DISTINCT(PatientID)) as cases_per_objectClass,
  SOPClassUID
FROM
  `canceridc-data.idc_views.dicom_all`
WHERE
  Modality = "CT"
GROUP BY
  SOPClassUID

And here’s the result of running the query above:

image

Here’s the result of selecting CT modality in the portal:

image

Note that Object Class pie chart has more than 2 entries.

So the question is - are the queries you specified not the ones that reflect the expected behavior of the portal, or is the portal showing incorrect results?

Those queries reflected my understanding of the web app although I admit I am not too familiar with the back end. But both you and Suzanne are more familiar with the source data.

If I go to the webapp and use the filter definition
COLLECTION IN (lidc_idri) AND MODALITY IN (Computed Tomography) then I get 3 non zero Object Classes: CT Image (1010 cases), Enhanced SR (875 cases) and Segmentation (875 cases). I expect you should know if this is correct (or at least makes sense) or not. If it makes sense then something is off with your query or your interpretation of the query.

The dicom_all table isn’t representative because it doesn’t include the derived data. That’s where things get tricky–the derived records have NULL for things like BodyPartExamined and SOPClassUID, or have a modality which is eg. SR or SEG.

There’s an added wrinkle on top of that. This query is counting cases from records with Modality=‘CT’, but it isn’t including all the records from that case (or studies that case is in). This is what the pie charts are showing: records with Modality=‘CT’ are found, the studies those records belong to are pulled, and then the facets are counted. So it’s more like:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new`
      WHERE Modality='CT')
      GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID;

This comes down to what does Modality=‘CT’ apply to. Case? Instance? Study? Here I’m showing study; the above query is instance. We can’t go by instance if we want derived data to work as indicated.

This is what changing the Solr index to have records by study or case will fix. We just need to decide what we want the filter to result in. Records from a study, or records from a case.

@spaquett if I follow your query, then why don’t I see a similar result for Modality?

This query

FROM (
  SELECT PatientID, Modality
  FROM (
    SELECT PatientID, Modality
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new`
      WHERE Modality='CT')
      GROUP BY PatientID, Modality
  )
)
GROUP BY Modality;

results in this:

image

not in this:

image

Per the conversation in the GitHub ticket, right now, the above filter is how records are counted. This is the reason for the behavior you’re seeing.

With related data, there’s a slight difference, in that related filters are only applied to related cases. In effect, the above query becomes something like:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new` dicom
      LEFT JOIN `isb-cgc.TCGA_bioclin_v0.clinical_v1` AS clin
      ON clin.case_barcode = dicom.PatientID
      WHERE (has_related='True' AND clin.vital_status='Alive')
    )
    GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID;

If you had an original filter as well, it becomes:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new` dicom
      LEFT JOIN `isb-cgc.TCGA_bioclin_v0.clinical_v1` AS clin
      ON clin.case_barcode = dicom.PatientID
      WHERE ((has_related='True' AND clin.vital_status='Alive') OR has_related='False') AND (Modality='CT')
    )
    GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID;

For derived data, we CAN do this, we’re just not doing it currently–because unlike related data, which is linked to PatientID and so relatively easy to determine a simple has/does not have relationship based on the PatientID, derived data are instance records as well, at the same level (if you will) as Original data. Furthermore, derived data is broken down into 3 separate subcategories (Segmentation, Qualitative, Quantitative). So, as for how it’s behaving right now, this is the filter for records:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new` dicom
      WHERE (AnatomicRegionSequence IS NULL)
    )
    GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID
ORDER BY cases DESC;

The filter for counts, on the other hand:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new` dicom
      WHERE (AnatomicRegionSequence IS NULL AND has_segmentation='True')
    )
    GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID
ORDER BY cases DESC;

Notice that counts are explicitly done under their derived subcategory. The record filter, on the other hand, assumes that if the user checks AnatomicRegionSequence is None, they mean anything with no AnatomicRegionSequence available.

Because the charts are ‘hardwired’ in the UI to only show the filtered attributes within a category if the category (ie Modality) is used in the filter. Meanwhile the counts appearing in the filter selection panel on the LHS are calculated AS IF ALL FILTERS WITHIN THE GIVEN CATEGORY ARE TURNED OFF. Thus if your filter is just "Modality in ‘CT’ " then the Object Class (or SOPClassUID) counts are calculated using Modality=‘CT’, ie Suzanne’s query given above. However the Modality counts are calculated WITHOUT this filter, or without the “where Modality=‘CT’” clause in your query. This is consistent with portal.gdc.cancer.gov. It’s understandable behavior with exclusive attributes (ie Primary Tumour Site, Program etc). It’s more confusing where one case can ‘belong’ to several attributes at once (ie Modality).

@spaquett I will need time to ponder over your post. Thank you.

@george.white your response brings me back to the question I asked earlier in this thread - is the definition below accurate based on your understanding?

So back to the UI: If I filter with ‘ANATOMIC REGION in None’ the counts I get back for Object Class are identical to those I get using your ‘filter for records’ query and your ‘filter for counts’. Is this expected? Am I passing in the filter to the backend correctly?

https://idc-dev.appspot.com/explore/?counts_only=True&is_json=true&is_dicofdic=True&data_source_type=S&filters={"AnatomicRegionSequence":["None"]}

Correction: … your ‘filter for records’ query and NOT your ‘filter for counts’ query

That explanation does sound correct. We should work on our terminology or borrow from another resource. I have used ‘attribute’ where you are using ‘facet’. I have used ‘category’ in place of ‘groups of facets’.

1 Like

It’s understandable behavior with exclusive attributes (ie Primary Tumour Site, Program etc). It’s more confusing where one case can ‘belong’ to several attributes at once (ie Modality).

This is our primary issue, the notion of exclusive attributes vs. non-exclusive; or rather, the notion of a PatientID/Case being linked to multiple records and so having multi-valued attributes, or just one. It’s not unsolvable, in fact in an indexer like Solr it’s quiet easy to manage. But first we have to decide what level we’re going to use for filters.

So back to the UI: If I filter with ‘ANATOMIC REGION in None’ the counts I get back for Object Class are identical to those I get using your ‘filter for records’ query and your ‘filter for counts’. Is this expected? Am I passing in the filter to the backend correctly?

Sorry, that might not have been explicit: derived data is handled differently from Original. Derived counts (and related counts) only count derived records. The filter for records isn’t restricted, however. So when I said the filter for counts, I meant, the filter for derived counts. (Note has_segmentation='True'; this is what restricts this count to derived records only–explicitly derived records with segmentation data, in fact.).

…apologies, I forgot to add, because SOPClassUID is not a derived attribute, the calculation is done over all records.

Ok, I have a partial understanding. What you have called ‘filter for records’ would be a query to count cases in an original data category (SOPClassUID in this case). It can be adjusted for different filters by changing the ‘WHERE Clause’. Now ‘filter for counts’ would be a query to count cases in derived:segmentation categories (but replace SOPClassUID with a derived:segmentation category). The only difference between this and ‘filter for records’ (besides the category) is the ‘has_segmentation=true’ clause. I think for counts in derived:quantitative you have ‘has_quantitative=true’ instead? Now ‘has_segmentation=true’ is only going to matter when ‘AnatomicRegionSequence is NULL’ is in the filter. If AnatomicRegionSequence has any other filter value then ‘has_segmentation=true’ is redundant.

Actually I I can take a stab at explaining it … with set theory and Venn diagrams. It can be understandable but as is the numbers are not going to be intuitive to anyone who has not spend a few minutes with the documentation.

Example w Discussion:

Here we filter for Atomic Region=Pharyngeal tonsil (adenoid)). From the UI there should only be 25 cases selected.


Here we examine the Modality Counts for the filter ‘Atomic Region=Pharyngeal tonsil (adenoid))’ (see RHS). Each case can have multiple modalities, so we do not expect the sum of all counts to be equal to 25, but the maximum number of counts in any one attribute to be less or equal to 25. This is confirmed.

Here we filter for ‘Atomic Region = None’. The UI is showing that there are only 10 cases with ‘Atomic Region = None’. However, within the Anatomic Region category the UI is only counting cases which have segmentation data but do not have an Anatomic Region within the segmentation data as None. There are hundreds of other cases that do not have any segmentation data.

Here is Modality with the filter ‘Anatomic Region=None’. How is it that many of the case counts are much greater than 10, the total number of cases we show for the ‘Anatomic Region=None’ bucket? Well in counting cases for Modality we now include all the cases that do not have any segmentation data, and by default their ‘Anatomic Region’ is None.

Thank you George, this is precisely what’s happening.

Again–it doesn’t have to work this way, this is just the logical outcome of wanting to associate derived data with the studies/cases they’re linked to. Because we’re working at an instance level with our data, where the derived and original records of a given study or case aren’t unified in one another’s attribute values (because why would they be–they’re different records, after all), we get some weird numbers and lists when we want to count on something other than instance records.

This is is why Ulrika and Justin asked us what we planned to do; it’s a thorny problem with no obvious answer, beyond ‘decide and document’.

Well, not quite - in fact we include all of the cases that have any instance with ‘Anatomic Region’ set to None. And every single case will have such instance, since every single study that has a Segmentation instance will have non-Segmentation instances.

So if the user selects ‘None’ for Derived filters, it has no effect on the definition of the cohort. I do not know how this can be helpful. We can discuss how this behavior can be improved, but until then maybe it’s better to just remove “None” as an option from the filters.

Following the discussion today with @spaquett, @george.white and @wlongabaugh, we all agreed to disable “None” filter selection for the derived data, since its current behavior is most likely not what any user would expect.