Understanding IDC portal cohort plots

fedorov · August 27, 2020, 5:44pm

IDC MVP portal is searching for cases that meet the search criteria. However, I am struggling understanding the plots that summarize the cohort selection.

Here’s the specific example. If I select CT modality, I get the summary below:

If I stick with the stated cohort selection rules, plots should summarize the shown attributes for all cases that include CT modality. Clearly, the middle piechart stands out as not meeting that desired behavior, since most definitely cases that have CT modality also have other modalities (e.g., Segmentation object class will definitely not have CT as the Modality).

Is this a bug or there is an explanation for this behavior? I cannot reconcile the stated expected behavior of the app with what I see.

@spaquett @george.white would be great to hear your thoughts.

spaquett · August 27, 2020, 6:22pm

This gets back to the data structure issue. Right now, our data is all Instance based. So, when you click a filter, you’re saying ‘what is the number of cases with instances that have this modality, and what are the faceted counts for those instances’. What you’re describing instead is, ‘which cases have instances that have this modality, and what are their faceted counts’. The later is case-centric.

This is on our radar to fix post-MVP by restructuring the data the WebApp uses to be case-centric. This will solve a LARGE number of problems we’re having right now.

Going forward, we always want to ask ourselves: are filters meant to return info on cases that have instances which match those filters? Or on instances that match those filters? (Or studies, or…) Filtering on instances, you lose information about the instances which don’t match. If you filter at a higher level, like cases, you don’t.

fedorov · August 27, 2020, 6:42pm

I don’t think that’s what I see. I am afraid it is a mix of both. What I see in the plot for “Modality” seems to be the count of either cases or studies (definitely not instances, the number is too small for that) that have Modality = CT. What I see in the plot for “Object class” on the same screen appears to be the counts for the cases or studies for all types of Object classes in the selected cohort where Modality = CT.

I thought we were having troubles just with the derived data, but it appears that the other counts are also not necessarily correct.

If you ask me, I would say the current behavior is a bug. If we keep it this way for the MVP, I will need to recruit you to write documentation for the portal

spaquett · August 27, 2020, 7:27pm

What I see in the plot for “Modality” seems to be the count of either cases or studies (definitely not instances, the number is too small for that) that have Modality = CT. What I see in the plot for “Object class” on the same screen appears to be the counts for the cases or studies for all types of Object classes in the selected cohort where Modality = CT.

I think I’m not clear on what you would expect to see otherwise. It’s the instances which are selected, not the cases. So, by selecting CT, you’re getting those instances, and then the Object Classes for instances with a modality of CT.

What would you expect to see instead, for Object Class?

spaquett · August 27, 2020, 7:29pm

Sorry I misread–I think I understand what you’re saying. And to be clear, it’s cases. We don’t count studies.

…ah ha. This is due to how we wanted to make Original data ‘child record case search’ against derived. It’s also applying it to other Original data records. Since I’m working on counting related separately (same code) I’ll see if I can sort out a way to only go between orginal/derived and not original/original.

Note though, if we swap to case centric, this IS what we’ll see.

fedorov · August 28, 2020, 2:05am

Great, I made an issue to make sure this is tracked:

Sounds like we need to discuss this at the next meeting to help me understand!

fedorov · August 28, 2020, 10:08pm

We had a team meeting today, and based on the discussion, the explanation of the behavior currently observed in IDC portal is the following (to the best of my understanding):

Plots that correspond to the groups of facets that have selected facets show counts of cases that have the data for those selected facets.
Plots that correspond to the groups of facets that do not have selected facets show counts of cases for all of the facets included in the group of facets for the cohort defined by the current filter.

As such, plots can only be interpreted in conjunction and in the context of the defined filter.

So in the example below, where “CT” is selected for the “Modality” facet group, “Modality” piechart shows the number of cases that have CT Modality. The other facets show the counts for the corresponding facet groups for all of the cases that are included in the cohort defined by the filter.

@wlongabaugh @spaquett @george.white would be great if you could confirm my summary above is correct, and feel free to amend for conciseness/clarity, since we will be using this for the documentation.

george.white · September 1, 2020, 9:19pm

Lets use SQL to be precise. SQL queries to generate the charts would look something like the following:
select count(distinct caseId), BodyPartExamined where Modality =‘CT’ group by BodyPartExamined;
select count(distinct caseId), Modality where Modality=‘CT’ group by Modality;
select count(distinct caseId), ObjectClass where Modality=‘CT’ group by ObjectClass;
Now I left out the table name and I don’t know the actual column names but I hope this provides clarification.

fedorov · September 2, 2020, 3:45pm

That’s a great idea George, although we still will need to come up with an explanation for the IDC users that does not require knowledge of SQL.

But in order to be precise, I think we need to have the precise table that is being queried by the frontend. Can @spaquett give a pointer to such table in BQ? Otherwise, these will be hard to confirm.

george.white:

SQL queries to generate the charts would look something like the following:

select count(distinct caseId), BodyPartExamined 
  where Modality =‘CT’ group by BodyPartExamined;
select count(distinct caseId), Modality 
  where Modality=‘CT’ group by Modality;
select count(distinct caseId), ObjectClass 
  where Modality=‘CT’ group by ObjectClass;

I am not sure the above is correct. As you can see from the filters configuration in the screenshot, “Object class” piechart includes a lot of categories, including Segmentation, for example. But it will will all depend on what is the table behind the query.

I am not saying the queries you provided should correspond to the behavior I would expect, but - assuming that the table being queried by Solr is similar to the DICOM metadata we have in BQ - the result of the query is quite different from what is shown in the portal interface.

Here’s the specific query against the actual table:

SELECT
  COUNT(DISTINCT(PatientID)) as cases_per_objectClass,
  SOPClassUID
FROM
  `canceridc-data.idc_views.dicom_all`
WHERE
  Modality = "CT"
GROUP BY
  SOPClassUID

And here’s the result of running the query above:

Here’s the result of selecting CT modality in the portal:

Note that Object Class pie chart has more than 2 entries.

So the question is - are the queries you specified not the ones that reflect the expected behavior of the portal, or is the portal showing incorrect results?

george.white · September 2, 2020, 4:58pm

Those queries reflected my understanding of the web app although I admit I am not too familiar with the back end. But both you and Suzanne are more familiar with the source data.

If I go to the webapp and use the filter definition
COLLECTION IN (lidc_idri) AND MODALITY IN (Computed Tomography) then I get 3 non zero Object Classes: CT Image (1010 cases), Enhanced SR (875 cases) and Segmentation (875 cases). I expect you should know if this is correct (or at least makes sense) or not. If it makes sense then something is off with your query or your interpretation of the query.

spaquett · September 2, 2020, 5:01pm

The dicom_all table isn’t representative because it doesn’t include the derived data. That’s where things get tricky–the derived records have NULL for things like BodyPartExamined and SOPClassUID, or have a modality which is eg. SR or SEG.

There’s an added wrinkle on top of that. This query is counting cases from records with Modality=‘CT’, but it isn’t including all the records from that case (or studies that case is in). This is what the pie charts are showing: records with Modality=‘CT’ are found, the studies those records belong to are pulled, and then the facets are counted. So it’s more like:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new`
      WHERE Modality='CT')
      GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID;

This comes down to what does Modality=‘CT’ apply to. Case? Instance? Study? Here I’m showing study; the above query is instance. We can’t go by instance if we want derived data to work as indicated.

This is what changing the Solr index to have records by study or case will fix. We just need to decide what we want the filter to result in. Records from a study, or records from a case.

fedorov · September 2, 2020, 7:21pm

@spaquett if I follow your query, then why don’t I see a similar result for Modality?

This query

FROM (
  SELECT PatientID, Modality
  FROM (
    SELECT PatientID, Modality
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new`
      WHERE Modality='CT')
      GROUP BY PatientID, Modality
  )
)
GROUP BY Modality;

results in this:

not in this:

spaquett · September 2, 2020, 7:24pm

Per the conversation in the GitHub ticket, right now, the above filter is how records are counted. This is the reason for the behavior you’re seeing.

With related data, there’s a slight difference, in that related filters are only applied to related cases. In effect, the above query becomes something like:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new` dicom
      LEFT JOIN `isb-cgc.TCGA_bioclin_v0.clinical_v1` AS clin
      ON clin.case_barcode = dicom.PatientID
      WHERE (has_related='True' AND clin.vital_status='Alive')
    )
    GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID;

If you had an original filter as well, it becomes:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new` dicom
      LEFT JOIN `isb-cgc.TCGA_bioclin_v0.clinical_v1` AS clin
      ON clin.case_barcode = dicom.PatientID
      WHERE ((has_related='True' AND clin.vital_status='Alive') OR has_related='False') AND (Modality='CT')
    )
    GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID;

For derived data, we CAN do this, we’re just not doing it currently–because unlike related data, which is linked to PatientID and so relatively easy to determine a simple has/does not have relationship based on the PatientID, derived data are instance records as well, at the same level (if you will) as Original data. Furthermore, derived data is broken down into 3 separate subcategories (Segmentation, Qualitative, Quantitative). So, as for how it’s behaving right now, this is the filter for records:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new` dicom
      WHERE (AnatomicRegionSequence IS NULL)
    )
    GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID
ORDER BY cases DESC;

The filter for counts, on the other hand:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new` dicom
      WHERE (AnatomicRegionSequence IS NULL AND has_segmentation='True')
    )
    GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID
ORDER BY cases DESC;

Notice that counts are explicitly done under their derived subcategory. The record filter, on the other hand, assumes that if the user checks AnatomicRegionSequence is None, they mean anything with no AnatomicRegionSequence available.

george.white · September 2, 2020, 8:49pm

Because the charts are ‘hardwired’ in the UI to only show the filtered attributes within a category if the category (ie Modality) is used in the filter. Meanwhile the counts appearing in the filter selection panel on the LHS are calculated AS IF ALL FILTERS WITHIN THE GIVEN CATEGORY ARE TURNED OFF. Thus if your filter is just "Modality in ‘CT’ " then the Object Class (or SOPClassUID) counts are calculated using Modality=‘CT’, ie Suzanne’s query given above. However the Modality counts are calculated WITHOUT this filter, or without the “where Modality=‘CT’” clause in your query. This is consistent with portal.gdc.cancer.gov. It’s understandable behavior with exclusive attributes (ie Primary Tumour Site, Program etc). It’s more confusing where one case can ‘belong’ to several attributes at once (ie Modality).

fedorov · September 2, 2020, 8:54pm

@spaquett I will need time to ponder over your post. Thank you.

@george.white your response brings me back to the question I asked earlier in this thread - is the definition below accurate based on your understanding?

george.white · September 2, 2020, 9:04pm

So back to the UI: If I filter with ‘ANATOMIC REGION in None’ the counts I get back for Object Class are identical to those I get using your ‘filter for records’ query and your ‘filter for counts’. Is this expected? Am I passing in the filter to the backend correctly?

https://idc-dev.appspot.com/explore/?counts_only=True&is_json=true&is_dicofdic=True&data_source_type=S&filters={"AnatomicRegionSequence":["None"]}

george.white · September 2, 2020, 9:05pm

Correction: … your ‘filter for records’ query and NOT your ‘filter for counts’ query

george.white · September 2, 2020, 9:23pm

That explanation does sound correct. We should work on our terminology or borrow from another resource. I have used ‘attribute’ where you are using ‘facet’. I have used ‘category’ in place of ‘groups of facets’.

spaquett · September 2, 2020, 9:30pm

It’s understandable behavior with exclusive attributes (ie Primary Tumour Site, Program etc). It’s more confusing where one case can ‘belong’ to several attributes at once (ie Modality).

This is our primary issue, the notion of exclusive attributes vs. non-exclusive; or rather, the notion of a PatientID/Case being linked to multiple records and so having multi-valued attributes, or just one. It’s not unsolvable, in fact in an indexer like Solr it’s quiet easy to manage. But first we have to decide what level we’re going to use for filters.

So back to the UI: If I filter with ‘ANATOMIC REGION in None’ the counts I get back for Object Class are identical to those I get using your ‘filter for records’ query and your ‘filter for counts’. Is this expected? Am I passing in the filter to the backend correctly?

Sorry, that might not have been explicit: derived data is handled differently from Original. Derived counts (and related counts) only count derived records. The filter for records isn’t restricted, however. So when I said the filter for counts, I meant, the filter for derived counts. (Note has_segmentation='True'; this is what restricts this count to derived records only–explicitly derived records with segmentation data, in fact.).

spaquett · September 2, 2020, 9:33pm

…apologies, I forgot to add, because SOPClassUID is not a derived attribute, the calculation is done over all records.

Topic		Replies	Views
TCGA-GBM tutorial notebook Support question	27	2451	September 10, 2021
How to modify an existing IDC cohort Feedback and features feature	9	541	April 28, 2021
Text2Cohort: a new LLM toolkit to query IDC database using Natural Language Queries Announcements	4	731	May 27, 2023
"Cart" feature is now available in IDC Portal Announcements	0	54	April 10, 2025
IDC slide microscopy dashboard: when IDC Portal is not enough Data dashboard	0	30	November 25, 2024

Understanding IDC portal cohort plots

Related topics