Understanding IDC portal cohort plots

This gets back to the data structure issue. Right now, our data is all Instance based. So, when you click a filter, you’re saying ‘what is the number of cases with instances that have this modality, and what are the faceted counts for those instances’. What you’re describing instead is, ‘which cases have instances that have this modality, and what are their faceted counts’. The later is case-centric.

This is on our radar to fix post-MVP by restructuring the data the WebApp uses to be case-centric. This will solve a LARGE number of problems we’re having right now.

Going forward, we always want to ask ourselves: are filters meant to return info on cases that have instances which match those filters? Or on instances that match those filters? (Or studies, or…) Filtering on instances, you lose information about the instances which don’t match. If you filter at a higher level, like cases, you don’t.

I don’t think that’s what I see. I am afraid it is a mix of both. What I see in the plot for “Modality” seems to be the count of either cases or studies (definitely not instances, the number is too small for that) that have Modality = CT. What I see in the plot for “Object class” on the same screen appears to be the counts for the cases or studies for all types of Object classes in the selected cohort where Modality = CT.

I thought we were having troubles just with the derived data, but it appears that the other counts are also not necessarily correct.

If you ask me, I would say the current behavior is a bug. If we keep it this way for the MVP, I will need to recruit you to write documentation for the portal :wink:

What I see in the plot for “Modality” seems to be the count of either cases or studies (definitely not instances, the number is too small for that) that have Modality = CT. What I see in the plot for “Object class” on the same screen appears to be the counts for the cases or studies for all types of Object classes in the selected cohort where Modality = CT.

I think I’m not clear on what you would expect to see otherwise. It’s the instances which are selected, not the cases. So, by selecting CT, you’re getting those instances, and then the Object Classes for instances with a modality of CT.

What would you expect to see instead, for Object Class?

Sorry I misread–I think I understand what you’re saying. And to be clear, it’s cases. We don’t count studies.

…ah ha. This is due to how we wanted to make Original data ‘child record case search’ against derived. It’s also applying it to other Original data records. Since I’m working on counting related separately (same code) I’ll see if I can sort out a way to only go between orginal/derived and not original/original.

Note though, if we swap to case centric, this IS what we’ll see.

1 Like

Great, I made an issue to make sure this is tracked:

Sounds like we need to discuss this at the next meeting to help me understand!

We had a team meeting today, and based on the discussion, the explanation of the behavior currently observed in IDC portal is the following (to the best of my understanding):

  • Plots that correspond to the groups of facets that have selected facets show counts of cases that have the data for those selected facets.

  • Plots that correspond to the groups of facets that do not have selected facets show counts of cases for all of the facets included in the group of facets for the cohort defined by the current filter.

As such, plots can only be interpreted in conjunction and in the context of the defined filter.

So in the example below, where “CT” is selected for the “Modality” facet group, “Modality” piechart shows the number of cases that have CT Modality. The other facets show the counts for the corresponding facet groups for all of the cases that are included in the cohort defined by the filter.

@wlongabaugh @spaquett @george.white would be great if you could confirm my summary above is correct, and feel free to amend for conciseness/clarity, since we will be using this for the documentation.

Lets use SQL to be precise. SQL queries to generate the charts would look something like the following:
select count(distinct caseId), BodyPartExamined where Modality =‘CT’ group by BodyPartExamined;
select count(distinct caseId), Modality where Modality=‘CT’ group by Modality;
select count(distinct caseId), ObjectClass where Modality=‘CT’ group by ObjectClass;
Now I left out the table name and I don’t know the actual column names but I hope this provides clarification.

1 Like

That’s a great idea George, although we still will need to come up with an explanation for the IDC users that does not require knowledge of SQL.

But in order to be precise, I think we need to have the precise table that is being queried by the frontend. Can @spaquett give a pointer to such table in BQ? Otherwise, these will be hard to confirm.

I am not sure the above is correct. As you can see from the filters configuration in the screenshot, “Object class” piechart includes a lot of categories, including Segmentation, for example. But it will will all depend on what is the table behind the query.

I am not saying the queries you provided should correspond to the behavior I would expect, but - assuming that the table being queried by Solr is similar to the DICOM metadata we have in BQ - the result of the query is quite different from what is shown in the portal interface.

Here’s the specific query against the actual table:

SELECT
  COUNT(DISTINCT(PatientID)) as cases_per_objectClass,
  SOPClassUID
FROM
  `canceridc-data.idc_views.dicom_all`
WHERE
  Modality = "CT"
GROUP BY
  SOPClassUID

And here’s the result of running the query above:

image

Here’s the result of selecting CT modality in the portal:

image

Note that Object Class pie chart has more than 2 entries.

So the question is - are the queries you specified not the ones that reflect the expected behavior of the portal, or is the portal showing incorrect results?

Those queries reflected my understanding of the web app although I admit I am not too familiar with the back end. But both you and Suzanne are more familiar with the source data.

If I go to the webapp and use the filter definition
COLLECTION IN (lidc_idri) AND MODALITY IN (Computed Tomography) then I get 3 non zero Object Classes: CT Image (1010 cases), Enhanced SR (875 cases) and Segmentation (875 cases). I expect you should know if this is correct (or at least makes sense) or not. If it makes sense then something is off with your query or your interpretation of the query.

The dicom_all table isn’t representative because it doesn’t include the derived data. That’s where things get tricky–the derived records have NULL for things like BodyPartExamined and SOPClassUID, or have a modality which is eg. SR or SEG.

There’s an added wrinkle on top of that. This query is counting cases from records with Modality=‘CT’, but it isn’t including all the records from that case (or studies that case is in). This is what the pie charts are showing: records with Modality=‘CT’ are found, the studies those records belong to are pulled, and then the facets are counted. So it’s more like:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new`
      WHERE Modality='CT')
      GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID;

This comes down to what does Modality=‘CT’ apply to. Case? Instance? Study? Here I’m showing study; the above query is instance. We can’t go by instance if we want derived data to work as indicated.

This is what changing the Solr index to have records by study or case will fix. We just need to decide what we want the filter to result in. Records from a study, or records from a case.

@spaquett if I follow your query, then why don’t I see a similar result for Modality?

This query

FROM (
  SELECT PatientID, Modality
  FROM (
    SELECT PatientID, Modality
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new`
      WHERE Modality='CT')
      GROUP BY PatientID, Modality
  )
)
GROUP BY Modality;

results in this:

image

not in this:

image

Per the conversation in the GitHub ticket, right now, the above filter is how records are counted. This is the reason for the behavior you’re seeing.

With related data, there’s a slight difference, in that related filters are only applied to related cases. In effect, the above query becomes something like:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new` dicom
      LEFT JOIN `isb-cgc.TCGA_bioclin_v0.clinical_v1` AS clin
      ON clin.case_barcode = dicom.PatientID
      WHERE (has_related='True' AND clin.vital_status='Alive')
    )
    GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID;

If you had an original filter as well, it becomes:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new` dicom
      LEFT JOIN `isb-cgc.TCGA_bioclin_v0.clinical_v1` AS clin
      ON clin.case_barcode = dicom.PatientID
      WHERE ((has_related='True' AND clin.vital_status='Alive') OR has_related='False') AND (Modality='CT')
    )
    GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID;

For derived data, we CAN do this, we’re just not doing it currently–because unlike related data, which is linked to PatientID and so relatively easy to determine a simple has/does not have relationship based on the PatientID, derived data are instance records as well, at the same level (if you will) as Original data. Furthermore, derived data is broken down into 3 separate subcategories (Segmentation, Qualitative, Quantitative). So, as for how it’s behaving right now, this is the filter for records:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new` dicom
      WHERE (AnatomicRegionSequence IS NULL)
    )
    GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID
ORDER BY cases DESC;

The filter for counts, on the other hand:

SELECT SOPClassUID, COUNT(*) as cases
FROM (
  SELECT PatientID, SOPClassUID
  FROM (
    SELECT PatientID, SOPClassUID
    FROM 	`idc-dev-etl.etl_metadata.dicom_derived_all_new`
    WHERE StudyInstanceUID in (
      SELECT StudyInstanceUID 
      FROM `idc-dev-etl.etl_metadata.dicom_derived_all_new` dicom
      WHERE (AnatomicRegionSequence IS NULL AND has_segmentation='True')
    )
    GROUP BY PatientID, SOPClassUID
  )
)
GROUP BY SOPClassUID
ORDER BY cases DESC;

Notice that counts are explicitly done under their derived subcategory. The record filter, on the other hand, assumes that if the user checks AnatomicRegionSequence is None, they mean anything with no AnatomicRegionSequence available.

Because the charts are ‘hardwired’ in the UI to only show the filtered attributes within a category if the category (ie Modality) is used in the filter. Meanwhile the counts appearing in the filter selection panel on the LHS are calculated AS IF ALL FILTERS WITHIN THE GIVEN CATEGORY ARE TURNED OFF. Thus if your filter is just "Modality in ‘CT’ " then the Object Class (or SOPClassUID) counts are calculated using Modality=‘CT’, ie Suzanne’s query given above. However the Modality counts are calculated WITHOUT this filter, or without the “where Modality=‘CT’” clause in your query. This is consistent with portal.gdc.cancer.gov. It’s understandable behavior with exclusive attributes (ie Primary Tumour Site, Program etc). It’s more confusing where one case can ‘belong’ to several attributes at once (ie Modality).

@spaquett I will need time to ponder over your post. Thank you.

@george.white your response brings me back to the question I asked earlier in this thread - is the definition below accurate based on your understanding?

So back to the UI: If I filter with ‘ANATOMIC REGION in None’ the counts I get back for Object Class are identical to those I get using your ‘filter for records’ query and your ‘filter for counts’. Is this expected? Am I passing in the filter to the backend correctly?

https://idc-dev.appspot.com/explore/?counts_only=True&is_json=true&is_dicofdic=True&data_source_type=S&filters={"AnatomicRegionSequence":["None"]}

Correction: … your ‘filter for records’ query and NOT your ‘filter for counts’ query

That explanation does sound correct. We should work on our terminology or borrow from another resource. I have used ‘attribute’ where you are using ‘facet’. I have used ‘category’ in place of ‘groups of facets’.

1 Like

It’s understandable behavior with exclusive attributes (ie Primary Tumour Site, Program etc). It’s more confusing where one case can ‘belong’ to several attributes at once (ie Modality).

This is our primary issue, the notion of exclusive attributes vs. non-exclusive; or rather, the notion of a PatientID/Case being linked to multiple records and so having multi-valued attributes, or just one. It’s not unsolvable, in fact in an indexer like Solr it’s quiet easy to manage. But first we have to decide what level we’re going to use for filters.

So back to the UI: If I filter with ‘ANATOMIC REGION in None’ the counts I get back for Object Class are identical to those I get using your ‘filter for records’ query and your ‘filter for counts’. Is this expected? Am I passing in the filter to the backend correctly?

Sorry, that might not have been explicit: derived data is handled differently from Original. Derived counts (and related counts) only count derived records. The filter for records isn’t restricted, however. So when I said the filter for counts, I meant, the filter for derived counts. (Note has_segmentation='True'; this is what restricts this count to derived records only–explicitly derived records with segmentation data, in fact.).

…apologies, I forgot to add, because SOPClassUID is not a derived attribute, the calculation is done over all records.

Ok, I have a partial understanding. What you have called ‘filter for records’ would be a query to count cases in an original data category (SOPClassUID in this case). It can be adjusted for different filters by changing the ‘WHERE Clause’. Now ‘filter for counts’ would be a query to count cases in derived:segmentation categories (but replace SOPClassUID with a derived:segmentation category). The only difference between this and ‘filter for records’ (besides the category) is the ‘has_segmentation=true’ clause. I think for counts in derived:quantitative you have ‘has_quantitative=true’ instead? Now ‘has_segmentation=true’ is only going to matter when ‘AnatomicRegionSequence is NULL’ is in the filter. If AnatomicRegionSequence has any other filter value then ‘has_segmentation=true’ is redundant.