Text2Cohort: a new LLM toolkit to query IDC database using Natural Language Queries

Vishwa_Parekh · May 15, 2023, 10:33pm

Excited to announce the release of Text2Cohort - a new LLM toolkit that allows users to interact with the imaging data commons (IDC) using natural language! With Text2Cohort, you can easily extract information or discover cohorts without having to use complicated bigquery scripts. Simply ask a query like “download all male brain MRIs for patients under the age of 25 across all relevant IDC collections” and Text2Cohort will handle the rest! This tool has the potential to change the way we approach data extraction and we’re thrilled to be able to share it with you.

Check out our paper on arXiv ([2305.07637] Text2Cohort: Democratizing the NCI Imaging Data Commons with Natural Language Cohort Discovery) or visit our Github repository to access the code (GitHub - UM2ii/text2cohort).

fedorov · May 16, 2023, 4:02pm

@Vishwa_Parekh I have not yet had a chance to read the preprint, but the abstract looks very interesting! It’s a very nice idea I never thought about myself to use IDC/DICOM metadata/BigQuery in the way you did! I will read the preprint and may follow up.

Thank you for sharing your results with us - we at IDC are always happy to learn about how IDC helps your research!

Vishwa_Parekh · May 16, 2023, 6:41pm

Thank you @fedorov !! We hope that this tool could make it easier for everyone to query the database and curate cohorts.
Please do let us know your feedback once you read the preprint.

fedorov · May 26, 2023, 9:22pm

I finally made the time to play with it a bit. As a disclamer, I do not know much about ChatGPT beyond occasionally using it via the OpenAI web interface.

What I liked

First of all, I learned new things and it is useful! I understand there is very limited code that you introduce on top of ChatGPT, but what you do is very helpful. The “pretext” you added (I had no idea there is even a concept of “pretext”!) in this line https://github.com/UM2ii/text2cohort/blob/main/text2cohort.ipynb?short_path=7cd8242#L62 is very handy! If you add this text before anything in a new ChatGPT interaction, you can then use off-the-shelf ChatGPT as a convenient IDC SQL query helper:

Make sure to use regex. Please be as specific as possible and only return the final query enclosed in ```. Do not provide explanations. Using the table: bigquery-public-data.idc_current.dicom_all:

Here’s an example (for those reading this, I emphasize - this “just works” out of the box with OpenAI ChatGPT interface, no extra code needed!):

(the actual query is a bit different, but the answer is close enough:

SELECT
  collection_id
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  REGEXP_CONTAINS(modality, r'CT')
GROUP BY
  collection_id
ORDER BY
  COUNT(DISTINCT PatientID) DESC

)

Second, it was very interesting to see the queries you selected for your evaluation! I think the most unexpected one was “For each collection hosted on IDC, what is the proportion of male and female patients?”.

What I did not like

I note that I was not able to run your notebook - I do not have any OpenAI tokens, and I am not interested to buy any. You may want to add a more prominent disclaimer to your notebook and/or instructions how to get tokens and get the API key.

I was not able to get the same results as you demonstrate in the supplementary table 1 of the version 2 of your preprint. Few examples:

SELECT
  collection,
  COUNTIF(REGEXP_CONTAINS(body_part, r'(?i)chest') AND modality = 'CT') AS num_cases_chest_ct,
  COUNT(DISTINCT case_id) AS num_cases_total,
  COUNTIF(patient_sex = 'M') / COUNT(DISTINCT case_id) AS proportion_male,
  COUNTIF(patient_sex = 'F') / COUNT(DISTINCT case_id) AS proportion_female
FROM
  `bigquery-public-data.idc_current.dicom_all`
GROUP BY
  collection
ORDER BY
  num_cases_chest_ct DESC

In contrast, the result presented in your preprint (purportedly, without any expert corrections) is the following (formatted for presentation purposes):

WITH
  gender_counts AS (
  SELECT
    collection_id,
    PatientSex,
    COUNT(DISTINCT PatientID) AS patient_count
  FROM
    `bigquery-public-data.idc_current.dicom_all`
  GROUP BY
    collection_id,
    PatientSex ),
  total_patients AS (
  SELECT
    collection_id,
    COUNT(DISTINCT PatientID) AS total_count
  FROM
    `bigquery-public-data.idc_current.dicom_all`
  GROUP BY
    collection_id )
SELECT
  gender_counts.collection_id,
  total_patients.total_count,
  ROUND(gender_counts.patient_count / CAST(total_patients.total_count AS numeric), 2) AS male_proportion,
  ROUND((total_patients.total_count - gender_counts.patient_count) / CAST(total_patients.total_count AS numeric), 2) AS female_proportion
FROM
  gender_counts
JOIN
  total_patients
ON
  gender_counts.collection_id = total_patients.collection_id
ORDER BY
  gender_counts.collection_id;

I do not know if there is a mistake in the preprint and it was revised by the expert, or the results of ChatGPT are not expected to be reproducible, or the results via API may be different from web interface, or I was using a different version of ChatGPT behind the scenes … but in either case, this this lack of reproducibility is a major problem if you want to present this as an academic study.

Next, and somewhat related to the above, I am curious how come in your results queries were using proper DICOM attributes and collection_id, while in the results I was getting, that was not the case? Another unexplained observation.

On another occasion, I noticed that the resulting query was simply incorrect and misleading, and, quite likely, this would not be noticed by the user. Here’s the example:

Remarkably, DICOM modalities were selected correctly. The problem is, the proposed query will select all collections that have exactly 2 modalities, with one of those modalities be either CT or SM. The cleaned up query with the added column listing all modalities within the collection (to demonstrate that the query is not satisfying the prescribed requirements) is below, along with the snippet of the result:

SELECT
  collection_id,
  COUNT(DISTINCT PatientID) AS patient_count,
  STRING_AGG(DISTINCT(Modality)) AS modalities
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  modality IN ('MR', 'SM')
GROUP BY
  collection_id
HAVING
  COUNT(DISTINCT modality) = 2
ORDER BY
  patient_count DESC

The above is consistent with my experience with ChatGPT overall - when it gives the correct answer, it is amazing. But then it will give an incorrect answer, without even a slight expression of doubt, and for a user who does not have the domain knowledge, it is impossible to detect that the answer is incorrect. Those answers should always be cross-checked, which greatly diminishes the practical value of this tool.

My take away

I can definitely see how the use of ChatGPT can be handy as an aid in exploring IDC - especially with the pretext customization proposed (and I hope this thread will motivate some of the beginners!). Further, I can also see how it can help put together initial version of the query even for those users who are somewhat familiar with SQL, but want to get the initial query automatically.

BUT - especially if you are novice user! - never treat the results produced by this tool as truth. I am sure the models will evolve, but I think the practical value of this approach is yet to be established. I would also be very interested to see how ChatGPT, with as minimal effort as possible, can become more DICOM-aware (e.g., by using DICOM attributes, incorporating the knowledge of the DICOM data model).

I would also encourage those who want to continue those exploration to try to engage with the users of IDC and/or broad community of imaging researchers and survey their needs with respect to what queries they would find interesting.

Vishwa_Parekh · May 27, 2023, 10:09pm

Hi @fedorov, thank you for taking the time to evaluate our work and providing very useful feedback.

The autocorrection module in our work essentially helps our query generator to recursively autocorrect itself to arrive at the correct DICOM attributes. The autocorrection module uses the generated query to interact with the GCP BigQuery client and autocorrect the dicom attributes using the errors generated by the BigQuery client. Attaching an example where the autocorrection module autocorrects the query from using scan_type to SeriesDescription through interaction with the BigQuery client. Unfortunately, this would not be possible when using the chatGPT web interface.

Text2Cohort_Figure25211×6659 2.2 MB
We did notice that the responses when using API were more consistent compared to using the web interface. We are currently using “gpt-3.5-turbo-0301” in our work.
We are currently working on updates to make the model DICOM-aware and correctly understand the data schema in order to consistently generate correct queries.

Topic		Replies	Views
IDC November 2022 release Announcements release	0	353	November 18, 2022
IDC May 2023 release Announcements release	1	398	May 11, 2023
IDC February 2022 release Announcements release	0	459	February 8, 2022
Welcome to Discourse	0	432	July 20, 2020
IDC March 2023 release Announcements release	0	799	March 16, 2023

Text2Cohort: a new LLM toolkit to query IDC database using Natural Language Queries

What I liked

What I did not like

My take away

Related topics