I finally made the time to play with it a bit. As a disclamer, I do not know much about ChatGPT beyond occasionally using it via the OpenAI web interface.
What I liked
First of all, I learned new things and it is useful! I understand there is very limited code that you introduce on top of ChatGPT, but what you do is very helpful. The “pretext” you added (I had no idea there is even a concept of “pretext”!) in this line text2cohort/text2cohort.ipynb at main · UM2ii/text2cohort · GitHub is very handy! If you add this text before anything in a new ChatGPT interaction, you can then use off-the-shelf ChatGPT as a convenient IDC SQL query helper:
Make sure to use regex. Please be as specific as possible and only return the final query enclosed in ```. Do not provide explanations. Using the table: bigquery-public-data.idc_current.dicom_all:
Here’s an example (for those reading this, I emphasize - this “just works” out of the box with OpenAI ChatGPT interface, no extra code needed!):
(the actual query is a bit different, but the answer is close enough:
COUNT(DISTINCT PatientID) DESC
Second, it was very interesting to see the queries you selected for your evaluation! I think the most unexpected one was “For each collection hosted on IDC, what is the proportion of male and female patients?”.
What I did not like
I note that I was not able to run your notebook - I do not have any OpenAI tokens, and I am not interested to buy any. You may want to add a more prominent disclaimer to your notebook and/or instructions how to get tokens and get the API key.
I was not able to get the same results as you demonstrate in the supplementary table 1 of the version 2 of your preprint. Few examples:
COUNTIF(REGEXP_CONTAINS(body_part, r'(?i)chest') AND modality = 'CT') AS num_cases_chest_ct,
COUNT(DISTINCT case_id) AS num_cases_total,
COUNTIF(patient_sex = 'M') / COUNT(DISTINCT case_id) AS proportion_male,
COUNTIF(patient_sex = 'F') / COUNT(DISTINCT case_id) AS proportion_female
In contrast, the result presented in your preprint (purportedly, without any expert corrections) is the following (formatted for presentation purposes):
gender_counts AS (
COUNT(DISTINCT PatientID) AS patient_count
total_patients AS (
COUNT(DISTINCT PatientID) AS total_count
ROUND(gender_counts.patient_count / CAST(total_patients.total_count AS numeric), 2) AS male_proportion,
ROUND((total_patients.total_count - gender_counts.patient_count) / CAST(total_patients.total_count AS numeric), 2) AS female_proportion
gender_counts.collection_id = total_patients.collection_id
I do not know if there is a mistake in the preprint and it was revised by the expert, or the results of ChatGPT are not expected to be reproducible, or the results via API may be different from web interface, or I was using a different version of ChatGPT behind the scenes … but in either case, this this lack of reproducibility is a major problem if you want to present this as an academic study.
Next, and somewhat related to the above, I am curious how come in your results queries were using proper DICOM attributes and
collection_id, while in the results I was getting, that was not the case? Another unexplained observation.
On another occasion, I noticed that the resulting query was simply incorrect and misleading, and, quite likely, this would not be noticed by the user. Here’s the example:
Remarkably, DICOM modalities were selected correctly. The problem is, the proposed query will select all collections that have exactly 2 modalities, with one of those modalities be either CT or SM. The cleaned up query with the added column listing all modalities within the collection (to demonstrate that the query is not satisfying the prescribed requirements) is below, along with the snippet of the result:
COUNT(DISTINCT PatientID) AS patient_count,
STRING_AGG(DISTINCT(Modality)) AS modalities
modality IN ('MR', 'SM')
COUNT(DISTINCT modality) = 2
The above is consistent with my experience with ChatGPT overall - when it gives the correct answer, it is amazing. But then it will give an incorrect answer, without even a slight expression of doubt, and for a user who does not have the domain knowledge, it is impossible to detect that the answer is incorrect. Those answers should always be cross-checked, which greatly diminishes the practical value of this tool.
My take away
I can definitely see how the use of ChatGPT can be handy as an aid in exploring IDC - especially with the pretext customization proposed (and I hope this thread will motivate some of the beginners!). Further, I can also see how it can help put together initial version of the query even for those users who are somewhat familiar with SQL, but want to get the initial query automatically.
BUT - especially if you are novice user! - never treat the results produced by this tool as truth. I am sure the models will evolve, but I think the practical value of this approach is yet to be established. I would also be very interested to see how ChatGPT, with as minimal effort as possible, can become more DICOM-aware (e.g., by using DICOM attributes, incorporating the knowledge of the DICOM data model).
I would also encourage those who want to continue those exploration to try to engage with the users of IDC and/or broad community of imaging researchers and survey their needs with respect to what queries they would find interesting.