Inquiry: BigQuery - idc_current Question

Dear IDC team,

I hope you all are doing well. I had a question about the description of each field in the table dicom_all in the dataset idc_current (https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1sbigquery-public-data!2sidc_current!3sdicom_all!).

I am working on research that needs to understand what those fields are for and how they are related to the other fields either in IDC or GDC. Is there any additional information or document that I missed to follow up for this inquiry.

Best regards,
Varik Hoang

Most of the fields in the dicom_all table are the DICOM metadata attributes extracted from the files in IDC. To understand that table, it is best to look at the query that is used to populate it:

https://github.com/ImagingDataCommons/etl_flow/blob/master/bq/derived_table_creation/BQ_Table_Building/derived_data_views/sql/dicom_all.sql

That query takes almost all of the DICOM attributes (table columns) from the dicom_metadata table (see https://github.com/ImagingDataCommons/etl_flow/blob/master/bq/derived_table_creation/BQ_Table_Building/derived_data_views/sql/dicom_all.sql#L43), and joins that table with the auxiliary metadata in the auxiliary_metadata table that is not in DICOM, but is either computed or populated separately, and collection-level metadata from the original_collections_metadata table (see https://github.com/ImagingDataCommons/etl_flow/blob/master/bq/derived_table_creation/BQ_Table_Building/derived_data_views/sql/dicom_all.sql#L58).

With the explanation above in mind, if you want to know the descriptions of the attributes in dicom_all, you should use the following approach:

This is a great question!

If you want to join IDC data with the information available in GDC, you will need to use PatientID column in the dicom_all table, and join with the information in GDC.

Can you give me an example of the kind of information from GDC that you would like to join? I think it would be better if we figure this out on the example that is relevant to your interests.

Also, do you mind if I move this conversation to the public forum? I think it may be of interest to other users of IDC.

Hello Andrey,

Thank you for your reply.

Can you give me an example of the kind of information from GDC that you would like to join? I think it would be better if we figure this out on the example that is relevant to your interests.

It’s more the description of each field in the table dicom_all. For example, in the view auxiliary_metadata, I could find the description of the field dicom_all.StudyInstanceUID according to https://learn.canceridc.dev/data/organization-of-data/files-and-metadata. Another example is I clearly knew dicom_all.PatientAge specifies the age of the patient based on the name of the field but could not find the official description for this field. It would be nice if we could have all descriptions for each field in the table dicom_all otherwise I just need to look up the other views beside the auxiliary_metadata.

Also, do you mind if I move this conversation to the public forum? I think it may be of interest to other users of IDC.

I don’t mind at all. Please move this conversation to the public if you feel it’s necessary.

Best regards,
Varik Hoang

···

On Tue, Feb 28, 2023 at 7:57 AM Andrey Fedorov via Imaging Data Commons <notifications@canceridc.discoursemail.com> wrote:

| fedorov
February 28 |

  • | - |

Most of the fields in the dicom_all table are the DICOM metadata attributes extracted from the files in IDC. To understand that table, it is best to look at the query that is used to populate it:

github.com

ImagingDataCommons/etl_flow/blob/master/bq/derived_table_creation/BQ_Table_Building/derived_data_views/sql/dicom_all.sql

WITH
  pre_dicom_all AS (
  SELECT
    aux.tcia_api_collection_id AS tcia_api_collection_id,
    aux.idc_webapp_collection_id AS idc_webapp_collection_id,
    aux.idc_webapp_collection_id AS collection_id,
    aux.collection_timestamp AS collection_timestamp,
    aux.collection_hash as collection_hash,
    aux.collection_init_idc_version AS collection_init_idc_version,
    aux.collection_revised_idc_version AS collection_revised_idc_version,
    aux.access AS access,
    dcm.PatientID as PatientID,
    aux.idc_case_id as idc_case_id,
    aux.patient_hash as patient_hash,
    aux.patient_init_idc_version AS patient_init_idc_version,
    aux.patient_revised_idc_version AS patient_revised_idc_version,
    dcm.StudyInstanceUID AS StudyInstanceUID,
    aux.study_uuid as crdc_study_uuid,
    aux.study_hash as study_hash,
    aux.study_init_idc_version AS study_init_idc_version,

This file has been truncated. show original

That query takes almost all of the DICOM attributes (table columns) from the dicom_metadata table (see etl_flow/dicom_all.sql at master · ImagingDataCommons/etl_flow · GitHub), and joins that table with the auxiliary metadata in the auxiliary_metadata table that is not in DICOM, but is either computed or populated separately, and collection-level metadata from the original_collections_metadata table (see etl_flow/dicom_all.sql at master · ImagingDataCommons/etl_flow · GitHub).

With the explanation above in mind, if you want to know the descriptions of the attributes in dicom_all, you should use the following approach:

varikmp:

I am working on research that needs to understand what those fields are for and how they are related to the other fields either in IDC or GDC.

This is a great question!

If you want to join IDC data with the information available in GDC, you will need to use PatientID column in the dicom_all table, and join with the information in GDC.

Can you give me an example of the kind of information from GDC that you would like to join? I think it would be better if we figure this out on the example that is relevant to your interests.

Also, do you mind if I move this conversation to the public forum? I think it may be of interest to other users of IDC.


Visit Message or reply to this email to respond to IDC support email triage group (4).

To unsubscribe from these emails, click here.

I understand. As I discussed in the earlier response above, the answer will depend on where that field is coming from into dicom_all.

For most of the fields, they are DICOM attributes extracted from DICOM images. As mentioned earlier, you can use Innolitics DICOM browser to look up the definitions from the DICOM standard.

So, for example, if you are interested in PatientID, you will do the following:

  1. Open the “Search” tab in Innolitics DICOM browser
  2. Type “Patient ID” (note that you will need to break CamelCase notation of the column names into separate words, so “PatientID” becomes “Patient ID”, etc) in the search field and hit “return”
  3. Select the attribute from the list.
  4. If you want to see the authoritative definition and context within the latest version of the DICOM standard, click the “View in Standard” link.

2023-03-02_11-55-44

It is not practical to replicate the description of the attributes into the BigQuery table schema, since those descriptions can be very extensive, and also the description may vary depending on the context where it is used. As the specific example for PatientID, semantics of this attribute will be a bit different for small imaging.

image

Thank you for asking those questions! DICOM standard is complex, the subject of medical imaging is complex, and we are here to help you navigate this information. I hope this response is helpful.