Use API to get collections metadata

I’m curious if there is a way to send a request to your API to get the list of collections and their metadata listed in the collections panel here:
https://portal.imaging.datacommons.cancer.gov/collections/

I’m trying to write an app that pulls this information with API requests as opposed to using the web-based UI. Specifically, I’d like to use the Python “requests” package, but any example of programmatic access that can be integrated into a script is sufficient.

Thanks!

1 Like

@cgmeyer welcome to the IDC community!

There are several approaches to getting collection-level metadata:

  1. Metadata for original collections and analysis results is available in BigQuery tables maintained by IDC, as discussed here: Organization of data - IDC User Guide. Specifically, those are canceridc-data.idc.data_collections_metadata and canceridc-data.idc.analysis_collections_metadata . You can access BigQuery from Python, and we have examples here: IDC-Examples/LIDC_exploration.ipynb at master · ImagingDataCommons/IDC-Examples · GitHub.
  2. We also have IDC API, which is not yet available in the IDC release, which should also be exposing this information. I will let @bill.clifford confirm this.

Will the first approach above work for you?

I’ve just realized that the notebook example above uses features that are specific for Google Colab, but pure python approach should be very similar, see BigQuery API Client Libraries  |  Google Cloud. We should add an example for that too.

The IDC API includes a /collections API that returns the information that you are interested in. Here is a fragment of the returned data:
{
“code”: 200,
“programs”: [
{
“collections”: [
{
“active”: true,
“cancer_type”: “Prostate Cancer”,
“collection_id”: “tcga_prad”,
“collection_type”: “Original”,
“date_updated”: “2021-03-30”,
“description”: “

\n\tNote: This collection has special restrictions on its usage. See <a href=“https://wiki.cancerimagingarchive.net/x/c4hF” target=”_blank">Data Usage Policies and Restrictions.

\n
\n\t 

\n
\n\tThe <a href=“http://imaging.cancer.gov/” target="_blank">Cancer Imaging Program (CIP) is working directly with primary investigators from institutes participating in TCGA to obtain and load images relating to the genomic, clinical, and pathological data being stored within the <a href=“http://tcga-data.nci.nih.gov/” target="_blank">TCGA Data Portal. Currently this image collection of prostate adenocarcinoma (PRAD) patients can be matched by each unique case identifier with the extensive gene and expression data of the same case from The Cancer Genome Atlas Data Portal to research the link between clinical phenome and tissue genome. 
\n\t

\n
\n\t 

\n
\n\tPlease see the <a href=“https://wiki.cancerimagingarchive.net/x/tgpp” target="_blank">TCGA-PRAD wiki page to learn more about the images and to obtain any supporting metadata for this collection.

\n",
“doi”: “10.7937/K9/TCIA.2016.YXOGLM4Y”,
“idc_data_versions”: [
“1.0”
],
“image_types”: “CT, PT, MR, Pathology”,
“location”: “Prostate”,
“owner_id”: 1,
“species”: “Human”,
“subject_count”: 14,
“supporting_data”: “Clinical Genomics”
},
{
“active”: true,
“cancer_type”: “Bladder Endothelial Carcinoma”,
“collection_id”: “tcga_blca”,
“collection_type”: “Original”,
“date_updated”: “2021-03-30”,
“description”: “

\n\tThe Cancer Genome Atlas-Bladder Endothelial Carcinoma (TCGA-BLCA) data collection is part of a larger effort to enhance the TCGA http://cancergenome.nih.gov/ data set with characterized radiological images. The Cancer Imaging Program (CIP), with the cooperation of several of the TCGA tissue-contributing institutions, has archived a large portion of the radiological images of the genetically-analyzed BLCA cases.

\n

\n\tPlease see the <a href=“https://wiki.cancerimagingarchive.net/display/Public/TCGA-BLCA” target=”_blank">TCGA-BLCA wiki page to learn more about the images and to obtain any supporting metadata for this collection.

\n",
“doi”: “10.7937/K9/TCIA.2016.8LNG8XDR”,
“idc_data_versions”: [
“1.0”
],
“image_types”: “CT, CR, MR, PT, DX, Pathology”,
“location”: “Bladder”,
“owner_id”: 1,
“species”: “Human”,
“subject_count”: 120,
“supporting_data”: “Clinical, Genomics”
},…

Thanks, Bill. Is there some documentation on how to use the IDC API? The example response you posted is the type of data I’m looking for. I’m hoping to simply post a request using the Python “requests” package to avoid using the google packages, as I’ve had some difficulty getting the google packages to work; “google.colab” in particular I couldn’t install without errors.

However, I was able to access some of the collections metadata in the BigQuery tables using the bigquery client via the following code:

    import pyarrow
    from google.cloud import bigquery
    from google.oauth2 import service_account

    cred_file = 'credentials.json'
    cred = service_account.Credentials.from_service_account_file(cred_file)

    pid = 'cgmeyer-001'
    client = bigquery.Client(credentials=cred, project=pid)

    collections_query = client.query("""
       SELECT *
       FROM canceridc-data.idc.data_collections_metadata
       LIMIT 1000 """)

    results = collections_query.result()
    df = results.to_dataframe()
    df.to_csv('IDC_data_collections_metadata.tsv',sep='\t',index=False)

When I run this query, however, I only get 25 collections, whereas the table on the web at this URL mentioned in the documentation lists 142 collections:
https://www.cancerimagingarchive.net/collections/

Are only a subset of the collections available via the API? Or have I malformed my request?

Finally, just as some feedback: to achieve the above query, I needed to create a service account in the google console, grant myself certain permissions, and generate keys to use with the bigquery client, which, at least for me, took some time and came with a very high learning curve for simply accessing a small table of public data. I would have expected to be able to grab that data with a simple, unauthenticated request or cURL, etc.

Thanks for all your help!

Chris, sorry - Discourse flagged your post as a potential spam, and I did not get notification until just now - sorry! I now raised your user level so hopefully that will not happen again.

IDC API is not yet available in the release, I believe. I will let @bill.clifford confirm, and if it is not yet released - if there is an option for you to use it as a beta tester.

We do not yet have all of the public TCIA collections in IDC - the plan is to have that done by the production release in the Fall 2021. You get information for the collections we have in IDC, and those are listed here: Collections | IDC and here: https://learn.canceridc.dev/data/data-release-notes.

I do not know of a way to access BQ without authentication, I am not sure if that is possible.

Thanks. I look forward to checking out the IDC API once it’s released!
I’m glad I was able to get all the data I needed. One last question would be, is there any easy way to get the total number of files in each collection as a simple count? No need for any detailed file info, just the count per collection like is displayed for subjects.
~Chris

Files map to DICOM instances, and each DICOM instance maps to a row in canceridc-data.idc_views.dicom_all , so you would just need to aggregate over collection_id in that table.