Hi,
I’m a developer building Biomedical Research Hub Data Commons. I was looking into a way to fetch data related to IDC studies, in such a way that, the information available in BRH will always be up to date with respect to the data available in IDC. I got a few questions related to that.
I also see a lot of discussions talking about Big Query usage, wanted to make sure what would be the best way to get the info I need. API or BigQuery?
In order to fetch the info I need, I’ve been told to use this query in the BigQuery console
Query
SELECT *
FROM canceridc-data.idc.data_collections_metadata
LIMIT 1000
But this query only yields 25 results, despite of the API returning 128 results. Wasn’t sure if I was using the right query.
The subject value in the BigQuery result set is different from the subject_count value in the API, are the fields different? If not, which one of those is stale?
The API should be functional. If you see problems, we will need to look into this. @bill.clifford should be able to help with this.
BigQuery can be used to access all of the metadata, you can access it using standard SQL. API exposes a very tiny bit of metadata, mostly what is available in the IDC portal. I personally use BigQuery for all the data querying needs.
Can you please let me know who told you to use that query? That query is using the wrong table, and if it is mentioned anywhere in public documentation or examples we should fix this.
The API at https://api.imaging.datacommons.cancer.gov/v1/collections is functional and maintained. It is primarily intended to allow programmatic interaction for cohort creation, as opposed to interactive use of the portal. But for the definitive read-only source of what is available in IDC, using the BigQuery tables now hosted in the Google Public Data program is probably the way to go. As Andrey says, the BQ dataset and table you are asking about are out-of-date.