Problem exporting cohort to csv/tsv/json, missing headers and invalid format

When I export a cohort to CSV or TSV, there are no manifest headers, or column headers in the downloaded file. Which doesn’t match what the documentation page shows the file to look like.

UPENN-GBM-00001,upenn_gbm,10.7937/TCIA.709X_DN49,1.3.6.1.4.1.14519.5.2.1.193084931700557478622366870650889741468,1.3.6.1.4.1.14519.5.2.1.40067253395349591373441931744246881031,cc8b6843-b357-4e5a-8f03-dcd3c3842060,5fd52284-6973-44da-b882-473b97e3c82d,16.0
UPENN-GBM-00001,upenn_gbm,10.7937/TCIA.709X_DN49,1.3.6.1.4.1.14519.5.2.1.27626308777696377724750002938243956383,1.3.6.1.4.1.14519.5.2.1.202685500194820733902722321770171989719,34676c1b-0ead-4ab6-93ca-7da191cef968,01918fce-3589-47a7-883e-f82cc9f730bf,16.0
UPENN-GBM-00001,upenn_gbm,10.7937/TCIA.709X_DN49,1.3.6.1.4.1.14519.5.2.1.303922662365557240161187739636975756435,1.3.6.1.4.1.14519.5.2.1.337491152254065288399657726162931889194,7a2afc1c-9686-41b9-ad79-97f988c563f7,951a4b1e-1ed3-4c59-b7d0-0e877b370b03,16.0

If I download it as json, the file is not valid json, its not an array of objects, just one object per line.

{"collection_id": "upenn_gbm", "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.40067253395349591373441931744246881031", "source_DOI": "10.7937/TCIA.709X_DN49", "idc_version": "", "PatientID": "UPENN-GBM-00001", "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.193084931700557478622366870650889741468", "crdc_study_uuid": "cc8b6843-b357-4e5a-8f03-dcd3c3842060", "crdc_series_uuid": "5fd52284-6973-44da-b882-473b97e3c82d"}
{"collection_id": "upenn_gbm", "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.202685500194820733902722321770171989719", "source_DOI": "10.7937/TCIA.709X_DN49", "idc_version": "", "PatientID": "UPENN-GBM-00001", "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.27626308777696377724750002938243956383", "crdc_study_uuid": "34676c1b-0ead-4ab6-93ca-7da191cef968", "crdc_series_uuid": "01918fce-3589-47a7-883e-f82cc9f730bf"}
{"collection_id": "upenn_gbm", "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.337491152254065288399657726162931889194", "source_DOI": "10.7937/TCIA.709X_DN49", "idc_version": "", "PatientID": "UPENN-GBM-00001", "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.303922662365557240161187739636975756435", "crdc_study_uuid": "7a2afc1c-9686-41b9-ad79-97f988c563f7", "crdc_series_uuid": "951a4b1e-1ed3-4c59-b7d0-0e877b370b03"}

I tried two cohorts with the same results. The cohort for the examples above is MR modality from UPENN-GBM collection.

Also I just noticed a spelling error of Manfiest Headers on the Export Manifest dialog
image

1 Like

@vanossj thank you for reaching out!

TL;DR: what are you trying to accomplish with exporting the cohort?

IDC Portal provides several means for exporting cohort manifest, and those different means aim to support different purposes. We discuss those in this documentation page: https://learn.canceridc.dev/portal/cohort-manifests (which of course can be improved).

If you want to download the cohort, you should use the s5cmd manifest, which is the default option.

CSV/TSV manifest has rather limited utility, and I asked the opening question in order to make sure it is actually the right manifest kind to support your needs.

Indeed, there is a known regression in the portal, as documented in this issue: Manifest headers are not showing up for CSV/TSV/JSON Downloads in Test and Prod tiers · Issue #1255 · ImagingDataCommons/IDC-WebApp · GitHub. We have not yet resolved this issue.

What you see is called “newline JSON” (see https://jsonlines.org/ as one resource explaining what this means), which is a “flavor” of JSON that is commonly used for representing data streams and row-structured data. I agree that our documentation is not precise, and we should clarify this in the documentation.

Thank you for pointing this out! I’ve just submitted a fix that should propagate to the portal shortly, I hope.

I hope this response helps better understand IDC manifests. But please let me know what is your goal so we can make sure we address your specific needs!

I do use s5cmd to download the files, but then I like to sort them into collection_id/PatientID/StudyInstanceUID/SeriesInstanceUID.

I use the csv file to get the collection_id information, although the csv doesn’t list the filenames, so i still need to read the SeriesInstanceUID from each downloaded file to sort them

Thanks for the link on newline JSON, I hadn’t run into that format before.

1 Like

Ah, thanks for letting me know about this!

You can sort the files into any hierarchy of DICOM tag values using dicomsort. But it is a valid point that collection_id is not in DICOM.

If you are not intimidated by SQL, it is best to build an s5cmd manifest that downloads the files into the proper folder structure on the fly. SQL interface gives you complete flexibility, both in terms of how to filter the selection and how to organize the files upon download.

The code below will generate a manifest that, after being passed to s5cmd, will download the files into the collection_id/StudyInstanceUID/SeriesInstanceUID hierarchy. But you can modify the query easily to sort into any hierarchy (i.e., by SeriesDescription, or SeriesNumber).

In this case, for the sake of an example, I select the cohort based on the value of SeriesInstanceUID, but you can as easily select based on any criteria, such as value of collection_id, or Modality.

I also updated the cookbook notebook to include this recipe.

from google.cloud import bigquery
bq_client = bigquery.Client(my_ProjectID)

selection_query ="""
SELECT
  # Organize the files in-place on the fly
  ANY_VALUE(CONCAT("cp s3",
      REGEXP_SUBSTR(aws_url, "(://.*)/"),
      "/* ",collection_id,"/",PatientID,"/",
      StudyInstanceUID,"/",SeriesInstanceUID)) AS s5cmd_command
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  # Use any filtering criteria here
  StudyInstanceUID = "1.3.6.1.4.1.14519.5.2.1.6279.6001.224985459390356936417021464571"
GROUP BY
  SeriesInstanceUID
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

selection_df.to_csv("/content/s5cmd_aws_manifest.txt", header=False, index=False)

Implementing features like this in the IDC Portal is a heavyweight operation, but, fortunately, you can do this easily and with full flexibility without having to open the portal.

Would this address your need?

That works! thanks for the example.

1 Like