Tutorial: select_unique_ngs


The genieBPC package allows users to link the clinical data to patients’ corresponding genomic samples.

Modifying Function Arguments

`select_unique_ngs()` Function Arguments
Argument Description


Output object of the create_analytic_cohort function.


Character vector specifying which sample OncoTree codes to keep. See 'cpt_oncotree_code' column of data_cohort argument above to get options.


Character specifying which type of genomic sample to prioritize, options are 'Primary', 'Local' and 'Metastasis'. Default is to not select a NGS sample based on the sample type.


Character specifying if the first or last genomic sample recorded should be kept. Options are 'min' (first) and 'max' (last).

The select_unique_ngs() function returns the ‘cohort_ngs’ object of the create_analytic_cohort with a unique genomic sample per patient for each diagnosis for the purpose of creating an analytic dataset based on user-defined criterion, including OncoTree code, primary vs. metastatic tumor sample, and earliest vs. most recent sample. If multiple reports for a patient remain available after the user-defined specifications, or if no specifications are provided, the panel with the largest number of genes is selected by default. Sample optimization is performed in the order that the arguments are specified in the function, regardless of the arguments’ order provided by the user. Namely the OncoTree code is prioritized first, sample type is prioritized second and finally the time is prioritized last. For patients with exactly one genomic sample, that unique genomic sample will be returned regardless of whether it meets the user-specified parameters; the purpose of this function is to specify criteria for selecting a single NGS in the case that there are multiple NGS reports associated with a single cancer diagnosis. Running the select_unique_ngs() function will ensure that the resulting dataset returned by merging the next generation sequencing report data onto the analytic dataset returned by create_analytic_cohort() will maintain a structure of one record per patient.

Note that the NGS dataset serves as the link between the clinical and genomic data, where the NGS dataset includes one record per NGS report per patient, including the NGS sample ID that is used to link to the genomic data files. Merging data from the NGS report onto the analytic cohort returned from create_analytic_cohort() therefore allows users to utilize all clinical and genomic data available.


Before going through the tutorial, load the {genieBPC} library and log into Synapse using the set_synapse_credentials() function. For more information on set_synapse_credentials(), refer to the Tutorial: pull_data_synapse() vignette.



This tutorial will utilize the data downloaded in the Tutorial: pull_data_synapse() vignette, as shown below:

nsclc_2_0 = pull_data_synapse("NSCLC", version = "v2.0-public")


Example 1

Select unique genomic samples from patients with stage IV NSCLC of histology adenocarcinoma. For patients with multiple samples, select the sample with OncoTree code LUAD. For patients with multiple samples with OncoTree code LUAD, select the metastatic genomic sample. If any patients have multiple metastatic samples with OncoTree code LUAD, take the later of the samples.

Note: for patients with exactly one genomic sample, that unique genomic sample will be returned regardless of whether it meets the argument criteria specified below.

ex1 <- create_analytic_cohort(
  data_synapse = nsclc_2_0$NSCLC_v2.0,
  stage_dx = c("Stage IV"),
  histology = "Adenocarcinoma"

samples_data1 <- select_unique_ngs(
  data_cohort = ex1$cohort_ngs,
  oncotree_code = "LUAD",
  sample_type = "Metastasis",
  min_max_time = "max"

Example 2

Create a cohort of all NSCLC patients who received Cisplatin, Pemetrexed Disodium or Cisplatin, Etoposide as their first drug regimen, and select corresponding genomic samples.

For patients with multiple samples, first select samples with OncoTree code LUAD. If multiple samples per patient remain, prioritize samples from the primary tumor. Finally, prioritize the sample taken earliest.

Note: for patients with exactly one genomic sample, that unique genomic sample will be returned regardless of whether it meets the argument criteria specified below.

ex2 <- create_analytic_cohort(
  data_synapse = nsclc_2_0$NSCLC_v2.0,
  regimen_drugs = c("Cisplatin, Pemetrexed Disodium", "Cisplatin, Etoposide"),
  regimen_order = 1,
  regimen_order_type = "within regimen"

samples_data <- select_unique_ngs(
  data_cohort = ex2$cohort_ngs,
  oncotree_code = "LUAD",
  sample_type = "Primary",
  min_max_time = "min"