Package 'gcap' reference manual

Title:	Gene-level Circular Amplicon Prediction
Description:	Provides data processing pipeline feeding paired bam files (or allele-specific copy number data) and XGBOOST model for predicting tumor circular amplicons (also known as ecDNA) in gene level.
Authors:	Shixiang Wang [aut, cre]
Maintainer:	Shixiang Wang <[email protected]>
License:	Non-Commercial Academic License + file LICENSE
Version:	1.2.0
Built:	2025-03-07 06:06:50 UTC
Source:	https://github.com/ShixiangWang/gcap

Example allele specific copy number (ASCN) data

Description

Example allele specific copy number (ASCN) data

Format

A data.frame

Source

Generate from ⁠data-raw/⁠, raw source from our study by calling ASCAT v3.0 alpha on corresponding WES sequencing data.

Examples

data("ascn")
data("ascn")

Deploy Command Line Interface to System Local Path

Description

Only should be used in Unix-like system. For details of the arguments passing to CLI, please check gcap.workflow() and gcap.ASCNworkflow().

Usage

deploy(target = "/usr/local/bin")
deploy(target = "/usr/local/bin")

Arguments

target

the target path to deploy the CLI.

Value

Nothing.

Example ecDNA training data

Description

Example ecDNA training data

Format

A data.table

Source

Generate from ⁠data-raw/⁠

Examples

data("ec")
data("ec")

R6 class representing focal copy number amplification list predicted from a cohort

Description

Contains fields storing data and methods to get, process and visualize fCNA information. Examples please see gcap.ASCNworkflow().

Public fields

data

a data.table storing fCNA list, which typically contains following columns:

sample sample or case ID.
band chromosome cytoband.
gene_id gene ID, typically Ensembl ID. You can convert the ID with R package IDConverter.
total_cn total copy number value.
minor_cn copy number value for minor allele.
prob the probability the gene located in circular DNA.
gene_class gene level amplicon classification.

sample_summary

a data.table storing sample summary data, which typically contains at least the following columns:

sample sample or case ID. Should only include cases have been called with GCAP workflow, otherwise the extra cases would be automatically classified as 'nofocal' (i.e. NA in sample_summary field) class.
purity, ploidy for tumor purity or ploidy.
AScore aneuploidy score.
pLOH genome percentage harboring LOH events.
⁠CN1 ... CN19⁠ activity of copy number signatures.
class the sample class based on amplicon type.
ec_genes number of genes predicted as located on circular DNA.
ec_possibly_genes same with ec_genes but with less confidence.
ec_cytobands number of cytobands predicted as located on circular DNA. (the regions of ec_possibly_genes are not included in computation)

Active bindings

min_prob: check ⁠$new()⁠ method for details. If you updated this value, a function will be called to update the sample summary.

Methods

Method `new()`

Create a fCNA object. Typically, you can obtain this object from gcap.workflow() or gcap.ASCNworkflow().

Usage

fCNA$new(
  fcna,
  pdata = fcna[, "sample", drop = FALSE],
  min_prob = 0.6,
  only_oncogenes = FALSE,
  genome_build = c("hg38", "hg19", "mm10")
)

Arguments

fcna: a data.frame storing focal copy number amplicon list.
pdata: a data.frame storing phenotype or sample-level related data. (Optional)
min_prob: the minimal aggregated (in cytoband level) probability to determine a circular amplicon.
only_oncogenes: only_oncogenes if TRUE, only known oncogenes are kept for circular prediction.
genome_build: genome version

Method `subset()`

Return a subset fCNA object

Usage

fCNA$subset(..., on = c("data", "sample_summary"))

Arguments

...: subset expressions on fCNA$data or fCNA$sample_summary.
on: if it is "data", subset operations are on data field of fCNA object, same for "sample_summary".

Returns

a fCNA

Method `getSampleSummary()`

Get sample summary of fCNA

Usage

fCNA$getSampleSummary(
  only_oncogenes = FALSE,
  genome_build = c("hg38", "hg19", "mm10")
)

Arguments

only_oncogenes: only_oncogenes if TRUE, only known oncogenes are kept for circular prediction.
genome_build: genome version.

Returns

a data.table

Method `getGeneSummary()`

Get gene level summary of fCNA type

Usage

fCNA$getGeneSummary(return_mat = FALSE)

Arguments

return_mat: if TRUE, return a cytoband by sample matrix instead of a summary.

Returns

a data.table or a matrix.

Method `getCytobandSummary()`

Get cytoband level summary of fCNA type

Usage

fCNA$getCytobandSummary(unique = FALSE, return_mat = FALSE)

Arguments

unique: if TRUE, count sample frequency instead of gene frequency.
return_mat: if TRUE, return a cytoband by sample matrix instead of a summary.

Returns

a data.table

Method `saveToFiles()`

Save the key data to local files

Usage

fCNA$saveToFiles(dirpath, fileprefix = "fCNA")

Arguments

dirpath: directory path storing output files.
fileprefix: file prefix. Two result files shall be generated.

Method `convertGeneID()`

Convert Gene IDs between Ensembl and Hugo Symbol System

Usage

fCNA$convertGeneID(
  type = c("ensembl", "symbol"),
  genome_build = c("hg38", "hg19", "mm10")
)

Arguments

type: type of input IDs, could be 'ensembl' or 'symbol'.
genome_build: reference genome build.

Method `print()`

print the fCNA object

Usage

fCNA$print(...)

Arguments

...: unused.

GCAP workflow for gene-level amplicon prediction from ASCN input

Description

Unlike gcap.workflow, this function directly uses the allele-specific copy number data along with some extra sample information to infer ecDNA genes.

Usage

gcap.ASCNworkflow(
  data,
  genome_build = c("hg38", "hg19"),
  model = "XGB11",
  tightness = 1L,
  gap_cn = 3L,
  overlap = 1,
  only_oncogenes = FALSE,
  outdir = getwd(),
  result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE))
)
gcap.ASCNworkflow(
  data,
  genome_build = c("hg38", "hg19"),
  model = "XGB11",
  tightness = 1L,
  gap_cn = 3L,
  overlap = 1,
  only_oncogenes = FALSE,
  outdir = getwd(),
  result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE))
)

Arguments

`data`	a `data.frame` with following columns. The key columns can be obtained from common allele specific CNV calling software, e.g., ASCAT, Sequenza, FACETS. chromosome: chromosome names starts with 'chr'. start: start position of the segment. end: end position of the segment. total_cn: total integer copy number of the segment. minor_cn: minor allele integer copy number of the segment. Set it to `NA` if you don't have this data. sample: sample identifier. purity: tumor purity of the sample. Set to `1` if you don't know. ploidy (optinal): ploidy value of the sample tumor genome. age (optional): age of the case, use along with `gender`. gender (optional): gender of the case, use along with `age`. type (optional): cancer type of the case, use along with `age` and `gender`. Please refer to gcap.collapse2Genes to see the supported cancer types. This info is only used in 'XGB56' model. If you don't use this model, you don't need to set it.
`genome_build`	"hg38" or "hg19".
`model`	model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test.
`tightness`	a coefficient to times to TCGA somatic CN to set a more strict threshold as a circular amplicon. If the value is larger, it is more likely a fCNA assigned to `noncircular` instead of `circular`. When it is `NA`, we don't use TCGA somatic CN data as reference.
`gap_cn`	a gap copy number value. A gene with copy number above background (`ploidy + gap_cn` in general) would be treated as focal amplicon. Smaller, more amplicons.
`overlap`	the overlap percentage on gene.
`only_oncogenes`	if `TRUE`, only known oncogenes are kept for circular prediction.
`outdir`	result output path.
`result_file_prefix`	file name prefix (without directory path) for storing final model prediction file in CSV format. Default a unique file name is generated by UUID approach.

Value

a list of invisible data.table and corresponding files saved to local machine.

Examples

data("ascn")
data <- ascn
rv <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB11")
data$purity <- 1
rv2 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB11")
data$age <- 60
data$gender <- "XY"
rv3 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB32")
# If you want to use 'XGB56', you should include 'type' column
data$type <- "LUAD"
rv4 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB56")
# If you only have total integer copy number
data$minor_cn <- NA
rv5 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB11")

# R6 class fCNA --------------------------------
print(rv)
print(rv$data)
print(rv$sample_summary)
print(rv$gene_summary)
print(rv$cytoband_summary)

# Create a subset fCNA
rv_subset <- rv$subset(total_cn > 10)
nrow(rv$data)
nrow(rv_subset$data)

rv_subset2 <- rv$subset(sample == "TCGA-02-2485-01")
nrow(rv_subset2$data)
unique(rv_subset2$data$sample)

sum_gene <- rv$getGeneSummary()
sum_gene
mat_gene <- rv$getGeneSummary(return_mat = TRUE)
mat_gene

sum_cytoband <- rv$getCytobandSummary()
sum_cytoband
mat_cytoband <- rv$getCytobandSummary(return_mat = TRUE)
mat_cytoband

data("ascn")
data <- ascn
rv <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB11")
data$purity <- 1
rv2 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB11")
data$age <- 60
data$gender <- "XY"
rv3 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB32")
# If you want to use 'XGB56', you should include 'type' column
data$type <- "LUAD"
rv4 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB56")
# If you only have total integer copy number
data$minor_cn <- NA
rv5 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB11")

# R6 class fCNA --------------------------------
print(rv)
print(rv$data)
print(rv$sample_summary)
print(rv$gene_summary)
print(rv$cytoband_summary)

# Create a subset fCNA
rv_subset <- rv$subset(total_cn > 10)
nrow(rv$data)
nrow(rv_subset$data)

rv_subset2 <- rv$subset(sample == "TCGA-02-2485-01")
nrow(rv_subset2$data)
unique(rv_subset2$data$sample)

sum_gene <- rv$getGeneSummary()
sum_gene
mat_gene <- rv$getGeneSummary(return_mat = TRUE)
mat_gene

sum_cytoband <- rv$getCytobandSummary()
sum_cytoband
mat_cytoband <- rv$getCytobandSummary(return_mat = TRUE)
mat_cytoband

Generate unified gene-level feature data

Description

Generate unified gene-level feature data

Usage

gcap.collapse2Genes(
  fts,
  extra_info = NULL,
  include_type = FALSE,
  fix_type = TRUE,
  genome_build = c("hg38", "hg19", "mm10"),
  overlap = 1
)
gcap.collapse2Genes(
  fts,
  extra_info = NULL,
  include_type = FALSE,
  fix_type = TRUE,
  genome_build = c("hg38", "hg19", "mm10"),
  overlap = 1
)

Arguments

`fts`	(modified) result from `gcap.extractFeatures()`
`extra_info`	(optional) a `data.frame` with 3 columns 'sample', 'age' and 'gender', for including cancer type, check parameter `include_type`. For gender, should be 'XX' or 'XY', also could be `0` for 'XX' and `1` for 'XY'.
`include_type`	if `TRUE`, a fourth column named 'type' should be included in `extra_info`, the supported cancer type should be described with TCGA cancer type abbr..
`fix_type`	default is `TRUE`, only cancer types used in pre-trained models are used, others will be convert to `NA`. If `FALSE`, only generating one-hot encoding for cancer types in input data.
`genome_build`	genome build version, should be one of 'hg38', 'hg19'.
`overlap`	the overlap percentage on gene.

Value

a data.table.

Extract sample and region level features

Description

Extract sample and region level features

Usage

gcap.extractFeatures(
  ascat_files,
  genome_build = c("hg38", "hg19", "mm10"),
  ascn_data = NULL
)
gcap.extractFeatures(
  ascat_files,
  genome_build = c("hg38", "hg19", "mm10"),
  ascn_data = NULL
)

Arguments

`ascat_files`	a list of file path. Typically the result of `gcap.runASCAT()`
`genome_build`	genome build version, should be one of 'hg38', 'hg19'.
`ascn_data`	if `ascat_files` is missing, an alternative `data.frame` can be provided for ASCN data along with purity and ploidy (optional).

Value

a list.

Run ASCAT on tumor-normal pair WES data files

Description

A wrapper calling ASCAT on WES data on one or more tumor(-normal paired) bam data. Note, for multiple tumor-normal pairs, the first 5 arguments should be a vector with same length.

Usage

gcap.runASCAT(
  tumourseqfile,
  normalseqfile = NA_character_,
  tumourname,
  normalname = NA_character_,
  jobname = tumourname,
  outdir = getwd(),
  allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter",
  g1000allelesprefix = file.path("~/data/snp/1000G_loci_hg38",
    "1kg.phase3.v5a_GRCh38nounref_allele_index_chr"),
  g1000lociprefix = file.path("~/data/snp/1000G_loci_hg38",
    "1kg.phase3.v5a_GRCh38nounref_loci_chrstring_chr"),
  GCcontentfile = "~/data/snp/GC_correction_hg38.txt",
  replictimingfile = "~/data/snp/RT_correction_hg38.txt",
  nthreads = 22,
  minCounts = 10,
  BED_file = NA,
  probloci_file = NA,
  chrom_names = 1:22,
  gender = "XX",
  min_base_qual = 20,
  min_map_qual = 35,
  penalty = 70,
  genome_build = "hg38",
  skip_finished_ASCAT = FALSE
)
gcap.runASCAT(
  tumourseqfile,
  normalseqfile = NA_character_,
  tumourname,
  normalname = NA_character_,
  jobname = tumourname,
  outdir = getwd(),
  allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter",
  g1000allelesprefix = file.path("~/data/snp/1000G_loci_hg38",
    "1kg.phase3.v5a_GRCh38nounref_allele_index_chr"),
  g1000lociprefix = file.path("~/data/snp/1000G_loci_hg38",
    "1kg.phase3.v5a_GRCh38nounref_loci_chrstring_chr"),
  GCcontentfile = "~/data/snp/GC_correction_hg38.txt",
  replictimingfile = "~/data/snp/RT_correction_hg38.txt",
  nthreads = 22,
  minCounts = 10,
  BED_file = NA,
  probloci_file = NA,
  chrom_names = 1:22,
  gender = "XX",
  min_base_qual = 20,
  min_map_qual = 35,
  penalty = 70,
  genome_build = "hg38",
  skip_finished_ASCAT = FALSE
)

Arguments

`tumourseqfile`	Full path to the tumour BAM file.
`normalseqfile`	Full path to the normal BAM file.
`tumourname`	Identifier to be used for tumour output files.
`normalname`	Identifier to be used for normal output files.
`jobname`	job name, typically an unique name for a tumor-normal pair.
`outdir`	result output path.
`allelecounter_exe`	Path to the allele counter executable.
`g1000allelesprefix`	Prefix path to the allele data (e.g. "G1000_alleles_chr").
`g1000lociprefix`	Prefix path to the loci data (e.g. "G1000_loci_chr").
`GCcontentfile`	File containing the GC content around every SNP for increasing window sizes.
`replictimingfile`	File containing replication timing at every SNP for various cell lines.
`nthreads`	The number of parallel processes for getting allele counts (optional, default=1).
`minCounts`	Minimum depth required in the normal for a SNP to be considered (optional, default=10).
`BED_file`	A BED file for only looking at SNPs within specific intervals (optional, default=NA).
`probloci_file`	A file (chromosome <tab> position; no header) containing specific loci to ignore (optional, default=NA).
`chrom_names`	A vector containing the names of chromosomes to be considered (optional, default=1:22).
`gender`	a vector of gender for each cases ("XX" or "XY"). Default = all female ("XX"). Ignore this if you don't include sex chromosomes.
`min_base_qual`	Minimum base quality required for a read to be counted (optional, default=20).
`min_map_qual`	Minimum mapping quality required for a read to be counted (optional, default=35).
`penalty`	penalty of introducing an additional ASPCF breakpoint (expert parameter, don't adapt unless you know what you're doing)
`genome_build`	"hg38" or "hg19".
`skip_finished_ASCAT`	if `TRUE`, skipped finished ASCAT calls to save time.

Value

Nothing. Check the outdir for results.

Build data for prediction from absolute copy number data

Description

This is is a wrapper of gcap.extractFeatures() and gcap.collapse2Genes() to combine the feature extraction and predict input generate procedure. If you want to modify the result of gcap.extractFeatures(), you should always use the two functions instead of this wrapper.

Usage

gcap.runASCNBuildflow(data, genome_build = c("hg38", "hg19"), overlap = 1)
gcap.runASCNBuildflow(data, genome_build = c("hg38", "hg19"), overlap = 1)

Arguments

data

a data.frame with following columns. The key columns can be obtained from common allele specific CNV calling software, e.g., ASCAT, Sequenza, FACETS.

chromosome: chromosome names starts with 'chr'.
start: start position of the segment.
end: end position of the segment.
total_cn: total integer copy number of the segment.
minor_cn: minor allele integer copy number of the segment. Set it to NA if you don't have this data.
sample: sample identifier.
purity: tumor purity of the sample. Set to 1 if you don't know.
ploidy (optinal): ploidy value of the sample tumor genome.
age (optional): age of the case, use along with gender.
gender (optional): gender of the case, use along with age.
type (optional): cancer type of the case, use along with age and gender. Please refer to gcap.collapse2Genes to see the supported cancer types. This info is only used in 'XGB56' model. If you don't use this model, you don't need to set it.

genome_build

"hg38" or "hg19".

overlap

the overlap percentage on gene.

Value

a data.table.

Build data for prediction from ASCAT result files

Description

Usage

gcap.runBuildflow(
  ascat_files,
  extra_info,
  include_type = FALSE,
  genome_build = c("hg38", "hg19", "mm10"),
  overlap = 1
)
gcap.runBuildflow(
  ascat_files,
  extra_info,
  include_type = FALSE,
  genome_build = c("hg38", "hg19", "mm10"),
  overlap = 1
)

Arguments

`ascat_files`	a list of file path. Typically the result of `gcap.runASCAT()`
`extra_info`	(optional) a `data.frame` with 3 columns 'sample', 'age' and 'gender', for including cancer type, check parameter `include_type`. For gender, should be 'XX' or 'XY', also could be `0` for 'XX' and `1` for 'XY'.
`include_type`	if `TRUE`, a fourth column named 'type' should be included in `extra_info`, the supported cancer type should be described with TCGA cancer type abbr..
`genome_build`	genome build version, should be one of 'hg38', 'hg19'.
`overlap`	the overlap percentage on gene.

Value

a data.table.

Run gene-level circular prediction

Description

Run gene-level circular prediction

Usage

gcap.runPrediction(data, model = "XGB11")
gcap.runPrediction(data, model = "XGB11")

Arguments

`data`	data to predict (`data.frame`/`matrix` format), from `gcap.collapse2Genes()` in general.
`model`	model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test.

Value

a numeric vector representing prob.

Examples

data("ec")
# Use toy model for illustration
y_pred <- gcap.runPrediction(ec, "toy")
y_pred
data("ec")
# Use toy model for illustration
y_pred <- gcap.runPrediction(ec, "toy")
y_pred

Summarize prediction result into gene/sample-level

Description

Summarize prediction result into gene/sample-level

Usage

gcap.runScoring(
  data,
  genome_build = "hg38",
  min_prob = 0.6,
  tightness = 1L,
  gap_cn = 3L,
  only_oncogenes = FALSE
)
gcap.runScoring(
  data,
  genome_build = "hg38",
  min_prob = 0.6,
  tightness = 1L,
  gap_cn = 3L,
  only_oncogenes = FALSE
)

Arguments

`data`	a `data.table` containing result from gcap.runPrediction.
`genome_build`	genome build version, should be one of 'hg38', 'hg19'.
`min_prob`	the minimal aggregated (in cytoband level) probability to determine a circular amplicon. The default value is for the balance of recall and precision. We highly recomment set it to 0.95 or larger if you want to detect solid positive cases (for experimental validation etc.) instead of subtyping cases.
`tightness`	a coefficient to times to TCGA somatic CN to set a more strict threshold as a circular amplicon. If the value is larger, it is more likely a fCNA assigned to `noncircular` instead of `circular`. When it is `NA`, we don't use TCGA somatic CN data as reference.
`gap_cn`	a gap copy number value. A gene with copy number above background (`ploidy + gap_cn` in general) would be treated as focal amplicon. Smaller, more amplicons.
`only_oncogenes`	if `TRUE`, only known oncogenes are kept for circular prediction.

Value

a list of data.table.

Examples

data("ec")
ec2 <- ec
ec2$prob <- gcap.runPrediction(ec)
score <- gcap.runScoring(ec2)
score
data("ec")
ec2 <- ec
ec2$prob <- gcap.runPrediction(ec)
score <- gcap.runScoring(ec2)
score

GCAP workflow for gene-level amplicon prediction

Description

GCAP workflow for gene-level amplicon prediction

Usage

gcap.workflow(
  tumourseqfile,
  normalseqfile,
  tumourname,
  normalname,
  jobname = tumourname,
  extra_info = NULL,
  include_type = FALSE,
  genome_build = c("hg38", "hg19"),
  model = "XGB11",
  tightness = 1L,
  gap_cn = 3L,
  overlap = 1,
  only_oncogenes = FALSE,
  outdir = getwd(),
  result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)),
  allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter",
  g1000allelesprefix = file.path("~/data/snp/1000G_loci_hg38",
    "1kg.phase3.v5a_GRCh38nounref_allele_index_chr"),
  g1000lociprefix = file.path("~/data/snp/1000G_loci_hg38",
    "1kg.phase3.v5a_GRCh38nounref_loci_chrstring_chr"),
  GCcontentfile = "~/data/snp/GC_correction_hg38.txt",
  replictimingfile = "~/data/snp/RT_correction_hg38.txt",
  nthreads = 22,
  minCounts = 10,
  BED_file = NA,
  probloci_file = NA,
  chrom_names = 1:22,
  min_base_qual = 20,
  min_map_qual = 35,
  penalty = 70,
  skip_finished_ASCAT = TRUE,
  skip_ascat_call = FALSE
)
gcap.workflow(
  tumourseqfile,
  normalseqfile,
  tumourname,
  normalname,
  jobname = tumourname,
  extra_info = NULL,
  include_type = FALSE,
  genome_build = c("hg38", "hg19"),
  model = "XGB11",
  tightness = 1L,
  gap_cn = 3L,
  overlap = 1,
  only_oncogenes = FALSE,
  outdir = getwd(),
  result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)),
  allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter",
  g1000allelesprefix = file.path("~/data/snp/1000G_loci_hg38",
    "1kg.phase3.v5a_GRCh38nounref_allele_index_chr"),
  g1000lociprefix = file.path("~/data/snp/1000G_loci_hg38",
    "1kg.phase3.v5a_GRCh38nounref_loci_chrstring_chr"),
  GCcontentfile = "~/data/snp/GC_correction_hg38.txt",
  replictimingfile = "~/data/snp/RT_correction_hg38.txt",
  nthreads = 22,
  minCounts = 10,
  BED_file = NA,
  probloci_file = NA,
  chrom_names = 1:22,
  min_base_qual = 20,
  min_map_qual = 35,
  penalty = 70,
  skip_finished_ASCAT = TRUE,
  skip_ascat_call = FALSE
)

Arguments

`tumourseqfile`	Full path to the tumour BAM file.
`normalseqfile`	Full path to the normal BAM file.
`tumourname`	Identifier to be used for tumour output files.
`normalname`	Identifier to be used for normal output files.
`jobname`	job name, typically an unique name for a tumor-normal pair.
`extra_info`	(optional) a (file containing) `data.frame` with 3 columns 'sample' (must identical to the setting of parameter `jobname`), 'age' and 'gender'. For gender, should be 'XX' or 'XY', also could be `0` for 'XX' and `1` for 'XY'.
`include_type`	if `TRUE`, a fourth column named 'type' should be included in `extra_info`, the supported cancer type should be described with TCGA cancer type abbr..
`genome_build`	"hg38" or "hg19".
`model`	model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test.
`tightness`	a coefficient to times to TCGA somatic CN to set a more strict threshold as a circular amplicon. If the value is larger, it is more likely a fCNA assigned to `noncircular` instead of `circular`. When it is `NA`, we don't use TCGA somatic CN data as reference.
`gap_cn`	a gap copy number value. A gene with copy number above background (`ploidy + gap_cn` in general) would be treated as focal amplicon. Smaller, more amplicons.
`overlap`	the overlap percentage on gene.
`only_oncogenes`	if `TRUE`, only known oncogenes are kept for circular prediction.
`outdir`	result output path.
`result_file_prefix`	file name prefix (without directory path) for storing final model prediction file in CSV format. Default a unique file name is generated by UUID approach.
`allelecounter_exe`	Path to the allele counter executable.
`g1000allelesprefix`	Prefix path to the allele data (e.g. "G1000_alleles_chr").
`g1000lociprefix`	Prefix path to the loci data (e.g. "G1000_loci_chr").
`GCcontentfile`	File containing the GC content around every SNP for increasing window sizes.
`replictimingfile`	File containing replication timing at every SNP for various cell lines.
`nthreads`	The number of parallel processes for getting allele counts (optional, default=1).
`minCounts`	Minimum depth required in the normal for a SNP to be considered (optional, default=10).
`BED_file`	A BED file for only looking at SNPs within specific intervals (optional, default=NA).
`probloci_file`	A file (chromosome <tab> position; no header) containing specific loci to ignore (optional, default=NA).
`chrom_names`	A vector containing the names of chromosomes to be considered (optional, default=1:22).
`min_base_qual`	Minimum base quality required for a read to be counted (optional, default=20).
`min_map_qual`	Minimum mapping quality required for a read to be counted (optional, default=35).
`penalty`	penalty of introducing an additional ASPCF breakpoint (expert parameter, don't adapt unless you know what you're doing)
`skip_finished_ASCAT`	if `TRUE`, skipped finished ASCAT calls to save time.
`skip_ascat_call`	if `TRUE`, skip calling ASCAT. This is useful when you have done this step and just want to run next steps.

Value

a list of invisible data.table and corresponding files saved to local machine.

GCAP FACETS workflow for gene-level amplicon prediction

Description

GCAP FACETS workflow for gene-level amplicon prediction

Usage

gcap.workflow.facets(
  tumourseqfile,
  normalseqfile,
  jobname,
  extra_info = NULL,
  include_type = FALSE,
  genome_build = c("mm10", "hg38", "hg19"),
  model = "XGB11",
  tightness = 1L,
  gap_cn = 3L,
  overlap = 1,
  pro_cval = 100,
  only_oncogenes = FALSE,
  snp_file = "path/to/genome_build_responding.vcf.gz",
  outdir = getwd(),
  result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)),
  util_exe = system.file("extcode", "snp-pileup", package = "facets"),
  nthreads = 1,
  skip_finished_facets = TRUE,
  skip_facets_call = FALSE
)
gcap.workflow.facets(
  tumourseqfile,
  normalseqfile,
  jobname,
  extra_info = NULL,
  include_type = FALSE,
  genome_build = c("mm10", "hg38", "hg19"),
  model = "XGB11",
  tightness = 1L,
  gap_cn = 3L,
  overlap = 1,
  pro_cval = 100,
  only_oncogenes = FALSE,
  snp_file = "path/to/genome_build_responding.vcf.gz",
  outdir = getwd(),
  result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)),
  util_exe = system.file("extcode", "snp-pileup", package = "facets"),
  nthreads = 1,
  skip_finished_facets = TRUE,
  skip_facets_call = FALSE
)

Arguments

`tumourseqfile`	Full path to the tumour BAM file.
`normalseqfile`	Full path to the normal BAM file.
`jobname`	job name, typically an unique name for a tumor-normal pair.
`extra_info`	(optional) a (file containing) `data.frame` with 3 columns 'sample' (must identical to the setting of parameter `jobname`), 'age' and 'gender'. For gender, should be 'XX' or 'XY', also could be `0` for 'XX' and `1` for 'XY'.
`include_type`	if `TRUE`, a fourth column named 'type' should be included in `extra_info`, the supported cancer type should be described with TCGA cancer type abbr..
`genome_build`	genome build version, should be one of 'hg38', 'hg19' and 'mm10'.
`model`	model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test.
`tightness`	a coefficient to times to TCGA somatic CN to set a more strict threshold as a circular amplicon. If the value is larger, it is more likely a fCNA assigned to `noncircular` instead of `circular`. When it is `NA`, we don't use TCGA somatic CN data as reference.
`gap_cn`	a gap copy number value. A gene with copy number above background (`ploidy + gap_cn` in general) would be treated as focal amplicon. Smaller, more amplicons.
`overlap`	the overlap percentage on gene.
`pro_cval`	critical value for segmentation used in `facets::procSample()`.
`only_oncogenes`	if `TRUE`, only known oncogenes are kept for circular prediction.
`snp_file`	a file path to SNP file of genome, should be consistent with `genome_build` option.
`outdir`	result output path.
`result_file_prefix`	file name prefix (without directory path) for storing final model prediction file in CSV format. Default a unique file name is generated by UUID approach.
`util_exe`	the path to `snp-pileup`.
`nthreads`	The number of parallel processes for getting allele counts (optional, default=1).
`skip_finished_facets`	if `TRUE`, skip finished FACETS runs.
`skip_facets_call`	if `TRUE`, skip calling FACETS. This is useful when you have done this step and just want to run next steps.

Details

For generating the snp-pileup program, reference commands given here. You need modify corresponding path to fit your own machine.

cd /data3/wsx/R/x86_64-pc-linux-gnu-library/4.2/facets/extcode/
g++ -std=c++11 -I/data3/wsx/miniconda3/envs/circlemap/include snp-pileup.cpp -L/data3/wsx/miniconda3/envs/circlemap/lib -lhts -Wl,-rpath=/data3/wsx/miniconda3/envs/circlemap/lib -o snp-pileup

Value

a list of invisible data.table and corresponding files saved to local machine.

GCAP sequenza workflow for gene-level amplicon prediction

Description

GCAP sequenza workflow for gene-level amplicon prediction

Usage

gcap.workflow.seqz(
  tumourseqfile,
  normalseqfile,
  jobname,
  extra_info = NULL,
  include_type = FALSE,
  genome_build = c("mm10", "hg38", "hg19"),
  model = "XGB11",
  tightness = 1L,
  gap_cn = 3L,
  overlap = 1,
  only_oncogenes = FALSE,
  ref_file = "path/to/reference.fa",
  data_tmp_dir = "~/gcap_data",
  outdir = getwd(),
  result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)),
  util_exe = "~/miniconda3/bin/sequenza-utils",
  samtools_exe = "~/miniconda3/bin/samtools",
  tabix_exe = "~/miniconda3/bin/tabix",
  nthreads = 1,
  skip_finished_sequenza = TRUE,
  skip_sequenza_call = FALSE
)
gcap.workflow.seqz(
  tumourseqfile,
  normalseqfile,
  jobname,
  extra_info = NULL,
  include_type = FALSE,
  genome_build = c("mm10", "hg38", "hg19"),
  model = "XGB11",
  tightness = 1L,
  gap_cn = 3L,
  overlap = 1,
  only_oncogenes = FALSE,
  ref_file = "path/to/reference.fa",
  data_tmp_dir = "~/gcap_data",
  outdir = getwd(),
  result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)),
  util_exe = "~/miniconda3/bin/sequenza-utils",
  samtools_exe = "~/miniconda3/bin/samtools",
  tabix_exe = "~/miniconda3/bin/tabix",
  nthreads = 1,
  skip_finished_sequenza = TRUE,
  skip_sequenza_call = FALSE
)

Arguments

`tumourseqfile`	Full path to the tumour BAM file.
`normalseqfile`	Full path to the normal BAM file.
`jobname`	job name, typically an unique name for a tumor-normal pair.
`extra_info`	(optional) a (file containing) `data.frame` with 3 columns 'sample' (must identical to the setting of parameter `jobname`), 'age' and 'gender'. For gender, should be 'XX' or 'XY', also could be `0` for 'XX' and `1` for 'XY'.
`include_type`	if `TRUE`, a fourth column named 'type' should be included in `extra_info`, the supported cancer type should be described with TCGA cancer type abbr..
`genome_build`	genome build version, should be one of 'hg38', 'hg19' and 'mm10'.
`model`	model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test.
`tightness`	a coefficient to times to TCGA somatic CN to set a more strict threshold as a circular amplicon. If the value is larger, it is more likely a fCNA assigned to `noncircular` instead of `circular`. When it is `NA`, we don't use TCGA somatic CN data as reference.
`gap_cn`	a gap copy number value. A gene with copy number above background (`ploidy + gap_cn` in general) would be treated as focal amplicon. Smaller, more amplicons.
`overlap`	the overlap percentage on gene.
`only_oncogenes`	if `TRUE`, only known oncogenes are kept for circular prediction.
`ref_file`	a reference genome file, should be consistent with `genome_build` option.
`data_tmp_dir`	a directory path for storing temp data for reuse in handling multiple samples.
`outdir`	result output path.
`result_file_prefix`	file name prefix (without directory path) for storing final model prediction file in CSV format. Default a unique file name is generated by UUID approach.
`util_exe`	the path to `sequenza-utils`.
`samtools_exe`	the path to `samtools_exe`.
`tabix_exe`	the path to `tabix`.
`nthreads`	The number of parallel processes for getting allele counts (optional, default=1).
`skip_finished_sequenza`	if `TRUE`, skip finished sequenza runs.
`skip_sequenza_call`	if `TRUE`, skip calling sequenza. This is useful when you have done this step and just want to run next steps.

Value

a list of invisible data.table and corresponding files saved to local machine.

Get AUC value

Description

Get AUC value

Usage

get_auc(y_pred, y, type = c("pr", "roc"), curve = FALSE)
get_auc(y_pred, y, type = c("pr", "roc"), curve = FALSE)

Arguments

`y_pred`	y prediction vector.
`y`	y true label vector.
`type`	AUC type, either 'pr' or 'roc'.
`curve`	if `TRUE`, generate plot data, the result can be plotted by `plot()`.

Value

A object.

Examples

if (require("PRROC")) {
  set.seed(2021)
  auc <- get_auc(sample(1:10, 10), c(rep(0, 5), rep(1, 5)))
  auc
}
if (require("PRROC")) {
  set.seed(2021)
  auc <- get_auc(sample(1:10, 10), c(rep(0, 5), rep(1, 5)))
  auc
}

Merge a list of data.table

Description

Merge a list of data.table

Usage

mergeDTs(dt_list, by = NULL, sort = FALSE)
mergeDTs(dt_list, by = NULL, sort = FALSE)

Arguments

`dt_list`	a list of `data.table`s.
`by`	which column used for merging.
`sort`	should sort the result?

Value

a data.table

Oncogene list

Description

Oncogene list

Format

A data.frame

Source

Generate from ⁠data-raw/⁠, raw source from http://ongene.bioinfo-minzhao.org/

Examples

data("oncogenes")
data("oncogenes")

Get overlaps of two genomic regions

Description

Get overlaps of two genomic regions

Usage

overlaps(x, y)
overlaps(x, y)

Arguments

x, y

a genemic region with data.frame format, the first 3 columns should representing chromosome, start and end position.

Value

a data.table

Package 'gcap'

Help Index

Example allele specific copy number (ASCN) data

Description

Format

Source

Examples

Deploy Command Line Interface to System Local Path

Description

Usage

Arguments

Value

Example ecDNA training data

Description

Format

Source

Examples

R6 class representing focal copy number amplification list predicted from a cohort

Description

Public fields

Active bindings

Methods

Public methods

Method new()

Usage

Arguments

Method subset()

Usage

Arguments

Returns

Method getSampleSummary()

Usage

Arguments

Returns

Method getGeneSummary()

Usage

Arguments

Returns

Method getCytobandSummary()

Usage

Arguments

Returns

Method saveToFiles()

Usage

Arguments

Method convertGeneID()

Usage

Arguments

Method print()

Usage

Arguments

GCAP workflow for gene-level amplicon prediction from ASCN input

Description

Usage

Arguments

Value

Examples

Generate unified gene-level feature data

Description

Usage

Arguments

Value

Extract sample and region level features

Description

Usage

Arguments

Value

Run ASCAT on tumor-normal pair WES data files

Description

Usage

Arguments

Value

Build data for prediction from absolute copy number data

Description

Usage

Arguments

Value

See Also

Build data for prediction from ASCAT result files

Description

Method `new()`

Method `subset()`

Method `getSampleSummary()`

Method `getGeneSummary()`

Method `getCytobandSummary()`

Method `saveToFiles()`

Method `convertGeneID()`

Method `print()`