Title: | Gene-level Circular Amplicon Prediction |
---|---|
Description: | Provides data processing pipeline feeding paired bam files (or allele-specific copy number data) and XGBOOST model for predicting tumor circular amplicons (also known as ecDNA) in gene level. |
Authors: | Shixiang Wang [aut, cre] |
Maintainer: | Shixiang Wang <[email protected]> |
License: | Non-Commercial Academic License + file LICENSE |
Version: | 1.2.0 |
Built: | 2024-11-16 03:20:07 UTC |
Source: | https://github.com/ShixiangWang/gcap |
Example allele specific copy number (ASCN) data
A data.frame
Generate from data-raw/
, raw source from our study by
calling ASCAT v3.0 alpha on corresponding WES sequencing data.
data("ascn")
data("ascn")
Only should be used in Unix-like system.
For details of the arguments passing to CLI, please check gcap.workflow()
and gcap.ASCNworkflow()
.
deploy(target = "/usr/local/bin")
deploy(target = "/usr/local/bin")
target |
the target path to deploy the CLI. |
Nothing.
Example ecDNA training data
A data.table
Generate from data-raw/
data("ec")
data("ec")
Contains fields storing data and methods to get, process and visualize
fCNA information. Examples please see gcap.ASCNworkflow()
.
data
a data.table
storing fCNA list, which typically contains following columns:
sample
sample or case ID.
band
chromosome cytoband.
gene_id
gene ID, typically Ensembl ID. You can convert the ID with R package IDConverter
.
total_cn
total copy number value.
minor_cn
copy number value for minor allele.
prob
the probability the gene located in circular DNA.
gene_class
gene level amplicon classification.
sample_summary
a data.table
storing sample summary data, which typically contains
at least the following columns:
sample
sample or case ID. Should only include cases have been called with GCAP workflow,
otherwise the extra cases would be automatically classified as 'nofocal' (i.e. NA
in sample_summary
field) class.
purity
, ploidy
for tumor purity or ploidy.
AScore
aneuploidy score.
pLOH
genome percentage harboring LOH events.
CN1 ... CN19
activity of copy number signatures.
class
the sample class based on amplicon type.
ec_genes
number of genes predicted as located on circular DNA.
ec_possibly_genes
same with ec_genes
but with less confidence.
ec_cytobands
number of cytobands predicted as located on circular DNA.
(the regions of ec_possibly_genes
are not included in computation)
min_prob
check $new()
method for details. If you updated this value,
a function will be called to update the sample summary.
new()
Create a fCNA
object.
Typically, you can obtain this object from gcap.workflow()
or gcap.ASCNworkflow()
.
fCNA$new( fcna, pdata = fcna[, "sample", drop = FALSE], min_prob = 0.6, only_oncogenes = FALSE, genome_build = c("hg38", "hg19", "mm10") )
fcna
a data.frame
storing focal copy number amplicon list.
pdata
a data.frame
storing phenotype or sample-level related data. (Optional)
min_prob
the minimal aggregated (in cytoband level) probability to determine a circular amplicon.
only_oncogenes
only_oncogenes if TRUE
, only known oncogenes are kept for circular prediction.
genome_build
genome version
subset()
Return a subset fCNA
object
fCNA$subset(..., on = c("data", "sample_summary"))
...
subset expressions on fCNA$data
or fCNA$sample_summary
.
on
if it is "data", subset operations are on data field of fCNA
object,
same for "sample_summary".
a fCNA
getSampleSummary()
Get sample summary of fCNA
fCNA$getSampleSummary( only_oncogenes = FALSE, genome_build = c("hg38", "hg19", "mm10") )
only_oncogenes
only_oncogenes if TRUE
, only known oncogenes are kept for circular prediction.
genome_build
genome version.
a data.table
getGeneSummary()
Get gene level summary of fCNA type
fCNA$getGeneSummary(return_mat = FALSE)
return_mat
if TRUE
, return a cytoband by sample matrix instead of a summary.
a data.table
or a matrix
.
getCytobandSummary()
Get cytoband level summary of fCNA type
fCNA$getCytobandSummary(unique = FALSE, return_mat = FALSE)
unique
if TRUE
, count sample frequency instead of gene frequency.
return_mat
if TRUE
, return a cytoband by sample matrix instead of a summary.
a data.table
saveToFiles()
Save the key data to local files
fCNA$saveToFiles(dirpath, fileprefix = "fCNA")
dirpath
directory path storing output files.
fileprefix
file prefix. Two result files shall be generated.
convertGeneID()
Convert Gene IDs between Ensembl and Hugo Symbol System
fCNA$convertGeneID( type = c("ensembl", "symbol"), genome_build = c("hg38", "hg19", "mm10") )
type
type of input IDs, could be 'ensembl' or 'symbol'.
genome_build
reference genome build.
print()
print the fCNA object
fCNA$print(...)
...
unused.
Unlike gcap.workflow, this function directly uses the allele-specific copy number data along with some extra sample information to infer ecDNA genes.
gcap.ASCNworkflow( data, genome_build = c("hg38", "hg19"), model = "XGB11", tightness = 1L, gap_cn = 3L, overlap = 1, only_oncogenes = FALSE, outdir = getwd(), result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)) )
gcap.ASCNworkflow( data, genome_build = c("hg38", "hg19"), model = "XGB11", tightness = 1L, gap_cn = 3L, overlap = 1, only_oncogenes = FALSE, outdir = getwd(), result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)) )
data |
a
|
genome_build |
"hg38" or "hg19". |
model |
model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test. |
tightness |
a coefficient to times to TCGA somatic CN to set a more strict threshold
as a circular amplicon.
If the value is larger, it is more likely a fCNA assigned to |
gap_cn |
a gap copy number value.
A gene with copy number above background ( |
overlap |
the overlap percentage on gene. |
only_oncogenes |
if |
outdir |
result output path. |
result_file_prefix |
file name prefix (without directory path) for storing final model prediction file in CSV format. Default a unique file name is generated by UUID approach. |
a list of invisible data.table
and corresponding files saved to local machine.
data("ascn") data <- ascn rv <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB11") data$purity <- 1 rv2 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB11") data$age <- 60 data$gender <- "XY" rv3 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB32") # If you want to use 'XGB56', you should include 'type' column data$type <- "LUAD" rv4 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB56") # If you only have total integer copy number data$minor_cn <- NA rv5 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB11") # R6 class fCNA -------------------------------- print(rv) print(rv$data) print(rv$sample_summary) print(rv$gene_summary) print(rv$cytoband_summary) # Create a subset fCNA rv_subset <- rv$subset(total_cn > 10) nrow(rv$data) nrow(rv_subset$data) rv_subset2 <- rv$subset(sample == "TCGA-02-2485-01") nrow(rv_subset2$data) unique(rv_subset2$data$sample) sum_gene <- rv$getGeneSummary() sum_gene mat_gene <- rv$getGeneSummary(return_mat = TRUE) mat_gene sum_cytoband <- rv$getCytobandSummary() sum_cytoband mat_cytoband <- rv$getCytobandSummary(return_mat = TRUE) mat_cytoband
data("ascn") data <- ascn rv <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB11") data$purity <- 1 rv2 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB11") data$age <- 60 data$gender <- "XY" rv3 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB32") # If you want to use 'XGB56', you should include 'type' column data$type <- "LUAD" rv4 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB56") # If you only have total integer copy number data$minor_cn <- NA rv5 <- gcap.ASCNworkflow(data, outdir = tempdir(), model = "XGB11") # R6 class fCNA -------------------------------- print(rv) print(rv$data) print(rv$sample_summary) print(rv$gene_summary) print(rv$cytoband_summary) # Create a subset fCNA rv_subset <- rv$subset(total_cn > 10) nrow(rv$data) nrow(rv_subset$data) rv_subset2 <- rv$subset(sample == "TCGA-02-2485-01") nrow(rv_subset2$data) unique(rv_subset2$data$sample) sum_gene <- rv$getGeneSummary() sum_gene mat_gene <- rv$getGeneSummary(return_mat = TRUE) mat_gene sum_cytoband <- rv$getCytobandSummary() sum_cytoband mat_cytoband <- rv$getCytobandSummary(return_mat = TRUE) mat_cytoband
Generate unified gene-level feature data
gcap.collapse2Genes( fts, extra_info = NULL, include_type = FALSE, fix_type = TRUE, genome_build = c("hg38", "hg19", "mm10"), overlap = 1 )
gcap.collapse2Genes( fts, extra_info = NULL, include_type = FALSE, fix_type = TRUE, genome_build = c("hg38", "hg19", "mm10"), overlap = 1 )
fts |
(modified) result from |
extra_info |
(optional) a |
include_type |
if |
fix_type |
default is |
genome_build |
genome build version, should be one of 'hg38', 'hg19'. |
overlap |
the overlap percentage on gene. |
a data.table
.
Extract sample and region level features
gcap.extractFeatures( ascat_files, genome_build = c("hg38", "hg19", "mm10"), ascn_data = NULL )
gcap.extractFeatures( ascat_files, genome_build = c("hg38", "hg19", "mm10"), ascn_data = NULL )
ascat_files |
a list of file path.
Typically the result of |
genome_build |
genome build version, should be one of 'hg38', 'hg19'. |
ascn_data |
if |
a list
.
A wrapper calling ASCAT on WES data on one or more tumor(-normal paired) bam data. Note, for multiple tumor-normal pairs, the first 5 arguments should be a vector with same length.
gcap.runASCAT( tumourseqfile, normalseqfile = NA_character_, tumourname, normalname = NA_character_, jobname = tumourname, outdir = getwd(), allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter", g1000allelesprefix = file.path("~/data/snp/1000G_loci_hg38", "1kg.phase3.v5a_GRCh38nounref_allele_index_chr"), g1000lociprefix = file.path("~/data/snp/1000G_loci_hg38", "1kg.phase3.v5a_GRCh38nounref_loci_chrstring_chr"), GCcontentfile = "~/data/snp/GC_correction_hg38.txt", replictimingfile = "~/data/snp/RT_correction_hg38.txt", nthreads = 22, minCounts = 10, BED_file = NA, probloci_file = NA, chrom_names = 1:22, gender = "XX", min_base_qual = 20, min_map_qual = 35, penalty = 70, genome_build = "hg38", skip_finished_ASCAT = FALSE )
gcap.runASCAT( tumourseqfile, normalseqfile = NA_character_, tumourname, normalname = NA_character_, jobname = tumourname, outdir = getwd(), allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter", g1000allelesprefix = file.path("~/data/snp/1000G_loci_hg38", "1kg.phase3.v5a_GRCh38nounref_allele_index_chr"), g1000lociprefix = file.path("~/data/snp/1000G_loci_hg38", "1kg.phase3.v5a_GRCh38nounref_loci_chrstring_chr"), GCcontentfile = "~/data/snp/GC_correction_hg38.txt", replictimingfile = "~/data/snp/RT_correction_hg38.txt", nthreads = 22, minCounts = 10, BED_file = NA, probloci_file = NA, chrom_names = 1:22, gender = "XX", min_base_qual = 20, min_map_qual = 35, penalty = 70, genome_build = "hg38", skip_finished_ASCAT = FALSE )
tumourseqfile |
Full path to the tumour BAM file. |
normalseqfile |
Full path to the normal BAM file. |
tumourname |
Identifier to be used for tumour output files. |
normalname |
Identifier to be used for normal output files. |
jobname |
job name, typically an unique name for a tumor-normal pair. |
outdir |
result output path. |
allelecounter_exe |
Path to the allele counter executable. |
g1000allelesprefix |
Prefix path to the allele data (e.g. "G1000_alleles_chr"). |
g1000lociprefix |
Prefix path to the loci data (e.g. "G1000_loci_chr"). |
GCcontentfile |
File containing the GC content around every SNP for increasing window sizes. |
replictimingfile |
File containing replication timing at every SNP for various cell lines. |
nthreads |
The number of parallel processes for getting allele counts (optional, default=1). |
minCounts |
Minimum depth required in the normal for a SNP to be considered (optional, default=10). |
BED_file |
A BED file for only looking at SNPs within specific intervals (optional, default=NA). |
probloci_file |
A file (chromosome <tab> position; no header) containing specific loci to ignore (optional, default=NA). |
chrom_names |
A vector containing the names of chromosomes to be considered (optional, default=1:22). |
gender |
a vector of gender for each cases ("XX" or "XY"). Default = all female ("XX"). Ignore this if you don't include sex chromosomes. |
min_base_qual |
Minimum base quality required for a read to be counted (optional, default=20). |
min_map_qual |
Minimum mapping quality required for a read to be counted (optional, default=35). |
penalty |
penalty of introducing an additional ASPCF breakpoint (expert parameter, don't adapt unless you know what you're doing) |
genome_build |
"hg38" or "hg19". |
skip_finished_ASCAT |
if |
Nothing. Check the outdir
for results.
This is is a wrapper of gcap.extractFeatures()
and gcap.collapse2Genes()
to combine the feature extraction
and predict input generate procedure.
If you want to modify the result of gcap.extractFeatures()
,
you should always use the two functions instead of this
wrapper.
gcap.runASCNBuildflow(data, genome_build = c("hg38", "hg19"), overlap = 1)
gcap.runASCNBuildflow(data, genome_build = c("hg38", "hg19"), overlap = 1)
data |
a
|
genome_build |
"hg38" or "hg19". |
overlap |
the overlap percentage on gene. |
a data.table
.
This is is a wrapper of gcap.extractFeatures()
and gcap.collapse2Genes()
to combine the feature extraction
and predict input generate procedure.
If you want to modify the result of gcap.extractFeatures()
,
you should always use the two functions instead of this
wrapper.
gcap.runBuildflow( ascat_files, extra_info, include_type = FALSE, genome_build = c("hg38", "hg19", "mm10"), overlap = 1 )
gcap.runBuildflow( ascat_files, extra_info, include_type = FALSE, genome_build = c("hg38", "hg19", "mm10"), overlap = 1 )
ascat_files |
a list of file path.
Typically the result of |
extra_info |
(optional) a |
include_type |
if |
genome_build |
genome build version, should be one of 'hg38', 'hg19'. |
overlap |
the overlap percentage on gene. |
a data.table
.
Run gene-level circular prediction
gcap.runPrediction(data, model = "XGB11")
gcap.runPrediction(data, model = "XGB11")
data |
data to predict ( |
model |
model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test. |
a numeric vector representing prob.
data("ec") # Use toy model for illustration y_pred <- gcap.runPrediction(ec, "toy") y_pred
data("ec") # Use toy model for illustration y_pred <- gcap.runPrediction(ec, "toy") y_pred
Summarize prediction result into gene/sample-level
gcap.runScoring( data, genome_build = "hg38", min_prob = 0.6, tightness = 1L, gap_cn = 3L, only_oncogenes = FALSE )
gcap.runScoring( data, genome_build = "hg38", min_prob = 0.6, tightness = 1L, gap_cn = 3L, only_oncogenes = FALSE )
data |
a |
genome_build |
genome build version, should be one of 'hg38', 'hg19'. |
min_prob |
the minimal aggregated (in cytoband level) probability to determine a circular amplicon. The default value is for the balance of recall and precision. We highly recomment set it to 0.95 or larger if you want to detect solid positive cases (for experimental validation etc.) instead of subtyping cases. |
tightness |
a coefficient to times to TCGA somatic CN to set a more strict threshold
as a circular amplicon.
If the value is larger, it is more likely a fCNA assigned to |
gap_cn |
a gap copy number value.
A gene with copy number above background ( |
only_oncogenes |
if |
a list of data.table
.
data("ec") ec2 <- ec ec2$prob <- gcap.runPrediction(ec) score <- gcap.runScoring(ec2) score
data("ec") ec2 <- ec ec2$prob <- gcap.runPrediction(ec) score <- gcap.runScoring(ec2) score
GCAP workflow for gene-level amplicon prediction
gcap.workflow( tumourseqfile, normalseqfile, tumourname, normalname, jobname = tumourname, extra_info = NULL, include_type = FALSE, genome_build = c("hg38", "hg19"), model = "XGB11", tightness = 1L, gap_cn = 3L, overlap = 1, only_oncogenes = FALSE, outdir = getwd(), result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)), allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter", g1000allelesprefix = file.path("~/data/snp/1000G_loci_hg38", "1kg.phase3.v5a_GRCh38nounref_allele_index_chr"), g1000lociprefix = file.path("~/data/snp/1000G_loci_hg38", "1kg.phase3.v5a_GRCh38nounref_loci_chrstring_chr"), GCcontentfile = "~/data/snp/GC_correction_hg38.txt", replictimingfile = "~/data/snp/RT_correction_hg38.txt", nthreads = 22, minCounts = 10, BED_file = NA, probloci_file = NA, chrom_names = 1:22, min_base_qual = 20, min_map_qual = 35, penalty = 70, skip_finished_ASCAT = TRUE, skip_ascat_call = FALSE )
gcap.workflow( tumourseqfile, normalseqfile, tumourname, normalname, jobname = tumourname, extra_info = NULL, include_type = FALSE, genome_build = c("hg38", "hg19"), model = "XGB11", tightness = 1L, gap_cn = 3L, overlap = 1, only_oncogenes = FALSE, outdir = getwd(), result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)), allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter", g1000allelesprefix = file.path("~/data/snp/1000G_loci_hg38", "1kg.phase3.v5a_GRCh38nounref_allele_index_chr"), g1000lociprefix = file.path("~/data/snp/1000G_loci_hg38", "1kg.phase3.v5a_GRCh38nounref_loci_chrstring_chr"), GCcontentfile = "~/data/snp/GC_correction_hg38.txt", replictimingfile = "~/data/snp/RT_correction_hg38.txt", nthreads = 22, minCounts = 10, BED_file = NA, probloci_file = NA, chrom_names = 1:22, min_base_qual = 20, min_map_qual = 35, penalty = 70, skip_finished_ASCAT = TRUE, skip_ascat_call = FALSE )
tumourseqfile |
Full path to the tumour BAM file. |
normalseqfile |
Full path to the normal BAM file. |
tumourname |
Identifier to be used for tumour output files. |
normalname |
Identifier to be used for normal output files. |
jobname |
job name, typically an unique name for a tumor-normal pair. |
extra_info |
(optional) a (file containing) |
include_type |
if |
genome_build |
"hg38" or "hg19". |
model |
model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test. |
tightness |
a coefficient to times to TCGA somatic CN to set a more strict threshold
as a circular amplicon.
If the value is larger, it is more likely a fCNA assigned to |
gap_cn |
a gap copy number value.
A gene with copy number above background ( |
overlap |
the overlap percentage on gene. |
only_oncogenes |
if |
outdir |
result output path. |
result_file_prefix |
file name prefix (without directory path) for storing final model prediction file in CSV format. Default a unique file name is generated by UUID approach. |
allelecounter_exe |
Path to the allele counter executable. |
g1000allelesprefix |
Prefix path to the allele data (e.g. "G1000_alleles_chr"). |
g1000lociprefix |
Prefix path to the loci data (e.g. "G1000_loci_chr"). |
GCcontentfile |
File containing the GC content around every SNP for increasing window sizes. |
replictimingfile |
File containing replication timing at every SNP for various cell lines. |
nthreads |
The number of parallel processes for getting allele counts (optional, default=1). |
minCounts |
Minimum depth required in the normal for a SNP to be considered (optional, default=10). |
BED_file |
A BED file for only looking at SNPs within specific intervals (optional, default=NA). |
probloci_file |
A file (chromosome <tab> position; no header) containing specific loci to ignore (optional, default=NA). |
chrom_names |
A vector containing the names of chromosomes to be considered (optional, default=1:22). |
min_base_qual |
Minimum base quality required for a read to be counted (optional, default=20). |
min_map_qual |
Minimum mapping quality required for a read to be counted (optional, default=35). |
penalty |
penalty of introducing an additional ASPCF breakpoint (expert parameter, don't adapt unless you know what you're doing) |
skip_finished_ASCAT |
if |
skip_ascat_call |
if |
a list of invisible data.table
and corresponding files saved to local machine.
GCAP FACETS workflow for gene-level amplicon prediction
gcap.workflow.facets( tumourseqfile, normalseqfile, jobname, extra_info = NULL, include_type = FALSE, genome_build = c("mm10", "hg38", "hg19"), model = "XGB11", tightness = 1L, gap_cn = 3L, overlap = 1, pro_cval = 100, only_oncogenes = FALSE, snp_file = "path/to/genome_build_responding.vcf.gz", outdir = getwd(), result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)), util_exe = system.file("extcode", "snp-pileup", package = "facets"), nthreads = 1, skip_finished_facets = TRUE, skip_facets_call = FALSE )
gcap.workflow.facets( tumourseqfile, normalseqfile, jobname, extra_info = NULL, include_type = FALSE, genome_build = c("mm10", "hg38", "hg19"), model = "XGB11", tightness = 1L, gap_cn = 3L, overlap = 1, pro_cval = 100, only_oncogenes = FALSE, snp_file = "path/to/genome_build_responding.vcf.gz", outdir = getwd(), result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)), util_exe = system.file("extcode", "snp-pileup", package = "facets"), nthreads = 1, skip_finished_facets = TRUE, skip_facets_call = FALSE )
tumourseqfile |
Full path to the tumour BAM file. |
normalseqfile |
Full path to the normal BAM file. |
jobname |
job name, typically an unique name for a tumor-normal pair. |
extra_info |
(optional) a (file containing) |
include_type |
if |
genome_build |
genome build version, should be one of 'hg38', 'hg19' and 'mm10'. |
model |
model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test. |
tightness |
a coefficient to times to TCGA somatic CN to set a more strict threshold
as a circular amplicon.
If the value is larger, it is more likely a fCNA assigned to |
gap_cn |
a gap copy number value.
A gene with copy number above background ( |
overlap |
the overlap percentage on gene. |
pro_cval |
critical value for segmentation used in |
only_oncogenes |
if |
snp_file |
a file path to SNP file of genome, should be consistent with |
outdir |
result output path. |
result_file_prefix |
file name prefix (without directory path) for storing final model prediction file in CSV format. Default a unique file name is generated by UUID approach. |
util_exe |
the path to |
nthreads |
The number of parallel processes for getting allele counts (optional, default=1). |
skip_finished_facets |
if |
skip_facets_call |
if |
For generating the snp-pileup
program, reference commands given here.
You need modify corresponding path to fit your own machine.
cd /data3/wsx/R/x86_64-pc-linux-gnu-library/4.2/facets/extcode/ g++ -std=c++11 -I/data3/wsx/miniconda3/envs/circlemap/include snp-pileup.cpp -L/data3/wsx/miniconda3/envs/circlemap/lib -lhts -Wl,-rpath=/data3/wsx/miniconda3/envs/circlemap/lib -o snp-pileup
a list of invisible data.table
and corresponding files saved to local machine.
GCAP sequenza workflow for gene-level amplicon prediction
gcap.workflow.seqz( tumourseqfile, normalseqfile, jobname, extra_info = NULL, include_type = FALSE, genome_build = c("mm10", "hg38", "hg19"), model = "XGB11", tightness = 1L, gap_cn = 3L, overlap = 1, only_oncogenes = FALSE, ref_file = "path/to/reference.fa", data_tmp_dir = "~/gcap_data", outdir = getwd(), result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)), util_exe = "~/miniconda3/bin/sequenza-utils", samtools_exe = "~/miniconda3/bin/samtools", tabix_exe = "~/miniconda3/bin/tabix", nthreads = 1, skip_finished_sequenza = TRUE, skip_sequenza_call = FALSE )
gcap.workflow.seqz( tumourseqfile, normalseqfile, jobname, extra_info = NULL, include_type = FALSE, genome_build = c("mm10", "hg38", "hg19"), model = "XGB11", tightness = 1L, gap_cn = 3L, overlap = 1, only_oncogenes = FALSE, ref_file = "path/to/reference.fa", data_tmp_dir = "~/gcap_data", outdir = getwd(), result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)), util_exe = "~/miniconda3/bin/sequenza-utils", samtools_exe = "~/miniconda3/bin/samtools", tabix_exe = "~/miniconda3/bin/tabix", nthreads = 1, skip_finished_sequenza = TRUE, skip_sequenza_call = FALSE )
tumourseqfile |
Full path to the tumour BAM file. |
normalseqfile |
Full path to the normal BAM file. |
jobname |
job name, typically an unique name for a tumor-normal pair. |
extra_info |
(optional) a (file containing) |
include_type |
if |
genome_build |
genome build version, should be one of 'hg38', 'hg19' and 'mm10'. |
model |
model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test. |
tightness |
a coefficient to times to TCGA somatic CN to set a more strict threshold
as a circular amplicon.
If the value is larger, it is more likely a fCNA assigned to |
gap_cn |
a gap copy number value.
A gene with copy number above background ( |
overlap |
the overlap percentage on gene. |
only_oncogenes |
if |
ref_file |
a reference genome file, should be consistent with |
data_tmp_dir |
a directory path for storing temp data for reuse in handling multiple samples. |
outdir |
result output path. |
result_file_prefix |
file name prefix (without directory path) for storing final model prediction file in CSV format. Default a unique file name is generated by UUID approach. |
util_exe |
the path to |
samtools_exe |
the path to |
tabix_exe |
the path to |
nthreads |
The number of parallel processes for getting allele counts (optional, default=1). |
skip_finished_sequenza |
if |
skip_sequenza_call |
if |
a list of invisible data.table
and corresponding files saved to local machine.
Get AUC value
get_auc(y_pred, y, type = c("pr", "roc"), curve = FALSE)
get_auc(y_pred, y, type = c("pr", "roc"), curve = FALSE)
y_pred |
y prediction vector. |
y |
y true label vector. |
type |
AUC type, either 'pr' or 'roc'. |
curve |
if |
A object.
if (require("PRROC")) { set.seed(2021) auc <- get_auc(sample(1:10, 10), c(rep(0, 5), rep(1, 5))) auc }
if (require("PRROC")) { set.seed(2021) auc <- get_auc(sample(1:10, 10), c(rep(0, 5), rep(1, 5))) auc }
Merge a list of data.table
mergeDTs(dt_list, by = NULL, sort = FALSE)
mergeDTs(dt_list, by = NULL, sort = FALSE)
dt_list |
a list of |
by |
which column used for merging. |
sort |
should sort the result? |
a data.table
Oncogene list
A data.frame
Generate from data-raw/
, raw source from http://ongene.bioinfo-minzhao.org/
data("oncogenes")
data("oncogenes")
Get overlaps of two genomic regions
overlaps(x, y)
overlaps(x, y)
x , y
|
a genemic region with data.frame format, the first 3 columns should representing chromosome, start and end position. |
a data.table