Title: | Extract, Analyze and Visualize Mutational Signatures for Genomic Variations |
---|---|
Description: | Genomic alterations including single nucleotide substitution, copy number alteration, etc. are the major force for cancer initialization and development. Due to the specificity of molecular lesions caused by genomic alterations, we can generate characteristic alteration spectra, called 'signature' (Wang, Shixiang, et al. (2021) <DOI:10.1371/journal.pgen.1009557> & Alexandrov, Ludmil B., et al. (2020) <DOI:10.1038/s41586-020-1943-3> & Steele Christopher D., et al. (2022) <DOI:10.1038/s41586-022-04738-6>). This package helps users to extract, analyze and visualize signatures from genomic alteration records, thus providing new insight into cancer study. |
Authors: | Shixiang Wang [aut, cre] , Ziyu Tao [aut] , Huimin Li [aut] , Tao Wu [aut] , Xue-Song Liu [aut, ctb] , Anand Mayakonda [ctb] |
Maintainer: | Shixiang Wang <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.3.2 |
Built: | 2024-11-11 02:47:12 UTC |
Source: | https://github.com/ShixiangWang/sigminer |
Add Horizontal Arrow with Text Label to a ggplot
add_h_arrow( p, x, y, label = "optimal number", space = 0.01, vjust = 0.3, seg_len = 0.1, arrow_len = unit(2, "mm"), arrow_type = c("closed", "open"), font_size = 5, font_family = c("serif", "sans", "mono"), font_face = c("plain", "bold", "italic") )
add_h_arrow( p, x, y, label = "optimal number", space = 0.01, vjust = 0.3, seg_len = 0.1, arrow_len = unit(2, "mm"), arrow_type = c("closed", "open"), font_size = 5, font_family = c("serif", "sans", "mono"), font_face = c("plain", "bold", "italic") )
p |
a |
x |
position at x axis. |
y |
position at y axis. |
label |
text label. |
space |
a small space between arrow and text. |
vjust |
vertical adjustment, set to 0 to align with the bottom, 0.5 for the middle, and 1 (the default) for the top. |
seg_len |
length of the arrow segment. |
arrow_len |
length of the arrow. |
arrow_type |
type of the arrow. |
font_size |
font size. |
font_family |
font family. |
font_face |
font face. |
a ggplot
object.
Add text labels to a ggplot object, such as the result from show_sig_profile.
add_labels( p, x, y, y_end = NULL, n_label = NULL, labels = NULL, revert_order = FALSE, font_size = 5, font_family = "serif", font_face = c("plain", "bold", "italic"), ... )
add_labels( p, x, y, y_end = NULL, n_label = NULL, labels = NULL, revert_order = FALSE, font_size = 5, font_family = "serif", font_face = c("plain", "bold", "italic"), ... )
p |
a |
x |
position at x axis. |
y |
position at y axis. |
y_end |
end position of y axis when |
n_label |
the number of label, when this is set,
the position of labels at y axis is auto-generated
according to |
labels |
text labels or a |
revert_order |
if |
font_size |
font size. |
font_family |
font family. |
font_face |
font face. |
... |
other parameters passing to ggplot2::annotate. |
a ggplot
object.
# Load mutational signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Show signature profile p <- show_sig_profile(sig2, mode = "SBS") # Method 1 p1 <- add_labels(p, x = 0.75, y = 0.3, y_end = 0.9, n_label = 3, labels = paste0("text", 1:3) ) p1 # Method 2 p2 <- add_labels(p, x = c(0.15, 0.6, 0.75), y = c(0.3, 0.6, 0.9), labels = paste0("text", 1:3) ) p2 # Method 3 sim <- get_sig_similarity(sig2) p3 <- add_labels(p, x = c(0.15, 0.6, 0.75), y = c(0.25, 0.55, 0.8), labels = sim, font_size = 2 ) p3
# Load mutational signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Show signature profile p <- show_sig_profile(sig2, mode = "SBS") # Method 1 p1 <- add_labels(p, x = 0.75, y = 0.3, y_end = 0.9, n_label = 3, labels = paste0("text", 1:3) ) p1 # Method 2 p2 <- add_labels(p, x = c(0.15, 0.6, 0.75), y = c(0.3, 0.6, 0.9), labels = paste0("text", 1:3) ) p2 # Method 3 sim <- get_sig_similarity(sig2) p3 <- add_labels(p, x = c(0.15, 0.6, 0.75), y = c(0.25, 0.55, 0.8), labels = sim, font_size = 2 ) p3
These functions are combined to provide a best practice for optimally identifying mutational signatures and attributing their activities (exposures) in tumor samples. They are listed in order to use.
bp_extract_signatures()
for extracting signatures.
bp_show_survey()
for showing measures change under different
signature numbers to help user select optimal signature number.
At default, an aggregated score (named score) is generated to
suggest the best solution.
bp_show_survey2()
for showing simplified signature number survey like
show_sig_number_survey()
.
bp_get_sig_obj()
for get a (list of) Signature
object which is common
used in sigminer for analysis and visualization.
bp_attribute_activity()
for optimizing signature activities (exposures).
NOTE: the activities from extraction step may be better!
You can also use sig_extract to get optimal NMF result from multiple NMF runs.
Besides, you can use sig_fit to quantify exposures based on signatures extracted
from bp_extract_signatures()
.
bp_extract_signatures_iter()
for extracting signature in a iteration way.
bp_cluster_iter_list()
for clustering (hclust
with average linkage)
iterated signatures to help collapse
multiple signatures into one. The result cluster can be visualized by
plot()
or factoextra::fviz_dend()
.
bp_get_clustered_sigs()
for getting clustered (grouped) mean signatures from signature clusters.
Extra: bp_get_stats
() for obtaining stats for signatures and samples of a solution.
These stats are aggregated (averaged) as the stats for a solution
(specific signature number).
Extra: bp_get_rank_score()
for obtaining rank score for all signature numbers.
bp_extract_signatures( nmf_matrix, range = 2:5, n_bootstrap = 20L, n_nmf_run = 50, RTOL = 0.001, min_contribution = 0, cores = min(4L, future::availableCores()), cores_solution = min(cores, length(range)), seed = 123456L, handle_hyper_mutation = TRUE, report_integer_exposure = FALSE, only_core_stats = nrow(nmf_matrix) > 100, cache_dir = file.path(tempdir(), "sigminer_bp"), keep_cache = FALSE, pynmf = FALSE, use_conda = TRUE, py_path = "/Users/wsx/anaconda3/bin/python" ) bp_extract_signatures_iter( nmf_matrix, range = 2:5, sim_threshold = 0.95, max_iter = 10L, n_bootstrap = 20L, n_nmf_run = 50, RTOL = 0.001, min_contribution = 0, cores = min(4L, future::availableCores()), cores_solution = min(cores, length(range)), seed = 123456L, handle_hyper_mutation = TRUE, report_integer_exposure = FALSE, only_core_stats = nrow(nmf_matrix) > 100, cache_dir = file.path(tempdir(), "sigminer_bp"), keep_cache = FALSE, pynmf = FALSE, use_conda = FALSE, py_path = "/Users/wsx/anaconda3/bin/python" ) bp_cluster_iter_list(x, k = NULL, include_final_iteration = TRUE) bp_get_clustered_sigs(SigClusters, cluster_label) bp_get_sig_obj(obj, signum = NULL) bp_get_stats(obj) bp_get_rank_score(obj) bp_show_survey2( obj, x = "signature_number", left_y = "silhouette", right_y = "L2_error", left_name = left_y, right_name = right_y, left_color = "black", right_color = "red", left_shape = 16, right_shape = 18, shape_size = 4, highlight = NULL ) bp_show_survey( obj, add_score = FALSE, scales = c("free_y", "free"), fixed_ratio = TRUE ) bp_attribute_activity( input, sample_class = NULL, nmf_matrix = NULL, method = c("bt", "stepwise"), bt_use_prop = FALSE, return_class = c("matrix", "data.table"), use_parallel = FALSE, cache_dir = file.path(tempdir(), "sigminer_attribute_activity"), keep_cache = FALSE )
bp_extract_signatures( nmf_matrix, range = 2:5, n_bootstrap = 20L, n_nmf_run = 50, RTOL = 0.001, min_contribution = 0, cores = min(4L, future::availableCores()), cores_solution = min(cores, length(range)), seed = 123456L, handle_hyper_mutation = TRUE, report_integer_exposure = FALSE, only_core_stats = nrow(nmf_matrix) > 100, cache_dir = file.path(tempdir(), "sigminer_bp"), keep_cache = FALSE, pynmf = FALSE, use_conda = TRUE, py_path = "/Users/wsx/anaconda3/bin/python" ) bp_extract_signatures_iter( nmf_matrix, range = 2:5, sim_threshold = 0.95, max_iter = 10L, n_bootstrap = 20L, n_nmf_run = 50, RTOL = 0.001, min_contribution = 0, cores = min(4L, future::availableCores()), cores_solution = min(cores, length(range)), seed = 123456L, handle_hyper_mutation = TRUE, report_integer_exposure = FALSE, only_core_stats = nrow(nmf_matrix) > 100, cache_dir = file.path(tempdir(), "sigminer_bp"), keep_cache = FALSE, pynmf = FALSE, use_conda = FALSE, py_path = "/Users/wsx/anaconda3/bin/python" ) bp_cluster_iter_list(x, k = NULL, include_final_iteration = TRUE) bp_get_clustered_sigs(SigClusters, cluster_label) bp_get_sig_obj(obj, signum = NULL) bp_get_stats(obj) bp_get_rank_score(obj) bp_show_survey2( obj, x = "signature_number", left_y = "silhouette", right_y = "L2_error", left_name = left_y, right_name = right_y, left_color = "black", right_color = "red", left_shape = 16, right_shape = 18, shape_size = 4, highlight = NULL ) bp_show_survey( obj, add_score = FALSE, scales = c("free_y", "free"), fixed_ratio = TRUE ) bp_attribute_activity( input, sample_class = NULL, nmf_matrix = NULL, method = c("bt", "stepwise"), bt_use_prop = FALSE, return_class = c("matrix", "data.table"), use_parallel = FALSE, cache_dir = file.path(tempdir(), "sigminer_attribute_activity"), keep_cache = FALSE )
nmf_matrix |
a |
range |
a |
n_bootstrap |
number of bootstrapped (resampling) catalogs used.
When it is |
n_nmf_run |
number of NMF runs for each bootstrapped or original catalog. At default, in total n_bootstrap x n_nmf_run (i.e. 1000) NMF runs are used for the task. |
RTOL |
a threshold proposed by Nature Cancer paper to control how to
filter solutions of NMF. Default is |
min_contribution |
a component contribution threshold to filer out small contributed components. |
cores |
number of cpu cores to run NMF. |
cores_solution |
cores for processing solutions, default is equal to argument |
seed |
a random seed to make reproducible result. |
handle_hyper_mutation |
default is |
report_integer_exposure |
if |
only_core_stats |
if |
cache_dir |
a directory for keep temp result files. |
keep_cache |
if |
pynmf |
if |
use_conda |
if |
py_path |
path to Python executable file, e.g. '/Users/wsx/anaconda3/bin/python'. In my
test, it is more stable than |
sim_threshold |
a similarity threshold for selecting samples to auto-rerun
the extraction procedure (i.e. |
max_iter |
the maximum iteration size, default is 10, i.e., at most run the extraction procedure 10 times. |
x |
result from |
k |
an integer sequence specifying the cluster number to get silhouette. |
include_final_iteration |
if |
SigClusters |
result from |
cluster_label |
cluster labels for a specified cluster number, obtain it
from |
obj |
a |
signum |
a integer vector to extract the corresponding |
left_y |
column name for left y axis. |
right_y |
column name for right y axis. |
left_name |
label name for left y axis. |
right_name |
label name for right y axis. |
left_color |
color for left axis. |
right_color |
color for right axis. |
left_shape , right_shape , shape_size
|
shape setting. |
highlight |
a |
add_score |
if |
scales |
one of "free_y" (default) and "free" to control the scales of plot facet. |
fixed_ratio |
if |
input |
result from |
sample_class |
a named string vector whose names are sample names
and values are class labels (i.e. cancer subtype). If it is |
method |
one of 'bt' (use bootstrap exposure median, from reference #2, the most recommended way in my personal view) or stepwise' (stepwise reduce and update signatures then do signature fitting with last signature sets, from reference #2, the result tends to assign the contribution of removed signatures to the remaining signatures, maybe I misunderstand the paper method? PAY ATTENTION). |
bt_use_prop |
this parameter is only used for |
return_class |
string, 'matrix' or 'data.table'. |
use_parallel |
if |
The signature extraction approach is adopted from reference #1, #2, and the whole best practice is adopted from the pipeline used by reference #3. I implement the whole procedure with R code based on the method description of papers. The code is well organized, tested and documented so user will find it pretty simple and useful. Besides, the structure of the results is very clear to see and also visualize like other approaches provided by sigminer.
It depends on the called function.
The survey plot provides a pretty good way to facilitate the signature number
selection. A score
measure is calculated as the weighted mean of selected
measures and visualized as the first sub-plot. The optimal number is highlighted
with red color dot and the best values for each measures are also
highlighted with orange color dots. The detail of 6 measures shown in plot are
explained as below.
score
- an aggregated score based on rank scores from selected measures below.
The higher, the better. When two signature numbers have the same score,
the larger signature number is preferred (this is a rare situation, you
have to double check other measures).
silhouette
- the average silhouette width for signatures, also named as ASW in reference #2.
The signature number with silhouette decreases sharply is preferred.
distance
- the average sample reconstructed cosine distance, the lower value is better.
error
- the average sample reconstructed error calculated with L2 formula
(i.e. L2 error). This lower value is better. This measure represents a
similar concept like distance
above, they are all used to quantify how well
sample mutation profiles can be reconstructed from signatures, but distance
cares the whole mutation profile similarity while error
here cares value difference.
pos cor
- the average positive signature exposure correlation coefficient.
The lower value is better. This measure is constructed based on my understanding
about signatures: mutational signatures are typically treated as independent
recurrent patterns, so their activities are less correlated.
similarity
- the average similarity within in a signature cluster.
Like silhouette
, the point decreases sharply is preferred.
In the practice, results from multiple NMF runs are clustered
with "clustering with match" algorithm proposed by reference #2. This value
indicates if the signature profiles extracted from different NMF runs are similar.
Shixiang Wang [email protected]
Alexandrov, Ludmil B., et al. "Deciphering signatures of mutational processes operative in human cancer." Cell reports 3.1 (2013): 246-259.
Degasperi, Andrea, et al. "A practical framework and online tool for mutational signature analyses show intertissue variation and driver dependencies." Nature cancer 1.2 (2020): 249-263.
Alexandrov, Ludmil B., et al. “The repertoire of mutational signatures in human cancer.” Nature 578.7793 (2020): 94-101.
See sig_estimate, sig_extract, sig_auto_extract, sigprofiler_extract for other approaches.
data("simulated_catalogs") # Here I reduce the values for n_bootstrap and n_nmf_run # for reducing the run time. # In practice, you should keep default or increase the values # for better estimation. # # The input data here is simulated from 10 mutational signatures # e1 <- bp_extract_signatures( # t(simulated_catalogs$set1), # range = 8:12, # n_bootstrap = 5, # n_nmf_run = 10 # ) # # To avoid computation in examples, # Here just load the result # (e1$signature and e1$exposure set to NA to reduce package size) load(system.file("extdata", "e1.RData", package = "sigminer")) # See the survey for different signature numbers # The suggested solution is marked as red dot # with highest integrated score. p1 <- bp_show_survey(e1) p1 # You can also exclude plotting and highlighting the score p2 <- bp_show_survey(e1, add_score = FALSE) p2 # You can also plot a simplified version p3 <- bp_show_survey2(e1, highlight = 10) p3 # Obtain the suggested solution from extraction result obj_suggested <- bp_get_sig_obj(e1, e1$suggested) obj_suggested # If you think the suggested signature number is not right # Just pick up the solution you want obj_s8 <- bp_get_sig_obj(e1, 8) # Track the reconstructed profile similarity rec_sim <- get_sig_rec_similarity(obj_s8, t(simulated_catalogs$set1)) rec_sim # After extraction, you can assign the signatures # to reference COSMIC signatures # More see ?get_sig_similarity sim <- get_sig_similarity(obj_suggested) # Visualize the match result if (require(pheatmap)) { pheatmap::pheatmap(sim$similarity) } # You already got the activities of signatures # in obj_suggested, however, you can still # try to optimize the result. # NOTE: the optimization step may not truly optimize the result! expo <- bp_attribute_activity(e1, return_class = "data.table") expo$abs_activity ## Not run: # Iterative extraction: # This procedure will rerun extraction step # for those samples with reconstructed catalog similarity # lower than a threshold (default is 0.95) e2 <- bp_extract_signatures_iter( t(simulated_catalogs$set1), range = 9:11, n_bootstrap = 5, n_nmf_run = 5, sim_threshold = 0.99 ) e2 # When the procedure run multiple rounds # you can cluster the signatures from different rounds by # the following command # bp_cluster_iter_list(e2) ## Extra utilities rank_score <- bp_get_rank_score(e1) rank_score stats <- bp_get_stats(e2$iter1) # Get the mean reconstructed similarity 1 - stats$stats_sample$cosine_distance_mean ## End(Not run)
data("simulated_catalogs") # Here I reduce the values for n_bootstrap and n_nmf_run # for reducing the run time. # In practice, you should keep default or increase the values # for better estimation. # # The input data here is simulated from 10 mutational signatures # e1 <- bp_extract_signatures( # t(simulated_catalogs$set1), # range = 8:12, # n_bootstrap = 5, # n_nmf_run = 10 # ) # # To avoid computation in examples, # Here just load the result # (e1$signature and e1$exposure set to NA to reduce package size) load(system.file("extdata", "e1.RData", package = "sigminer")) # See the survey for different signature numbers # The suggested solution is marked as red dot # with highest integrated score. p1 <- bp_show_survey(e1) p1 # You can also exclude plotting and highlighting the score p2 <- bp_show_survey(e1, add_score = FALSE) p2 # You can also plot a simplified version p3 <- bp_show_survey2(e1, highlight = 10) p3 # Obtain the suggested solution from extraction result obj_suggested <- bp_get_sig_obj(e1, e1$suggested) obj_suggested # If you think the suggested signature number is not right # Just pick up the solution you want obj_s8 <- bp_get_sig_obj(e1, 8) # Track the reconstructed profile similarity rec_sim <- get_sig_rec_similarity(obj_s8, t(simulated_catalogs$set1)) rec_sim # After extraction, you can assign the signatures # to reference COSMIC signatures # More see ?get_sig_similarity sim <- get_sig_similarity(obj_suggested) # Visualize the match result if (require(pheatmap)) { pheatmap::pheatmap(sim$similarity) } # You already got the activities of signatures # in obj_suggested, however, you can still # try to optimize the result. # NOTE: the optimization step may not truly optimize the result! expo <- bp_attribute_activity(e1, return_class = "data.table") expo$abs_activity ## Not run: # Iterative extraction: # This procedure will rerun extraction step # for those samples with reconstructed catalog similarity # lower than a threshold (default is 0.95) e2 <- bp_extract_signatures_iter( t(simulated_catalogs$set1), range = 9:11, n_bootstrap = 5, n_nmf_run = 5, sim_threshold = 0.99 ) e2 # When the procedure run multiple rounds # you can cluster the signatures from different rounds by # the following command # bp_cluster_iter_list(e2) ## Extra utilities rank_score <- bp_get_rank_score(e1) rank_score stats <- bp_get_stats(e2$iter1) # Get the mean reconstructed similarity 1 - stats$stats_sample$cosine_distance_mean ## End(Not run)
Location of Centromeres at Genome Build hg19
A data.frame
Generate from UCSC gold path
data(centromeres.hg19)
data(centromeres.hg19)
Location of Centromeres at Genome Build hg38
A data.frame
Generate from Genome Reference Consortium
data(centromeres.hg38)
data(centromeres.hg38)
Location of Centromeres at Genome Build mm10
A data.frame
Generate from https://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/gap.txt.gz
data(centromeres.mm10)
data(centromeres.mm10)
Location of Centromeres at Genome Build mm9
A data.frame
Generate from https://hgdownload.soe.ucsc.edu/goldenPath/mm9/database/ with code:
for i in $(seq 1 19) X Y; do wget https://hgdownload.soe.ucsc.edu/goldenPath/mm9/database/chr${i}_gap.txt.gz done
data(centromeres.mm9)
data(centromeres.mm9)
Location of Centromeres at Genome Build T2T
A data.frame
from T2T study
data(centromeres.T2T)
data(centromeres.T2T)
Chromosome Size of Genome Build hg19
A data.frame
Generate from UCSC gold path
data(chromsize.hg19)
data(chromsize.hg19)
Chromosome Size of Genome Build hg38
A data.frame
Generate from UCSC gold path
data(chromsize.hg38)
data(chromsize.hg38)
Chromosome Size of Genome Build mm10
A data.frame
Generate from UCSC gold path http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.chrom.sizes
data(chromsize.mm10)
data(chromsize.mm10)
Chromosome Size of Genome Build mm9
A data.frame
Generate from UCSC gold path http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/mm9.chrom.sizes
data(chromsize.mm9)
data(chromsize.mm9)
Chromosome Size of Genome Build T2T
A data.frame
from T2T study
data(chromsize.T2T)
data(chromsize.T2T)
Classification Table of Copy Number Features Devised by Wang et al. for Method 'W'
A data.table
with "sigminer.features" class name
Generate from code under data_raw/
data(CN.features)
data(CN.features)
S4 class for storing summarized absolute copy number profile.
data
data.table of absolute copy number calling.
summary.per.sample
data.table of copy number variation summary per sample.
genome_build
genome build version, should be one of 'hg19' or 'hg38'.
genome_measure
Set 'called' will use autosomo called segments size to compute total size for CNA burden calculation, this option is useful for WES and target sequencing. Set 'wg' will autosome size from genome build, this option is useful for WGS, SNP etc..
annotation
data.table of annotation for copy number segments.
dropoff.segs
data.table of copy number segments dropped from raw input.
Calculate Cosine Measures
cosine(x, y)
cosine(x, y)
x |
a numeric vector or matrix with column representing vector to calculate similarity. |
y |
must be same format as |
a numeric value or matrix
.
x <- c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0) y <- c(0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0) z1 <- cosine(x, y) z1 z2 <- cosine(matrix(x), matrix(y)) z2
x <- c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0) y <- c(0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0) z1 <- cosine(x, y) z1 z2 <- cosine(matrix(x), matrix(y)) z2
Location of Chromosome Cytobands at Genome Build hg19
A data.frame
from UCSC
data(cytobands.hg19)
data(cytobands.hg19)
Location of Chromosome Cytobands at Genome Build hg38
A data.frame
from UCSC
data(cytobands.hg38)
data(cytobands.hg38)
Location of Chromosome Cytobands at Genome Build mm10
A data.frame
from UCSC http://hgdownload.cse.ucsc.edu/goldenpath/mm10/database/cytoBand.txt.gz
data(cytobands.mm10)
data(cytobands.mm10)
Location of Chromosome Cytobands at Genome Build mm9
A data.frame
from UCSC http://hgdownload.cse.ucsc.edu/goldenpath/mm9/database/cytoBand.txt.gz
data(cytobands.mm9)
data(cytobands.mm9)
Location of Chromosome Cytobands at Genome Build T2T
A data.frame
from T2T study
data(cytobands.T2T)
data(cytobands.T2T)
See sig_tally for examples.
enrich_component_strand_bias(mat)
enrich_component_strand_bias(mat)
mat |
a sample-by-component matrix from sig_tally with strand bias labels "T:" and "B:". |
a data.table
sorted by p_value
.
Setting aes(label=..p.adj..)
in ggpubr::compare_means()
does not
show adjust p values. The returned result of this function can be combined with ggpubr::stat_pvalue_manual()
to fix
this problem.
get_adj_p( data, .col, .grp = "Sample", comparisons = NULL, method = "wilcox.test", p.adjust.method = "fdr", p.digits = 3L, ... )
get_adj_p( data, .col, .grp = "Sample", comparisons = NULL, method = "wilcox.test", p.adjust.method = "fdr", p.digits = 3L, ... )
data |
a |
.col |
column name for comparison. |
.grp |
column name for groups. |
comparisons |
Default is |
method |
a character string indicating which method to be used for comparing means. It can be 't.test', 'wilcox.test' etc.. |
p.adjust.method |
correction method, default is 'fdr'. Run |
p.digits |
how many significant digits are to be used. |
... |
other arguments passed to |
More info see ggpubr::compare_means()
, ggpubr::stat_compare_means()
and stats::p.adjust()
.
a data.frame
containing comparison result
https://github.com/kassambara/ggpubr/issues/143
library(ggpubr) # T-test stat.test <- compare_means( len ~ dose, data = ToothGrowth, method = "t.test", p.adjust.method = "fdr" ) stat.test # Create a simple box plot p <- ggboxplot(ToothGrowth, x = "dose", y = "len") p # Add p values my_comparisons <- list(c("0.5", "1"), c("1", "2"), c("0.5", "2")) p + stat_compare_means(method = "t.test", comparisons = my_comparisons) # Try adding adjust p values # proposed by author of ggpubr # however it does not work p + stat_compare_means(aes(label = ..p.adj..), method = "t.test", comparisons = my_comparisons) # Solution: # calculate adjust p values and their location # then use stat_pvalue_manual() function p_adj <- get_adj_p(ToothGrowth, .col = "len", .grp = "dose") p_adj p + stat_pvalue_manual(p_adj, label = "p.adj") # Show selected comparisons # Of note, p value is ajusted # for three comparisons, but only # two are showed in figure p_adj <- get_adj_p(ToothGrowth, .col = "len", .grp = "dose", comparisons = list(c("0.5", "1"), c("1", "2")) ) p + stat_pvalue_manual(p_adj, label = "p.adj")
library(ggpubr) # T-test stat.test <- compare_means( len ~ dose, data = ToothGrowth, method = "t.test", p.adjust.method = "fdr" ) stat.test # Create a simple box plot p <- ggboxplot(ToothGrowth, x = "dose", y = "len") p # Add p values my_comparisons <- list(c("0.5", "1"), c("1", "2"), c("0.5", "2")) p + stat_compare_means(method = "t.test", comparisons = my_comparisons) # Try adding adjust p values # proposed by author of ggpubr # however it does not work p + stat_compare_means(aes(label = ..p.adj..), method = "t.test", comparisons = my_comparisons) # Solution: # calculate adjust p values and their location # then use stat_pvalue_manual() function p_adj <- get_adj_p(ToothGrowth, .col = "len", .grp = "dose") p_adj p + stat_pvalue_manual(p_adj, label = "p.adj") # Show selected comparisons # Of note, p value is ajusted # for three comparisons, but only # two are showed in figure p_adj <- get_adj_p(ToothGrowth, .col = "len", .grp = "dose", comparisons = list(c("0.5", "1"), c("1", "2")) ) p + stat_pvalue_manual(p_adj, label = "p.adj")
This implements a Cohen-Sharir method (see reference) like "Aneuploidy Score" computation. You can read the source code to see how it works. Basically, it follows the logic of Cohen-Sharir method but with some difference in detail implementation. Their results should be counterpart, but with no data validation for now. Please raise an issue if you find problem/bugs in this function.
get_Aneuploidy_score( data, ploidy_df = NULL, genome_build = "hg19", rm_black_arms = FALSE )
get_Aneuploidy_score( data, ploidy_df = NULL, genome_build = "hg19", rm_black_arms = FALSE )
data |
a CopyNumber object or a |
ploidy_df |
default is |
genome_build |
genome build version, should be 'hg19', 'hg38', 'mm9' or 'mm10'. |
rm_black_arms |
if |
A data.frame
Cohen-Sharir, Y., McFarland, J. M., Abdusamad, M., Marquis, C., Bernhard, S. V., Kazachkova, M., ... & Ben-David, U. (2021). Aneuploidy renders cancer cells vulnerable to mitotic checkpoint inhibition. Nature, 1-6.
Logic reference: https://github.com/quevedor2/aneuploidy_score/.
Taylor, Alison M., et al. "Genomic and functional approaches to understanding cancer aneuploidy." Cancer cell 33.4 (2018): 676-689.
# Load copy number object load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) df <- get_Aneuploidy_score(cn) df df2 <- get_Aneuploidy_score(cn@data) df2 df3 <- get_Aneuploidy_score(cn@data, ploidy_df = get_cn_ploidy(cn@data) ) df3
# Load copy number object load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) df <- get_Aneuploidy_score(cn) df df2 <- get_Aneuploidy_score(cn@data) df2 df3 <- get_Aneuploidy_score(cn@data, ploidy_df = get_cn_ploidy(cn@data) ) df3
Sometimes, we may want to use or inspect specified run result from sig_auto_extract. This function is designed for this purpose.
get_bayesian_result(run_info)
get_bayesian_result(run_info)
run_info |
a |
a list
.
Shixiang Wang
load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) res <- sig_auto_extract(cn_tally_W$nmf_matrix, result_prefix = "Test_copynumber", nrun = 1) # All run info are stored in res$Raw$summary_run # Obtain result of run 1 res_run1 <- get_bayesian_result(res$Raw$summary_run[1, ])
load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) res <- sig_auto_extract(cn_tally_W$nmf_matrix, result_prefix = "Test_copynumber", nrun = 1) # All run info are stored in res$Raw$summary_run # Obtain result of run 1 res_run1 <- get_bayesian_result(res$Raw$summary_run[1, ])
Get CNV Frequency Table
get_cn_freq_table( data, genome_build = "hg19", cutoff = 2L, resolution_factor = 1L )
get_cn_freq_table( data, genome_build = "hg19", cutoff = 2L, resolution_factor = 1L )
data |
a |
genome_build |
genome build version, used when |
cutoff |
copy number value cutoff for splitting data into AMP and DEL.
The values equal to cutoff are discarded. Default is |
resolution_factor |
an integer to control the resolution.
When it is |
a data.table
.
Get Ploidy from Absolute Copy Number Profile
get_cn_ploidy(data)
get_cn_ploidy(data)
data |
a CopyNumber object or a |
a value or a data.table
# Load copy number object load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) df <- get_cn_ploidy(cn) df
# Load copy number object load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) df <- get_cn_ploidy(cn) df
Get Genome Annotation
get_genome_annotation( data_type = c("chr_size", "centro_loc", "cytobands", "transcript", "gene"), chrs = paste0("chr", c(1:22, "X", "Y")), genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11") )
get_genome_annotation( data_type = c("chr_size", "centro_loc", "cytobands", "transcript", "gene"), chrs = paste0("chr", c(1:22, "X", "Y")), genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11") )
data_type |
'chr_size' for chromosome size, 'centro_loc' for location of centromeres, 'cytobands' for location of chromosome cytobands and 'transcript' for location of transcripts. |
chrs |
chromosomes start with 'chr' |
genome_build |
one of 'hg19', 'hg38' |
a data.frame
containing annotation data
df1 <- get_genome_annotation() df1 df2 <- get_genome_annotation(genome_build = "hg38") df2 df3 <- get_genome_annotation(data_type = "centro_loc") df3 df4 <- get_genome_annotation(data_type = "centro_loc", genome_build = "hg38") df4 df5 <- get_genome_annotation(data_type = "cytobands") df5 df6 <- get_genome_annotation(data_type = "cytobands", genome_build = "hg38") df6
df1 <- get_genome_annotation() df1 df2 <- get_genome_annotation(genome_build = "hg38") df2 df3 <- get_genome_annotation(data_type = "centro_loc") df3 df4 <- get_genome_annotation(data_type = "centro_loc", genome_build = "hg38") df4 df5 <- get_genome_annotation(data_type = "cytobands") df5 df6 <- get_genome_annotation(data_type = "cytobands", genome_build = "hg38") df6
Compare genotypes/phenotypes based on signature groups (samples are assigned to
several groups). For categorical
type, calculate fisher p value (using stats::fisher.test) and count table.
In larger than 2 by 2 tables, compute p-values by Monte Carlo simulation.
For continuous type, calculate anova p value (using stats::aov),
summary table and Tukey Honest significant difference (using stats::TukeyHSD).
The result of this function can be plotted by show_group_comparison()
.
get_group_comparison( data, col_group, cols_to_compare, type = "ca", NAs = NA, verbose = FALSE )
get_group_comparison( data, col_group, cols_to_compare, type = "ca", NAs = NA, verbose = FALSE )
data |
a |
col_group |
column name of signature groups. |
cols_to_compare |
column names of genotypes/phenotypes want to summarize based on groups. |
type |
a characater vector with length same as |
NAs |
default is |
verbose |
if |
a list
contains data, summary, p value etc..
Shixiang Wang [email protected]
load(system.file("extdata", "toy_copynumber_signature_by_W.RData", package = "sigminer", mustWork = TRUE )) # Assign samples to clusters groups <- get_groups(sig, method = "k-means") set.seed(1234) groups$prob <- rnorm(10) groups$new_group <- sample(c("1", "2", "3", "4", NA), size = nrow(groups), replace = TRUE) # Compare groups (filter NAs for categorical coloumns) groups.cmp <- get_group_comparison(groups[, -1], col_group = "group", cols_to_compare = c("prob", "new_group"), type = c("co", "ca"), verbose = TRUE ) # Compare groups (Set NAs of categorical columns to 'Rest') groups.cmp2 <- get_group_comparison(groups[, -1], col_group = "group", cols_to_compare = c("prob", "new_group"), type = c("co", "ca"), NAs = "Rest", verbose = TRUE )
load(system.file("extdata", "toy_copynumber_signature_by_W.RData", package = "sigminer", mustWork = TRUE )) # Assign samples to clusters groups <- get_groups(sig, method = "k-means") set.seed(1234) groups$prob <- rnorm(10) groups$new_group <- sample(c("1", "2", "3", "4", NA), size = nrow(groups), replace = TRUE) # Compare groups (filter NAs for categorical coloumns) groups.cmp <- get_group_comparison(groups[, -1], col_group = "group", cols_to_compare = c("prob", "new_group"), type = c("co", "ca"), verbose = TRUE ) # Compare groups (Set NAs of categorical columns to 'Rest') groups.cmp2 <- get_group_comparison(groups[, -1], col_group = "group", cols_to_compare = c("prob", "new_group"), type = c("co", "ca"), NAs = "Rest", verbose = TRUE )
One of key results from signature analysis is to cluster samples into different
groups. This function takes Signature
object as input
and return the membership in each cluster.
get_groups( Signature, method = c("consensus", "k-means", "exposure", "samples"), n_cluster = NULL, match_consensus = TRUE )
get_groups( Signature, method = c("consensus", "k-means", "exposure", "samples"), n_cluster = NULL, match_consensus = TRUE )
Signature |
a |
method |
grouping method, more see details, could be one of the following:
|
n_cluster |
only used when the |
match_consensus |
only used when the |
Users may find there are bigger differences between using method 'samples' and 'exposure' but they use a similar idear to find dominant signature, here goes the reason:
Method 'samples' using data directly from NMF decomposition, this means the two matrix
W
(basis matrix or signature matrix) and H
(coefficient matrix or exposure matrix) are
the results of NMF. For method 'exposure', it uses the signature exposure loading matrix.
In this situation, each signture represents a number of mutations (alterations)
about implementation please see source code of sig_extract()
function.
a data.table
object
# Load copy number prepare object load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) # Extract copy number signatures library(NMF) sig <- sig_extract(cn_tally_W$nmf_matrix, 2, nrun = 10 ) # Methods 'consensus' and 'samples' are from NMF::predict() g1 <- get_groups(sig, method = "consensus", match_consensus = TRUE) g1 g2 <- get_groups(sig, method = "samples") g2 # Use k-means clustering g3 <- get_groups(sig, method = "k-means") g3
# Load copy number prepare object load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) # Extract copy number signatures library(NMF) sig <- sig_extract(cn_tally_W$nmf_matrix, 2, nrun = 10 ) # Methods 'consensus' and 'samples' are from NMF::predict() g1 <- get_groups(sig, method = "consensus", match_consensus = TRUE) g1 g2 <- get_groups(sig, method = "samples") g2 # Use k-means clustering g3 <- get_groups(sig, method = "k-means") g3
Get Overlap Size between Interval x and y
get_intersect_size(x.start, x.end, y.start, y.end)
get_intersect_size(x.start, x.end, y.start, y.end)
x.start |
start position of interval x. |
x.end |
start position of interval x. |
y.start |
start position of interval x. |
y.end |
start position of interval x. |
a numeric vector.
o1 <- get_intersect_size(1, 5, 3, 20) o1 o2 <- get_intersect_size(3, 20, 1, 10) o2 o3 <- get_intersect_size(c(1, 2, 1), c(10, 4, 6), c(4, 2, 5), c(10, 3, 22)) o3
o1 <- get_intersect_size(1, 5, 3, 20) o1 o2 <- get_intersect_size(3, 20, 1, 10) o2 o3 <- get_intersect_size(c(1, 2, 1), c(10, 4, 6), c(4, 2, 5), c(10, 3, 22)) o3
pLOH score represents the genome that displayed LOH.
get_pLOH_score(data, rm_chrs = c("chrX", "chrY"), genome_build = "hg19")
get_pLOH_score(data, rm_chrs = c("chrX", "chrY"), genome_build = "hg19")
data |
a CopyNumber object or a |
rm_chrs |
chromosomes to be removed in calculation. Default is sex chromosomes (recommended). |
genome_build |
genome build version, should be 'hg19', 'hg38', 'mm9' or 'mm10'. |
A data.frame
Steele, Christopher D., et al. "Signatures of copy number alterations in human cancer." bioRxiv (2021).
# Load toy dataset of absolute copynumber profile load(system.file("extdata", "toy_segTab.RData", package = "sigminer", mustWork = TRUE )) set.seed(1234) segTabs$minor_cn <- sample(c(0, 1), size = nrow(segTabs), replace = TRUE) cn <- read_copynumber(segTabs, seg_cols = c("chromosome", "start", "end", "segVal"), genome_measure = "wg", complement = TRUE, add_loh = TRUE ) df <- get_pLOH_score(cn) df df2 <- get_pLOH_score(cn@data) df2
# Load toy dataset of absolute copynumber profile load(system.file("extdata", "toy_segTab.RData", package = "sigminer", mustWork = TRUE )) set.seed(1234) segTabs$minor_cn <- sample(c(0, 1), size = nrow(segTabs), replace = TRUE) cn <- read_copynumber(segTabs, seg_cols = c("chromosome", "start", "end", "segVal"), genome_measure = "wg", complement = TRUE, add_loh = TRUE ) df <- get_pLOH_score(cn) df df2 <- get_pLOH_score(cn@data) df2
where n
is the number
of signatures identified in the signature with exposure > cutoff
,
and pi
is the normalized exposure of the ith signature with
exposure > cutoff
. Exposures of signatures were normalized to
sum to 1
.
get_shannon_diversity_index(rel_expo, cutoff = 0.001)
get_shannon_diversity_index(rel_expo, cutoff = 0.001)
rel_expo |
a |
cutoff |
a relative exposure cutoff for filtering signatures,
default is |
a data.frame
Steele, Christopher D., et al. "Undifferentiated sarcomas develop through distinct evolutionary pathways." Cancer Cell 35.3 (2019): 441-456.
# Load mutational signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Get signature exposure rel_expo <- get_sig_exposure(sig2, type = "relative") rel_expo diversity_index <- get_shannon_diversity_index(rel_expo) diversity_index
# Load mutational signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Get signature exposure rel_expo <- get_sig_exposure(sig2, type = "relative") rel_expo diversity_index <- get_shannon_diversity_index(rel_expo) diversity_index
Obtain Signature Index for Cancer Types
get_sig_cancer_type_index( sig_type = c("legacy", "SBS", "DBS", "ID"), seq_type = c("WGS", "WES"), source = c("PCAWG", "TCGA", "nonPCAWG"), keyword = NULL )
get_sig_cancer_type_index( sig_type = c("legacy", "SBS", "DBS", "ID"), seq_type = c("WGS", "WES"), source = c("PCAWG", "TCGA", "nonPCAWG"), keyword = NULL )
sig_type |
signature type. |
seq_type |
sequencing type. |
source |
data source. |
keyword |
keyword to search in the signature index database. |
a list
.
l1 <- get_sig_cancer_type_index() l2 <- get_sig_cancer_type_index(sig_type = "SBS") l3 <- get_sig_cancer_type_index(sig_type = "DBS", source = "PCAWG", seq_type = "WGS") l4 <- get_sig_cancer_type_index(sig_type = "ID") l5 <- get_sig_cancer_type_index(keyword = "breast") l1 l2 l3 l4 l5
l1 <- get_sig_cancer_type_index() l2 <- get_sig_cancer_type_index(sig_type = "SBS") l3 <- get_sig_cancer_type_index(sig_type = "DBS", source = "PCAWG", seq_type = "WGS") l4 <- get_sig_cancer_type_index(sig_type = "ID") l5 <- get_sig_cancer_type_index(keyword = "breast") l1 l2 l3 l4 l5
Reference mutational signatures and their aetiologies, mainly obtained from COSMIC database (SigProfiler results) and cleaned before saving into sigminer package. You can obtain:
COSMIC legacy SBS signatures.
COSMIC v3 SBS signatures.
COSMIC v3 DBS signatures.
COSMIC v3 ID (indel) signatures.
SBS and RS (rearrangement) signatures from Nik lab 2020 Nature Cancer paper.
RS signatures from BRCA560 and USARC cohorts.
Copy number signatures from USARC cohort and TCGA.
Copy number signatures from Liu lab 2023. It supports both PCAWG and TCGA cohort.
get_sig_db(sig_db = "legacy")
get_sig_db(sig_db = "legacy")
sig_db |
default 'legacy', it can be 'legacy' (for COSMIC v2 'SBS'),
'SBS', 'DBS', 'ID' and 'TSB' (for COSMIV v3.1 signatures)
for small scale mutations.
For more specific details, it can also be 'SBS_hg19', 'SBS_hg38',
'SBS_mm9', 'SBS_mm10', 'DBS_hg19', 'DBS_hg38', 'DBS_mm9', 'DBS_mm10' to use
COSMIC v3 reference signatures from Alexandrov, Ludmil B., et al. (2020) (reference #1).
In addition, it can be one of "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ",
"SBS_Nik_lab", "RS_Nik_lab" to refer reference signatures from
Degasperi, Andrea, et al. (2020) (reference #2);
"RS_BRCA560", "RS_USARC" to reference signatures from BRCA560 and USARC cohorts;
"CNS_USARC" (40 categories), "CNS_TCGA" (48 categories) to reference copy number signatures from USARC cohort and TCGA;
"CNS_TCGA176" (176 categories) and "CNS_PCAWG176" (176 categories) to reference copy number signatures from PCAWG and TCGA separately.
UPDATE, the latest version of reference version can be automatically
downloaded and loaded from https://cancer.sanger.ac.uk/signatures/downloads/
when a option with |
a list
.
Steele, Christopher D., et al. "Signatures of copy number alterations in human cancer." Nature 606.7916 (2022): 984-991.
Alexandrov, Ludmil B., et al. "The repertoire of mutational signatures in human cancer." Nature 578.7793 (2020): 94-101.
Steele, Christopher D., et al. "Undifferentiated sarcomas develop through distinct evolutionary pathways." Cancer Cell 35.3 (2019): 441-456.
Ziyu Tao, et al. "The repertoire of copy number alteration signatures in human cancer." Briefings in Bioinformatics (2023): bbad053.
get_sig_similarity, sig_fit and show_cosmic_sig_profile.
s1 <- get_sig_db() s2 <- get_sig_db("SBS") s3 <- get_sig_db("DBS") s4 <- get_sig_db("DBS_mm10") s5 <- get_sig_db("SBS_Nik_lab") s6 <- get_sig_db("ID") s7 <- get_sig_db("RS_BRCA560") s8 <- get_sig_db("RS_USARC") s9 <- get_sig_db("RS_Nik_lab") s10 <- get_sig_db("CNS_USARC") s11 <- get_sig_db("CNS_TCGA") s12 <- get_sig_db("CNS_TCGA176") s13 <- get_sig_db("CNS_PCAWG176") s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13
s1 <- get_sig_db() s2 <- get_sig_db("SBS") s3 <- get_sig_db("DBS") s4 <- get_sig_db("DBS_mm10") s5 <- get_sig_db("SBS_Nik_lab") s6 <- get_sig_db("ID") s7 <- get_sig_db("RS_BRCA560") s8 <- get_sig_db("RS_USARC") s9 <- get_sig_db("RS_Nik_lab") s10 <- get_sig_db("CNS_USARC") s11 <- get_sig_db("CNS_TCGA") s12 <- get_sig_db("CNS_TCGA176") s13 <- get_sig_db("CNS_PCAWG176") s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13
The expected number of mutations (or copy number segment records) with each signature was determined after a scaling transformation V ~ WH = W'H' where W' = WU' and H' = UH. The scaling matrix U is a KxK diagnal matrix (K is signature number, U' is the inverse of U) with the element corresponding to the L1-norm of column vectors of W (ie. the sum of the elements of the vector). As a result, the k-th row vector of the final matrix H' represents the absolute exposure (activity) of the k-th process across samples (e.g., for SBS, the estimated (or expected) number of mutations generated by the k-th process). Of note, for copy number signatures, only components of feature CN was used for calculating H'.
get_sig_exposure( Signature, type = c("absolute", "relative"), rel_threshold = 0.01 )
get_sig_exposure( Signature, type = c("absolute", "relative"), rel_threshold = 0.01 )
Signature |
a |
type |
'absolute' for signature exposure and 'relative' for signature relative exposure. |
rel_threshold |
only used when type is 'relative', relative exposure less
than ( |
a data.table
Shixiang Wang [email protected]
Kim, Jaegil, et al. "Somatic ERCC2 mutations are associated with a distinct genomic signature in urothelial tumors." Nature genetics 48.6 (2016): 600.
# Load mutational signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Get signature exposure expo1 <- get_sig_exposure(sig2) expo1 expo2 <- get_sig_exposure(sig2, type = "relative") expo2
# Load mutational signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Get signature exposure expo1 <- get_sig_exposure(sig2) expo1 expo2 <- get_sig_exposure(sig2, type = "relative") expo2
Association of signature exposures with other features will be performed using one of two procedures: for a continuous association variable (including ordinal variable), correaltion is performed; for a binary association variable, samples will be divided into two groups and Mann-Whitney U-test is performed to test for differences in signature exposure medians between the two groups. See get_tidy_association for cleaning association result.
get_sig_feature_association( data, cols_to_sigs, cols_to_features, type = "ca", method_co = c("spearman", "pearson", "kendall"), method_ca = stats::wilcox.test, min_n = 0.01, verbose = FALSE, ... )
get_sig_feature_association( data, cols_to_sigs, cols_to_features, type = "ca", method_co = c("spearman", "pearson", "kendall"), method_ca = stats::wilcox.test, min_n = 0.01, verbose = FALSE, ... )
data |
a |
cols_to_sigs |
colnames for signature exposure |
cols_to_features |
colnames for other features |
type |
a character vector containing 'ca' for categorical variable and 'co' for continuous variable,
it must have the same length as |
method_co |
method for continuous variable, default is "spearman", could also be "pearson" and "kendall". |
method_ca |
method for categorical variable, default is "wilcox.test" |
min_n |
a minimal fraction (e.g. 0.01) or a integer number (e.g. 10) for filtering some variables with few positive events. Default is 0.01. |
verbose |
if |
... |
other arguments passing to test functions, like |
a list
. For 'co' features, 'measure' means correlation coefficient.
For 'ca' features, 'measure' means difference in means of signature exposure.
See bp_extract_signatures for examples.
get_sig_rec_similarity(Signature, nmf_matrix)
get_sig_rec_similarity(Signature, nmf_matrix)
Signature |
a |
nmf_matrix |
a |
a data.table
.
The reference signatures can be either a Signature
object specified by Ref
argument
or known COSMIC signatures specified by sig_db
argument.
Two COSMIC databases are used for comparisons - "legacy" which includes 30 signaures,
and "SBS" - which includes updated/refined 65 signatures. This function is modified
from compareSignatures()
in maftools package.
NOTE: all reference signatures are generated from gold standard tool:
SigProfiler.
get_sig_similarity( Signature, Ref = NULL, sig_db = c("SBS", "legacy", "DBS", "ID", "TSB", "SBS_Nik_lab", "RS_Nik_lab", "RS_BRCA560", "RS_USARC", "CNS_USARC", "CNS_TCGA", "CNS_TCGA176", "CNS_PCAWG176", "SBS_hg19", "SBS_hg38", "SBS_mm9", "SBS_mm10", "DBS_hg19", "DBS_hg38", "DBS_mm9", "DBS_mm10", "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ", "latest_SBS_GRCh37", "latest_DBS_GRCh37", "latest_ID_GRCh37", "latest_SBS_GRCh38", "latest_DBS_GRCh38", "latest_SBS_mm9", "latest_DBS_mm9", "latest_SBS_mm10", "latest_DBS_mm10", "latest_SBS_rn6", "latest_DBS_rn6", "latest_CN_GRCh37", "latest_RNA-SBS_GRCh37", "latest_SV_GRCh38"), db_type = c("", "human-exome", "human-genome"), method = "cosine", normalize = c("row", "feature"), feature_setting = sigminer::CN.features, set_order = TRUE, pattern_to_rm = NULL, verbose = TRUE )
get_sig_similarity( Signature, Ref = NULL, sig_db = c("SBS", "legacy", "DBS", "ID", "TSB", "SBS_Nik_lab", "RS_Nik_lab", "RS_BRCA560", "RS_USARC", "CNS_USARC", "CNS_TCGA", "CNS_TCGA176", "CNS_PCAWG176", "SBS_hg19", "SBS_hg38", "SBS_mm9", "SBS_mm10", "DBS_hg19", "DBS_hg38", "DBS_mm9", "DBS_mm10", "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ", "latest_SBS_GRCh37", "latest_DBS_GRCh37", "latest_ID_GRCh37", "latest_SBS_GRCh38", "latest_DBS_GRCh38", "latest_SBS_mm9", "latest_DBS_mm9", "latest_SBS_mm10", "latest_DBS_mm10", "latest_SBS_rn6", "latest_DBS_rn6", "latest_CN_GRCh37", "latest_RNA-SBS_GRCh37", "latest_SV_GRCh38"), db_type = c("", "human-exome", "human-genome"), method = "cosine", normalize = c("row", "feature"), feature_setting = sigminer::CN.features, set_order = TRUE, pattern_to_rm = NULL, verbose = TRUE )
Signature |
a |
Ref |
default is |
sig_db |
default 'legacy', it can be 'legacy' (for COSMIC v2 'SBS'),
'SBS', 'DBS', 'ID' and 'TSB' (for COSMIV v3.1 signatures)
for small scale mutations.
For more specific details, it can also be 'SBS_hg19', 'SBS_hg38',
'SBS_mm9', 'SBS_mm10', 'DBS_hg19', 'DBS_hg38', 'DBS_mm9', 'DBS_mm10' to use
COSMIC v3 reference signatures from Alexandrov, Ludmil B., et al. (2020) (reference #1).
In addition, it can be one of "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ",
"SBS_Nik_lab", "RS_Nik_lab" to refer reference signatures from
Degasperi, Andrea, et al. (2020) (reference #2);
"RS_BRCA560", "RS_USARC" to reference signatures from BRCA560 and USARC cohorts;
"CNS_USARC" (40 categories), "CNS_TCGA" (48 categories) to reference copy number signatures from USARC cohort and TCGA;
"CNS_TCGA176" (176 categories) and "CNS_PCAWG176" (176 categories) to reference copy number signatures from PCAWG and TCGA separately.
UPDATE, the latest version of reference version can be automatically
downloaded and loaded from https://cancer.sanger.ac.uk/signatures/downloads/
when a option with |
db_type |
only used when |
method |
default is 'cosine' for cosine similarity. |
normalize |
one of "row" and "feature". "row" is typically used for common mutational signatures. "feature" is designed by me to use when input are copy number signatures. |
feature_setting |
a |
set_order |
if |
pattern_to_rm |
patterns for removing some features/components in similarity
calculation. A vector of component name is also accepted.
The remove operation will be done after normalization. Default is |
verbose |
if |
a list
containing smilarities, aetiologies if available, best match and RSS.
Shixiang Wang [email protected]
Alexandrov, Ludmil B., et al. "The repertoire of mutational signatures in human cancer." Nature 578.7793 (2020): 94-101.
Degasperi, Andrea, et al. "A practical framework and online tool for mutational signature analyses show intertissue variation and driver dependencies." Nature cancer 1.2 (2020): 249-263.
Steele, Christopher D., et al. "Undifferentiated sarcomas develop through distinct evolutionary pathways." Cancer Cell 35.3 (2019): 441-456.
Nik-Zainal, Serena, et al. "Landscape of somatic mutations in 560 breast cancer whole-genome sequences." Nature 534.7605 (2016): 47-54.
Steele, Christopher D., et al. "Signatures of copy number alterations in human cancer." Nature 606.7916 (2022): 984-991.
# Load mutational signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) s1 <- get_sig_similarity(sig2, Ref = sig2) s1 s2 <- get_sig_similarity(sig2) s2 s3 <- get_sig_similarity(sig2, sig_db = "SBS") s3 # Set order for result similarity matrix s4 <- get_sig_similarity(sig2, sig_db = "SBS", set_order = TRUE) s4 ## Remove some components ## in similarity calculation s5 <- get_sig_similarity(sig2, Ref = sig2, pattern_to_rm = c("T[T>G]C", "T[T>G]G", "T[T>G]T") ) s5 ## Same to DBS and ID signatures x1 <- get_sig_db("DBS_hg19") x2 <- get_sig_db("DBS_hg38") s6 <- get_sig_similarity(x1$db, x2$db) s6
# Load mutational signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) s1 <- get_sig_similarity(sig2, Ref = sig2) s1 s2 <- get_sig_similarity(sig2) s2 s3 <- get_sig_similarity(sig2, sig_db = "SBS") s3 # Set order for result similarity matrix s4 <- get_sig_similarity(sig2, sig_db = "SBS", set_order = TRUE) s4 ## Remove some components ## in similarity calculation s5 <- get_sig_similarity(sig2, Ref = sig2, pattern_to_rm = c("T[T>G]C", "T[T>G]G", "T[T>G]T") ) s5 ## Same to DBS and ID signatures x1 <- get_sig_db("DBS_hg19") x2 <- get_sig_db("DBS_hg38") s6 <- get_sig_similarity(x1$db, x2$db) s6
Get Tidy Signature Association Results
get_tidy_association(cor_res, p_adjust = FALSE, method = "fdr")
get_tidy_association(cor_res, p_adjust = FALSE, method = "fdr")
cor_res |
data returned by |
p_adjust |
logical, if |
method |
p value correction method, see stats::p.adjust for more detail. |
a data.frame
This function takes a data.frame
as input, compares proportion of positive
cases or mean measure in one subgroup and the remaining samples.
group_enrichment( df, grp_vars = NULL, enrich_vars = NULL, cross = TRUE, co_method = c("t.test", "wilcox.test"), ref_group = NA )
group_enrichment( df, grp_vars = NULL, enrich_vars = NULL, cross = TRUE, co_method = c("t.test", "wilcox.test"), ref_group = NA )
df |
a |
grp_vars |
character vector specifying group variables to split samples into subgroups (at least 2 subgroups, otherwise this variable will be skipped). |
enrich_vars |
character vector specifying measure variables to be compared.
If variable is not numeric, only binary cases are accepted in the form of
|
cross |
logical, default is |
co_method |
test method for continuous variable, default is 't.test'. |
ref_group |
reference group set in |
a data.table
with following columns:
grp_var
: group variable name.
enrich_var
: enrich variable (variable to be compared) name.
grp1
: the first group name, should be a member in grp_var
column.
grp2
: the remaining samples, marked as 'Rest'.
grp1_size
: sample size for grp1
.
grp1_pos_measure
: for binary variable, it stores the proportion of
positive cases in grp1
; for continuous variable, it stores mean value.
grp2_size
: sample size for grp2
.
grp2_pos_measure
: same as grp1_pos_measure
but for grp2
.
measure_observed
: for binary variable, it stores odds ratio;
for continuous variable, it stores scaled mean ratio.
measure_tested
: only for binary variable, it stores
estimated odds ratio and its 95% CI from fisher.test()
.
p_value
: for binary variable, it stores p value from fisher.test()
;
for continuous variable, it stores value from wilcox.test()
or t.test()
.
type
: one of "binary" and "continuous".
method
: one of "fish.test", "wilcox.test" and "t.test".
set.seed(1234) df <- dplyr::tibble( g1 = factor(abs(round(rnorm(99, 0, 1)))), g2 = rep(LETTERS[1:4], c(50, 40, 8, 1)), e1 = sample(c("P", "N"), 99, replace = TRUE), e2 = rnorm(99) ) print(str(df)) print(head(df)) # Compare g1:e1, g1:e2, g2:e1 and g2:e2 x1 <- group_enrichment(df, grp_vars = c("g1", "g2"), enrich_vars = c("e1", "e2")) x1 # Only compare g1:e1, g2:e2 x2 <- group_enrichment(df, grp_vars = c("g1", "g2"), enrich_vars = c("e1", "e2"), co_method = "wilcox.test", cross = FALSE ) x2 # Visualization p1 <- show_group_enrichment(x1, fill_by_p_value = TRUE) p1 p2 <- show_group_enrichment(x1, fill_by_p_value = FALSE) p2 p3 <- show_group_enrichment(x1, return_list = TRUE) p3
set.seed(1234) df <- dplyr::tibble( g1 = factor(abs(round(rnorm(99, 0, 1)))), g2 = rep(LETTERS[1:4], c(50, 40, 8, 1)), e1 = sample(c("P", "N"), 99, replace = TRUE), e2 = rnorm(99) ) print(str(df)) print(head(df)) # Compare g1:e1, g1:e2, g2:e1 and g2:e2 x1 <- group_enrichment(df, grp_vars = c("g1", "g2"), enrich_vars = c("e1", "e2")) x1 # Only compare g1:e1, g2:e2 x2 <- group_enrichment(df, grp_vars = c("g1", "g2"), enrich_vars = c("e1", "e2"), co_method = "wilcox.test", cross = FALSE ) x2 # Visualization p1 <- show_group_enrichment(x1, fill_by_p_value = TRUE) p1 p2 <- show_group_enrichment(x1, fill_by_p_value = FALSE) p2 p3 <- show_group_enrichment(x1, return_list = TRUE) p3
More details see group_enrichment()
.
group_enrichment2( df, subset_var, grp_vars, enrich_vars, co_method = c("t.test", "wilcox.test"), ref_group = NA )
group_enrichment2( df, subset_var, grp_vars, enrich_vars, co_method = c("t.test", "wilcox.test"), ref_group = NA )
df |
a |
subset_var |
a column for subsetting. |
grp_vars |
character vector specifying group variables to split samples into subgroups (at least 2 subgroups, otherwise this variable will be skipped). |
enrich_vars |
character vector specifying measure variables to be compared.
If variable is not numeric, only binary cases are accepted in the form of
|
co_method |
test method for continuous variable, default is 't.test'. |
ref_group |
reference group set in |
This can be used for SNV/INDEL count matrix. For copy number analysis, please skip it.
handle_hyper_mutation(nmf_matrix)
handle_hyper_mutation(nmf_matrix)
nmf_matrix |
a |
a matrix
.
Kim, Jaegil, et al. "Somatic ERCC2 mutations are associated with a distinct genomic signature in urothelial tumors." Nature genetics 48.6 (2016): 600.
S4 class for storing summarized MAF. It is from maftools
package.
More about MAF object please see maftools.
data
data.table of MAF file containing all non-synonymous variants.
variants.per.sample
table containing variants per sample
variant.type.summary
table containing variant types per sample
variant.classification.summary
table containing variant classification per sample
gene.summary
table containing variant classification per gene
summary
table with basic MAF summary stats
maf.silent
subset of main MAF containing only silent variants
clinical.data
clinical data associated with each sample/Tumor_Sample_Barcode in MAF.
Output Signature Bootstrap Fitting Results
output_bootstrap(x, result_dir, mut_type = "SBS", sig_db = mut_type)
output_bootstrap(x, result_dir, mut_type = "SBS", sig_db = mut_type)
x |
result from sig_fit_bootstrap_batch. |
result_dir |
a result directory. |
mut_type |
one of 'SBS', 'DBS', 'ID' or 'CN'. |
sig_db |
default 'legacy', it can be 'legacy' (for COSMIC v2 'SBS'),
'SBS', 'DBS', 'ID' and 'TSB' (for COSMIV v3.1 signatures)
for small scale mutations.
For more specific details, it can also be 'SBS_hg19', 'SBS_hg38',
'SBS_mm9', 'SBS_mm10', 'DBS_hg19', 'DBS_hg38', 'DBS_mm9', 'DBS_mm10' to use
COSMIC v3 reference signatures from Alexandrov, Ludmil B., et al. (2020) (reference #1).
In addition, it can be one of "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ",
"SBS_Nik_lab", "RS_Nik_lab" to refer reference signatures from
Degasperi, Andrea, et al. (2020) (reference #2);
"RS_BRCA560", "RS_USARC" to reference signatures from BRCA560 and USARC cohorts;
"CNS_USARC" (40 categories), "CNS_TCGA" (48 categories) to reference copy number signatures from USARC cohort and TCGA;
"CNS_TCGA176" (176 categories) and "CNS_PCAWG176" (176 categories) to reference copy number signatures from PCAWG and TCGA separately.
UPDATE, the latest version of reference version can be automatically
downloaded and loaded from https://cancer.sanger.ac.uk/signatures/downloads/
when a option with |
Nothing.
Output Signature Fitting Results
output_fit(x, result_dir, mut_type = "SBS", sig_db = mut_type)
output_fit(x, result_dir, mut_type = "SBS", sig_db = mut_type)
x |
result from sig_fit. |
result_dir |
a result directory. |
mut_type |
one of 'SBS', 'DBS', 'ID' or 'CN'. |
sig_db |
default 'legacy', it can be 'legacy' (for COSMIC v2 'SBS'),
'SBS', 'DBS', 'ID' and 'TSB' (for COSMIV v3.1 signatures)
for small scale mutations.
For more specific details, it can also be 'SBS_hg19', 'SBS_hg38',
'SBS_mm9', 'SBS_mm10', 'DBS_hg19', 'DBS_hg38', 'DBS_mm9', 'DBS_mm10' to use
COSMIC v3 reference signatures from Alexandrov, Ludmil B., et al. (2020) (reference #1).
In addition, it can be one of "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ",
"SBS_Nik_lab", "RS_Nik_lab" to refer reference signatures from
Degasperi, Andrea, et al. (2020) (reference #2);
"RS_BRCA560", "RS_USARC" to reference signatures from BRCA560 and USARC cohorts;
"CNS_USARC" (40 categories), "CNS_TCGA" (48 categories) to reference copy number signatures from USARC cohort and TCGA;
"CNS_TCGA176" (176 categories) and "CNS_PCAWG176" (176 categories) to reference copy number signatures from PCAWG and TCGA separately.
UPDATE, the latest version of reference version can be automatically
downloaded and loaded from https://cancer.sanger.ac.uk/signatures/downloads/
when a option with |
Nothing.
Output Signature Results
output_sig(sig, result_dir, mut_type = "SBS", sig_db = mut_type)
output_sig(sig, result_dir, mut_type = "SBS", sig_db = mut_type)
sig |
a |
result_dir |
a result directory. |
mut_type |
one of 'SBS', 'DBS', 'ID' or 'CN'. |
sig_db |
default 'legacy', it can be 'legacy' (for COSMIC v2 'SBS'),
'SBS', 'DBS', 'ID' and 'TSB' (for COSMIV v3.1 signatures)
for small scale mutations.
For more specific details, it can also be 'SBS_hg19', 'SBS_hg38',
'SBS_mm9', 'SBS_mm10', 'DBS_hg19', 'DBS_hg38', 'DBS_mm9', 'DBS_mm10' to use
COSMIC v3 reference signatures from Alexandrov, Ludmil B., et al. (2020) (reference #1).
In addition, it can be one of "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ",
"SBS_Nik_lab", "RS_Nik_lab" to refer reference signatures from
Degasperi, Andrea, et al. (2020) (reference #2);
"RS_BRCA560", "RS_USARC" to reference signatures from BRCA560 and USARC cohorts;
"CNS_USARC" (40 categories), "CNS_TCGA" (48 categories) to reference copy number signatures from USARC cohort and TCGA;
"CNS_TCGA176" (176 categories) and "CNS_PCAWG176" (176 categories) to reference copy number signatures from PCAWG and TCGA separately.
UPDATE, the latest version of reference version can be automatically
downloaded and loaded from https://cancer.sanger.ac.uk/signatures/downloads/
when a option with |
Nothing.
Output Tally Result in Barplots
output_tally(x, result_dir, mut_type = "SBS")
output_tally(x, result_dir, mut_type = "SBS")
x |
a matrix with row representing components (motifs) and column representing samples. |
result_dir |
a result directory. |
mut_type |
one of 'SBS', 'DBS', 'ID' or 'CN'. |
Nothing.
Read absolute copy number profile for preparing CNV signature
analysis. See detail part of sig_tally()
to see how to handle sex to get correct
summary.
read_copynumber( input, pattern = NULL, ignore_case = FALSE, seg_cols = c("Chromosome", "Start.bp", "End.bp", "modal_cn"), samp_col = "sample", add_loh = FALSE, loh_min_len = 10000, loh_min_frac = 0.05, join_adj_seg = TRUE, skip_annotation = FALSE, use_all = add_loh, min_segnum = 0L, max_copynumber = 20L, genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), genome_measure = c("called", "wg"), complement = FALSE, ... )
read_copynumber( input, pattern = NULL, ignore_case = FALSE, seg_cols = c("Chromosome", "Start.bp", "End.bp", "modal_cn"), samp_col = "sample", add_loh = FALSE, loh_min_len = 10000, loh_min_frac = 0.05, join_adj_seg = TRUE, skip_annotation = FALSE, use_all = add_loh, min_segnum = 0L, max_copynumber = 20L, genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), genome_measure = c("called", "wg"), complement = FALSE, ... )
input |
a |
pattern |
an optional regular expression used to select part of files if
|
ignore_case |
logical. Should pattern-matching be case-insensitive? |
seg_cols |
four strings used to specify chromosome, start position,
end position and copy number value in |
samp_col |
a character used to specify the sample column name. If |
add_loh |
if |
loh_min_len |
The length cut-off for labeling a segment as 'LOH'.
Default is |
loh_min_frac |
When |
join_adj_seg |
if |
skip_annotation |
if |
use_all |
default is |
min_segnum |
minimal number of copy number segments within a sample. |
max_copynumber |
bigger copy number within a sample will be reset to this value. |
genome_build |
genome build version, should be 'hg19', 'hg38', 'mm9' or 'mm10'. |
genome_measure |
default is 'called', can be 'wg' or 'called'. Set 'called' will use called segments size to compute total size for CNA burden calculation, this option is useful for WES and target sequencing. Set 'wg' will use autosome size from genome build, this option is useful for WGS, SNP etc.. |
complement |
if |
... |
other parameters pass to |
a CopyNumber object.
Shixiang Wang [email protected]
read_maf for reading mutation data to MAF object.
# Load toy dataset of absolute copynumber profile load(system.file("extdata", "toy_segTab.RData", package = "sigminer", mustWork = TRUE )) cn <- read_copynumber(segTabs, seg_cols = c("chromosome", "start", "end", "segVal"), genome_build = "hg19", complement = FALSE ) cn cn_subset <- subset(cn, sample == "TCGA-DF-A2KN-01A-11D-A17U-01") # Add LOH set.seed(1234) segTabs$minor_cn <- sample(c(0, 1), size = nrow(segTabs), replace = TRUE) cn <- read_copynumber(segTabs, seg_cols = c("chromosome", "start", "end", "segVal"), genome_measure = "wg", complement = TRUE, add_loh = TRUE ) # Use tally method "S" (Steele et al.) tally_s <- sig_tally(cn, method = "S") tab_file <- system.file("extdata", "metastatic_tumor.segtab.txt", package = "sigminer", mustWork = TRUE ) cn2 <- read_copynumber(tab_file) cn2
# Load toy dataset of absolute copynumber profile load(system.file("extdata", "toy_segTab.RData", package = "sigminer", mustWork = TRUE )) cn <- read_copynumber(segTabs, seg_cols = c("chromosome", "start", "end", "segVal"), genome_build = "hg19", complement = FALSE ) cn cn_subset <- subset(cn, sample == "TCGA-DF-A2KN-01A-11D-A17U-01") # Add LOH set.seed(1234) segTabs$minor_cn <- sample(c(0, 1), size = nrow(segTabs), replace = TRUE) cn <- read_copynumber(segTabs, seg_cols = c("chromosome", "start", "end", "segVal"), genome_measure = "wg", complement = TRUE, add_loh = TRUE ) # Use tally method "S" (Steele et al.) tally_s <- sig_tally(cn, method = "S") tab_file <- system.file("extdata", "metastatic_tumor.segtab.txt", package = "sigminer", mustWork = TRUE ) cn2 <- read_copynumber(tab_file) cn2
Note, the result is not a CopyNumber
object, you need to generate it
by yourself.
read_copynumber_ascat(x)
read_copynumber_ascat(x)
x |
one or more |
a tidy list
.
Read Absolute Copy Number Profile from Sequenza Result Directory
read_copynumber_seqz(target_dir, return_df = FALSE, ...)
read_copynumber_seqz(target_dir, return_df = FALSE, ...)
target_dir |
a directory path. |
return_df |
if |
... |
other parameters passing to |
a data.frame
or a CopyNumber
object.
This function is a wrapper of maftools::read.maf. Useless options in maftools::read.maf are dropped here. You can also use maftools::read.maf to read the data. All reference alleles and mutation alleles should be recorded in positive strand format.
read_maf(maf, verbose = TRUE) read_maf_minimal(dt)
read_maf(maf, verbose = TRUE) read_maf_minimal(dt)
maf |
tab delimited MAF file. File can also be gz compressed. Required. Alternatively, you can also provide already read MAF file as a dataframe. |
verbose |
TRUE logical. Default to be talkative and prints summary. |
dt |
A data.frame contains at least the following columns: "Tumor_Sample_Barcode", "Chromosome", "Start_Position", "End_Position", "Reference_Allele", "Tumor_Seq_Allele2" |
read_maf_minimal()
: Read Maf data.frame from a minimal maf-like data
read_copynumber for reading copy number data to CopyNumber object.
laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools", mustWork = TRUE) if (!require("R.utils")) { message("Please install 'R.utils' package firstly") } else { laml <- read_maf(maf = laml.maf) laml laml_mini <- laml@data[, list( Tumor_Sample_Barcode, Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2 )] laml2 <- read_maf_minimal(laml_mini) laml2 }
laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools", mustWork = TRUE) if (!require("R.utils")) { message("Please install 'R.utils' package firstly") } else { laml <- read_maf(maf = laml.maf) laml laml_mini <- laml@data[, list( Tumor_Sample_Barcode, Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2 )] laml2 <- read_maf_minimal(laml_mini) laml2 }
Read Structural Variation Data as RS object
read_sv_as_rs(input)
read_sv_as_rs(input)
input |
a
|
a list
sv <- readRDS(system.file("extdata", "toy_sv.rds", package = "sigminer", mustWork = TRUE)) rs <- read_sv_as_rs(sv) # svclass is optional rs2 <- read_sv_as_rs(sv[, setdiff(colnames(sv), "svclass")]) identical(rs, rs2) ## Not run: tally_rs <- sig_tally(rs) ## End(Not run)
sv <- readRDS(system.file("extdata", "toy_sv.rds", package = "sigminer", mustWork = TRUE)) rs <- read_sv_as_rs(sv) # svclass is optional rs2 <- read_sv_as_rs(sv[, setdiff(colnames(sv), "svclass")]) identical(rs, rs2) ## Not run: tally_rs <- sig_tally(rs) ## End(Not run)
MAF file is more recommended. In this function, we will mimic
the MAF object from the key c(1, 2, 4, 5, 7)
columns of VCF file.
read_vcf( vcfs, samples = NULL, genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), keep_only_pass = FALSE, verbose = TRUE )
read_vcf( vcfs, samples = NULL, genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), keep_only_pass = FALSE, verbose = TRUE )
vcfs |
VCF file paths. |
samples |
sample names for VCF files. |
genome_build |
genome build version like "hg19". |
keep_only_pass |
if |
verbose |
if |
a MAF.
vcfs <- list.files(system.file("extdata", package = "sigminer"), "*.vcf", full.names = TRUE) maf <- read_vcf(vcfs) maf <- read_vcf(vcfs, keep_only_pass = TRUE)
vcfs <- list.files(system.file("extdata", package = "sigminer"), "*.vcf", full.names = TRUE) maf <- read_vcf(vcfs) maf <- read_vcf(vcfs, keep_only_pass = TRUE)
Read UCSC Xena Variant Format Data as MAF Object
read_xena_variants(path)
read_xena_variants(path)
path |
a path to variant file. |
a MAF
object.
if (requireNamespace("UCSCXenaTools")) { library(UCSCXenaTools) options(use_hiplot = TRUE) example_file <- XenaGenerate(subset = XenaDatasets == "mc3/ACC_mc3.txt") %>% XenaQuery() %>% XenaDownload() x <- read_xena_variants(example_file$destfiles) x@data y <- sig_tally(x) y }
if (requireNamespace("UCSCXenaTools")) { library(UCSCXenaTools) options(use_hiplot = TRUE) example_file <- XenaGenerate(subset = XenaDatasets == "mc3/ACC_mc3.txt") %>% XenaQuery() %>% XenaDownload() x <- read_xena_variants(example_file$destfiles) x@data y <- sig_tally(x) y }
See examples in sig_fit_bootstrap.
report_bootstrap_p_value(x, thresholds = c(0.01, 0.05, 0.1))
report_bootstrap_p_value(x, thresholds = c(0.01, 0.05, 0.1))
x |
a (list of) result from sig_fit_bootstrap. |
thresholds |
a vector of relative exposure threshold for calculating p values. |
a (list of) matrix
This is a wrapper for several implementation that classify samples into same size clusters, the details please see this blog. The source code is modified based on code from the blog.
same_size_clustering( mat, diss = FALSE, clsize = NULL, algo = c("nnit", "hcbottom", "kmvar"), method = c("maxd", "random", "mind", "elki", "ward.D", "average", "complete", "single") )
same_size_clustering( mat, diss = FALSE, clsize = NULL, algo = c("nnit", "hcbottom", "kmvar"), method = c("maxd", "random", "mind", "elki", "ward.D", "average", "complete", "single") )
mat |
a data/distance matrix. |
diss |
if |
clsize |
integer, number of sample within a cluster. |
algo |
algorithm. |
method |
method. |
a vector.
set.seed(1234L) x <- rbind( matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2) ) colnames(x) <- c("x", "y") y1 <- same_size_clustering(x, clsize = 10) y11 <- same_size_clustering(as.matrix(dist(x)), clsize = 10, diss = TRUE) y2 <- same_size_clustering(x, clsize = 10, algo = "hcbottom", method = "ward.D") y3 <- same_size_clustering(x, clsize = 10, algo = "kmvar") y33 <- same_size_clustering(as.matrix(dist(x)), clsize = 10, algo = "kmvar", diss = TRUE)
set.seed(1234L) x <- rbind( matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2) ) colnames(x) <- c("x", "y") y1 <- same_size_clustering(x, clsize = 10) y11 <- same_size_clustering(as.matrix(dist(x)), clsize = 10, diss = TRUE) y2 <- same_size_clustering(x, clsize = 10, algo = "hcbottom", method = "ward.D") y3 <- same_size_clustering(x, clsize = 10, algo = "kmvar") y33 <- same_size_clustering(as.matrix(dist(x)), clsize = 10, algo = "kmvar", diss = TRUE)
Returns quantification of copy number profile and events including tandem duplication and Chromothripisis etc. Only copy number data from autosome is used here. Some of the quantification methods are rough, you use at your risk. You should do some extra work to check the result scores.
scoring(object, TD_size_cutoff = c(1000, 1e+05, 2e+06), TD_cn_cutoff = Inf)
scoring(object, TD_size_cutoff = c(1000, 1e+05, 2e+06), TD_cn_cutoff = Inf)
object |
a object of CopyNumber. |
TD_size_cutoff |
a length-3 numeric vector used to specify the start, midpoint, end segment size for determining tandem duplication size range, midpoint is used to split TD into short TD and long TD. Default is 1Kb to 100Kb for short TD, 100Kb to 2Mb for long TD. |
TD_cn_cutoff |
a number defining the maximum copy number of TD,
default is |
a data.table
with following scores:
cnaBurden: CNA burden representing the altered genomic fraction as previously reported.
cnaLoad: CNA load representing the quantity of copy number alteration.
MACN: mean altered copy number (MACN) reflecting the property of altered copy number segments, calculated as
where is the copy number of altered segment
,
is
the number of CNV.
weightedMACN: same as MACN but weighted with segment length.
where is the length of altered copy number segment
.
Ploidy: ploidy, the formula is same as weightedMACN
but using all copy number segments instead of
altered copy number segments.
TDP_pnas: tandem duplication phenotype score from https://www.pnas.org/doi/10.1073/pnas.1520010113
,
the threshold k
in reference is omitted.
where is the number of TD,
and
are observed number of TD and expected number of TD for each chromosome.
TDP: tandem duplication score used defined by our group work, TD represents segment with copy number greater than 2.
sTDP: TDP score for short TD.
lTDP: TDP score for long TD.
TDP_size : TDP region size (Mb).
sTDP_size: sTDP region size (Mb).
lTDP_size: lTDP region size(Mb).
Chromoth_state: chromothripsis state score, according to reference doi:10.1016/j.cell.2013.02.023, chromothripsis frequently leads to massive loss of segments on the affected chromosome with segmental losses being interspersed with regions displaying normal (disomic) copy-number (e.g., copy-number states oscillating between copy-number = 1 and copy-number = 2), form tens to hundreds of locally clustered DNA rearrangements. Most of methods use both SV and CNV to infer chromothripsis, here we roughly quantify it with
where is the number of oscillating copy number pattern "2-1-2" for each chromosome.
# Load copy number object load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) d <- scoring(cn) d d2 <- scoring(cn, TD_cn_cutoff = 4L) d2
# Load copy number object load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) d <- scoring(cn) d d2 <- scoring(cn, TD_cn_cutoff = 4L) d2
Show Alteration Catalogue Profile
show_catalogue( catalogue, mode = c("SBS", "copynumber", "DBS", "ID", "RS"), method = "Wang", normalize = c("raw", "row", "feature"), style = c("default", "cosmic"), samples = NULL, samples_name = NULL, x_lab = "Components", y_lab = "Counts", ... )
show_catalogue( catalogue, mode = c("SBS", "copynumber", "DBS", "ID", "RS"), method = "Wang", normalize = c("raw", "row", "feature"), style = c("default", "cosmic"), samples = NULL, samples_name = NULL, x_lab = "Components", y_lab = "Counts", ... )
catalogue |
result from sig_tally or a matrix with row representing components (motifs) and column representing samples |
mode |
signature type for plotting, now supports 'copynumber', 'SBS', 'DBS', 'ID' and 'RS' (genome rearrangement signature). |
method |
method for copy number feature classification in sig_tally, can be one of "Wang" ("W"), "S". |
normalize |
normalize method. |
style |
plot style, one of 'default' and 'cosmic'. |
samples |
default is |
samples_name |
set the sample names shown in plot. |
x_lab |
x axis lab. |
y_lab |
y axis lab. |
... |
other arguments passing to show_sig_profile. |
a ggplot
object
data("simulated_catalogs") p <- show_catalogue(simulated_catalogs$set1, style = "cosmic") p
data("simulated_catalogs") p <- show_catalogue(simulated_catalogs$set1, style = "cosmic") p
Another visualization method for copy number profile like show_cn_profile.
show_cn_circos( data, samples = NULL, show_title = TRUE, chrs = paste0("chr", 1:22), genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), col = NULL, side = "inside", ... )
show_cn_circos( data, samples = NULL, show_title = TRUE, chrs = paste0("chr", 1:22), genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), col = NULL, side = "inside", ... )
data |
a CopyNumber object or a |
samples |
default is |
show_title |
if |
chrs |
chromosomes start with 'chr'. |
genome_build |
genome build version, used when |
col |
colors for the heatmaps. If it is |
side |
side of the heatmaps. |
... |
other parameters passing to circlize::circos.genomicHeatmap. |
a circos plot
load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) show_cn_circos(cn, samples = 1) show_cn_circos(cn, samples = "TCGA-99-7458-01A-11D-2035-01") ## Remove title show_cn_circos(cn, samples = 1, show_title = FALSE) ## Subset chromosomes show_cn_circos(cn, samples = 1, chrs = c("chr1", "chr2", "chr3")) ## Arrange plots layout(matrix(1:4, 2, 2)) show_cn_circos(cn, samples = 4) layout(1) # reset layout
load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) show_cn_circos(cn, samples = 1) show_cn_circos(cn, samples = "TCGA-99-7458-01A-11D-2035-01") ## Remove title show_cn_circos(cn, samples = 1, show_title = FALSE) ## Subset chromosomes show_cn_circos(cn, samples = 1, chrs = c("chr1", "chr2", "chr3")) ## Arrange plots layout(matrix(1:4, 2, 2)) show_cn_circos(cn, samples = 4) layout(1) # reset layout
Show classified components ("Wang" ("W") method) for copy number data.
show_cn_components( parameters, method = "Wang", show_weights = TRUE, log_y = FALSE, return_plotlist = FALSE, base_size = 12, nrow = 2, align = "hv", ... )
show_cn_components( parameters, method = "Wang", show_weights = TRUE, log_y = FALSE, return_plotlist = FALSE, base_size = 12, nrow = 2, align = "hv", ... )
parameters |
a |
method |
method for feature classification, can be one of "Wang" ("W"), "S" (for method described in Steele et al. 2019), "X" (for method described in Tao et al. 2023). |
show_weights |
default is |
log_y |
logical, if |
return_plotlist |
if |
base_size |
overall font size. |
nrow |
(optional) Number of rows in the plot grid. |
align |
(optional) Specifies whether graphs in the grid should be horizontally ("h") or vertically ("v") aligned. Options are "none" (default), "hv" (align in both directions), "h", and "v". |
... |
other options pass to |
a ggplot
object
Shixiang Wang [email protected]
Visually summarize copy number distribution either by copy number segment length
or chromosome. Input is a CopyNumber object, genome_build
option will
read from genome_build
slot of object.
show_cn_distribution( data, rm_normal = TRUE, mode = c("ld", "cd"), fill = FALSE, scale_chr = TRUE, base_size = 14 )
show_cn_distribution( data, rm_normal = TRUE, mode = c("ld", "cd"), fill = FALSE, scale_chr = TRUE, base_size = 14 )
data |
a CopyNumber object. |
rm_normal |
logical. Whether remove normal copy (i.e. "segVal" equals 2), default is |
mode |
either "ld" for distribution by CN length or "cd" for distribution by chromosome. |
fill |
when |
scale_chr |
logical. If |
base_size |
overall font size. |
a ggplot
object
Shixiang Wang [email protected]
# Load copy number object load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) # Plot distribution p1 <- show_cn_distribution(cn) p1 p2 <- show_cn_distribution(cn, mode = "cd") p2 p3 <- show_cn_distribution(cn, mode = "cd", fill = TRUE) p3
# Load copy number object load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) # Plot distribution p1 <- show_cn_distribution(cn) p1 p2 <- show_cn_distribution(cn, mode = "cd") p2 p3 <- show_cn_distribution(cn, mode = "cd", fill = TRUE) p3
Show Copy Number Feature Distributions
show_cn_features( features, method = "Wang", rm_outlier = FALSE, ylab = NULL, log_y = FALSE, return_plotlist = FALSE, base_size = 12, nrow = 2, align = "hv", ... )
show_cn_features( features, method = "Wang", rm_outlier = FALSE, ylab = NULL, log_y = FALSE, return_plotlist = FALSE, base_size = 12, nrow = 2, align = "hv", ... )
features |
a feature |
method |
method for feature classification, can be one of "Wang" ("W"), "S" (for method described in Steele et al. 2019), "X" (for method described in Tao et al. 2023). |
rm_outlier |
default is |
ylab |
lab of y axis. |
log_y |
logical, if |
return_plotlist |
if |
base_size |
overall font size. |
nrow |
(optional) Number of rows in the plot grid. |
align |
(optional) Specifies whether graphs in the grid should be horizontally ("h") or vertically ("v") aligned. Options are "none" (default), "hv" (align in both directions), "h", and "v". |
... |
other options pass to |
a ggplot
object
Show Copy Number Variation Frequency Profile with Circos
show_cn_freq_circos( data, groups = NULL, cutoff = 2L, resolution_factor = 1L, title = c("AMP", "DEL"), chrs = paste0("chr", 1:22), genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), cols = NULL, plot_ideogram = TRUE, track_height = 0.5, ideogram_height = 1, ... )
show_cn_freq_circos( data, groups = NULL, cutoff = 2L, resolution_factor = 1L, title = c("AMP", "DEL"), chrs = paste0("chr", 1:22), genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), cols = NULL, plot_ideogram = TRUE, track_height = 0.5, ideogram_height = 1, ... )
data |
a |
groups |
a named list or a column name for specifying groups. |
cutoff |
copy number value cutoff for splitting data into AMP and DEL.
The values equal to cutoff are discarded. Default is |
resolution_factor |
an integer to control the resolution.
When it is |
title |
length-2 titles for AMP and DEL. |
chrs |
chromosomes start with 'chr'. |
genome_build |
genome build version, used when |
cols |
length-2 colors for AMP and DEL. |
plot_ideogram |
default is |
track_height |
track height in |
ideogram_height |
ideogram height in |
... |
other parameters passing to circlize::circos.genomicLines. |
Nothing.
load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) show_cn_freq_circos(cn) ss <- unique(cn@data$sample) show_cn_freq_circos(cn, groups = list(a = ss[1:5], b = ss[6:10]), cols = c("red", "green"))
load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) show_cn_freq_circos(cn) ss <- unique(cn@data$sample) show_cn_freq_circos(cn, groups = list(a = ss[1:5], b = ss[6:10]), cols = c("red", "green"))
Show Summary Copy Number Profile for Sample Groups
show_cn_group_profile( data, groups = NULL, fill_area = TRUE, cols = NULL, chrs = paste0("chr", c(1:22, "X")), genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), cutoff = 2L, resolution_factor = 1L, force_y_limit = TRUE, highlight_genes = NULL, repel = FALSE, nrow = NULL, ncol = NULL, return_plotlist = FALSE )
show_cn_group_profile( data, groups = NULL, fill_area = TRUE, cols = NULL, chrs = paste0("chr", c(1:22, "X")), genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), cutoff = 2L, resolution_factor = 1L, force_y_limit = TRUE, highlight_genes = NULL, repel = FALSE, nrow = NULL, ncol = NULL, return_plotlist = FALSE )
data |
a |
groups |
a named list or a column name for specifying groups. |
fill_area |
default is |
cols |
length-2 colors for AMP and DEL. |
chrs |
chromosomes start with 'chr'. |
genome_build |
genome build version, used when |
cutoff |
copy number value cutoff for splitting data into AMP and DEL.
The values equal to cutoff are discarded. Default is |
resolution_factor |
an integer to control the resolution.
When it is |
force_y_limit |
default is |
highlight_genes |
gene list to highlight. have same y ranges. You can also set a length-2 numeric value. |
repel |
if |
nrow |
number of rows in the plot grid when multiple samples are selected. |
ncol |
number of columns in the plot grid when multiple samples are selected. |
return_plotlist |
default is |
a (list of) ggplot
object.
load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) p1 <- show_cn_group_profile(cn) p1 ss <- unique(cn@data$sample) p2 <- show_cn_group_profile(cn, groups = list(a = ss[1:5], b = ss[6:10])) p2 p3 <- show_cn_group_profile(cn, groups = list(g1 = ss[1:5], g2 = ss[6:10]), force_y_limit = c(-1, 1), nrow = 2 ) p3 ## Set custom cutoff for custom data data <- cn@data data$segVal <- data$segVal - 2L p4 <- show_cn_group_profile(data, groups = list(g1 = ss[1:5], g2 = ss[6:10]), force_y_limit = c(-1, 1), nrow = 2, cutoff = c(0, 0) ) p4 ## Add highlight gene p5 <- show_cn_group_profile(cn, highlight_genes = c("TP53", "EGFR")) p5
load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) p1 <- show_cn_group_profile(cn) p1 ss <- unique(cn@data$sample) p2 <- show_cn_group_profile(cn, groups = list(a = ss[1:5], b = ss[6:10])) p2 p3 <- show_cn_group_profile(cn, groups = list(g1 = ss[1:5], g2 = ss[6:10]), force_y_limit = c(-1, 1), nrow = 2 ) p3 ## Set custom cutoff for custom data data <- cn@data data$segVal <- data$segVal - 2L p4 <- show_cn_group_profile(data, groups = list(g1 = ss[1:5], g2 = ss[6:10]), force_y_limit = c(-1, 1), nrow = 2, cutoff = c(0, 0) ) p4 ## Add highlight gene p5 <- show_cn_group_profile(cn, highlight_genes = c("TP53", "EGFR")) p5
Sometimes it is very useful to check details about copy number profile for one or multiple samples. This function is designed to do this job and can be further modified by ggplot2 related packages.
show_cn_profile( data, samples = NULL, show_n = NULL, show_title = FALSE, show_labels = NULL, chrs = paste0("chr", 1:22), position = NULL, genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), ylim = NULL, nrow = NULL, ncol = NULL, return_plotlist = FALSE )
show_cn_profile( data, samples = NULL, show_n = NULL, show_title = FALSE, show_labels = NULL, chrs = paste0("chr", 1:22), position = NULL, genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), ylim = NULL, nrow = NULL, ncol = NULL, return_plotlist = FALSE )
data |
a CopyNumber object or a |
samples |
default is NULL, can be a chracter vector representing multiple samples. If |
show_n |
number of samples to show, this is used for checking. |
show_title |
if |
show_labels |
one of |
chrs |
chromosomes start with 'chr'. |
position |
a position range, e.g. |
genome_build |
genome build version, used when |
ylim |
limites for y axis. |
nrow |
number of rows in the plot grid when multiple samples are selected. |
ncol |
number of columns in the plot grid when multiple samples are selected. |
return_plotlist |
default is |
a ggplot
object or a list
# Load copy number object load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) p <- show_cn_profile(cn, nrow = 2, ncol = 1) p p2 <- show_cn_profile(cn, nrow = 2, ncol = 1, position = "chr1:3218923-116319008" ) p2
# Load copy number object load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) p <- show_cn_profile(cn, nrow = 2, ncol = 1) p p2 <- show_cn_profile(cn, nrow = 2, ncol = 1, position = "chr1:3218923-116319008" ) p2
All variables must be continuous.
The matrix will be returned as an element of ggplot
object.
This is basically a wrapper of R package
ggcorrplot.
show_cor( data, x_vars = colnames(data), y_vars = x_vars, cor_method = "spearman", vis_method = "square", lab = TRUE, test = TRUE, hc_order = FALSE, p_adj = NULL, ... )
show_cor( data, x_vars = colnames(data), y_vars = x_vars, cor_method = "spearman", vis_method = "square", lab = TRUE, test = TRUE, hc_order = FALSE, p_adj = NULL, ... )
data |
a |
x_vars |
variables/column names shown in x axis. |
y_vars |
variables/column names shown in y axis. |
cor_method |
method for correlation, default is 'spearman'. |
vis_method |
visualization method, default is 'square', can also be 'circle'. |
lab |
logical value. If TRUE, add correlation coefficient on the plot. |
test |
if |
hc_order |
logical value. If |
p_adj |
p adjust method, see stats::p.adjust for details. |
... |
other parameters passing to |
a ggplot
object
show_sig_feature_corrplot for specific and more powerful association analysis and visualization.
data("mtcars") p1 <- show_cor(mtcars) p2 <- show_cor(mtcars, x_vars = colnames(mtcars)[1:4], y_vars = colnames(mtcars)[5:8] ) p3 <- show_cor(mtcars, vis_method = "circle", p_adj = "fdr") p1 p1$cor p2 p3 ## Auto detect problem variables mtcars$xx <- 0L p4 <- show_cor(mtcars) p4
data("mtcars") p1 <- show_cor(mtcars) p2 <- show_cor(mtcars, x_vars = colnames(mtcars)[1:4], y_vars = colnames(mtcars)[5:8] ) p3 <- show_cor(mtcars, vis_method = "circle", p_adj = "fdr") p1 p1$cor p2 p3 ## Auto detect problem variables mtcars$xx <- 0L p4 <- show_cor(mtcars) p4
Show Signature Information in Web Browser
show_cosmic(x = "home")
show_cosmic(x = "home")
x |
a string indicating location ("home" for COSMIC signature home, "legacy" for COSMIC v2 signatures, "SBS" for COSMIC v3 SBS signatures, "DBS" for COSMIC v3 DBS signatures, "ID" for COSMIC v3 INDEL signatures) or signature index (e.g. "SBS1", "DBS2", "ID3"). |
Nothing.
## Not run: show_cosmic() show_cosmic("legacy") show_cosmic("SBS") show_cosmic("DBS") show_cosmic("ID") show_cosmic("SBS1") show_cosmic("DBS2") show_cosmic("ID3") ## End(Not run)
## Not run: show_cosmic() show_cosmic("legacy") show_cosmic("SBS") show_cosmic("DBS") show_cosmic("ID") show_cosmic("SBS1") show_cosmic("DBS2") show_cosmic("ID3") ## End(Not run)
Plot Reference (Mainly COSMIC) Signature Profile
show_cosmic_sig_profile( sig_index = NULL, show_index = TRUE, sig_db = "legacy", ... )
show_cosmic_sig_profile( sig_index = NULL, show_index = TRUE, sig_db = "legacy", ... )
sig_index |
a vector for signature index. "ALL" for all signatures. |
show_index |
if |
sig_db |
default 'legacy', it can be 'legacy' (for COSMIC v2 'SBS'),
'SBS', 'DBS', 'ID' and 'TSB' (for COSMIV v3.1 signatures)
for small scale mutations.
For more specific details, it can also be 'SBS_hg19', 'SBS_hg38',
'SBS_mm9', 'SBS_mm10', 'DBS_hg19', 'DBS_hg38', 'DBS_mm9', 'DBS_mm10' to use
COSMIC v3 reference signatures from Alexandrov, Ludmil B., et al. (2020) (reference #1).
In addition, it can be one of "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ",
"SBS_Nik_lab", "RS_Nik_lab" to refer reference signatures from
Degasperi, Andrea, et al. (2020) (reference #2);
"RS_BRCA560", "RS_USARC" to reference signatures from BRCA560 and USARC cohorts;
"CNS_USARC" (40 categories), "CNS_TCGA" (48 categories) to reference copy number signatures from USARC cohort and TCGA;
"CNS_TCGA176" (176 categories) and "CNS_PCAWG176" (176 categories) to reference copy number signatures from PCAWG and TCGA separately.
UPDATE, the latest version of reference version can be automatically
downloaded and loaded from https://cancer.sanger.ac.uk/signatures/downloads/
when a option with |
... |
other arguments passing to show_sig_profile. |
a ggplot
object
Shixiang Wang [email protected]
show_cosmic_sig_profile() show_cosmic_sig_profile(sig_db = "SBS") show_cosmic_sig_profile(sig_index = 1:5) show_cosmic_sig_profile(sig_db = "SBS", sig_index = c("10a", "17a")) gg <- show_cosmic_sig_profile(sig_index = 1:5) gg$aetiology
show_cosmic_sig_profile() show_cosmic_sig_profile(sig_db = "SBS") show_cosmic_sig_profile(sig_index = 1:5) show_cosmic_sig_profile(sig_db = "SBS", sig_index = c("10a", "17a")) gg <- show_cosmic_sig_profile(sig_index = 1:5) gg$aetiology
Using result data from get_group_comparison, this function plots
genotypes/phenotypes comparison between signature groups using ggplot2 package and return
a list of ggplot
object contains individual and combined plots. The combined
plot is easily saved to local using cowplot::save_plot()
. Of note, default fisher
test p values are shown for categorical data and fdr values are shown for
continuous data.
show_group_comparison( group_comparison, xlab = "group", ylab_co = NA, legend_title_ca = NA, legend_position_ca = "bottom", set_ca_sig_yaxis = FALSE, set_ca_custom_xlab = FALSE, show_pvalue = TRUE, ca_p_threshold = 0.01, method = "wilcox.test", p.adjust.method = "fdr", base_size = 12, font_size_x = 12, text_angle_x = 30, text_hjust_x = 0.2, ... )
show_group_comparison( group_comparison, xlab = "group", ylab_co = NA, legend_title_ca = NA, legend_position_ca = "bottom", set_ca_sig_yaxis = FALSE, set_ca_custom_xlab = FALSE, show_pvalue = TRUE, ca_p_threshold = 0.01, method = "wilcox.test", p.adjust.method = "fdr", base_size = 12, font_size_x = 12, text_angle_x = 30, text_hjust_x = 0.2, ... )
group_comparison |
a |
xlab |
lab name of x axis for all plots. if it is |
ylab_co |
lab name of y axis for plots of continuous type data. Of note,
this argument should be a character vector has same length as |
legend_title_ca |
legend title for plots of categorical type data. |
legend_position_ca |
legend position for plots of categorical type data.
Of note,
this argument should be a character vector has same length as |
set_ca_sig_yaxis |
if |
set_ca_custom_xlab |
only works when |
show_pvalue |
if |
ca_p_threshold |
a p threshold for categorical variables, default is 0.01.
A p value less than 0.01 will be shown as |
method |
a character string indicating which method to be used for comparing means. It can be 't.test', 'wilcox.test' etc.. |
p.adjust.method |
correction method, default is 'fdr'. Run |
base_size |
overall font size. |
font_size_x |
font size for x. |
text_angle_x |
text angle for x. |
text_hjust_x |
adjust x axis text |
... |
other paramters pass to |
a list
of ggplot
objects.
Shixiang Wang [email protected]
load(system.file("extdata", "toy_copynumber_signature_by_W.RData", package = "sigminer", mustWork = TRUE )) # Assign samples to clusters groups <- get_groups(sig, method = "k-means") set.seed(1234) groups$prob <- rnorm(10) groups$new_group <- sample(c("1", "2", "3", "4", NA), size = nrow(groups), replace = TRUE) # Compare groups (filter NAs for categorical coloumns) groups.cmp <- get_group_comparison(groups[, -1], col_group = "group", cols_to_compare = c("prob", "new_group"), type = c("co", "ca"), verbose = TRUE ) # Compare groups (Set NAs of categorical columns to 'Rest') groups.cmp2 <- get_group_comparison(groups[, -1], col_group = "group", cols_to_compare = c("prob", "new_group"), type = c("co", "ca"), NAs = "Rest", verbose = TRUE ) show_group_comparison(groups.cmp) ggcomp <- show_group_comparison(groups.cmp2) ggcomp$co_comb ggcomp$ca_comb
load(system.file("extdata", "toy_copynumber_signature_by_W.RData", package = "sigminer", mustWork = TRUE )) # Assign samples to clusters groups <- get_groups(sig, method = "k-means") set.seed(1234) groups$prob <- rnorm(10) groups$new_group <- sample(c("1", "2", "3", "4", NA), size = nrow(groups), replace = TRUE) # Compare groups (filter NAs for categorical coloumns) groups.cmp <- get_group_comparison(groups[, -1], col_group = "group", cols_to_compare = c("prob", "new_group"), type = c("co", "ca"), verbose = TRUE ) # Compare groups (Set NAs of categorical columns to 'Rest') groups.cmp2 <- get_group_comparison(groups[, -1], col_group = "group", cols_to_compare = c("prob", "new_group"), type = c("co", "ca"), NAs = "Rest", verbose = TRUE ) show_group_comparison(groups.cmp) ggcomp <- show_group_comparison(groups.cmp2) ggcomp$co_comb ggcomp$ca_comb
This is a general function, it can be used in any proper analysis.
show_group_distribution( data, gvar, dvar, fun = stats::median, order_by_fun = FALSE, alpha = 0.8, g_label = "label", g_angle = 0, g_position = "top", point_size = 1L, segment_size = 1L, segment_color = "red", xlab = NULL, ylab = NULL, nrow = 1L, background_color = c("#DCDCDC", "#F5F5F5") )
show_group_distribution( data, gvar, dvar, fun = stats::median, order_by_fun = FALSE, alpha = 0.8, g_label = "label", g_angle = 0, g_position = "top", point_size = 1L, segment_size = 1L, segment_color = "red", xlab = NULL, ylab = NULL, nrow = 1L, background_color = c("#DCDCDC", "#F5F5F5") )
data |
a |
gvar |
a group variable name/index. |
dvar |
a distribution variable name/index. |
fun |
a function to summarize, default is stats::median, can also be mean. |
order_by_fun |
if |
alpha |
alpha for points, range from 0 to 1. |
g_label |
a string 'label' (default) for labeling with sample size, or 'norm' to show just group name, or a named vector to set facet labels. |
g_angle |
angle for facet labels, default is |
g_position |
position for facet labels, default is 'top', can also be 'bottom'. |
point_size |
size of point. |
segment_size |
size of segment. |
segment_color |
color of segment. |
xlab |
title for x axis. |
ylab |
title for y axis. |
nrow |
number of row. |
background_color |
background color for plot panel. |
a ggplot
object.
Shixiang Wang [email protected]
set.seed(1234) data <- data.frame( yval = rnorm(120), gr = c(rep("A", 50), rep("B", 40), rep("C", 30)) ) p <- show_group_distribution(data, gvar = 2, dvar = 1, g_label = "norm", background_color = "grey" ) p p2 <- show_group_distribution(data, gvar = "gr", dvar = "yval", g_position = "bottom", order_by_fun = TRUE, alpha = 0.3 ) p2 # Set custom group names p3 <- show_group_distribution(data, gvar = 2, dvar = 1, g_label = c("A" = "X", "B" = "Y", "C" = "Z") ) p3
set.seed(1234) data <- data.frame( yval = rnorm(120), gr = c(rep("A", 50), rep("B", 40), rep("C", 30)) ) p <- show_group_distribution(data, gvar = 2, dvar = 1, g_label = "norm", background_color = "grey" ) p p2 <- show_group_distribution(data, gvar = "gr", dvar = "yval", g_position = "bottom", order_by_fun = TRUE, alpha = 0.3 ) p2 # Set custom group names p3 <- show_group_distribution(data, gvar = 2, dvar = 1, g_label = c("A" = "X", "B" = "Y", "C" = "Z") ) p3
See group_enrichment for examples. NOTE the box fill and the box text have different meanings.
show_group_enrichment( df_enrich, return_list = FALSE, scales = "free", add_text_annotation = TRUE, fill_by_p_value = TRUE, use_fdr = TRUE, cut_p_value = FALSE, cut_breaks = c(-Inf, -5, log10(0.05), -log10(0.05), 5, Inf), cut_labels = c("↓ 1e-5", "↓ 0.05", "non-significant", "↑ 0.05", "↑ 1e-5"), fill_scale = scale_fill_gradient2(low = "#08A76B", mid = "white", high = "red", midpoint = ifelse(fill_by_p_value, 0, 1)), cluster_row = FALSE, cluster_col = FALSE, ... )
show_group_enrichment( df_enrich, return_list = FALSE, scales = "free", add_text_annotation = TRUE, fill_by_p_value = TRUE, use_fdr = TRUE, cut_p_value = FALSE, cut_breaks = c(-Inf, -5, log10(0.05), -log10(0.05), 5, Inf), cut_labels = c("↓ 1e-5", "↓ 0.05", "non-significant", "↑ 0.05", "↑ 1e-5"), fill_scale = scale_fill_gradient2(low = "#08A76B", mid = "white", high = "red", midpoint = ifelse(fill_by_p_value, 0, 1)), cluster_row = FALSE, cluster_col = FALSE, ... )
df_enrich |
result |
return_list |
if |
scales |
Should scales be fixed ( |
add_text_annotation |
if |
fill_by_p_value |
if |
use_fdr |
if |
cut_p_value |
if |
cut_breaks |
when |
cut_labels |
when |
fill_scale |
a |
cluster_row , cluster_col
|
if |
... |
other parameters passing to ggplot2::facet_wrap, only used
when |
a (list of) ggplot
object.
This feature is designed for signature analysis. However, users can also use it in other similar situations.
show_group_mapping( data, col_to_flow, cols_to_map, include_sig = FALSE, fill_na = FALSE, title = NULL, xlab = NULL, ylab = NULL, custom_theme = cowplot::theme_minimal_hgrid() )
show_group_mapping( data, col_to_flow, cols_to_map, include_sig = FALSE, fill_na = FALSE, title = NULL, xlab = NULL, ylab = NULL, custom_theme = cowplot::theme_minimal_hgrid() )
data |
a |
col_to_flow |
length-1 character showing the column to flow, typically a signature group. |
cols_to_map |
character vector showing colnames of other groups. |
include_sig |
default if |
fill_na |
length-1 string to fill NA, default is |
title |
the title. |
xlab |
label for x axis. |
ylab |
label for y axis. |
custom_theme |
theme for plotting, default is |
a ggplot
object
data <- dplyr::tibble( Group1 = rep(LETTERS[1:5], each = 10), Group2 = rep(LETTERS[6:15], each = 5), zzzz = c(rep("xx", 20), rep("yy", 20), rep(NA, 10)) ) p1 <- show_group_mapping(data, col_to_flow = "Group1", cols_to_map = colnames(data)[-1]) p1 p2 <- show_group_mapping(data, col_to_flow = "Group1", cols_to_map = colnames(data)[-1], include_sig = TRUE ) p2
data <- dplyr::tibble( Group1 = rep(LETTERS[1:5], each = 10), Group2 = rep(LETTERS[6:15], each = 5), zzzz = c(rep("xx", 20), rep("yy", 20), rep(NA, 10)) ) p1 <- show_group_mapping(data, col_to_flow = "Group1", cols_to_map = colnames(data)[-1]) p1 p2 <- show_group_mapping(data, col_to_flow = "Group1", cols_to_map = colnames(data)[-1], include_sig = TRUE ) p2
See example section in sig_fit()
for an examples.
show_groups(grp_dt, ...)
show_groups(grp_dt, ...)
grp_dt |
a result |
... |
parameters passing to |
nothing.
See details for description.
show_sig_bootstrap_exposure( bt_result, sample = NULL, signatures = NULL, methods = "QP", plot_fun = c("boxplot", "violin"), agg_fun = c("mean", "median", "min", "max"), highlight = "auto", highlight_size = 4, palette = "aaas", title = NULL, xlab = FALSE, ylab = "Signature exposure", width = 0.3, dodge_width = 0.8, outlier.shape = NA, add = "jitter", add.params = list(alpha = 0.3), ... ) show_sig_bootstrap_error( bt_result, sample = NULL, methods = "QP", plot_fun = c("boxplot", "violin"), agg_fun = c("mean", "median"), highlight = "auto", highlight_size = 4, palette = "aaas", title = NULL, xlab = FALSE, ylab = "Reconstruction error (L2 norm)", width = 0.3, dodge_width = 0.8, outlier.shape = NA, add = "jitter", add.params = list(alpha = 0.3), legend = "none", ... ) show_sig_bootstrap_stability( bt_result, signatures = NULL, measure = c("RMSE", "CV", "MAE", "AbsDiff"), methods = "QP", plot_fun = c("boxplot", "violin"), palette = "aaas", title = NULL, xlab = FALSE, ylab = "Signature instability", width = 0.3, outlier.shape = NA, add = "jitter", add.params = list(alpha = 0.3), ... )
show_sig_bootstrap_exposure( bt_result, sample = NULL, signatures = NULL, methods = "QP", plot_fun = c("boxplot", "violin"), agg_fun = c("mean", "median", "min", "max"), highlight = "auto", highlight_size = 4, palette = "aaas", title = NULL, xlab = FALSE, ylab = "Signature exposure", width = 0.3, dodge_width = 0.8, outlier.shape = NA, add = "jitter", add.params = list(alpha = 0.3), ... ) show_sig_bootstrap_error( bt_result, sample = NULL, methods = "QP", plot_fun = c("boxplot", "violin"), agg_fun = c("mean", "median"), highlight = "auto", highlight_size = 4, palette = "aaas", title = NULL, xlab = FALSE, ylab = "Reconstruction error (L2 norm)", width = 0.3, dodge_width = 0.8, outlier.shape = NA, add = "jitter", add.params = list(alpha = 0.3), legend = "none", ... ) show_sig_bootstrap_stability( bt_result, signatures = NULL, measure = c("RMSE", "CV", "MAE", "AbsDiff"), methods = "QP", plot_fun = c("boxplot", "violin"), palette = "aaas", title = NULL, xlab = FALSE, ylab = "Signature instability", width = 0.3, outlier.shape = NA, add = "jitter", add.params = list(alpha = 0.3), ... )
bt_result |
result object from sig_fit_bootstrap_batch. |
sample |
a sample id. |
signatures |
signatures to show. |
methods |
a subset of |
plot_fun |
set the plot function. |
agg_fun |
set the aggregation function when |
highlight |
set the color for optimal solution. Default is "auto", which use the same color as bootstrap results, you can set it to color like "red", "gold", etc. |
highlight_size |
size for highlighting triangle, default is |
palette |
the color palette to be used for coloring or filling by groups. Allowed values include "grey" for grey color palettes; brewer palettes e.g. "RdBu", "Blues", ...; or custom color palette e.g. c("blue", "red"); and scientific journal palettes from ggsci R package, e.g.: "npg", "aaas", "lancet", "jco", "ucscgb", "uchicago", "simpsons" and "rickandmorty". |
title |
plot main title. |
xlab |
character vector specifying x axis labels. Use xlab = FALSE to hide xlab. |
ylab |
character vector specifying y axis labels. Use ylab = FALSE to hide ylab. |
width |
numeric value between 0 and 1 specifying box width. |
dodge_width |
dodge width. |
outlier.shape |
point shape of outlier. Default is 19. To hide outlier,
specify |
add |
character vector for adding another plot element (e.g.: dot plot or error bars). Allowed values are one or the combination of: "none", "dotplot", "jitter", "boxplot", "point", "mean", "mean_se", "mean_sd", "mean_ci", "mean_range", "median", "median_iqr", "median_hilow", "median_q1q3", "median_mad", "median_range"; see ?desc_statby for more details. |
add.params |
parameters (color, shape, size, fill, linetype) for the argument 'add'; e.g.: add.params = list(color = "red"). |
... |
other parameters passing to ggpubr::ggboxplot or ggpubr::ggviolin. |
legend |
character specifying legend position. Allowed values are one of c("top", "bottom", "left", "right", "none"). To remove the legend use legend = "none". Legend position can be also specified using a numeric vector c(x, y); see details section. |
measure |
measure to estimate the exposure instability, can be one of 'RMSE', 'CV', 'MAE' and 'AbsDiff'. |
Functions:
show_sig_bootstrap_exposure - this function plots exposures from bootstrap samples with both dotted boxplot. The optimal exposure (the exposure from original input) is shown as triangle point. Only one sample can be plotted.
show_sig_bootstrap_error - this function plots decomposition errors from bootstrap samples with both dotted boxplot. The error from optimal solution (the decomposition error from original input) is shown as triangle point. Only one sample can be plotted.
show_sig_bootstrap_stability - this function plots the signature exposure instability for specified signatures. Currently, the instability measure supports 3 types:
'RMSE' for Mean Root Squared Error (default) of bootstrap exposures and original exposures for each sample.
'CV' for Coefficient of Variation (CV) based on RMSE (i.e. RMSE / btExposure_mean
).
'MAE' for Mean Absolute Error of bootstrap exposures and original exposures for each sample.
'AbsDiff' for Absolute Difference between mean bootstram exposure and original exposure.
a ggplot
object
Huang X, Wojtowicz D, Przytycka TM. Detecting presence of mutational signatures in cancer with confidence. Bioinformatics. 2018;34(2):330–337. doi:10.1093/bioinformatics/btx604
sig_fit_bootstrap_batch, sig_fit, sig_fit_bootstrap
if (require("BSgenome.Hsapiens.UCSC.hg19")) { laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools") laml <- read_maf(maf = laml.maf) mt_tally <- sig_tally( laml, ref_genome = "BSgenome.Hsapiens.UCSC.hg19", use_syn = TRUE ) library(NMF) mt_sig <- sig_extract(mt_tally$nmf_matrix, n_sig = 3, nrun = 2, cores = 1 ) mat <- t(mt_tally$nmf_matrix) mat <- mat[, colSums(mat) > 0] bt_result <- sig_fit_bootstrap_batch(mat, sig = mt_sig, n = 10) ## Parallel computation ## bt_result = sig_fit_bootstrap_batch(mat, sig = mt_sig, n = 10, use_parallel = TRUE) ## At default, mean bootstrap exposure for each sample has been calculated p <- show_sig_bootstrap_exposure(bt_result, methods = c("QP")) ## Show bootstrap exposure (optimal exposure is shown as triangle) p1 <- show_sig_bootstrap_exposure(bt_result, methods = c("QP"), sample = "TCGA-AB-2802") p1 p2 <- show_sig_bootstrap_exposure(bt_result, methods = c("QP"), sample = "TCGA-AB-3012", signatures = c("Sig1", "Sig2") ) p2 ## Show bootstrap error ## Similar to exposure above p <- show_sig_bootstrap_error(bt_result, methods = c("QP")) p p3 <- show_sig_bootstrap_error(bt_result, methods = c("QP"), sample = "TCGA-AB-2802") p3 ## Show exposure (in)stability p4 <- show_sig_bootstrap_stability(bt_result, methods = c("QP")) p4 p5 <- show_sig_bootstrap_stability(bt_result, methods = c("QP"), measure = "MAE") p5 p6 <- show_sig_bootstrap_stability(bt_result, methods = c("QP"), measure = "AbsDiff") p6 p7 <- show_sig_bootstrap_stability(bt_result, methods = c("QP"), measure = "CV") p7 } else { message("Please install package 'BSgenome.Hsapiens.UCSC.hg19' firstly!") }
if (require("BSgenome.Hsapiens.UCSC.hg19")) { laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools") laml <- read_maf(maf = laml.maf) mt_tally <- sig_tally( laml, ref_genome = "BSgenome.Hsapiens.UCSC.hg19", use_syn = TRUE ) library(NMF) mt_sig <- sig_extract(mt_tally$nmf_matrix, n_sig = 3, nrun = 2, cores = 1 ) mat <- t(mt_tally$nmf_matrix) mat <- mat[, colSums(mat) > 0] bt_result <- sig_fit_bootstrap_batch(mat, sig = mt_sig, n = 10) ## Parallel computation ## bt_result = sig_fit_bootstrap_batch(mat, sig = mt_sig, n = 10, use_parallel = TRUE) ## At default, mean bootstrap exposure for each sample has been calculated p <- show_sig_bootstrap_exposure(bt_result, methods = c("QP")) ## Show bootstrap exposure (optimal exposure is shown as triangle) p1 <- show_sig_bootstrap_exposure(bt_result, methods = c("QP"), sample = "TCGA-AB-2802") p1 p2 <- show_sig_bootstrap_exposure(bt_result, methods = c("QP"), sample = "TCGA-AB-3012", signatures = c("Sig1", "Sig2") ) p2 ## Show bootstrap error ## Similar to exposure above p <- show_sig_bootstrap_error(bt_result, methods = c("QP")) p p3 <- show_sig_bootstrap_error(bt_result, methods = c("QP"), sample = "TCGA-AB-2802") p3 ## Show exposure (in)stability p4 <- show_sig_bootstrap_stability(bt_result, methods = c("QP")) p4 p5 <- show_sig_bootstrap_stability(bt_result, methods = c("QP"), measure = "MAE") p5 p6 <- show_sig_bootstrap_stability(bt_result, methods = c("QP"), measure = "AbsDiff") p6 p7 <- show_sig_bootstrap_stability(bt_result, methods = c("QP"), measure = "CV") p7 } else { message("Please install package 'BSgenome.Hsapiens.UCSC.hg19' firstly!") }
This function is a wrapper of NMF::consensusmap()
.
show_sig_consensusmap( sig, main = "Consensus matrix", tracks = c("consensus:", "silhouette:"), lab_row = NA, lab_col = NA, ... )
show_sig_consensusmap( sig, main = "Consensus matrix", tracks = c("consensus:", "silhouette:"), lab_row = NA, lab_col = NA, ... )
sig |
a |
main |
Main title as a character string or a grob. |
tracks |
Special additional annotation tracks to highlight associations between basis components and sample clusters:
|
lab_row |
labels for the rows. |
lab_col |
labels for the columns. |
... |
other parameters passing to |
nothing
Currently support copy number signatures and mutational signatures.
show_sig_exposure( Signature, sig_names = NULL, groups = NULL, grp_order = NULL, grp_size = NULL, samps = NULL, cutoff = NULL, style = c("default", "cosmic"), palette = use_color_style(style), base_size = 12, font_scale = 1, rm_space = FALSE, rm_grid_line = TRUE, rm_panel_border = FALSE, hide_samps = TRUE, legend_position = "top" )
show_sig_exposure( Signature, sig_names = NULL, groups = NULL, grp_order = NULL, grp_size = NULL, samps = NULL, cutoff = NULL, style = c("default", "cosmic"), palette = use_color_style(style), base_size = 12, font_scale = 1, rm_space = FALSE, rm_grid_line = TRUE, rm_panel_border = FALSE, hide_samps = TRUE, legend_position = "top" )
Signature |
a |
sig_names |
set name of signatures, can be a character vector. |
groups |
sample groups, default is |
grp_order |
order of groups, default is |
grp_size |
font size of groups. |
samps |
sample vector to filter samples or sort samples, default is |
cutoff |
a cutoff value to remove hyper-mutated samples. |
style |
plot style, one of 'default' and 'cosmic', works when
parameter |
palette |
palette used to plot, default use a built-in palette
according to parameter |
base_size |
overall font size. |
font_scale |
a number used to set font scale. |
rm_space |
default is |
rm_grid_line |
default is |
rm_panel_border |
default is |
hide_samps |
if |
legend_position |
position of legend, default is 'top'. |
a ggplot
object
Shixiang Wang
# Load mutational signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Show signature exposure p1 <- show_sig_exposure(sig2) p1 # Load copy number signature load(system.file("extdata", "toy_copynumber_signature_by_W.RData", package = "sigminer", mustWork = TRUE )) # Show signature exposure p2 <- show_sig_exposure(sig) p2
# Load mutational signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Show signature exposure p1 <- show_sig_exposure(sig2) p1 # Load copy number signature load(system.file("extdata", "toy_copynumber_signature_by_W.RData", package = "sigminer", mustWork = TRUE )) # Show signature exposure p2 <- show_sig_exposure(sig) p2
This function is for association visualization. Of note,
the parameters p_val
and drop
will affect the visualization
of association results under p value threshold.
show_sig_feature_corrplot( tidy_cor, feature_list, sort_features = FALSE, sig_orders = NULL, drop = TRUE, return_plotlist = FALSE, p_val = 0.05, xlab = "Signatures", ylab = "Features", co_gradient_colors = scale_color_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0), ca_gradient_colors = co_gradient_colors, plot_ratio = "auto", breaks_count = NULL )
show_sig_feature_corrplot( tidy_cor, feature_list, sort_features = FALSE, sig_orders = NULL, drop = TRUE, return_plotlist = FALSE, p_val = 0.05, xlab = "Signatures", ylab = "Features", co_gradient_colors = scale_color_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0), ca_gradient_colors = co_gradient_colors, plot_ratio = "auto", breaks_count = NULL )
tidy_cor |
data returned by get_tidy_association. |
feature_list |
a character vector contains features want to be plotted. If missing, all features will be used. |
sort_features |
default is |
sig_orders |
signature levels for ordering. |
drop |
if |
return_plotlist |
if |
p_val |
p value threshold. If p value larger than this threshold, the result becomes blank white. |
xlab |
label for x axis. |
ylab |
label for y axis. |
co_gradient_colors |
a Scale object representing gradient colors used to plot for continuous features. |
ca_gradient_colors |
a Scale object representing gradient colors used to plot for categorical features. |
plot_ratio |
a length-2 numeric vector to set the height/width ratio. |
breaks_count |
breaks for sample count. If set it to |
a ggplot2
object
get_tidy_association and get_sig_feature_association
# The data is generated from Wang, Shixiang et al. load(system.file("extdata", "asso_data.RData", package = "sigminer", mustWork = TRUE )) p <- show_sig_feature_corrplot( tidy_data.seqz.feature, p_val = 0.05, breaks_count = c(0L,200L, 400L, 600L, 800L, 1020L)) p
# The data is generated from Wang, Shixiang et al. load(system.file("extdata", "asso_data.RData", package = "sigminer", mustWork = TRUE )) p <- show_sig_feature_corrplot( tidy_data.seqz.feature, p_val = 0.05, breaks_count = c(0L,200L, 400L, 600L, 800L, 1020L)) p
See sig_fit for examples.
show_sig_fit( fit_result, samples = NULL, signatures = NULL, plot_fun = c("boxplot", "violin", "scatter"), palette = "aaas", title = NULL, xlab = FALSE, ylab = "Signature exposure", legend = "none", width = 0.3, outlier.shape = NA, add = "jitter", add.params = list(alpha = 0.3), ... )
show_sig_fit( fit_result, samples = NULL, signatures = NULL, plot_fun = c("boxplot", "violin", "scatter"), palette = "aaas", title = NULL, xlab = FALSE, ylab = "Signature exposure", legend = "none", width = 0.3, outlier.shape = NA, add = "jitter", add.params = list(alpha = 0.3), ... )
fit_result |
result object from sig_fit. |
samples |
samples to show, if |
signatures |
signatures to show. |
plot_fun |
set the plot function. |
palette |
the color palette to be used for coloring or filling by groups. Allowed values include "grey" for grey color palettes; brewer palettes e.g. "RdBu", "Blues", ...; or custom color palette e.g. c("blue", "red"); and scientific journal palettes from ggsci R package, e.g.: "npg", "aaas", "lancet", "jco", "ucscgb", "uchicago", "simpsons" and "rickandmorty". |
title |
plot main title. |
xlab |
character vector specifying x axis labels. Use xlab = FALSE to hide xlab. |
ylab |
character vector specifying y axis labels. Use ylab = FALSE to hide ylab. |
legend |
character specifying legend position. Allowed values are one of c("top", "bottom", "left", "right", "none"). To remove the legend use legend = "none". Legend position can be also specified using a numeric vector c(x, y); see details section. |
width |
numeric value between 0 and 1 specifying box width. |
outlier.shape |
point shape of outlier. Default is 19. To hide outlier,
specify |
add |
character vector for adding another plot element (e.g.: dot plot or error bars). Allowed values are one or the combination of: "none", "dotplot", "jitter", "boxplot", "point", "mean", "mean_se", "mean_sd", "mean_ci", "mean_range", "median", "median_iqr", "median_hilow", "median_q1q3", "median_mad", "median_range"; see ?desc_statby for more details. |
add.params |
parameters (color, shape, size, fill, linetype) for the argument 'add'; e.g.: add.params = list(color = "red"). |
... |
other arguments to be passed to
|
a ggplot
object.
sig_fit, show_sig_bootstrap_exposure, sig_fit_bootstrap, sig_fit_bootstrap_batch
Who don't like to show a barplot for signature profile? This is for it.
show_sig_profile( Signature, mode = c("SBS", "copynumber", "DBS", "ID", "RS"), method = "Wang", by_context = FALSE, normalize = c("row", "column", "raw", "feature"), y_tr = NULL, filters = NULL, feature_setting = sigminer::CN.features, style = c("default", "cosmic"), palette = use_color_style(style, ifelse(by_context, "SBS", mode), method), set_gradient_color = FALSE, free_space = "free_x", rm_panel_border = style == "cosmic", rm_grid_line = style == "cosmic", rm_axis_text = FALSE, bar_border_color = ifelse(style == "default", "grey50", "white"), bar_width = 0.7, paint_axis_text = TRUE, x_label_angle = ifelse(mode == "copynumber" & !(startsWith(method, "T") | method == "X"), 60, 90), x_label_vjust = ifelse(mode == "copynumber" & !(startsWith(method, "T") | method == "X"), 1, 0.5), x_label_hjust = 1, x_lab = "Components", y_lab = "auto", y_limits = NULL, params = NULL, show_cv = FALSE, params_label_size = 3, params_label_angle = 60, y_expand = 1, digits = 2, base_size = 12, font_scale = 1, sig_names = NULL, sig_orders = NULL, check_sig_names = TRUE )
show_sig_profile( Signature, mode = c("SBS", "copynumber", "DBS", "ID", "RS"), method = "Wang", by_context = FALSE, normalize = c("row", "column", "raw", "feature"), y_tr = NULL, filters = NULL, feature_setting = sigminer::CN.features, style = c("default", "cosmic"), palette = use_color_style(style, ifelse(by_context, "SBS", mode), method), set_gradient_color = FALSE, free_space = "free_x", rm_panel_border = style == "cosmic", rm_grid_line = style == "cosmic", rm_axis_text = FALSE, bar_border_color = ifelse(style == "default", "grey50", "white"), bar_width = 0.7, paint_axis_text = TRUE, x_label_angle = ifelse(mode == "copynumber" & !(startsWith(method, "T") | method == "X"), 60, 90), x_label_vjust = ifelse(mode == "copynumber" & !(startsWith(method, "T") | method == "X"), 1, 0.5), x_label_hjust = 1, x_lab = "Components", y_lab = "auto", y_limits = NULL, params = NULL, show_cv = FALSE, params_label_size = 3, params_label_angle = 60, y_expand = 1, digits = 2, base_size = 12, font_scale = 1, sig_names = NULL, sig_orders = NULL, check_sig_names = TRUE )
Signature |
a |
mode |
signature type for plotting, now supports 'copynumber', 'SBS', 'DBS', 'ID' and 'RS' (genome rearrangement signature). |
method |
method for copy number feature classification in sig_tally, can be one of "Wang" ("W"), "S". |
by_context |
for specific use. |
normalize |
one of 'row', 'column', 'raw' and "feature", for row normalization (signature), column normalization (component), raw data, row normalization by feature, respectively. Of note, 'feature' only works when the mode is 'copynumber'. |
y_tr |
a function (e.g. |
filters |
a pattern used to select components to plot. |
feature_setting |
a |
style |
plot style, one of 'default' and 'cosmic', works when
parameter |
palette |
palette used to plot when |
set_gradient_color |
default is |
free_space |
default is 'free_x'. If "fixed", all panels have the same size. If "free_y" their height will be proportional to the length of the y scale; if "free_x" their width will be proportional to the length of the x scale; or if "free" both height and width will vary. This setting has no effect unless the appropriate scales also vary. |
rm_panel_border |
default is |
rm_grid_line |
default is |
rm_axis_text |
default is |
bar_border_color |
the color of bar border. |
bar_width |
bar width. By default, set to 70% of the resolution of the data. |
paint_axis_text |
if |
x_label_angle |
font angle for x label. |
x_label_vjust |
font vjust for x label. |
x_label_hjust |
font hjust for x label. |
x_lab |
x axis lab. |
y_lab |
y axis lab. |
y_limits |
limits to expand in y axis. e.g., |
params |
params |
show_cv |
default is |
params_label_size |
font size for params label. |
params_label_angle |
font angle for params label. |
y_expand |
y expand height for plotting params of copy number signatures. |
digits |
digits for plotting params of copy number signatures. |
base_size |
overall font size. |
font_scale |
a number used to set font scale. |
sig_names |
subset signatures or set name of signatures, can be a character vector.
Default is |
sig_orders |
set order of signatures, can be a character vector.
Default is |
check_sig_names |
if |
a ggplot
object
Shixiang Wang
show_sig_profile_loop, show_sig_profile_heatmap
# Load SBS signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Show signature profile p1 <- show_sig_profile(sig2, mode = "SBS") p1 # Use 'y_tr' option to transform values in y axis p11 <- show_sig_profile(sig2, mode = "SBS", y_tr = function(x) x * 100) p11 # Load copy number signature from method "W" load(system.file("extdata", "toy_copynumber_signature_by_W.RData", package = "sigminer", mustWork = TRUE )) # Show signature profile p2 <- show_sig_profile(sig, style = "cosmic", mode = "copynumber", method = "W", normalize = "feature" ) p2 # Visualize rearrangement signatures s <- get_sig_db("RS_Nik_lab") ss <- s$db[, 1:3] colnames(ss) <- c("Sig1", "Sig2", "Sig3") p3 <- show_sig_profile(ss, mode = "RS", style = "cosmic") p3
# Load SBS signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Show signature profile p1 <- show_sig_profile(sig2, mode = "SBS") p1 # Use 'y_tr' option to transform values in y axis p11 <- show_sig_profile(sig2, mode = "SBS", y_tr = function(x) x * 100) p11 # Load copy number signature from method "W" load(system.file("extdata", "toy_copynumber_signature_by_W.RData", package = "sigminer", mustWork = TRUE )) # Show signature profile p2 <- show_sig_profile(sig, style = "cosmic", mode = "copynumber", method = "W", normalize = "feature" ) p2 # Visualize rearrangement signatures s <- get_sig_db("RS_Nik_lab") ss <- s$db[, 1:3] colnames(ss) <- c("Sig1", "Sig2", "Sig3") p3 <- show_sig_profile(ss, mode = "RS", style = "cosmic") p3
This is a complementary function to show_sig_profile()
, it is used for visualizing
some big signatures, i.e. SBS-1536, not all signatures are supported. See details for
current supported signatures.
show_sig_profile_heatmap( Signature, mode = c("SBS", "DBS"), normalize = c("row", "column", "raw"), filters = NULL, x_lab = NULL, y_lab = NULL, legend_name = "auto", palette = "red", x_label_angle = 90, x_label_vjust = 1, x_label_hjust = 0.5, y_label_angle = 0, y_label_vjust = 0.5, y_label_hjust = 1, flip_xy = FALSE, sig_names = NULL, sig_orders = NULL, check_sig_names = TRUE )
show_sig_profile_heatmap( Signature, mode = c("SBS", "DBS"), normalize = c("row", "column", "raw"), filters = NULL, x_lab = NULL, y_lab = NULL, legend_name = "auto", palette = "red", x_label_angle = 90, x_label_vjust = 1, x_label_hjust = 0.5, y_label_angle = 0, y_label_vjust = 0.5, y_label_hjust = 1, flip_xy = FALSE, sig_names = NULL, sig_orders = NULL, check_sig_names = TRUE )
Signature |
a |
mode |
one of "SBS" and "DBS". |
normalize |
one of 'row', 'column', 'raw' and "feature", for row normalization (signature), column normalization (component), raw data, row normalization by feature, respectively. Of note, 'feature' only works when the mode is 'copynumber'. |
filters |
a pattern used to select components to plot. |
x_lab |
x label. |
y_lab |
y label. |
legend_name |
name of figure legend. |
palette |
color for value. |
x_label_angle |
angle for x axis text. |
x_label_vjust |
vjust for x axis text. |
x_label_hjust |
hjust for x axis text. |
y_label_angle |
angle for y axis text. |
y_label_vjust |
vjust for y axis text. |
y_label_hjust |
hjust for y axis text. |
flip_xy |
if |
sig_names |
subset signatures or set name of signatures, can be a character vector.
Default is |
sig_orders |
set order of signatures, can be a character vector.
Default is |
check_sig_names |
if |
Support:
SBS-24
SBS-96
SBS-384
SBS-1536
SBS-6144
DBS-78
DBS-186
a ggplot
object.
# Load SBS signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Show signature profile p1 <- show_sig_profile_heatmap(sig2, mode = "SBS") p1
# Load SBS signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Show signature profile p1 <- show_sig_profile_heatmap(sig2, mode = "SBS") p1
Show Signature Profile with Loop Way
show_sig_profile_loop( Signature, sig_names = NULL, ncol = 1, nrow = NULL, x_lab = "Components", ... )
show_sig_profile_loop( Signature, sig_names = NULL, ncol = 1, nrow = NULL, x_lab = "Components", ... )
Signature |
a |
sig_names |
subset signatures or set name of signatures, can be a character vector.
Default is |
ncol |
(optional) Number of columns in the plot grid. |
nrow |
(optional) Number of rows in the plot grid. |
x_lab |
x axis lab. |
... |
other parameters but |
a ggplot
result from cowplot::plot_grid()
.
load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Show signature profile p1 <- show_sig_profile_loop(sig2, mode = "SBS") p1 p2 <- show_sig_profile_loop(sig2, mode = "SBS", style = "cosmic", sig_names = c("A", "B", "C")) p2
load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Show signature profile p1 <- show_sig_profile_loop(sig2, mode = "SBS") p1 p2 <- show_sig_profile_loop(sig2, mode = "SBS", style = "cosmic", sig_names = c("A", "B", "C")) p2
A bayesian variant of NMF algorithm to enable optimal inferences for the number of signatures through the automatic relevance determination technique. This functions delevers highly interpretable and sparse representations for both signature profiles and attributions at a balance between data fitting and model complexity (this method may introduce more signatures than expected, especially for copy number signatures (thus I don't recommend you to use this feature to extract copy number signatures)). See detail part and references for more.
sig_auto_extract( nmf_matrix = NULL, result_prefix = "BayesNMF", destdir = tempdir(), method = c("L1W.L2H", "L1KL", "L2KL"), strategy = c("stable", "optimal", "ms"), ref_sigs = NULL, K0 = 25, nrun = 10, niter = 2e+05, tol = 1e-07, cores = 1, optimize = FALSE, skip = FALSE, recover = FALSE )
sig_auto_extract( nmf_matrix = NULL, result_prefix = "BayesNMF", destdir = tempdir(), method = c("L1W.L2H", "L1KL", "L2KL"), strategy = c("stable", "optimal", "ms"), ref_sigs = NULL, K0 = 25, nrun = 10, niter = 2e+05, tol = 1e-07, cores = 1, optimize = FALSE, skip = FALSE, recover = FALSE )
nmf_matrix |
a |
result_prefix |
prefix for result data files. |
destdir |
path to save data runs, default is |
method |
default is "L1W.L2H", which uses an exponential prior for W and a half-normal prior for H (This method is used by PCAWG project, see reference #3). You can also use "L1KL" to set expoential priors for both W and H, and "L2KL" to set half-normal priors for both W and H. The latter two methods are originally implemented by SignatureAnalyzer software. |
strategy |
the selection strategy for returned data. Set 'stable' for getting optimal
result from the most frequent K. Set 'optimal' for getting optimal result from all Ks.
Set 'ms' for getting result with maximum mean cosine similarity with provided reference
signatures. See |
ref_sigs |
A Signature object or matrix or string for specifying
reference signatures, only used when |
K0 |
number of initial signatures. |
nrun |
number of independent simulations. |
niter |
the maximum number of iterations. |
tol |
tolerance for convergence. |
cores |
number of cpu cores to run NMF. |
optimize |
if |
skip |
if |
recover |
if |
There are three methods available in this function: "L1W.L2H", "L1KL" and "L2KL".
They use different priors for the bayesian variant of NMF algorithm
(see method
parameter) written by reference #1 and implemented in
SignatureAnalyzer software
(reference #2).
I copied source code for the three methods from Broad Institute and supplementary
files of reference #3, and wrote this higher function. It is more friendly for users
to extract, visualize and analyze signatures by combining with other powerful functions
in sigminer package. Besides, I implemented parallel computation to speed up
the calculation process and a similar input and output structure like sig_extract()
.
a list
with Signature
class.
Shixiang Wang
Tan, Vincent YF, and Cédric Févotte. "Automatic relevance determination in nonnegative matrix factorization with the/spl beta/-divergence." IEEE Transactions on Pattern Analysis and Machine Intelligence 35.7 (2012): 1592-1605.
Kim, Jaegil, et al. "Somatic ERCC2 mutations are associated with a distinct genomic signature in urothelial tumors." Nature genetics 48.6 (2016): 600.
Alexandrov, Ludmil, et al. "The repertoire of mutational signatures in human cancer." BioRxiv (2018): 322859.
sig_tally for getting variation matrix, sig_extract for extracting signatures using NMF package, sig_estimate for estimating signature number for sig_extract.
load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) res <- sig_auto_extract(cn_tally_W$nmf_matrix, result_prefix = "Test_copynumber", nrun = 1) # At default, all run files are stored in tempdir() dir(tempdir(), pattern = "Test_copynumber") laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools") laml <- read_maf(maf = laml.maf) mt_tally <- sig_tally( laml, ref_genome = "BSgenome.Hsapiens.UCSC.hg19", use_syn = TRUE ) x <- sig_auto_extract(mt_tally$nmf_matrix, strategy = "ms", nrun = 3, ref_sigs = "legacy" ) x
load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) res <- sig_auto_extract(cn_tally_W$nmf_matrix, result_prefix = "Test_copynumber", nrun = 1) # At default, all run files are stored in tempdir() dir(tempdir(), pattern = "Test_copynumber") laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools") laml <- read_maf(maf = laml.maf) mt_tally <- sig_tally( laml, ref_genome = "BSgenome.Hsapiens.UCSC.hg19", use_syn = TRUE ) x <- sig_auto_extract(mt_tally$nmf_matrix, strategy = "ms", nrun = 3, ref_sigs = "legacy" ) x
Converts signatures between two representations relative to different sets of mutational opportunities. Currently, only SBS signature is supported.
sig_convert(sig, from = "human-genome", to = "human-exome")
sig_convert(sig, from = "human-genome", to = "human-exome")
sig |
a |
from |
either one of "human-genome" and "human-exome" or an opportunity matrix
(repeated |
to |
same as |
The default opportunity matrix for "human-genome" and "human-exome" comes from COSMIC signature database v2 and v3.
a matrix
.
convert_signatures
function from sigfit package.
# Load SBS signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Exome-relative to Genome-relative sig_converted <- sig_convert(sig2, from = "human-exome", to = "human-genome" ) sig_converted show_sig_profile(sig2, style = "cosmic") show_sig_profile(sig_converted, style = "cosmic")
# Load SBS signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) # Exome-relative to Genome-relative sig_converted <- sig_convert(sig2, from = "human-exome", to = "human-genome" ) sig_converted show_sig_profile(sig2, style = "cosmic") show_sig_profile(sig_converted, style = "cosmic")
Use NMF package to evaluate the optimal number of signatures.
This is used along with sig_extract.
Users should library(NMF)
firstly. If NMF objects are returned,
the result can be further visualized by NMF plot methods like
NMF::consensusmap()
and NMF::basismap()
.
sig_estimate()
shows comprehensive rank survey generated by
NMF package, sometimes
it is hard to consider all measures. show_sig_number_survey()
provides a
one or two y-axis visualization method to help users determine
the optimal signature number (showing both
stability ("cophenetic") and error (RSS) at default).
Users can also set custom measures to show.
show_sig_number_survey2()
is modified from NMF package to
better help users to explore survey of signature number.
sig_estimate( nmf_matrix, range = 2:5, nrun = 10, use_random = FALSE, method = "brunet", seed = 123456, cores = 1, keep_nmfObj = FALSE, save_plots = FALSE, plot_basename = file.path(tempdir(), "nmf"), what = "all", verbose = FALSE ) show_sig_number_survey( object, x = "rank", left_y = "cophenetic", right_y = "rss", left_name = left_y, right_name = toupper(right_y), left_color = "black", right_color = "red", left_shape = 16, right_shape = 18, shape_size = 4, highlight = NULL ) show_sig_number_survey2( x, y = NULL, what = c("all", "cophenetic", "rss", "residuals", "dispersion", "evar", "sparseness", "sparseness.basis", "sparseness.coef", "silhouette", "silhouette.coef", "silhouette.basis", "silhouette.consensus"), na.rm = FALSE, xlab = "Total signatures", ylab = "", main = "Signature number survey using NMF package" )
sig_estimate( nmf_matrix, range = 2:5, nrun = 10, use_random = FALSE, method = "brunet", seed = 123456, cores = 1, keep_nmfObj = FALSE, save_plots = FALSE, plot_basename = file.path(tempdir(), "nmf"), what = "all", verbose = FALSE ) show_sig_number_survey( object, x = "rank", left_y = "cophenetic", right_y = "rss", left_name = left_y, right_name = toupper(right_y), left_color = "black", right_color = "red", left_shape = 16, right_shape = 18, shape_size = 4, highlight = NULL ) show_sig_number_survey2( x, y = NULL, what = c("all", "cophenetic", "rss", "residuals", "dispersion", "evar", "sparseness", "sparseness.basis", "sparseness.coef", "silhouette", "silhouette.coef", "silhouette.basis", "silhouette.consensus"), na.rm = FALSE, xlab = "Total signatures", ylab = "", main = "Signature number survey using NMF package" )
nmf_matrix |
a |
range |
a |
nrun |
a |
use_random |
Should generate random data from input to test measurements. Default is |
method |
specification of the NMF algorithm. Use 'brunet' as default. Available methods for NMF decompositions are 'brunet', 'lee', 'ls-nmf', 'nsNMF', 'offset'. |
seed |
specification of the starting point or seeding method, which will compute a starting point, usually using data from the target matrix in order to provide a good guess. |
cores |
number of cpu cores to run NMF. |
keep_nmfObj |
default is |
save_plots |
if |
plot_basename |
when save plots, set custom basename for file path. |
what |
a character vector whose elements partially match one of the following item,
which correspond to the measures computed by |
verbose |
if |
object |
a |
x |
a |
left_y |
column name for left y axis. |
right_y |
column name for right y axis. |
left_name |
label name for left y axis. |
right_name |
label name for right y axis. |
left_color |
color for left axis. |
right_color |
color for right axis. |
left_shape , right_shape , shape_size
|
shape setting. |
highlight |
a |
y |
for random simulation,
a |
na.rm |
single logical that specifies if the rank
for which the measures are NA values should be removed
from the graph or not (default to |
xlab |
x-axis label |
ylab |
y-axis label |
main |
main title |
The most common approach is to choose the smallest rank for which cophenetic correlation coefficient starts decreasing (Used by this function). Another approach is to choose the rank for which the plot of the residual sum of squares (RSS) between the input matrix and its estimate shows an inflection point. More custom features please directly use NMF::nmfEstimateRank.
sig_estimate: a list
contains information of NMF run and rank survey.
show_sig_number_survey: a ggplot
object
show_sig_number_survey2: a ggplot
object
Shixiang Wang
Gaujoux, Renaud, and Cathal Seoighe. "A flexible R package for nonnegative matrix factorization." BMC bioinformatics 11.1 (2010): 367.
sig_extract for extracting signatures using NMF package, sig_auto_extract for extracting signatures using automatic relevance determination technique.
sig_estimate for estimating signature number for sig_extract, show_sig_number_survey2 for more visualization method.
load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) library(NMF) cn_estimate <- sig_estimate(cn_tally_W$nmf_matrix, cores = 1, nrun = 5, verbose = TRUE ) p <- show_sig_number_survey2(cn_estimate$survey) p # Show two measures show_sig_number_survey(cn_estimate) # Show one measure p1 <- show_sig_number_survey(cn_estimate, right_y = NULL) p1 p2 <- add_h_arrow(p, x = 4.1, y = 0.953, label = "selected number") p2 # Show data from a data.frame p3 <- show_sig_number_survey(cn_estimate$survey) p3 # Show other measures head(cn_estimate$survey) p4 <- show_sig_number_survey(cn_estimate$survey, right_y = "dispersion", right_name = "dispersion" ) p4 p5 <- show_sig_number_survey(cn_estimate$survey, right_y = "evar", right_name = "evar" ) p5
load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) library(NMF) cn_estimate <- sig_estimate(cn_tally_W$nmf_matrix, cores = 1, nrun = 5, verbose = TRUE ) p <- show_sig_number_survey2(cn_estimate$survey) p # Show two measures show_sig_number_survey(cn_estimate) # Show one measure p1 <- show_sig_number_survey(cn_estimate, right_y = NULL) p1 p2 <- add_h_arrow(p, x = 4.1, y = 0.953, label = "selected number") p2 # Show data from a data.frame p3 <- show_sig_number_survey(cn_estimate$survey) p3 # Show other measures head(cn_estimate$survey) p4 <- show_sig_number_survey(cn_estimate$survey, right_y = "dispersion", right_name = "dispersion" ) p4 p5 <- show_sig_number_survey(cn_estimate$survey, right_y = "evar", right_name = "evar" ) p5
Do NMF de-composition and then extract signatures.
sig_extract( nmf_matrix, n_sig, nrun = 10, cores = 1, method = "brunet", optimize = FALSE, pynmf = FALSE, use_conda = TRUE, py_path = "/Users/wsx/anaconda3/bin/python", seed = 123456, ... )
sig_extract( nmf_matrix, n_sig, nrun = 10, cores = 1, method = "brunet", optimize = FALSE, pynmf = FALSE, use_conda = TRUE, py_path = "/Users/wsx/anaconda3/bin/python", seed = 123456, ... )
nmf_matrix |
a |
n_sig |
number of signature. Please run sig_estimate to select a suitable value. |
nrun |
a |
cores |
number of cpu cores to run NMF. |
method |
specification of the NMF algorithm. Use 'brunet' as default. Available methods for NMF decompositions are 'brunet', 'lee', 'ls-nmf', 'nsNMF', 'offset'. |
optimize |
if |
pynmf |
if |
use_conda |
if |
py_path |
path to Python executable file, e.g. '/Users/wsx/anaconda3/bin/python'. In my
test, it is more stable than |
seed |
specification of the starting point or seeding method, which will compute a starting point, usually using data from the target matrix in order to provide a good guess. |
... |
other arguments passed to |
a list
with Signature
class.
Shixiang Wang
Gaujoux, Renaud, and Cathal Seoighe. "A flexible R package for nonnegative matrix factorization." BMC bioinformatics 11.1 (2010): 367.
Mayakonda, Anand, et al. "Maftools: efficient and comprehensive analysis of somatic variants in cancer." Genome research 28.11 (2018): 1747-1756.
sig_tally for getting variation matrix, sig_estimate for estimating signature number for sig_extract, sig_auto_extract for extracting signatures using automatic relevance determination technique.
load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) # Extract copy number signatures res <- sig_extract(cn_tally_W$nmf_matrix, 2, nrun = 1)
load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) # Extract copy number signatures res <- sig_extract(cn_tally_W$nmf_matrix, 2, nrun = 1)
The function performs a signatures decomposition of a given mutational
catalogue V
with known signatures W
by solving the minimization problem
min(||W*H - V||)
where W and V are known.
sig_fit( catalogue_matrix, sig, sig_index = NULL, sig_db = c("legacy", "SBS", "DBS", "ID", "TSB", "SBS_Nik_lab", "RS_Nik_lab", "RS_BRCA560", "RS_USARC", "CNS_USARC", "CNS_TCGA", "CNS_TCGA176", "CNS_PCAWG176", "SBS_hg19", "SBS_hg38", "SBS_mm9", "SBS_mm10", "DBS_hg19", "DBS_hg38", "DBS_mm9", "DBS_mm10", "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ", "latest_SBS_GRCh37", "latest_DBS_GRCh37", "latest_ID_GRCh37", "latest_SBS_GRCh38", "latest_DBS_GRCh38", "latest_SBS_mm9", "latest_DBS_mm9", "latest_SBS_mm10", "latest_DBS_mm10", "latest_SBS_rn6", "latest_DBS_rn6", "latest_CN_GRCh37", "latest_RNA-SBS_GRCh37", "latest_SV_GRCh38"), db_type = c("", "human-exome", "human-genome"), show_index = TRUE, method = c("QP", "NNLS", "SA"), auto_reduce = FALSE, type = c("absolute", "relative"), return_class = c("matrix", "data.table"), return_error = FALSE, rel_threshold = 0, mode = c("SBS", "DBS", "ID", "copynumber"), true_catalog = NULL, ... )
sig_fit( catalogue_matrix, sig, sig_index = NULL, sig_db = c("legacy", "SBS", "DBS", "ID", "TSB", "SBS_Nik_lab", "RS_Nik_lab", "RS_BRCA560", "RS_USARC", "CNS_USARC", "CNS_TCGA", "CNS_TCGA176", "CNS_PCAWG176", "SBS_hg19", "SBS_hg38", "SBS_mm9", "SBS_mm10", "DBS_hg19", "DBS_hg38", "DBS_mm9", "DBS_mm10", "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ", "latest_SBS_GRCh37", "latest_DBS_GRCh37", "latest_ID_GRCh37", "latest_SBS_GRCh38", "latest_DBS_GRCh38", "latest_SBS_mm9", "latest_DBS_mm9", "latest_SBS_mm10", "latest_DBS_mm10", "latest_SBS_rn6", "latest_DBS_rn6", "latest_CN_GRCh37", "latest_RNA-SBS_GRCh37", "latest_SV_GRCh38"), db_type = c("", "human-exome", "human-genome"), show_index = TRUE, method = c("QP", "NNLS", "SA"), auto_reduce = FALSE, type = c("absolute", "relative"), return_class = c("matrix", "data.table"), return_error = FALSE, rel_threshold = 0, mode = c("SBS", "DBS", "ID", "copynumber"), true_catalog = NULL, ... )
catalogue_matrix |
a numeric matrix |
sig |
a |
sig_index |
a vector for signature index. "ALL" for all signatures. |
sig_db |
default 'legacy', it can be 'legacy' (for COSMIC v2 'SBS'),
'SBS', 'DBS', 'ID' and 'TSB' (for COSMIV v3.1 signatures)
for small scale mutations.
For more specific details, it can also be 'SBS_hg19', 'SBS_hg38',
'SBS_mm9', 'SBS_mm10', 'DBS_hg19', 'DBS_hg38', 'DBS_mm9', 'DBS_mm10' to use
COSMIC v3 reference signatures from Alexandrov, Ludmil B., et al. (2020) (reference #1).
In addition, it can be one of "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ",
"SBS_Nik_lab", "RS_Nik_lab" to refer reference signatures from
Degasperi, Andrea, et al. (2020) (reference #2);
"RS_BRCA560", "RS_USARC" to reference signatures from BRCA560 and USARC cohorts;
"CNS_USARC" (40 categories), "CNS_TCGA" (48 categories) to reference copy number signatures from USARC cohort and TCGA;
"CNS_TCGA176" (176 categories) and "CNS_PCAWG176" (176 categories) to reference copy number signatures from PCAWG and TCGA separately.
UPDATE, the latest version of reference version can be automatically
downloaded and loaded from https://cancer.sanger.ac.uk/signatures/downloads/
when a option with |
db_type |
only used when |
show_index |
if |
method |
method to solve the minimazation problem. 'NNLS' for non-negative least square; 'QP' for quadratic programming; 'SA' for simulated annealing. |
auto_reduce |
if |
type |
'absolute' for signature exposure and 'relative' for signature relative exposure. |
return_class |
string, 'matrix' or 'data.table'. |
return_error |
if |
rel_threshold |
numeric vector, a signature with relative exposure
lower than (equal is included, i.e. |
mode |
signature type for plotting, now supports 'copynumber', 'SBS', 'DBS', 'ID' and 'RS' (genome rearrangement signature). |
true_catalog |
used by sig_fit_bootstrap, user never use it. |
... |
control parameters passing to argument |
The method 'NNLS' solves the minimization problem with nonnegative least-squares constraints. The method 'QP' and 'SA' are modified from SignatureEstimation package. See references for details. Of note, when fitting exposures for copy number signatures, only components of feature CN is used.
The exposure result either in matrix
or data.table
format.
If return_error
set TRUE
, a list
is returned.
Daniel Huebschmann, Zuguang Gu and Matthias Schlesner (2019). YAPSA: Yet Another Package for Signature Analysis. R package version 1.12.0.
Huang X, Wojtowicz D, Przytycka TM. Detecting presence of mutational signatures in cancer with confidence. Bioinformatics. 2018;34(2):330–337. doi:10.1093/bioinformatics/btx604
Kim, Jaegil, et al. "Somatic ERCC2 mutations are associated with a distinct genomic signature in urothelial tumors." Nature genetics 48.6 (2016): 600.
sig_extract, sig_auto_extract, sig_fit_bootstrap, sig_fit_bootstrap_batch
# For mutational signatures ---------------- # SBS is used for illustration, similar # operations can be applied to DBS, INDEL, CN, RS, etc. # Load simulated data data("simulated_catalogs") data = simulated_catalogs$set1 data[1:5, 1:5] # Fitting with all COSMIC v2 reference signatures sig_fit(data, sig_index = "ALL") # Check ?sig_fit for sig_db options # e.g., use the COSMIC SBS v3 sig_fit(data, sig_index = "ALL", sig_db = "SBS") # Fitting with specified signatures # opt 1. use selected reference signatures sig_fit(data, sig_index = c(1, 5, 9, 2, 13), sig_db = "SBS") # opt 2. use user specified signatures ref = get_sig_db()$db ref[1:5, 1:5] ref = ref[, 1:10] # The `sig` used here can be result object from `sig_extract` # or any reference matrix with similar structure (96-motif) v1 = sig_fit(data, sig = ref) v1 # If possible, auto-reduce the reference signatures # for better fitting data from a sample v2 = sig_fit(data, sig = ref, auto_reduce = TRUE) v2 all.equal(v1, v2) # Some samples reported signatures dropped # but its original activity values are 0s, # so the data remain same (0 -> 0) all.equal(v1[, 2], v2[, 2]) # For COSMIC_10, 6.67638 -> 0 v1[, 4]; v2[, 4] all.equal(v1[, 4], v2[, 4]) # For general purpose ----------------------- W <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2) colnames(W) <- c("sig1", "sig2") W <- apply(W, 2, function(x) x / sum(x)) H <- matrix(c(2, 5, 3, 6, 1, 9, 1, 2), ncol = 4) colnames(H) <- paste0("samp", 1:4) V <- W %*% H V if (requireNamespace("quadprog", quietly = TRUE)) { H_infer <- sig_fit(V, W, method = "QP") H_infer H H_dt <- sig_fit(V, W, method = "QP", auto_reduce = TRUE, return_class = "data.table") H_dt ## Show results show_sig_fit(H_infer) show_sig_fit(H_dt) ## Get clusters/groups H_dt_rel <- sig_fit(V, W, return_class = "data.table", type = "relative") z <- get_groups(H_dt_rel, method = "k-means") show_groups(z) } # if (requireNamespace("GenSA", quietly = TRUE)) { # H_infer <- sig_fit(V, W, method = "SA") # H_infer # H # # H_dt <- sig_fit(V, W, method = "SA", return_class = "data.table") # H_dt # # ## Modify arguments to method # sig_fit(V, W, method = "SA", maxit = 10, temperature = 100) # # ## Show results # show_sig_fit(H_infer) # show_sig_fit(H_dt) # }
# For mutational signatures ---------------- # SBS is used for illustration, similar # operations can be applied to DBS, INDEL, CN, RS, etc. # Load simulated data data("simulated_catalogs") data = simulated_catalogs$set1 data[1:5, 1:5] # Fitting with all COSMIC v2 reference signatures sig_fit(data, sig_index = "ALL") # Check ?sig_fit for sig_db options # e.g., use the COSMIC SBS v3 sig_fit(data, sig_index = "ALL", sig_db = "SBS") # Fitting with specified signatures # opt 1. use selected reference signatures sig_fit(data, sig_index = c(1, 5, 9, 2, 13), sig_db = "SBS") # opt 2. use user specified signatures ref = get_sig_db()$db ref[1:5, 1:5] ref = ref[, 1:10] # The `sig` used here can be result object from `sig_extract` # or any reference matrix with similar structure (96-motif) v1 = sig_fit(data, sig = ref) v1 # If possible, auto-reduce the reference signatures # for better fitting data from a sample v2 = sig_fit(data, sig = ref, auto_reduce = TRUE) v2 all.equal(v1, v2) # Some samples reported signatures dropped # but its original activity values are 0s, # so the data remain same (0 -> 0) all.equal(v1[, 2], v2[, 2]) # For COSMIC_10, 6.67638 -> 0 v1[, 4]; v2[, 4] all.equal(v1[, 4], v2[, 4]) # For general purpose ----------------------- W <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2) colnames(W) <- c("sig1", "sig2") W <- apply(W, 2, function(x) x / sum(x)) H <- matrix(c(2, 5, 3, 6, 1, 9, 1, 2), ncol = 4) colnames(H) <- paste0("samp", 1:4) V <- W %*% H V if (requireNamespace("quadprog", quietly = TRUE)) { H_infer <- sig_fit(V, W, method = "QP") H_infer H H_dt <- sig_fit(V, W, method = "QP", auto_reduce = TRUE, return_class = "data.table") H_dt ## Show results show_sig_fit(H_infer) show_sig_fit(H_dt) ## Get clusters/groups H_dt_rel <- sig_fit(V, W, return_class = "data.table", type = "relative") z <- get_groups(H_dt_rel, method = "k-means") show_groups(z) } # if (requireNamespace("GenSA", quietly = TRUE)) { # H_infer <- sig_fit(V, W, method = "SA") # H_infer # H # # H_dt <- sig_fit(V, W, method = "SA", return_class = "data.table") # H_dt # # ## Modify arguments to method # sig_fit(V, W, method = "SA", maxit = 10, temperature = 100) # # ## Show results # show_sig_fit(H_infer) # show_sig_fit(H_dt) # }
This can be used to obtain the confidence of signature exposures or search the suboptimal decomposition solution.
sig_fit_bootstrap( catalog, sig, n = 100L, sig_index = NULL, sig_db = "legacy", db_type = c("", "human-exome", "human-genome"), show_index = TRUE, method = c("QP", "NNLS", "SA"), auto_reduce = FALSE, SA_not_bootstrap = FALSE, type = c("absolute", "relative"), rel_threshold = 0, mode = c("SBS", "DBS", "ID", "copynumber"), find_suboptimal = FALSE, suboptimal_ref_error = NULL, suboptimal_factor = 1.05, ... )
sig_fit_bootstrap( catalog, sig, n = 100L, sig_index = NULL, sig_db = "legacy", db_type = c("", "human-exome", "human-genome"), show_index = TRUE, method = c("QP", "NNLS", "SA"), auto_reduce = FALSE, SA_not_bootstrap = FALSE, type = c("absolute", "relative"), rel_threshold = 0, mode = c("SBS", "DBS", "ID", "copynumber"), find_suboptimal = FALSE, suboptimal_ref_error = NULL, suboptimal_factor = 1.05, ... )
catalog |
a named numeric vector or a numeric matrix with dimension Nx1. N is the number of component, 1 is the sample. |
sig |
a |
n |
the number of bootstrap replicates. |
sig_index |
a vector for signature index. "ALL" for all signatures. |
sig_db |
default 'legacy', it can be 'legacy' (for COSMIC v2 'SBS'),
'SBS', 'DBS', 'ID' and 'TSB' (for COSMIV v3.1 signatures)
for small scale mutations.
For more specific details, it can also be 'SBS_hg19', 'SBS_hg38',
'SBS_mm9', 'SBS_mm10', 'DBS_hg19', 'DBS_hg38', 'DBS_mm9', 'DBS_mm10' to use
COSMIC v3 reference signatures from Alexandrov, Ludmil B., et al. (2020) (reference #1).
In addition, it can be one of "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ",
"SBS_Nik_lab", "RS_Nik_lab" to refer reference signatures from
Degasperi, Andrea, et al. (2020) (reference #2);
"RS_BRCA560", "RS_USARC" to reference signatures from BRCA560 and USARC cohorts;
"CNS_USARC" (40 categories), "CNS_TCGA" (48 categories) to reference copy number signatures from USARC cohort and TCGA;
"CNS_TCGA176" (176 categories) and "CNS_PCAWG176" (176 categories) to reference copy number signatures from PCAWG and TCGA separately.
UPDATE, the latest version of reference version can be automatically
downloaded and loaded from https://cancer.sanger.ac.uk/signatures/downloads/
when a option with |
db_type |
only used when |
show_index |
if |
method |
method to solve the minimazation problem. 'NNLS' for non-negative least square; 'QP' for quadratic programming; 'SA' for simulated annealing. |
auto_reduce |
if |
SA_not_bootstrap |
if |
type |
'absolute' for signature exposure and 'relative' for signature relative exposure. |
rel_threshold |
numeric vector, a signature with relative exposure
lower than (equal is included, i.e. |
mode |
signature type for plotting, now supports 'copynumber', 'SBS', 'DBS', 'ID' and 'RS' (genome rearrangement signature). |
find_suboptimal |
logical, if |
suboptimal_ref_error |
baseline error used for finding suboptimal solution.
if it is |
suboptimal_factor |
suboptimal factor to get suboptimal error, default is |
... |
control parameters passing to argument |
a list
Huang X, Wojtowicz D, Przytycka TM. Detecting presence of mutational signatures in cancer with confidence. Bioinformatics. 2018;34(2):330–337. doi:10.1093/bioinformatics/btx604
report_bootstrap_p_value, sig_fit, sig_fit_bootstrap_batch
# This function is designed for processing # one sample, thus is not very useful in practice # please check `sig_fit_bootstrap_batch` # For general purpose ------------------- W <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2) colnames(W) <- c("sig1", "sig2") W <- apply(W, 2, function(x) x / sum(x)) H <- matrix(c(2, 5, 3, 6, 1, 9, 1, 2), ncol = 4) colnames(H) <- paste0("samp", 1:4) V <- W %*% H V if (requireNamespace("quadprog", quietly = TRUE)) { H_bootstrap <- sig_fit_bootstrap(V[, 1], W, n = 10, type = "absolute") ## Typically, you have to run many times to get close to the answer boxplot(t(H_bootstrap$expo)) H[, 1] ## Return P values ## In practice, run times >= 100 ## is recommended report_bootstrap_p_value(H_bootstrap) ## For multiple samples ## Input a list report_bootstrap_p_value(list(samp1 = H_bootstrap, samp2 = H_bootstrap)) # ## Find suboptimal decomposition # H_suboptimal <- sig_fit_bootstrap(V[, 1], W, # n = 10, # type = "absolute", # method = "SA", # find_suboptimal = TRUE # ) }
# This function is designed for processing # one sample, thus is not very useful in practice # please check `sig_fit_bootstrap_batch` # For general purpose ------------------- W <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2) colnames(W) <- c("sig1", "sig2") W <- apply(W, 2, function(x) x / sum(x)) H <- matrix(c(2, 5, 3, 6, 1, 9, 1, 2), ncol = 4) colnames(H) <- paste0("samp", 1:4) V <- W %*% H V if (requireNamespace("quadprog", quietly = TRUE)) { H_bootstrap <- sig_fit_bootstrap(V[, 1], W, n = 10, type = "absolute") ## Typically, you have to run many times to get close to the answer boxplot(t(H_bootstrap$expo)) H[, 1] ## Return P values ## In practice, run times >= 100 ## is recommended report_bootstrap_p_value(H_bootstrap) ## For multiple samples ## Input a list report_bootstrap_p_value(list(samp1 = H_bootstrap, samp2 = H_bootstrap)) # ## Find suboptimal decomposition # H_suboptimal <- sig_fit_bootstrap(V[, 1], W, # n = 10, # type = "absolute", # method = "SA", # find_suboptimal = TRUE # ) }
Read sig_fit_bootstrap for more option setting.
sig_fit_bootstrap_batch( catalogue_matrix, methods = c("QP"), n = 100L, min_count = 1L, p_val_thresholds = c(0.05), use_parallel = FALSE, seed = 123456L, job_id = NULL, result_dir = tempdir(), ... )
sig_fit_bootstrap_batch( catalogue_matrix, methods = c("QP"), n = 100L, min_count = 1L, p_val_thresholds = c(0.05), use_parallel = FALSE, seed = 123456L, job_id = NULL, result_dir = tempdir(), ... )
catalogue_matrix |
a numeric matrix |
methods |
a subset of |
n |
the number of bootstrap replicates. |
min_count |
minimal exposure in a sample, default is 1. Any patient has total exposure less than this value will be filtered out. |
p_val_thresholds |
a vector of relative exposure threshold for calculating p values. |
use_parallel |
if |
seed |
random seed to reproduce the result. |
job_id |
a job ID, default is |
result_dir |
see above, default is temp directory defined by R. |
... |
other common parameters passing to sig_fit_bootstrap, including
|
a list
of data.table
.
# For mutational signatures ---------------- # SBS is used for illustration, similar # operations can be applied to DBS, INDEL, CN, RS, etc. # Load simulated data data("simulated_catalogs") data = simulated_catalogs$set1 data[1:5, 1:5] # Fitting with COSMIC reference signatures # Generally set n = 100 rv = sig_fit_bootstrap_batch(data, sig_index = c(1, 5, 9, 2, 13), sig_db = "SBS", n = 10) rv # For general purpose -------------------- W <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2) colnames(W) <- c("sig1", "sig2") W <- apply(W, 2, function(x) x / sum(x)) H <- matrix(c(2, 5, 3, 6, 1, 9, 1, 2), ncol = 4) colnames(H) <- paste0("samp", 1:4) V <- W %*% H V if (requireNamespace("quadprog")) { z10 <- sig_fit_bootstrap_batch(V, sig = W, n = 10) z10 }
# For mutational signatures ---------------- # SBS is used for illustration, similar # operations can be applied to DBS, INDEL, CN, RS, etc. # Load simulated data data("simulated_catalogs") data = simulated_catalogs$set1 data[1:5, 1:5] # Fitting with COSMIC reference signatures # Generally set n = 100 rv = sig_fit_bootstrap_batch(data, sig_index = c(1, 5, 9, 2, 13), sig_db = "SBS", n = 10) rv # For general purpose -------------------- W <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2) colnames(W) <- c("sig1", "sig2") W <- apply(W, 2, function(x) x / sum(x)) H <- matrix(c(2, 5, 3, 6, 1, 9, 1, 2), ncol = 4) colnames(H) <- paste0("samp", 1:4) V <- W %*% H V if (requireNamespace("quadprog")) { z10 <- sig_fit_bootstrap_batch(V, sig = W, n = 10) z10 }
Obtain or Modify Signature Information
sig_names(sig) sig_modify_names(sig, new_names) sig_number(sig) sig_attrs(sig) sig_signature(sig, normalize = c("row", "column", "raw", "feature")) sig_exposure(sig, type = c("absolute", "relative"))
sig_names(sig) sig_modify_names(sig, new_names) sig_number(sig) sig_attrs(sig) sig_signature(sig, normalize = c("row", "column", "raw", "feature")) sig_exposure(sig, type = c("absolute", "relative"))
sig |
a |
new_names |
new signature names. |
normalize |
one of 'row', 'column', 'raw' and "feature", for row normalization (signature), column normalization (component), raw data, row normalization by feature, respectively. |
type |
one of 'absolute' and 'relative'. |
a Signature
object or data.
## Operate signature names load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) sig_names(sig2) cc <- sig_modify_names(sig2, new_names = c("Sig2", "Sig1", "Sig3")) sig_names(cc) # The older names are stored in tags. print(attr(cc, "tag")) ## Get signature number sig_number(sig2) ## Get signature attributes sig_number(sig2) ## Get signature matrix z <- sig_signature(sig2) z <- sig_signature(sig2, normalize = "raw") ## Get exposure matrix ## Of note, this is different from get_sig_exposure() ## it returns a matrix instead of data table. z <- sig_exposure(sig2) # it is same as sig$Exposure z <- sig_exposure(sig2, type = "relative") # it is same as sig2$Exposure.norm
## Operate signature names load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) sig_names(sig2) cc <- sig_modify_names(sig2, new_names = c("Sig2", "Sig1", "Sig3")) sig_names(cc) # The older names are stored in tags. print(attr(cc, "tag")) ## Get signature number sig_number(sig2) ## Get signature attributes sig_number(sig2) ## Get signature matrix z <- sig_signature(sig2) z <- sig_signature(sig2, normalize = "raw") ## Get exposure matrix ## Of note, this is different from get_sig_exposure() ## it returns a matrix instead of data table. z <- sig_exposure(sig2) # it is same as sig$Exposure z <- sig_exposure(sig2, type = "relative") # it is same as sig2$Exposure.norm
Tally a variation object like MAF, CopyNumber and return a matrix for NMF de-composition and more. This is a generic function, so it can be further extended to other mutation cases. Please read details about how to set sex for identifying copy number signatures. Please read https://osf.io/s93d5/ for the generation of SBS, DBS and ID (INDEL) components.
sig_tally(object, ...) ## S3 method for class 'CopyNumber' sig_tally( object, method = "Wang", ignore_chrs = NULL, indices = NULL, add_loh = FALSE, feature_setting = sigminer::CN.features, cores = 1, keep_only_matrix = FALSE, ... ) ## S3 method for class 'RS' sig_tally(object, keep_only_matrix = FALSE, ...) ## S3 method for class 'MAF' sig_tally( object, mode = c("SBS", "DBS", "ID", "ALL"), ref_genome = "BSgenome.Hsapiens.UCSC.hg19", genome_build = NULL, add_trans_bias = FALSE, ignore_chrs = NULL, use_syn = TRUE, keep_only_matrix = FALSE, ... )
sig_tally(object, ...) ## S3 method for class 'CopyNumber' sig_tally( object, method = "Wang", ignore_chrs = NULL, indices = NULL, add_loh = FALSE, feature_setting = sigminer::CN.features, cores = 1, keep_only_matrix = FALSE, ... ) ## S3 method for class 'RS' sig_tally(object, keep_only_matrix = FALSE, ...) ## S3 method for class 'MAF' sig_tally( object, mode = c("SBS", "DBS", "ID", "ALL"), ref_genome = "BSgenome.Hsapiens.UCSC.hg19", genome_build = NULL, add_trans_bias = FALSE, ignore_chrs = NULL, use_syn = TRUE, keep_only_matrix = FALSE, ... )
object |
a CopyNumber object or MAF object or SV object (from read_sv_as_rs). |
... |
custom setting for operating object. Detail see S3 method for
corresponding class (e.g. |
method |
method for feature classification, can be one of "Wang" ("W"), "S" (for method described in Steele et al. 2019), "X" (for method described in Tao et al. 2023). |
ignore_chrs |
Chromsomes to ignore from analysis. e.g. chrX and chrY. |
indices |
integer vector indicating segments to keep. |
add_loh |
flag to add LOH classifications. |
feature_setting |
a |
cores |
number of computer cores to run this task.
You can use |
keep_only_matrix |
if |
mode |
type of mutation matrix to extract, can be one of 'SBS', 'DBS' and 'ID'. |
ref_genome |
'BSgenome.Hsapiens.UCSC.hg19', 'BSgenome.Hsapiens.UCSC.hg38', 'BSgenome.Mmusculus.UCSC.mm10', 'BSgenome.Mmusculus.UCSC.mm9', etc. |
genome_build |
genome build 'hg19', 'hg38', 'mm9' or "mm10", if not set, guess it by |
add_trans_bias |
if |
use_syn |
Logical. If |
For identifying copy number signatures, we have to derive copy number features firstly. Due to the difference of copy number values in sex chromosomes between male and female, we have to do an extra step if we don't want to ignore them.
I create two options to control this, the default values are shown as the following, you can use the same way to set (per R session).
options(sigminer.sex = "female", sigminer.copynumber.max = NA_integer_)
If your cohort are all females, you can totally ignore this.
If your cohort are all males, set sigminer.sex
to 'male' and
sigminer.copynumber.max
to a proper value (the best is consistent
with read_copynumber).
If your cohort contains both males and females, set sigminer.sex
as a data.frame
with two columns "sample" and "sex". And
set sigminer.copynumber.max
to a proper value (the best is consistent
with read_copynumber).
a list
contains a matrix
used for NMF de-composition.
sig_tally(CopyNumber)
: Returns copy number features, components and component-by-sample matrix
sig_tally(RS)
: Returns genome rearrangement sample-by-component matrix
sig_tally(MAF)
: Returns SBS mutation sample-by-component matrix and APOBEC enrichment
Shixiang Wang
Wang, Shixiang, et al. "Copy number signature analyses in prostate cancer reveal distinct etiologies and clinical outcomes." medRxiv (2020).
Steele, Christopher D., et al. "Undifferentiated sarcomas develop through distinct evolutionary pathways." Cancer Cell 35.3 (2019): 441-456.
Mayakonda, Anand, et al. "Maftools: efficient and comprehensive analysis of somatic variants in cancer." Genome research 28.11 (2018): 1747-1756.
Roberts SA, Lawrence MS, Klimczak LJ, et al. An APOBEC Cytidine Deaminase Mutagenesis Pattern is Widespread in Human Cancers. Nature genetics. 2013;45(9):970-976. doi:10.1038/ng.2702.
Bergstrom EN, Huang MN, Mahto U, Barnes M, Stratton MR, Rozen SG, Alexandrov LB: SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC Genomics 2019, 20:685 https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6041-2
sig_estimate for estimating signature number for sig_extract, sig_auto_extract for extracting signatures using automatic relevance determination technique.
# Load copy number object load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) # Use method designed by Wang, Shixiang et al. cn_tally_W <- sig_tally(cn, method = "W") # Use method designed by Steele et al. # See example in read_copynumber # Prepare SBS signature analysis laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools") laml <- read_maf(maf = laml.maf) if (require("BSgenome.Hsapiens.UCSC.hg19")) { mt_tally <- sig_tally( laml, ref_genome = "BSgenome.Hsapiens.UCSC.hg19", use_syn = TRUE ) mt_tally$nmf_matrix[1:5, 1:5] ## Use strand bias categories mt_tally <- sig_tally( laml, ref_genome = "BSgenome.Hsapiens.UCSC.hg19", use_syn = TRUE, add_trans_bias = TRUE ) ## Test it by enrichment analysis enrich_component_strand_bias(mt_tally$nmf_matrix) enrich_component_strand_bias(mt_tally$all_matrices$SBS_24) } else { message("Please install package 'BSgenome.Hsapiens.UCSC.hg19' firstly!") }
# Load copy number object load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) # Use method designed by Wang, Shixiang et al. cn_tally_W <- sig_tally(cn, method = "W") # Use method designed by Steele et al. # See example in read_copynumber # Prepare SBS signature analysis laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools") laml <- read_maf(maf = laml.maf) if (require("BSgenome.Hsapiens.UCSC.hg19")) { mt_tally <- sig_tally( laml, ref_genome = "BSgenome.Hsapiens.UCSC.hg19", use_syn = TRUE ) mt_tally$nmf_matrix[1:5, 1:5] ## Use strand bias categories mt_tally <- sig_tally( laml, ref_genome = "BSgenome.Hsapiens.UCSC.hg19", use_syn = TRUE, add_trans_bias = TRUE ) ## Test it by enrichment analysis enrich_component_strand_bias(mt_tally$nmf_matrix) enrich_component_strand_bias(mt_tally$all_matrices$SBS_24) } else { message("Please install package 'BSgenome.Hsapiens.UCSC.hg19' firstly!") }
This function provides an unified interface to signature extractor
implemented in sigminer. If you determine a specific approach
,
please also read the documentation of corresponding extractor.
See "Arguments" part.
sig_unify_extract( nmf_matrix, range = 2:5, nrun = 10, approach = c("bayes_nmf", "repeated_nmf", "bootstrap_nmf", "sigprofiler"), cores = 1L, ... )
sig_unify_extract( nmf_matrix, range = 2:5, nrun = 10, approach = c("bayes_nmf", "repeated_nmf", "bootstrap_nmf", "sigprofiler"), cores = 1L, ... )
nmf_matrix |
a |
range |
signature number range, i.e. |
nrun |
the number of iteration to be performed to extract each signature number. |
approach |
approach name.
|
cores |
number of cores used for computation. |
... |
other parameters passing to signature extractor based
on the |
Result dependent on the approach
setting.
sig_extract, sig_auto_extract, bp_extract_signatures, sigprofiler
load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) # Extract signatures # It is same as sig_extract(cn_tally_W$nmf_matrix, 2, nrun = 1) res <- sig_unify_extract(cn_tally_W$nmf_matrix, 2, nrun = 1, approach = "repeated_nmf" ) # Auto-extract signatures based on bayesian NMF res2 <- sig_unify_extract(cn_tally_W$nmf_matrix, nrun = 1, approach = "bayes_nmf" )
load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) # Extract signatures # It is same as sig_extract(cn_tally_W$nmf_matrix, 2, nrun = 1) res <- sig_unify_extract(cn_tally_W$nmf_matrix, 2, nrun = 1, approach = "repeated_nmf" ) # Auto-extract signatures based on bayesian NMF res2 <- sig_unify_extract(cn_tally_W$nmf_matrix, nrun = 1, approach = "bayes_nmf" )
This function provides an interface to software SigProfiler.
More please see https://github.com/AlexandrovLab/SigProfilerExtractor.
Typically, a reference genome is not required because the input is a matrix (my understanding).
If you are using refitting result by SigProfiler, please make sure you have input the matrix same order as examples at https://github.com/AlexandrovLab/SigProfilerMatrixGenerator/tree/master/SigProfilerMatrixGenerator/references/matrix/BRCA_example. If not, use sigprofiler_reorder()
firstly.
sigprofiler_extract( nmf_matrix, output, output_matrix_only = FALSE, range = 2:5, nrun = 10L, refit = FALSE, refit_plot = FALSE, is_exome = FALSE, init_method = c("random", "nndsvd_min", "nndsvd", "nndsvda", "nndsvdar"), cores = -1L, genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), use_conda = FALSE, py_path = NULL, sigprofiler_version = "1.1.3" ) sigprofiler_import( output, order_by_expo = FALSE, type = c("suggest", "refit", "all") ) sigprofiler_reorder( nmf_matrix, type = c("SBS96", "SBS6", "SBS12", "SBS192", "SBS1536", "SBS3072", "DBS78", "DBS312", "DBS1248", "DBS4992") )
sigprofiler_extract( nmf_matrix, output, output_matrix_only = FALSE, range = 2:5, nrun = 10L, refit = FALSE, refit_plot = FALSE, is_exome = FALSE, init_method = c("random", "nndsvd_min", "nndsvd", "nndsvda", "nndsvdar"), cores = -1L, genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), use_conda = FALSE, py_path = NULL, sigprofiler_version = "1.1.3" ) sigprofiler_import( output, order_by_expo = FALSE, type = c("suggest", "refit", "all") ) sigprofiler_reorder( nmf_matrix, type = c("SBS96", "SBS6", "SBS12", "SBS192", "SBS1536", "SBS3072", "DBS78", "DBS312", "DBS1248", "DBS4992") )
nmf_matrix |
a |
output |
output directory. |
output_matrix_only |
if |
range |
signature number range, i.e. |
nrun |
the number of iteration to be performed to extract each signature number. |
refit |
if |
refit_plot |
if |
is_exome |
if |
init_method |
the initialization algorithm for W and H matrix of NMF. Options are 'random', 'nndsvd', 'nndsvda', 'nndsvdar', 'alexandrov-lab-custom' and 'nndsvd_min'. |
cores |
number of cores used for computation. |
genome_build |
I think this option is useless when input is |
use_conda |
if |
py_path |
path to Python executable file, e.g. '/Users/wsx/anaconda3/bin/python'. |
sigprofiler_version |
version of |
order_by_expo |
if |
type |
mutational signature type. |
For sigprofiler_extract()
, returns nothing. See output
directory.
For sigprofiler_import()
, a list
containing Signature
object.
A NMF matrix for input of sigprofiler_extract()
.
if (FALSE) { load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) reticulate::conda_list() sigprofiler_extract(cn_tally_W$nmf_matrix, "~/test/test_sigminer", use_conda = TRUE ) sigprofiler_extract(cn_tally_W$nmf_matrix, "~/test/test_sigminer", use_conda = FALSE, py_path = "/Users/wsx/anaconda3/bin/python" ) } data("simulated_catalogs") sigprofiler_reorder(t(simulated_catalogs$set1))
if (FALSE) { load(system.file("extdata", "toy_copynumber_tally_W.RData", package = "sigminer", mustWork = TRUE )) reticulate::conda_list() sigprofiler_extract(cn_tally_W$nmf_matrix, "~/test/test_sigminer", use_conda = TRUE ) sigprofiler_extract(cn_tally_W$nmf_matrix, "~/test/test_sigminer", use_conda = FALSE, py_path = "/Users/wsx/anaconda3/bin/python" ) } data("simulated_catalogs") sigprofiler_reorder(t(simulated_catalogs$set1))
Data from doi:10.1038/s43018-020-0027-5. 5 simulated mutation catalogs are used by the paper but only 4 are available. The data are simulated from COSMIC mutational signatures 1, 2, 3, 5, 6, 8, 12, 13, 17 and 18. Each sample is a linear combination of 5 randomly selected signatures with the addiction of Poisson noise. The number of mutation in each sample is randomly selected between 1,000 and 50,000 mutations, in log scale so that a lower number of mutations is more likely to be selected. The proportion of each signature in each sample is also random.
A list of matrix
Generate from code under data_raw/
data(simulated_catalogs)
data(simulated_catalogs)
simulate_signature()
- Simulate signatures from signature pool.
simulate_catalogue()
- Simulate catalogs from signature/catalog pool.
simulate_catalogue_matrix()
- Simulate a bootstrapped catalog matrix.
simulate_signature(x, weights = NULL) simulate_catalogue(x, n, weights = NULL) simulate_catalogue_matrix(x)
simulate_signature(x, weights = NULL) simulate_catalogue(x, n, weights = NULL) simulate_catalogue_matrix(x)
x |
a numeric vector representing a signature/catalog or matrix with rows representing signatures/samples and columns representing components. |
weights |
a numeric vector for weights. |
n |
an integer indicating mutation number to be generated in a catalog. |
a matrix
.
# Generate a catalog set.seed(1234) catalog <- as.integer(table(sample(1:96, 1000, replace = TRUE))) names(catalog) <- paste0("comp", 1:96) # Generate a signature sig <- catalog / sum(catalog) # Simulate catalogs x1 <- simulate_catalogue(catalog, 10) # 10 mutations x1 x2 <- simulate_catalogue(catalog, 100) # 100 mutations x2 x3 <- simulate_catalogue(catalog, 1000) # 1000 mutations x3 # Similar with a signature x4 <- simulate_catalogue(sig, 10) # 10 mutations x4 # Load SBS signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) s <- t(sig2$Signature.norm) # Generate a signature from multiple signatures/catalogs s1 <- simulate_signature(s) s1 s2 <- simulate_signature(s, weights = 1:3) s2 # Generate a catalog from multiple signatures/catalogs c1 <- simulate_catalogue(s, 100, weights = 1:3) c1
# Generate a catalog set.seed(1234) catalog <- as.integer(table(sample(1:96, 1000, replace = TRUE))) names(catalog) <- paste0("comp", 1:96) # Generate a signature sig <- catalog / sum(catalog) # Simulate catalogs x1 <- simulate_catalogue(catalog, 10) # 10 mutations x1 x2 <- simulate_catalogue(catalog, 100) # 100 mutations x2 x3 <- simulate_catalogue(catalog, 1000) # 1000 mutations x3 # Similar with a signature x4 <- simulate_catalogue(sig, 10) # 10 mutations x4 # Load SBS signature load(system.file("extdata", "toy_mutational_signature.RData", package = "sigminer", mustWork = TRUE )) s <- t(sig2$Signature.norm) # Generate a signature from multiple signatures/catalogs s1 <- simulate_signature(s) s1 s2 <- simulate_signature(s, weights = 1:3) s2 # Generate a catalog from multiple signatures/catalogs c1 <- simulate_catalogue(s, 100, weights = 1:3) c1
Subset data
slot of CopyNumber object, un-selected rows will move to
dropoff.segs slot, annotation slot will update in the same way.
## S3 method for class 'CopyNumber' subset(x, subset = TRUE, ...)
## S3 method for class 'CopyNumber' subset(x, subset = TRUE, ...)
x |
a CopyNumber object to be subsetted. |
subset |
logical expression indicating rows to keep. |
... |
further arguments to be passed to or from other methods. Useless here. |
a CopyNumber object
Shixiang Wang
Merged Transcript Location at Genome Build hg19
A data.table
from GENCODE release v33.
data(transcript.hg19)
data(transcript.hg19)
Merged Transcript Location at Genome Build hg38
A data.table
from GENCODE release v33.
data(transcript.hg38)
data(transcript.hg38)
Merged Transcript Location at Genome Build mm10
A data.table
from GENCODE release M25.
data(transcript.mm10)
data(transcript.mm10)
Merged Transcript Location at Genome Build mm9
A data.table
from UCSC http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/transcriptome.txt.gz
data(transcript.mm9)
data(transcript.mm9)
Merged Transcript Location at Genome Build T2T
A data.table
from T2T study.
data(transcript.T2T)
data(transcript.T2T)
Transform Copy Number Table
transform_seg_table( data, genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), ref_type = c("cytoband", "gene"), values_fill = NA, values_fn = function(x, ...) { round(mean(x, ...)) }, resolution_factor = 1L )
transform_seg_table( data, genome_build = c("hg19", "hg38", "T2T", "mm10", "mm9", "ce11"), ref_type = c("cytoband", "gene"), values_fill = NA, values_fn = function(x, ...) { round(mean(x, ...)) }, resolution_factor = 1L )
data |
a |
genome_build |
genome build version, used when |
ref_type |
annotation data type used for constructing matrix. |
values_fill |
Optionally, a (scalar) value that specifies what each
This can be a named list if you want to apply different fill values to different value columns. |
values_fn |
Optionally, a function applied to the value in each cell
in the output. You will typically use this when the combination of
This can be a named list if you want to apply different aggregations
to different |
resolution_factor |
an integer to control the resolution.
When it is |
a data.table
.
load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) # Compute the mean segVal in each cytoband x <- transform_seg_table(cn, resolution_factor = 1) x # Compute the mean segVal in each half-cytoband x2 <- transform_seg_table(cn, resolution_factor = 2) x2
load(system.file("extdata", "toy_copynumber.RData", package = "sigminer", mustWork = TRUE )) # Compute the mean segVal in each cytoband x <- transform_seg_table(cn, resolution_factor = 1) x # Compute the mean segVal in each half-cytoband x2 <- transform_seg_table(cn, resolution_factor = 2) x2
Set Color Style for Plotting
use_color_style( style, mode = c("SBS", "copynumber", "DBS", "ID", "RS"), method = "Wang" )
use_color_style( style, mode = c("SBS", "copynumber", "DBS", "ID", "RS"), method = "Wang" )
style |
one of 'default' and 'cosmic'. |
mode |
only used when the |
method |
used to set a more custom palette for different methods. |
color values.
use_color_style("default") use_color_style("cosmic")
use_color_style("default") use_color_style("cosmic")