Title: | Convert Identifiers in Biological Databases |
---|---|
Description: | Identifiers in biological databases connect different levels of metadata, phenotype data or genotype data. This tool is designed to easily convert identifiers within or between different biological databases (Wang, Shixiang, et al. (2021) <DOI:10.1371/journal.pgen.1009557>). |
Authors: | Shixiang Wang [aut, cre] |
Maintainer: | Shixiang Wang <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.5 |
Built: | 2024-11-15 02:50:34 UTC |
Source: | https://github.com/ShixiangWang/IDConverter |
Convert Identifiers with Custom Database
convert_custom(x, from = NULL, to = NULL, dt = NULL, multiple = FALSE)
convert_custom(x, from = NULL, to = NULL, dt = NULL, multiple = FALSE)
x |
A character vector to convert. |
from |
Which identifier type to be converted. |
to |
Identifier type convert to. |
dt |
A |
multiple |
if |
A character vector.
dt <- data.table::data.table(UpperCase = LETTERS[1:5], LowerCase = letters[1:5]) dt x <- convert_custom(c("B", "C", "E", "E", "FF"), from = "UpperCase", to = "LowerCase", dt = dt) x
dt <- data.table::data.table(UpperCase = LETTERS[1:5], LowerCase = letters[1:5]) dt x <- convert_custom(c("B", "C", "E", "E", "FF"), from = "UpperCase", to = "LowerCase", dt = dt) x
Convert Human/Mouse Gene IDs between Ensembl and Hugo Symbol System
convert_hm_genes( IDs, type = c("ensembl", "symbol"), genome_build = c("hg38", "hg19", "mm10", "mm9"), multiple = FALSE )
convert_hm_genes( IDs, type = c("ensembl", "symbol"), genome_build = c("hg38", "hg19", "mm10", "mm9"), multiple = FALSE )
IDs |
a character vector to convert. |
type |
type of input |
genome_build |
reference genome build. |
multiple |
if |
a vector or a data.table
.
convert_hm_genes("ENSG00000243485") convert_hm_genes("ENSG00000243485", multiple = TRUE) convert_hm_genes(c("TP53", "KRAS", "EGFR", "MYC"), type = "symbol")
convert_hm_genes("ENSG00000243485") convert_hm_genes("ENSG00000243485", multiple = TRUE) convert_hm_genes(c("TP53", "KRAS", "EGFR", "MYC"), type = "symbol")
Run data("icgc")
to see detail database for conversion.
convert_icgc( x, from = "icgc_specimen_id", to = "icgc_donor_id", multiple = FALSE )
convert_icgc( x, from = "icgc_specimen_id", to = "icgc_donor_id", multiple = FALSE )
x |
A character vector to convert. |
from |
Which identifier type to be converted. One of . |
to |
Identifier type convert to. Same as parameter |
multiple |
if |
A character vector.
x <- convert_icgc("SP29019") x ## Not run: convert_icgc("SA170678") ## End(Not run)
x <- convert_icgc("SP29019") x ## Not run: convert_icgc("SA170678") ## End(Not run)
Run data("pcawg_full")
or data("pcawg_simple")
to see detail database for conversion.
The pcawg_simple
database only contains PCAWG white-list donors.
convert_pcawg( x, from = "icgc_specimen_id", to = "icgc_donor_id", db = c("full", "simple"), multiple = FALSE )
convert_pcawg( x, from = "icgc_specimen_id", to = "icgc_donor_id", db = c("full", "simple"), multiple = FALSE )
x |
A character vector to convert. |
from |
Which identifier type to be converted. For db "full", one of . For db "simple", one of . |
to |
Identifier type convert to. Same as parameter |
db |
Database, one of "full" (for |
multiple |
if |
A character vector.
x <- convert_pcawg("SP1677") x y <- convert_pcawg("DO804", from = "icgc_donor_id", to = "icgc_specimen_id", multiple = TRUE ) y ## Not run: convert_pcawg("SA5213") ## End(Not run)
x <- convert_pcawg("SP1677") x y <- convert_pcawg("DO804", from = "icgc_donor_id", to = "icgc_specimen_id", multiple = TRUE ) y ## Not run: convert_pcawg("SA5213") ## End(Not run)
Run data("tcga")
to see detail database for conversion.
convert_tcga(x, from = "sample_id", to = "submitter_id", multiple = FALSE)
convert_tcga(x, from = "sample_id", to = "submitter_id", multiple = FALSE)
x |
A character vector to convert. |
from |
Which identifier type to be converted. One of . |
to |
Identifier type convert to. Same as parameter |
multiple |
if |
A character vector.
x <- convert_tcga("TCGA-02-0001-10") x ## Not run: convert_tcga("TCGA-02-0001-10A-01W-0188-10") ## End(Not run)
x <- convert_tcga("TCGA-02-0001-10") x ## Not run: convert_tcga("TCGA-02-0001-10A-01W-0188-10") ## End(Not run)
Check details for filter rules.
filter_tcga_barcodes( tsb, analyte_target = c("DNA", "RNA"), decreasing = TRUE, analyte_position = 20, plate = c(22, 25), portion = c(18, 19), filter_FFPE = FALSE )
filter_tcga_barcodes( tsb, analyte_target = c("DNA", "RNA"), decreasing = TRUE, analyte_position = 20, plate = c(22, 25), portion = c(18, 19), filter_FFPE = FALSE )
tsb |
a vector of TCGA sample barcodes. |
analyte_target |
type of barcodes, "DNA" or "RNA". |
decreasing |
if |
analyte_position |
bit position for analyte. DON'T CHANGE IT if you don't understand. |
plate |
bit position for plate. DON'T CHANGE IT if you don't understand. |
portion |
bit position for portion. DON'T CHANGE IT if you don't understand. |
filter_FFPE |
if |
In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter.
The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number.
The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.
NOTE: Basically, user provides tsb and analyte_target is fine.
a barcode list.
Rules:
https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-sampleTypesQWhatTCGAsampletypesareFirehosepipelinesexecutedupon
FFPE cases:
http://gdac.broadinstitute.org/runs/sampleReports/latest/FPPP_FFPE_Cases.html
filter_tcga_barcodes(c("TCGA-44-2656-01B-06D-A271-08", "TCGA-44-2656-01B-06D-A273-01")) filter_tcga_barcodes(c("TCGA-44-2656-01B-06D-A271-08", "TCGA-44-2656-01B-06D-A273-01"), filter_FFPE = TRUE )
filter_tcga_barcodes(c("TCGA-44-2656-01B-06D-A271-08", "TCGA-44-2656-01B-06D-A273-01")) filter_tcga_barcodes(c("TCGA-44-2656-01B-06D-A271-08", "TCGA-44-2656-01B-06D-A273-01"), filter_FFPE = TRUE )
ICGC Sample Identifiers
A data frame with 155874 rows and 6 variables.
https://dcc.icgc.org/repositories
load_data("icgc")
load_data("icgc")
Data are stored in remote Zenodo repo. This function will help download required data and load it into R.
load_data(x)
load_data(x)
x |
a dataset name. |
typically a data.frame
, depends on x
.
load_data("pcawg_full") load_data("pcawg_simple") load_data("tcga") load_data("icgc")
load_data("pcawg_full") load_data("pcawg_simple") load_data("tcga") load_data("icgc")
annotables
packageThe tables are obtained from annotables
package and stored in Zenodo for better management. They can be downloaded and
loaded with load_data()
. See details for more info.
ls_annotables()
ls_annotables()
Many bioinformatics tasks require converting gene identifiers from one convention to another, or annotating gene identifiers with gene symbol, description, position, etc. Sure, biomaRt does this for you, but users may get tired of remembering biomaRt syntax and hammering Ensembl's servers every time. These tables have basic annotation information from Ensembl Genes for:
Human build 38 (grch38
)
Human build 37 (grch37
)
Mouse (grcm38
)
Rat (rnor6
)
Chicken (galgal5
)
Worm (wbcel235
)
Fly (bdgp6
)
Macaque (mmul801
)
Where each table contains:
ensgene
: Ensembl gene ID
entrez
: Entrez gene ID
symbol
: Gene symbol
chr
: Chromosome
start
: Start
end
: End
strand
: Strand
biotype
: Protein coding, pseudogene, mitochondrial tRNA, etc.
description
: Full gene name/description
Additionally, there are tx2gene
tables that link Ensembl gene IDs to Ensembl transcript IDs.
NOTE, the description above is copied from README of annotables
package.
If you are unclear to the data tables, please refer to annotables.
a data.frame
https://github.com/stephenturner/annotables
ls_annotables() load_data(ls_annotables()[1])
ls_annotables() load_data(ls_annotables()[1])
Parse Sample ID from GDC Portal File UUID
parse_gdc_file_uuid( x, legacy = FALSE, fields = "cases.samples.submitter_id,cases.samples.sample_type,file_id", token = NULL, max_try = 5L )
parse_gdc_file_uuid( x, legacy = FALSE, fields = "cases.samples.submitter_id,cases.samples.sample_type,file_id", token = NULL, max_try = 5L )
x |
a GDC manifest file or a vector of file UUIDs. |
legacy |
if use GDC legacy data. |
fields |
a list of fields to query. If it is a string, then fields should be separated by comma. It could also be a vector. See https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/#file-fields for list. |
token |
the token used for querying. |
max_try |
maximum try time. |
a data.frame
parse_gdc_file_uuid("fe522fc8-e690-49b9-b3b6-fa3658705057") parse_gdc_file_uuid( c( "fe522fc8-e690-49b9-b3b6-fa3658705057", "2c16506f-1110-4d60-81e3-a85233c79909" ) )
parse_gdc_file_uuid("fe522fc8-e690-49b9-b3b6-fa3658705057") parse_gdc_file_uuid( c( "fe522fc8-e690-49b9-b3b6-fa3658705057", "2c16506f-1110-4d60-81e3-a85233c79909" ) )
PCAWG Full Sample Identifiers
A data frame with 7255 rows and 8 variables.
https://dcc.icgc.org/releases/PCAWG
load_data("pcawg_full")
load_data("pcawg_full")
This dataset contains less records than data("pcawg_full")
but
with more ID columns. Of note, only white-list donors included.
A data frame with 2583 rows and 12 variables.
https://www.nature.com/articles/s41586-020-1969-6
load_data("pcawg_simple")
load_data("pcawg_simple")
How to get the dataset can be viewed in code under data-raw
.
Cases in case_id
column can be directly mapped to a GDC portal
page, e.g. https://portal.gdc.cancer.gov/cases/30a1fe5e-5b12-472c-aa86-c2db8167ab23.
A data frame with 150849 rows and 5 variables.
https://portal.gdc.cancer.gov/
load_data("tcga")
load_data("tcga")