Title: | Download and Explore Datasets from UCSC Xena Data Hubs |
---|---|
Description: | Download and explore datasets from UCSC Xena data hubs, which are a collection of UCSC-hosted public databases such as TCGA, ICGC, TARGET, GTEx, CCLE, and others. Databases are normalized so they can be combined, linked, filtered, explored and downloaded. |
Authors: | Shixiang Wang [aut, cre] , Xue-Song Liu [aut] , Martin Morgan [ctb], Christine Stawitz [rev] (Christine reviewed the package for ropensci, see <https://github.com/ropensci/software-review/issues/315>), Carl Ganz [rev] (Carl reviewed the package for ropensci, see <https://github.com/ropensci/software-review/issues/315>) |
Maintainer: | Shixiang Wang <[email protected]> |
License: | GPL-3 |
Version: | 1.6.0 |
Built: | 2024-10-30 09:18:30 UTC |
Source: | https://github.com/ropensci/UCSCXenaTools |
Get or Check TCGA Available ProjectID, DataType and FileType
availTCGA(which = c("all", "ProjectID", "DataType", "FileType"))
availTCGA(which = c("all", "ProjectID", "DataType", "FileType"))
which |
a character of |
Shixiang Wang [email protected]
availTCGA("all")
availTCGA("all")
Get cohorts of XenaHub object
cohorts(x)
cohorts(x)
x |
a XenaHub object |
a character vector contains cohorts
xe = XenaGenerate(subset = XenaHostNames == "tcgaHub"); cohorts(xe)
xe = XenaGenerate(subset = XenaHostNames == "tcgaHub"); cohorts(xe)
Get datasets of XenaHub object
datasets(x)
datasets(x)
x |
a XenaHub object |
a character vector contains datasets
xe = XenaGenerate(subset = XenaHostNames == "tcgaHub"); datasets(xe)
xe = XenaGenerate(subset = XenaHostNames == "tcgaHub"); datasets(xe)
TCGA is a very useful database and here we provide this function to download TCGA (include TCGA Pancan) datasets in human-friendly way. Users who are not familiar with R operation will benefit from this.
downloadTCGA( project = NULL, data_type = NULL, file_type = NULL, destdir = tempdir(), force = FALSE, ... )
downloadTCGA( project = NULL, data_type = NULL, file_type = NULL, destdir = tempdir(), force = FALSE, ... )
project |
default is |
data_type |
default is |
file_type |
default is |
destdir |
specify a location to store download data. Default is system temp directory. |
force |
logical. if |
... |
other argument to |
All availble information about datasets of TCGA can access vis availTCGA()
and
check with showTCGA()
.
same as XenaDownload()
function result.
Shixiang Wang [email protected]
XenaQuery()
,
XenaFilter()
,
XenaDownload()
,
XenaPrepare()
,
availTCGA()
,
showTCGA()
## Not run: # download RNASeq data (use UVM as example) downloadTCGA(project = "UVM", data_type = "Gene Expression RNASeq", file_type = "IlluminaHiSeq RNASeqV2") ## End(Not run)
## Not run: # download RNASeq data (use UVM as example) downloadTCGA(project = "UVM", data_type = "Gene Expression RNASeq", file_type = "IlluminaHiSeq RNASeqV2") ## End(Not run)
When you want to query just data for several genes/samples from UCSC Xena datasets, a better way
is to use these fetch_
functions instead of downloading a whole dataset. Details about functions
please see the following sections.
fetch(host, dataset) fetch_dense_values( host, dataset, identifiers = NULL, samples = NULL, check = TRUE, use_probeMap = FALSE, time_limit = 30 ) fetch_sparse_values(host, dataset, genes, samples = NULL, time_limit = 30) fetch_dataset_samples(host, dataset, limit = NULL) fetch_dataset_identifiers(host, dataset) has_probeMap(host, dataset, return_url = FALSE)
fetch(host, dataset) fetch_dense_values( host, dataset, identifiers = NULL, samples = NULL, check = TRUE, use_probeMap = FALSE, time_limit = 30 ) fetch_sparse_values(host, dataset, genes, samples = NULL, time_limit = 30) fetch_dataset_samples(host, dataset, limit = NULL) fetch_dataset_identifiers(host, dataset) has_probeMap(host, dataset, return_url = FALSE)
host |
a UCSC Xena host, like "https://toil.xenahubs.net".
All available hosts can be printed by |
dataset |
a UCSC Xena dataset, like "tcga_RSEM_gene_tpm".
All available datasets can be printed by running |
identifiers |
Identifiers could be probe (like "ENSG00000000419.12"),
gene (like "TP53") etc.. If it is |
samples |
ID of samples, like "TCGA-02-0047-01".
If it is |
check |
if |
use_probeMap |
if |
time_limit |
time limit for getting response in seconds. |
genes |
gene names. |
limit |
number of samples, if |
return_url |
if |
There are three primary data types: dense matrix (samples by probes (or say identifiers)), sparse (sample, position, variant), and segmented (sample, position, value).
Dense matrices can be genotypic or phenotypic, it is a sample-by-identifiers matrix. Phenotypic matrices have associated field metadata (descriptive names, codes, etc.). Genotypic matricies may have an associated probeMap, which maps probes to genomic locations. If a matrix has hugo probeMap, the probes themselves are gene names. Otherwise, a probeMap is used to map a gene location to a set of probes.
a matirx
or character vector or a list
.
fetch_dense_values()
: fetches values from a dense matrix.
fetch_sparse_values()
: fetches values from a sparse data.frame
.
fetch_dataset_samples()
: fetches samples from a dataset
fetch_dataset_identifiers()
: fetches identifies from a dataset.
has_probeMap()
: checks if a dataset has ProbeMap.
library(UCSCXenaTools) host <- "https://toil.xenahubs.net" dataset <- "tcga_RSEM_gene_tpm" samples <- c("TCGA-02-0047-01", "TCGA-02-0055-01", "TCGA-02-2483-01", "TCGA-02-2485-01") probes <- c("ENSG00000282740.1", "ENSG00000000005.5", "ENSG00000000419.12") genes <- c("TP53", "RB1", "PIK3CA") # Fetch samples fetch_dataset_samples(host, dataset, 2) # Fetch identifiers fetch_dataset_identifiers(host, dataset) # Fetch expression value by probes fetch_dense_values(host, dataset, probes, samples, check = FALSE) # Fetch expression value by gene symbol (if the dataset has probeMap) has_probeMap(host, dataset) fetch_dense_values(host, dataset, genes, samples, check = FALSE, use_probeMap = TRUE)
library(UCSCXenaTools) host <- "https://toil.xenahubs.net" dataset <- "tcga_RSEM_gene_tpm" samples <- c("TCGA-02-0047-01", "TCGA-02-0055-01", "TCGA-02-2483-01", "TCGA-02-2485-01") probes <- c("ENSG00000282740.1", "ENSG00000000005.5", "ENSG00000000419.12") genes <- c("TP53", "RB1", "PIK3CA") # Fetch samples fetch_dataset_samples(host, dataset, 2) # Fetch identifiers fetch_dataset_identifiers(host, dataset) # Fetch expression value by probes fetch_dense_values(host, dataset, probes, samples, check = FALSE) # Fetch expression value by gene symbol (if the dataset has probeMap) has_probeMap(host, dataset) fetch_dense_values(host, dataset, genes, samples, check = FALSE, use_probeMap = TRUE)
This is the most useful function for user to download common
TCGA datasets, it is similar to getFirehoseData
function in RTCGAToolbox
package.
getTCGAdata( project = NULL, clinical = TRUE, download = FALSE, forceDownload = FALSE, destdir = tempdir(), mRNASeq = FALSE, mRNAArray = FALSE, mRNASeqType = "normalized", miRNASeq = FALSE, exonRNASeq = FALSE, RPPAArray = FALSE, ReplicateBaseNormalization = FALSE, Methylation = FALSE, MethylationType = c("27K", "450K"), GeneMutation = FALSE, SomaticMutation = FALSE, GisticCopyNumber = FALSE, Gistic2Threshold = TRUE, CopyNumberSegment = FALSE, RemoveGermlineCNV = TRUE, ... )
getTCGAdata( project = NULL, clinical = TRUE, download = FALSE, forceDownload = FALSE, destdir = tempdir(), mRNASeq = FALSE, mRNAArray = FALSE, mRNASeqType = "normalized", miRNASeq = FALSE, exonRNASeq = FALSE, RPPAArray = FALSE, ReplicateBaseNormalization = FALSE, Methylation = FALSE, MethylationType = c("27K", "450K"), GeneMutation = FALSE, SomaticMutation = FALSE, GisticCopyNumber = FALSE, Gistic2Threshold = TRUE, CopyNumberSegment = FALSE, RemoveGermlineCNV = TRUE, ... )
project |
default is |
clinical |
logical. if |
download |
logical. if |
forceDownload |
logical. if |
destdir |
specify a location to store download data. Default is system temp directory. |
mRNASeq |
logical. if |
mRNAArray |
logical. if |
mRNASeqType |
character vector. Can be one, two or three
in |
miRNASeq |
logical. if |
exonRNASeq |
logical. if |
RPPAArray |
logical. if |
ReplicateBaseNormalization |
logical. if |
Methylation |
logical. if |
MethylationType |
character vector. Can be one or two in |
GeneMutation |
logical. if |
SomaticMutation |
logical. if |
GisticCopyNumber |
logical. if |
Gistic2Threshold |
logical. if |
CopyNumberSegment |
logical. if |
RemoveGermlineCNV |
logical. if |
... |
other argument to |
TCGA Common Data Sets are frequently used for biological analysis.
To make easier to achieve these data, this function provide really easy
options to choose datasets and behavior. All availble information about
datasets of TCGA can access vis availTCGA()
and check with showTCGA()
.
if download=TRUE
, return data.frame
from XenaDownload
,
otherwise return a list including XenaHub
object and datasets information
Shixiang Wang [email protected]
###### get data, but not download # 1 choose project and data types you wanna download getTCGAdata(project = "LUAD", mRNASeq = TRUE, mRNAArray = TRUE, mRNASeqType = "normalized", miRNASeq = TRUE, exonRNASeq = TRUE, RPPAArray = TRUE, Methylation = TRUE, MethylationType = "450K", GeneMutation = TRUE, SomaticMutation = TRUE) # 2 only choose 'LUAD' and its clinical data getTCGAdata(project = "LUAD") ## Not run: ###### download datasets # 3 download clinical datasets of LUAD and LUSC getTCGAdata(project = c("LUAD", "LUSC"), clinical = TRUE, download = TRUE) # 4 download clinical, RPPA and gene mutation datasets of LUAD and LUSC # getTCGAdata(project = c("LUAD", "LUSC"), clinical = TRUE, RPPAArray = TRUE, GeneMutation = TRUE) ## End(Not run)
###### get data, but not download # 1 choose project and data types you wanna download getTCGAdata(project = "LUAD", mRNASeq = TRUE, mRNAArray = TRUE, mRNASeqType = "normalized", miRNASeq = TRUE, exonRNASeq = TRUE, RPPAArray = TRUE, Methylation = TRUE, MethylationType = "450K", GeneMutation = TRUE, SomaticMutation = TRUE) # 2 only choose 'LUAD' and its clinical data getTCGAdata(project = "LUAD") ## Not run: ###### download datasets # 3 download clinical datasets of LUAD and LUSC getTCGAdata(project = c("LUAD", "LUSC"), clinical = TRUE, download = TRUE) # 4 download clinical, RPPA and gene mutation datasets of LUAD and LUSC # getTCGAdata(project = c("LUAD", "LUSC"), clinical = TRUE, RPPAArray = TRUE, GeneMutation = TRUE) ## End(Not run)
Get hosts of XenaHub object
hosts(x)
hosts(x)
x |
a XenaHub object |
a character vector contains hosts
xe = XenaGenerate(subset = XenaHostNames == "tcgaHub"); hosts(xe)
xe = XenaGenerate(subset = XenaHostNames == "tcgaHub"); hosts(xe)
One is often interested in identifying samples or features present in each data set, or shared by all data sets, or present in any of several data sets. Identifying these samples, including samples in arbitrarily chosen data sets.
samples( x, i = character(), by = c("hosts", "cohorts", "datasets"), how = c("each", "any", "all") )
samples( x, i = character(), by = c("hosts", "cohorts", "datasets"), how = c("each", "any", "all") )
x |
a XenaHub object |
i |
default is a empty character, it is used to specify
the host, cohort or dataset by |
by |
a character specify |
how |
a character specify |
a list include samples
## Not run: xe = XenaHub(cohorts = "Cancer Cell Line Encyclopedia (CCLE)") # samples in each dataset, first host x = samples(xe, by="datasets", how="each")[[1]] lengths(x) # data sets in ccle cohort on first (only) host ## End(Not run)
## Not run: xe = XenaHub(cohorts = "Cancer Cell Line Encyclopedia (CCLE)") # samples in each dataset, first host x = samples(xe, by="datasets", how="each")[[1]] lengths(x) # data sets in ccle cohort on first (only) host ## End(Not run)
This can used to check if data type or file type exist in one or more projects by hand.
showTCGA(project = "all")
showTCGA(project = "all")
project |
a character vector. Can be "all" or one or more of TCGA Project IDs. |
a data.frame
including project data structure information.
Shixiang Wang [email protected]
showTCGA("all")
showTCGA("all")
Convert camel case to snake case
to_snake(name)
to_snake(name)
name |
a character vector |
same length as name
but with snake case
to_snake("sparseDataRange")
to_snake("sparseDataRange")
Return Xena default hosts
xena_default_hosts()
xena_default_hosts()
A character vector include current defalut hosts
Shixiang Wang [email protected]
This will open dataset/cohort link of UCSC Xena in user's default browser.
XenaBrowse(x, type = c("dataset", "cohort"), multiple = FALSE)
XenaBrowse(x, type = c("dataset", "cohort"), multiple = FALSE)
x |
a XenaHub object. |
type |
one of "dataset" and "cohort". |
multiple |
if |
XenaGenerate(subset = XenaHostNames == "tcgaHub") %>% XenaFilter(filterDatasets = "clinical") %>% XenaFilter(filterDatasets = "LUAD") -> to_browse
XenaGenerate(subset = XenaHostNames == "tcgaHub") %>% XenaFilter(filterDatasets = "clinical") %>% XenaFilter(filterDatasets = "LUAD") -> to_browse
This data.frame
is very useful for selecting datasets fastly and
independent on APIs of UCSC Xena Hubs.
A tibble
.
Generated from UCSC Xena Data Hubs.
data(XenaData) str(XenaData)
data(XenaData) str(XenaData)
Get or Update Newest Data Information of UCSC Xena Data Hubs
XenaDataUpdate(saveTolocal = TRUE)
XenaDataUpdate(saveTolocal = TRUE)
saveTolocal |
logical. Whether save to local R package data directory for permanent use or Not. |
a data.frame
contains all datasets information of Xena.
Shixiang Wang [email protected]
## Not run: XenaDataUpdate() XenaDataUpdate(saveTolocal = TRUE) ## End(Not run)
## Not run: XenaDataUpdate() XenaDataUpdate(saveTolocal = TRUE) ## End(Not run)
Avaliable datasets list: https://xenabrowser.net/datapages/
XenaDownload( xquery, destdir = tempdir(), download_probeMap = FALSE, trans_slash = FALSE, force = FALSE, max_try = 3L, ... )
XenaDownload( xquery, destdir = tempdir(), download_probeMap = FALSE, trans_slash = FALSE, force = FALSE, max_try = 3L, ... )
xquery |
a tibble object generated by XenaQuery function. |
destdir |
specify a location to store download data. Default is system temp directory. |
download_probeMap |
if |
trans_slash |
logical, default is |
force |
logical. if |
max_try |
time limit to try downloading the data. |
... |
other argument to |
a tibble
Shixiang Wang [email protected]
## Not run: xe = XenaGenerate(subset = XenaHostNames == "tcgaHub") hosts(xe) xe_query = XenaQuery(xe) xe_download = XenaDownload(xe_query) ## End(Not run)
## Not run: xe = XenaGenerate(subset = XenaHostNames == "tcgaHub") hosts(xe) xe_query = XenaQuery(xe) xe_download = XenaDownload(xe_query) ## End(Not run)
One of main functions in UCSCXenatools. It is used to filter
XenaHub
object according to cohorts, datasets. All datasets can be found
at https://xenabrowser.net/datapages/.
XenaFilter( x, filterCohorts = NULL, filterDatasets = NULL, ignore.case = TRUE, ... )
XenaFilter( x, filterCohorts = NULL, filterDatasets = NULL, ignore.case = TRUE, ... )
x |
a XenaHub object |
filterCohorts |
default is |
filterDatasets |
default is |
ignore.case |
if |
... |
other arguments except |
a XenaHub
object
Shixiang Wang [email protected]
# operate TCGA datasets xe = XenaGenerate(subset = XenaHostNames == "tcgaHub") xe # get all names of clinical data xe2 = XenaFilter(xe, filterDatasets = "clinical") datasets(xe2)
# operate TCGA datasets xe = XenaGenerate(subset = XenaHostNames == "tcgaHub") xe # get all names of clinical data xe2 = XenaFilter(xe, filterDatasets = "clinical") datasets(xe2)
Generate and Subset a XenaHub Object from 'XenaData'
XenaGenerate(XenaData = UCSCXenaTools::XenaData, subset = TRUE)
XenaGenerate(XenaData = UCSCXenaTools::XenaData, subset = TRUE)
XenaData |
a |
subset |
logical expression indicating elements or rows to keep. |
a XenaHub object.
Shixiang Wang [email protected]
# 1 get all datasets XenaGenerate() # 2 get TCGA BRCA XenaGenerate(subset = XenaCohorts == "TCGA Breast Cancer (BRCA)") # 3 get all datasets containing BRCA XenaGenerate(subset = grepl("BRCA", XenaCohorts))
# 1 get all datasets XenaGenerate() # 2 get TCGA BRCA XenaGenerate(subset = XenaCohorts == "TCGA Breast Cancer (BRCA)") # 3 get all datasets containing BRCA XenaGenerate(subset = grepl("BRCA", XenaCohorts))
It is used to generate original
XenaHub
object according to hosts, cohorts, datasets or hostName.
If these arguments not specified, all hosts and corresponding datasets
will be returned as a XenaHub
object. All datasets can be found
at https://xenabrowser.net/datapages/.
XenaHub( hosts = xena_default_hosts(), cohorts = character(), datasets = character(), hostName = c("publicHub", "tcgaHub", "gdcHub", "gdcHubV18", "icgcHub", "toilHub", "pancanAtlasHub", "treehouseHub", "pcawgHub", "atacseqHub", "singlecellHub", "kidsfirstHub", "tdiHub") )
XenaHub( hosts = xena_default_hosts(), cohorts = character(), datasets = character(), hostName = c("publicHub", "tcgaHub", "gdcHub", "gdcHubV18", "icgcHub", "toilHub", "pancanAtlasHub", "treehouseHub", "pcawgHub", "atacseqHub", "singlecellHub", "kidsfirstHub", "tdiHub") )
hosts |
a character vector specify UCSC Xena hosts, all available hosts can be
found by |
cohorts |
default is empty character vector, all cohorts will be returned. |
datasets |
default is empty character vector, all datasets will be returned. |
hostName |
name of host, available options can be accessed by |
a XenaHub object
Shixiang Wang [email protected]
## Not run: #1 query all hosts, cohorts and datasets xe = XenaHub() xe #2 query only TCGA hosts xe = XenaHub(hostName = "tcgaHub") xe hosts(xe) # get hosts cohorts(xe) # get cohorts datasets(xe) # get datasets samples(xe) # get samples ## End(Not run)
## Not run: #1 query all hosts, cohorts and datasets xe = XenaHub() xe #2 query only TCGA hosts xe = XenaHub(hostName = "tcgaHub") xe hosts(xe) # get hosts cohorts(xe) # get cohorts datasets(xe) # get datasets samples(xe) # get samples ## End(Not run)
a S4 class to represent UCSC Xena Data Hubs
hosts
hosts of data hubs
cohorts
cohorts of data hubs
datasets
datasets of data hubs
Prepare (Load) Downloaded Datasets to R
XenaPrepare( objects, objectsName = NULL, use_chunk = FALSE, chunk_size = 100, subset_rows = TRUE, select_cols = TRUE, callback = NULL, comment = "#", na = c("", "NA", "[Discrepancy]"), ... )
XenaPrepare( objects, objectsName = NULL, use_chunk = FALSE, chunk_size = 100, subset_rows = TRUE, select_cols = TRUE, callback = NULL, comment = "#", na = c("", "NA", "[Discrepancy]"), ... )
objects |
a object of character vector or data.frame. If |
objectsName |
specify names for elements of return object, i.e. names of list |
use_chunk |
default is |
chunk_size |
the number of rows to include in each chunk |
subset_rows |
logical expression indicating elements or rows to keep:
missing values are taken as false. |
select_cols |
expression, indicating columns to select from a data frame.
'x' can be a representation of data frame you wanna do subset operation,
e.g. |
callback |
a function to call on each chunk, default is |
comment |
a character specify comment rows in files |
na |
a character vectory specify |
... |
other arguments transfer to |
a list contains file data, which in way of tibbles
Shixiang Wang [email protected]
## Not run: xe = XenaGenerate(subset = XenaHostNames == "tcgaHub") hosts(xe) xe_query = XenaQuery(xe) xe_download = XenaDownload(xe_query) dat = XenaPrepare(xe_download) ## End(Not run)
## Not run: xe = XenaGenerate(subset = XenaHostNames == "tcgaHub") hosts(xe) xe_query = XenaQuery(xe) xe_download = XenaDownload(xe_query) dat = XenaPrepare(xe_download) ## End(Not run)
Query URL of Datasets before Downloading
XenaQuery(x)
XenaQuery(x)
x |
a XenaHub object |
a data.frame
contains hosts, datasets and url
Shixiang Wang [email protected]
xe = XenaGenerate(subset = XenaHostNames == "tcgaHub") hosts(xe) ## Not run: xe_query = XenaQuery(xe) ## End(Not run)
xe = XenaGenerate(subset = XenaHostNames == "tcgaHub") hosts(xe) ## Not run: xe_query = XenaQuery(xe) ## End(Not run)
If dataset has no ProbeMap, it will be ignored.
XenaQueryProbeMap(x)
XenaQueryProbeMap(x)
x |
a XenaHub object |
a data.frame
contains hosts, datasets and url
Shixiang Wang [email protected]
xe = XenaGenerate(subset = XenaHostNames == "tcgaHub") hosts(xe) ## Not run: xe_query = XenaQueryProbeMap(xe) ## End(Not run)
xe = XenaGenerate(subset = XenaHostNames == "tcgaHub") hosts(xe) ## Not run: xe_query = XenaQueryProbeMap(xe) ## End(Not run)
XenaScan()
is a function can be used before XenaGenerate()
.
XenaScan( XenaData = UCSCXenaTools::XenaData, pattern = NULL, ignore.case = TRUE )
XenaScan( XenaData = UCSCXenaTools::XenaData, pattern = NULL, ignore.case = TRUE )
XenaData |
a |
pattern |
character string containing a regular expression
(or character string for |
ignore.case |
if |
a data.frame
x1 <- XenaScan(pattern = "Blood") x2 <- XenaScan(pattern = "LUNG", ignore.case = FALSE) x1 %>% XenaGenerate() x2 %>% XenaGenerate()
x1 <- XenaScan(pattern = "Blood") x2 <- XenaScan(pattern = "LUNG", ignore.case = FALSE) x1 %>% XenaGenerate() x2 %>% XenaGenerate()