The creDB package provides easy access to a sqlite database containing enhancer sequences annotated across a range of biological contexts (tissues, cell types). The package contains the database, it builds on the dplyr/dbplyr packages to provide data access, and it provides a creGet function to make querying the database easy. It aims to fill the gap between “enhancer browsers” on the one hand that provide a web-based graphical user interface, and simple flat files for download on the other hand. The former often cannot be employed in automated analyses, while the latter typically requires substantial time and effort for pre-processing etc. before data can be integrated in an analysis.
In the following we show some examples of how it can be used.
After loading the package an object of type creDB provides access to the database. In the following we show how to extract the set of enhancers annotated to be active in the heart.
First we load the database/package:
library(creDB)
CREDB.0.1
cdb = cdb
## Object of type creDB
The creDB class extends SQLiteConnection class from the RSQLite package. It provides easy access to the data with the creGet, so we do not have to worry about details of the database (see DB-schema). creGet retrieves information about annotated enhancers as columns (e.g., information about cis-regulatory sequences) for a range of keytypes (like biological context, genomic location, enhancer track, or study) and associated keys. Here enhancer tracks encode the annotation/identification approach employed to annotate enhancers. Examples are “H3K27ac” for enhancers called based on this histone mark, “eRNA”, “enhancerAtlas_consensus”, and others (see queries with creGet).
Next, we retrieve heart enhancers contained in creDB:
library(dplyr, quietly = TRUE)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
c("enh_chrm","enh_start","enh_end","context_name")
columns = creGet(cdb,keytype="context_name",keys="Heart",columns)
enh = enh
## # Source: lazy query [?? x 4]
## # Database: sqlite 3.30.1
## enh_chrm enh_start enh_end context_name
## <chr> <int> <int> <chr>
## 1 chr1 931201 934340 Heart
## 2 chr1 934811 934900 Heart
## 3 chr1 934991 935070 Heart
## 4 chr1 948961 949360 Heart
## 5 chr1 1207641 1209040 Heart
## 6 chr1 1309831 1310080 Heart
## 7 chr1 1341841 1342280 Heart
## 8 chr1 1342401 1342510 Heart
## 9 chr1 1407471 1408220 Heart
## 10 chr1 1471181 1475060 Heart
## # … with more rows
enh %>% head %>% collect
enh =#enh = enh %>% collect(n=Inf)
We can find out about the columns, keytypes, and keys creDB supports with creGet as follows:
listKeytypes(cdb)
## [1] "enh_id" "context_name" "track" "range" "study"
listKeys(cdb,keytype="context_name")
## # Source: lazy query [?? x 1]
## # Database: sqlite 3.30.1
## context_name
## <chr>
## 1 786-O
## 2 A375
## 3 A549
## 4 Adipocyte
## 5 Aml_blast
## 6 Astrocyte
## 7 B_cell_blood
## 8 Be2c
## 9 Bj
## 10 Bronchia_epithelial
## # … with more rows
listColumns(cdb)
## enhid
## "Integer; internal enhancer ID"
## enh_chrm
## "Character; chromosome"
## enh_start
## "Integer; enhancer start"
## enh_end
## "Integer; enhancer end"
## enh_clst
## "Integer; internal cluster ID"
## context_id
## "Integer; internal context ID"
## context_name
## "Character; name of the context (tissue / cell line)"
## study
## "Character; name of the study"
## pmid
## "Integer; pubmed ID"
## species
## "Character; species name"
## track
## "Character; annotation track"
listKeys(cdb,keytype="track")
## # Source: lazy query [?? x 1]
## # Database: sqlite 3.30.1
## track
## <chr>
## 1 enhancer-atlas-v2_consensus
## 2 benton_ChromHMM
## 3 benton_DNasePlusHistone
## 4 benton_EncodeEnhancerlike
## 5 benton_FANTOM
## 6 benton_H3K27acMinusH3K4me3
## 7 benton_H3K27acPlusH3K4me1
## 8 benton_H3K27ac
## 9 benton_Ho14
## 10 benton_Yip12
## # … with more rows
Therefore, another creGet–based query could be:
library(GenomicRanges, quietly = TRUE)
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:base':
##
## expand.grid
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
GRanges("chr1", IRanges(107362617,107362617+100000))
my.range = creGet(cdb,keytype="range",keys=my.range,columns) tmp =
In addition to creGet, the creDB object provides direct access to the SQLite database (via the SQLiteConnection class). Here we display the available tables, please see the RSQLite and DBI packages for more information:
library(RSQLite)
dbListTables(cdb)
## [1] "contextid_annot" "dbplyr_001"
## [3] "enhid_contextid" "enhid_promid"
## [5] "enhid_range" "enhid_track"
## [7] "hg_promid_range" "hg_promid_transid_geneid"
## [9] "sqlite_stat1" "sqlite_stat4"
tbl(cdb,"enhid_range")
enh_range = filter(enh_range, enh_start %% 2 == 1) enh_odd =
It is straight forward to export enhancer sets as bed files, for example to use them in downstream analyses:
library(rtracklayer)
GRanges(enh$enh_chrm,IRanges(enh$enh_start,enh$enh_end))
heart_enh = sortSeqlevels(heart_enh) ; sort(heart_enh) heart_enh =
## GRanges object with 6 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 931201-934340 *
## [2] chr1 934811-934900 *
## [3] chr1 934991-935070 *
## [4] chr1 948961-949360 *
## [5] chr1 1207641-1209040 *
## [6] chr1 1309831-1310080 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
export.bed(sort(heart_enh),con="./heart-enhancers_hg19.bed")
For more details about export.bed and on how to include meta-data/annotation, please see the rtracklayer package. The file ./heart-enhancers_hg19.bed now contains the selected enhancers.
Here we show how to access the data used in Benton et al. As an example, we compare annotated liver enhancers from two strategies: H3K27ac and ChromHmm. We find that 44% of the H3K27ac enhancers are overlapped by ChromHmm enhancers, and that 68% of the ChromHmm enhancers are overlapped by the H3K27ac enhancers:
library(creDB)
library(GenomicRanges)
library(stringr)
CREDB.0.1
cdb =
#- information per enhancers
c("enh_id","enh_chrm","enh_start","enh_end","context_name","track")
columns =
#- retrieve enhancers from database
creGet(cdb,keytype="study", keys="benton",columns) %>% dplyr::collect()
enh =
#- remove duplicate annotation from enhancers also in enhancerAtlas
enh %>% dplyr::filter(str_detect(track,"^benton_"))
enh =
#- make GenomicRanges for Liver enhancer using two different identification strategies
enh %>%
liver_ac = dplyr::filter(context_name =="Liver" , track == "benton_H3K27ac") %>%
dplyr::select(enh_chrm,enh_start,enh_end)
GRanges(liver_ac$enh_chrm,IRanges(liver_ac$enh_start,liver_ac$enh_end))
liver_ac =
enh %>%
liver_ch = dplyr::filter(context_name =="Liver" , track == "benton_ChromHMM") %>%
dplyr::select(enh_chrm,enh_start,enh_end)
GRanges(liver_ch$enh_chrm,IRanges(liver_ch$enh_start,liver_ch$enh_end))
liver_ch =
#- how much do they overlap?
findOverlaps(liver_ac,liver_ch)
ov = length(unique(queryHits(ov))) /length(liver_ac) #- 44%
f1 = length(unique(subjectHits(ov)))/length(liver_ch) #- 68% f2 =
creDB contains the downloadable data from the EnhancerAtlas publication. Here is how to access and cast it into binary matrix format, with enhancers as (named) rows, contexts as columns, and a binary code for enhancer activity.
library(creDB)
library(reshape2)
library(stringr)
library(Matrix)
CREDB.0.1
cdb =
#- get enhancers from database
creGet( cdb, keytype="study", keys="EnhancerAtlas",
enh =columns=c( "enh_clst", "enh_chrm", "enh_start","enh_end","context_name")) %>%
collect
#- add an enhancer-name column
$name = str_c(enh$enh_chrm,":",enh$enh_start,"-",enh$enh_end)
enh
#- make a binary matrix indicating tissue expression per enhancer
dcast(enh,name~context_name,value.var="name")
tmp =rownames(tmp) = tmp[,"name"] ; tmp = tmp[,-1]
Matrix(!is.na(tmp),sparse=T) ; rm(tmp)
mat = mat
creDB contains the downloadable data from the Fantom repository
library(creDB)
CREDB.0.1
cdb =
#- get enhancers from database
creGet( cdb, keytype="study", keys="fantom",
enh =columns=c( "enh_clst", "enh_chrm", "enh_start","enh_end","context_name")) %>%
collect
head(enh)
Heer we give an overview of the tables in creDB and their relations:
Each of the tables can be accessed from R directly (see above):
library(creDB)
library(dplyr,quietly = TRUE)
dbListFields(cdb,"enhid_range")
## [1] "enh_id" "enh_chrm" "enh_start" "enh_end" "assembly" "enh_clst"
tbl(cdb, "enhid_range")
enhid_range = enhid_range
## # Source: table<enhid_range> [?? x 6]
## # Database: sqlite 3.30.1
## enh_id enh_chrm enh_start enh_end assembly enh_clst
## <int> <chr> <int> <int> <chr> <int>
## 1 601774 chr1 12231 12610 hg19 1
## 2 601775 chr1 12721 13220 hg19 2
## 3 723142 chr1 14831 14960 hg19 3
## 4 4358371 chr1 14992 15456 hg19 3
## 5 723143 chr1 15041 15640 hg19 3
## 6 601776 chr1 15041 15790 hg19 3
## 7 3011473 chr1 15951 16580 hg19 4
## 8 1414603 chr1 15951 16600 hg19 4
## 9 2234554 chr1 16771 16850 hg19 -1
## 10 2234555 chr1 17061 17230 hg19 5
## # … with more rows
In the following we describe each table and its columns:
Table enhid_range:
This table contains information about the location of each enhancer. The enh_clst column groups enhancers with not more than 50bp distance into gene regulatory loci.
Column | Type | Example value | Description |
---|---|---|---|
enh_id | INTEGER | 1607650 | Internal enhancer ID |
enh_chrm | VARCHAR | chr1 | Chromosome |
enh_start | INTEGER | 958511 | Enhancer start |
enh_end | INTEGER | 961000 | Enhancer end |
assembly | VARCHAR | hg19 | Assembly |
enh_clst | INTEGER | 98 | Internal enhancer locus/cluster ID |
Table: enhid_contextid
This table connects enhancer IDs with context IDs. A context is describes the conditions under which the enhancer was identified as active. See the contextid_annot table.
Column | Type | Example value | Description |
---|---|---|---|
enh_id | INTEGER | 1607650 | Internal enhancer ID |
context_id | INTEGER | 74 | Internal context ID |
score | VARCHAR | 3.27685525 | Enhancer activity score in the context (if available) |
Table: contextid_annot
This table describes the contexts of enhancers in creDB
Column | Type | Example value | Description |
---|---|---|---|
context_id | INTEGER | 74 | Internal context ID |
context_name | VARCHAR | Lung | Name of tissue / cell type / context |
ontology | VARCHAR | UBERON:0002048 | ontology of tissue / cell type / primary cells |
study | VARCHAR | EnhancerAtlas | Name of the study the context was assayed |
pmid | INTEGER | 27515742 | Pubmed ID of the study (if available) |
species | VARCHAR | HomoSapiens | Species name |
Table: enhid_track
This table links enhancer IDs to tracks. Tracks refer to the assay types that were used to identify enhancers in a certain context (cell type / tissue).
Column | Type | Example value | Description |
---|---|---|---|
enh_id | INTEGER | 1607650 | Internal enhancer ID |
track | VARCHAR | enhancerAtlas_consensus | Name describing track |
Table: enhid_promid
This table links enhancers to promoters, where interactions have been identified / predicted. Not for all studies.
Column | Type | Example value | Description |
---|---|---|---|
contact_id | INTEGER | 1754258 | Internal enhancer-promoter contact ID |
enh_id | INTEGER | 1607650 | Internal enhancer ID |
prom_id | INTEGER | 279730 | Internal promoter ID |
consistent | INTEGER | 1 | Does annotated promoter matches creDB annotation? |
study | VARCHAR | EnhancerAtlas | Name of the study the contact was assayed |
e_p_score | VARCHAR | 0.957 | Confidence score of the enhancer-target interaction |
context_id | INTEGER | 74 | Internal context ID |
method | VARCHAR | IM-PET | Method to predict enhancer targets |
In creDB GENCODE gene annotations (v26) are used (hg38 and “backmapped” to hg19). According to this annotation we define promoters as one base (the TSS). Therefore promoters can be associated with more than one transcript, and also with more than one gene and strand (happens, but rarely).
Table: hg_promid_range
This table describes human promoters (GENCODE v. 26)
Column | Type | Example value | Description |
---|---|---|---|
prom_id | INTEGER | 279730 | Internal promoter ID |
prom_chrom | VARCHAR | chr1 | Promoter chromosome |
tss | INTEGER | 955503 | Promoter (TSS) location |
strand | CHAR | + | Strand of promoter {+,-,*} |
assembly | VARCHAR | hg19 | Assembly |
Table: hg_promid_transid_geneid
This table links promoters to transcripts and genes (in human) using GENCODE.
Column | Type | Example value | Description |
---|---|---|---|
prom_id | INTEGER | 3899 | Internal promoter ID |
trans_id | VARCHAR | ENST00000379370 | Ensembl transcript name |
gene_id | VARCHAR | ENSG00000188157 | Ensembl gene name |