Introduction

The creDB package provides easy access to a sqlite database containing enhancer sequences annotated across a range of biological contexts (tissues, cell types). The package contains the database, it builds on the dplyr/dbplyr packages to provide data access, and it provides a creGet function to make querying the database easy. It aims to fill the gap between “enhancer browsers” on the one hand that provide a web-based graphical user interface, and simple flat files for download on the other hand. The former often cannot be employed in automated analyses, while the latter typically requires substantial time and effort for pre-processing etc. before data can be integrated in an analysis.

In the following we show some examples of how it can be used.

Accessing enhancers by biological context (tissue / cell type)

After loading the package an object of type creDB provides access to the database. In the following we show how to extract the set of enhancers annotated to be active in the heart.

First we load the database/package:

library(creDB)
cdb = CREDB.0.1
cdb
## Object of type creDB

The creDB class extends SQLiteConnection class from the RSQLite package. It provides easy access to the data with the creGet, so we do not have to worry about details of the database (see DB-schema). creGet retrieves information about annotated enhancers as columns (e.g., information about cis-regulatory sequences) for a range of keytypes (like biological context, genomic location, enhancer track, or study) and associated keys. Here enhancer tracks encode the annotation/identification approach employed to annotate enhancers. Examples are “H3K27ac” for enhancers called based on this histone mark, “eRNA”, “enhancerAtlas_consensus”, and others (see queries with creGet).

Next, we retrieve heart enhancers contained in creDB:

library(dplyr,quietly = TRUE)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
columns = c("enh_chrm","enh_start","enh_end","context_name")
enh     = creGet(cdb,keytype="context_name",keys="HEART",columns)
enh
## # Source:   lazy query [?? x 4]
## # Database: sqlite 3.19.3
## #   [/Library/Frameworks/R.framework/Versions/3.4/Resources/library/creDB/extdata/creDB.sqlite]
##    enh_chrm enh_start  enh_end context_name
##       <chr>     <int>    <int>        <chr>
##  1     chr1   1509090  1509860        HEART
##  2     chr1   1589750  1591110        HEART
##  3     chr1   1839740  1841270        HEART
##  4     chr1   2474880  2481230        HEART
##  5     chr1   2573340  2575700        HEART
##  6     chr1   8021850  8022390        HEART
##  7     chr1   8761100  8764030        HEART
##  8     chr1  11072800 11073420        HEART
##  9     chr1  11967670 11969280        HEART
## 10     chr1  11969580 11970340        HEART
## # ... with more rows
enh     = enh %>% head %>% collect
#enh     = enh %>% collect(n=Inf)

Queries with creGet

We can find out about the columns, keytypes, and keys creDB supports with creGet as follows:

listKeytypes(cdb)
## [1] "enh_id"       "context_name" "track"        "range"       
## [5] "study"
listKeys(cdb,keytype="context_name")
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.19.3
## #   [/Library/Frameworks/R.framework/Versions/3.4/Resources/library/creDB/extdata/creDB.sqlite]
##           context_name
##                  <chr>
##  1                A549
##  2           ASTROCYTE
##  3                  BJ
##  4 BRONCHIA_EPITHELIAL
##  5              CACO-2
##  6              CD133+
##  7               CD14+
##  8               CD19+
##  9               CD20+
## 10               CD34+
## # ... with more rows
listColumns(cdb)
##                                                 enhid 
##                       "Integer; internal enhancer ID" 
##                                              enh_chrm 
##                               "Character; chromosome" 
##                                             enh_start 
##                             "Integer; enhancer start" 
##                                               enh_end 
##                               "Integer; enhancer end" 
##                                              enh_clst 
##                        "Integer; internal cluster ID" 
##                                            context_id 
##                        "Integer; internal context ID" 
##                                          context_name 
## "Character; name of the context (tissue / cell line)" 
##                                                 study 
##                        "Character; name of the study" 
##                                                  pmid 
##                                  "Integer; pubmed ID" 
##                                               species 
##                             "Character; species name" 
##                                                 track 
##                         "Character; annotation track"
listKeys(cdb,keytype="track")
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.19.3
## #   [/Library/Frameworks/R.framework/Versions/3.4/Resources/library/creDB/extdata/creDB.sqlite]
##                         track
##                         <chr>
##  1    enhancerAtlas_consensus
##  2            benton_ChromHMM
##  3    benton_DNasePlusHistone
##  4  benton_EncodeEnhancerlike
##  5              benton_FANTOM
##  6             benton_H3K27ac
##  7 benton_H3K27acMinusH3K4me3
##  8  benton_H3K27acPlusH3K4me1
##  9                benton_Ho14
## 10               benton_Yip12
## # ... with more rows

Therefore, another creGet–based query could be:

library(GenomicRanges, quietly = TRUE)
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colMeans,
##     colnames, colSums, do.call, duplicated, eval, evalq, Filter,
##     Find, get, grep, grepl, intersect, is.unsorted, lapply,
##     lengths, Map, mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
##     rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following object is masked from 'package:base':
## 
##     expand.grid
## 
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
my.range = GRanges("chr1", IRanges(107362617,107362617+100000))
tmp      = creGet(cdb,keytype="range",keys=my.range,columns)

Custom queries against creDB

In addition to creGet, the creDB object provides direct access to the SQLite database (via the SQLiteConnection class). Here we display the available tables, please see the RSQLite and DBI packages for more information:

library(RSQLite)

dbListTables(cdb)
##  [1] "contextid_annot"          "enhid_contextid"         
##  [3] "enhid_promid"             "enhid_range"             
##  [5] "enhid_track"              "hg_promid_range"         
##  [7] "hg_promid_transid_geneid" "nxvdbxzkkk"              
##  [9] "sqlite_stat1"             "sqlite_stat4"
enh_range = tbl(cdb,"enhid_range")
enh_odd   = filter(enh_range, enh_start %% 2 == 1)

Saving query results

It is straight forward to export enhancer sets as bed files, for example to use them in downstream analyses:

library(rtracklayer)

heart_enh = GRanges(enh$enh_chrm,IRanges(enh$enh_start,enh$enh_end))
heart_enh = sortSeqlevels(heart_enh) ; sort(heart_enh)
## GRanges object with 6 ranges and 0 metadata columns:
##       seqnames             ranges strand
##          <Rle>          <IRanges>  <Rle>
##   [1]     chr1 [1509090, 1509860]      *
##   [2]     chr1 [1589750, 1591110]      *
##   [3]     chr1 [1839740, 1841270]      *
##   [4]     chr1 [2474880, 2481230]      *
##   [5]     chr1 [2573340, 2575700]      *
##   [6]     chr1 [8021850, 8022390]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
export.bed(sort(heart_enh),con="./heart-enhancers_hg19.bed")

For more details about export.bed and on how to include meta-data/annotation, please see the rtracklayer package. The file ./heart-enhancers_hg19.bed now contains the selected enhancers.

Benton et al. data

Here we show how to access the data used in Benton et al. As an example, we compare annotated liver enhancers from two strategies: H3K27ac and ChromHmm. We find that 44% of the H3K27ac enhancers are overlapped by ChromHmm enhancers, and that 68% of the ChromHmm enhancers are overlapped by the H3K27ac enhancers:

library(creDB)
library(GenomicRanges)
cdb = CREDB.0.1

#- information per enhancers
columns = c("enh_id","enh_chrm","enh_start","enh_end","context_name","track")

#- retrieve enhancers from database
enh = creGet(cdb,keytype="study", keys="benton",columns) %>% collect

#- remove duplicate annotation from enhancers also in enhancerAtlas
enh = enh %>% filter(track != "enhancerAtlas_consensus")

#- make GenomicRanges for Liver enhancer using two different identification strategies
liver_ac = enh %>%
           filter(context_name =="LIVER" , track == "benton_H3K27ac") %>%
           select(enh_chrm,enh_start,enh_end)
liver_ac = GRanges(liver_ac$enh_chrm,IRanges(liver_ac$enh_start,liver_ac$enh_end))

liver_ch = enh %>%
           filter(context_name =="LIVER" , track == "benton_ChromHMM") %>%
           select(enh_chrm,enh_start,enh_end)
liver_ch = GRanges(liver_ch$enh_chrm,IRanges(liver_ch$enh_start,liver_ch$enh_end))

#- how much do they overlap?
ov = findOverlaps(liver_ac,liver_ch)
f1 = length(unique(queryHits(ov)))  /length(liver_ac) #- 44%
f2 = length(unique(subjectHits(ov)))/length(liver_ch) #- 68%

Enhancer Atlas data

creDB contains the downloadable data from the EnhancerAtlas publication. Here is how to access and cast it into binary matrix format, with enhancers as (named) rows, contexts as columns, and a binary code for enhancer activity.

library(creDB)
library(reshape2)
library(stringr)
library(Matrix)

cdb = CREDB.0.1

#- get enhancers from database
enh = creGet( cdb,  keytype="study", keys="EnhancerAtlas",
              columns=c( "enh_clst", "enh_chrm", "enh_start","enh_end","context_name")) %>%
              collect

#- add an enhancer-name column
enh$name = str_c(enh$enh_chrm,":",enh$enh_start,"-",enh$enh_end)

#- make a binary matrix indicating tissue expression per enhancer
tmp           = dcast(enh,name~context_name,value.var="name")
rownames(tmp) = tmp[,"name"] ; tmp = tmp[,-1]
mat           = Matrix(!is.na(tmp),sparse=T) ; rm(tmp)
mat

Database structure

Heer we give an overview of the tables in creDB and their relations:

creDBschema

creDBschema

Each of the tables can be accessed from R directly (see above):

library(creDB)
library(dplyr,quietly = TRUE)
dbListFields(cdb,"enhid_range")
## [1] "enh_id"    "enh_chrm"  "enh_start" "enh_end"   "enh_clst"
enhid_range  = tbl(cdb, "enhid_range")
enhid_range
## # Source:   table<enhid_range> [?? x 5]
## # Database: sqlite 3.19.3
## #   [/Library/Frameworks/R.framework/Versions/3.4/Resources/library/creDB/extdata/creDB.sqlite]
##     enh_id enh_chrm enh_start enh_end enh_clst
##      <int>    <chr>     <int>   <int>    <int>
##  1 3055278     chr1     14992   15456        1
##  2 3055279     chr1     23096   24164        2
##  3  208416     chr1     24890   29320        3
##  4 3055280     chr1     26192   26662        3
##  5 3055281     chr1     26992   27475        3
##  6  208417     chr1     27460   29320        3
##  7  208418     chr1     28600   29320        3
##  8  208419     chr1     28640   29320        3
##  9  208420     chr1     35180   35270       -1
## 10 3055282     chr1     40011   40432        4
## # ... with more rows

In the following we describe each table and its columns:

Table enhid_range:

This table contains information about the location of each enhancer. The enh_clst column groups enhancers with not more than 50bp distance into gene regulatory loci.

Column Type Example value Description
enh_id INTEGER 34759 Internal enhancer ID
enh_chrm VARCHAR chr1 Chromosome
enh_start INTEGER 12983 Enhancer start
enh_end INTEGER 13930 Enhancer end
assembly VARCHAR hg19 Assembly
enh_clst INTEGER 34 Internal enhancer locus/cluster ID

Table: enhid_contextid

This table connects enhancer IDs with context IDs. A context is describes the conditions under which the enhancer was identified as active. See the contextid_annot table.

Column Type Example value Description
enh_id INTEGER 34759 Internal enhancer ID
context_id INTEGER 24857 Internal context ID

Table: contextid_annot

This table describes the contexts of enhancers in creDB

Column Type Example value Description
context_id INTEGER 24857 Internal context ID
context_name VARCHAR HEART Name of tissue / cell type / context
study VARCHAR enhancerAtlas Name of the study the context was assayed
pmid INTEGER 1928304 Pubmed ID of the study (if available)
species VARCHAR HomoSapiens Species name

Table: enhid_track

This table links enhancer IDs to tracks. Tracks refer to the assay types that were used to identify enhancers in a certain context (cell type / tissue).

Column Type Example value Description
enh_id INTEGER 34759 Internal enhancer ID
track VARCHAR H3K27Ac Name describing track

Table: enhid_promid

This table links enhancers to promoters, where interactions have been identified / predicted. Not for all studies.

Column Type Example value Description
contact_id INTEGER 43 Internal enhancer-promoter contact ID
enh_id INTEGER 34759 Internal enhancer ID
prom_id INTEGER 3899 Internal promoter ID
consistent INTEGER 1 Does annotated promoter matches creDB annotation?
study VARCHAR enhancerAtlas Name of the study the contact was assayed
context_id INTEGER 24857 Internal context ID

In creDB GENCODE gene annotations (v26) are used (hg38 and “backmapped” to hg19). According to this annotation we define promoters as one base (the TSS). Therefore promoters can be associated with more than one transcript, and also with more than one gene and strand (happens, but rarely).

Table: hg_promid_range

This table describes human promoters (GENCODE v. 26)

Column Type Example value Description
prom_id INTEGER 3899 Internal promoter ID
prom_chrom VARCHAR chr1 Promoter chromosome
tss INTEGER 123 Promoter (TSS) location
strand CHAR + Strand of promoter {+,-,*}
assembly VARCHAR hg19 Assembly

Table: enhid_promid_transid_geneid

This table links promoters to transcripts and genes (in human) using GENCODE.

Column Type Example value Description
prom_id INTEGER 3899 Internal promoter ID
trans_id VARCHAR ENST00000578629 Ensembl transcript name
gene_id VARCHAR ENSG00000284596 Ensembl gene name