Cis–REgulatory elements DataBase

2020-10-05

Introduction

The creDB package provides easy access to a sqlite database containing enhancer sequences annotated across a range of biological contexts (tissues, cell types). The package contains the database, it builds on the dplyr/dbplyr packages to provide data access, and it provides a creGet function to make querying the database easy. It aims to fill the gap between “enhancer browsers” on the one hand that provide a web-based graphical user interface, and simple flat files for download on the other hand. The former often cannot be employed in automated analyses, while the latter typically requires substantial time and effort for pre-processing etc. before data can be integrated in an analysis.

In the following we show some examples of how it can be used.

Accessing enhancers by biological context (tissue / cell type)

After loading the package an object of type creDB provides access to the database. In the following we show how to extract the set of enhancers annotated to be active in the heart.

First we load the database/package:

library(creDB)
cdb = CREDB.0.1
cdb
## Object of type creDB

The creDB class extends SQLiteConnection class from the RSQLite package. It provides easy access to the data with the creGet, so we do not have to worry about details of the database (see DB-schema). creGet retrieves information about annotated enhancers as columns (e.g., information about cis-regulatory sequences) for a range of keytypes (like biological context, genomic location, enhancer track, or study) and associated keys. Here enhancer tracks encode the annotation/identification approach employed to annotate enhancers. Examples are “H3K27ac” for enhancers called based on this histone mark, “eRNA”, “enhancerAtlas_consensus”, and others (see queries with creGet).

Next, we retrieve heart enhancers contained in creDB:

library(dplyr, quietly = TRUE)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
columns = c("enh_chrm","enh_start","enh_end","context_name")
enh     = creGet(cdb,keytype="context_name",keys="Heart",columns)
enh
## # Source:   lazy query [?? x 4]
## # Database: sqlite 3.30.1
##    enh_chrm enh_start enh_end context_name
##    <chr>        <int>   <int> <chr>
##  1 chr1        931201  934340 Heart
##  2 chr1        934811  934900 Heart
##  3 chr1        934991  935070 Heart
##  4 chr1        948961  949360 Heart
##  5 chr1       1207641 1209040 Heart
##  6 chr1       1309831 1310080 Heart
##  7 chr1       1341841 1342280 Heart
##  8 chr1       1342401 1342510 Heart
##  9 chr1       1407471 1408220 Heart
## 10 chr1       1471181 1475060 Heart
## # … with more rows
enh     = enh %>% head %>% collect
#enh     = enh %>% collect(n=Inf)

Queries with creGet

We can find out about the columns, keytypes, and keys creDB supports with creGet as follows:

listKeytypes(cdb)
## [1] "enh_id"       "context_name" "track"        "range"        "study"
listKeys(cdb,keytype="context_name")
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.30.1
##    context_name
##    <chr>
##  1 786-O
##  2 A375
##  3 A549
##  4 Adipocyte
##  5 Aml_blast
##  6 Astrocyte
##  7 B_cell_blood
##  8 Be2c
##  9 Bj
## 10 Bronchia_epithelial
## # … with more rows
listColumns(cdb)
##                                                 enhid
##                       "Integer; internal enhancer ID"
##                                              enh_chrm
##                               "Character; chromosome"
##                                             enh_start
##                             "Integer; enhancer start"
##                                               enh_end
##                               "Integer; enhancer end"
##                                              enh_clst
##                        "Integer; internal cluster ID"
##                                            context_id
##                        "Integer; internal context ID"
##                                          context_name
## "Character; name of the context (tissue / cell line)"
##                                                 study
##                        "Character; name of the study"
##                                                  pmid
##                                  "Integer; pubmed ID"
##                                               species
##                             "Character; species name"
##                                                 track
##                         "Character; annotation track"
listKeys(cdb,keytype="track")
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.30.1
##    track
##    <chr>
##  1 enhancer-atlas-v2_consensus
##  2 benton_ChromHMM
##  3 benton_DNasePlusHistone
##  4 benton_EncodeEnhancerlike
##  5 benton_FANTOM
##  6 benton_H3K27acMinusH3K4me3
##  7 benton_H3K27acPlusH3K4me1
##  8 benton_H3K27ac
##  9 benton_Ho14
## 10 benton_Yip12
## # … with more rows

Therefore, another creGet–based query could be:

library(GenomicRanges, quietly = TRUE)
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
##
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
##     first, rename
## The following object is masked from 'package:base':
##
##     expand.grid
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
##     collapse, desc, slice
my.range = GRanges("chr1", IRanges(107362617,107362617+100000))
tmp      = creGet(cdb,keytype="range",keys=my.range,columns)

Custom queries against creDB

In addition to creGet, the creDB object provides direct access to the SQLite database (via the SQLiteConnection class). Here we display the available tables, please see the RSQLite and DBI packages for more information:

library(RSQLite)

dbListTables(cdb)
##  [1] "contextid_annot"          "dbplyr_001"
##  [3] "enhid_contextid"          "enhid_promid"
##  [5] "enhid_range"              "enhid_track"
##  [7] "hg_promid_range"          "hg_promid_transid_geneid"
##  [9] "sqlite_stat1"             "sqlite_stat4"
enh_range = tbl(cdb,"enhid_range")
enh_odd   = filter(enh_range, enh_start %% 2 == 1)

Saving query results

It is straight forward to export enhancer sets as bed files, for example to use them in downstream analyses:

library(rtracklayer)

heart_enh = GRanges(enh$enh_chrm,IRanges(enh$enh_start,enh$enh_end))
heart_enh = sortSeqlevels(heart_enh) ; sort(heart_enh)
## GRanges object with 6 ranges and 0 metadata columns:
##       seqnames          ranges strand
##          <Rle>       <IRanges>  <Rle>
##   [1]     chr1   931201-934340      *
##   [2]     chr1   934811-934900      *
##   [3]     chr1   934991-935070      *
##   [4]     chr1   948961-949360      *
##   [5]     chr1 1207641-1209040      *
##   [6]     chr1 1309831-1310080      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
export.bed(sort(heart_enh),con="./heart-enhancers_hg19.bed")

For more details about export.bed and on how to include meta-data/annotation, please see the rtracklayer package. The file ./heart-enhancers_hg19.bed now contains the selected enhancers.

Benton et al. data

Here we show how to access the data used in Benton et al. As an example, we compare annotated liver enhancers from two strategies: H3K27ac and ChromHmm. We find that 44% of the H3K27ac enhancers are overlapped by ChromHmm enhancers, and that 68% of the ChromHmm enhancers are overlapped by the H3K27ac enhancers:

library(creDB)
library(GenomicRanges)
library(stringr)
cdb = CREDB.0.1

#- information per enhancers
columns = c("enh_id","enh_chrm","enh_start","enh_end","context_name","track")

#- retrieve enhancers from database
enh = creGet(cdb,keytype="study", keys="benton",columns) %>% dplyr::collect()

#- remove duplicate annotation from enhancers also in enhancerAtlas
enh = enh %>% dplyr::filter(str_detect(track,"^benton_"))

#- make GenomicRanges for Liver enhancer using two different identification strategies
liver_ac = enh %>%
           dplyr::filter(context_name =="Liver" , track == "benton_H3K27ac") %>%
           dplyr::select(enh_chrm,enh_start,enh_end)
liver_ac = GRanges(liver_ac$enh_chrm,IRanges(liver_ac$enh_start,liver_ac$enh_end))

liver_ch = enh %>%
           dplyr::filter(context_name =="Liver" , track == "benton_ChromHMM") %>%
           dplyr::select(enh_chrm,enh_start,enh_end)
liver_ch = GRanges(liver_ch$enh_chrm,IRanges(liver_ch$enh_start,liver_ch$enh_end))

#- how much do they overlap?
ov = findOverlaps(liver_ac,liver_ch)
f1 = length(unique(queryHits(ov)))  /length(liver_ac) #- 44%
f2 = length(unique(subjectHits(ov)))/length(liver_ch) #- 68%

Enhancer Atlas data

creDB contains the downloadable data from the EnhancerAtlas publication. Here is how to access and cast it into binary matrix format, with enhancers as (named) rows, contexts as columns, and a binary code for enhancer activity.

library(creDB)
library(reshape2)
library(stringr)
library(Matrix)

cdb = CREDB.0.1

#- get enhancers from database
enh = creGet( cdb,  keytype="study", keys="EnhancerAtlas",
              columns=c( "enh_clst", "enh_chrm", "enh_start","enh_end","context_name")) %>%
              collect

#- add an enhancer-name column
enh$name = str_c(enh$enh_chrm,":",enh$enh_start,"-",enh$enh_end)

#- make a binary matrix indicating tissue expression per enhancer
tmp           = dcast(enh,name~context_name,value.var="name")
rownames(tmp) = tmp[,"name"] ; tmp = tmp[,-1]
mat           = Matrix(!is.na(tmp),sparse=T) ; rm(tmp)
mat

Fantom data

creDB contains the downloadable data from the Fantom repository

library(creDB)

cdb = CREDB.0.1

#- get enhancers from database
enh = creGet( cdb,  keytype="study", keys="fantom",
              columns=c( "enh_clst", "enh_chrm", "enh_start","enh_end","context_name")) %>%
              collect
head(enh)

Database structure

Heer we give an overview of the tables in creDB and their relations:

creDBschema

Each of the tables can be accessed from R directly (see above):

library(creDB)
library(dplyr,quietly = TRUE)
dbListFields(cdb,"enhid_range")
## [1] "enh_id"    "enh_chrm"  "enh_start" "enh_end"   "assembly"  "enh_clst"
enhid_range  = tbl(cdb, "enhid_range")
enhid_range
## # Source:   table<enhid_range> [?? x 6]
## # Database: sqlite 3.30.1
##     enh_id enh_chrm enh_start enh_end assembly enh_clst
##      <int> <chr>        <int>   <int> <chr>       <int>
##  1  601774 chr1         12231   12610 hg19            1
##  2  601775 chr1         12721   13220 hg19            2
##  3  723142 chr1         14831   14960 hg19            3
##  4 4358371 chr1         14992   15456 hg19            3
##  5  723143 chr1         15041   15640 hg19            3
##  6  601776 chr1         15041   15790 hg19            3
##  7 3011473 chr1         15951   16580 hg19            4
##  8 1414603 chr1         15951   16600 hg19            4
##  9 2234554 chr1         16771   16850 hg19           -1
## 10 2234555 chr1         17061   17230 hg19            5
## # … with more rows

In the following we describe each table and its columns:

Table enhid_range:

This table contains information about the location of each enhancer. The enh_clst column groups enhancers with not more than 50bp distance into gene regulatory loci.

Column Type Example value Description
enh_id INTEGER 1607650 Internal enhancer ID
enh_chrm VARCHAR chr1 Chromosome
enh_start INTEGER 958511 Enhancer start
enh_end INTEGER 961000 Enhancer end
assembly VARCHAR hg19 Assembly
enh_clst INTEGER 98 Internal enhancer locus/cluster ID

Table: enhid_contextid

This table connects enhancer IDs with context IDs. A context is describes the conditions under which the enhancer was identified as active. See the contextid_annot table.

Column Type Example value Description
enh_id INTEGER 1607650 Internal enhancer ID
context_id INTEGER 74 Internal context ID
score VARCHAR 3.27685525 Enhancer activity score in the context (if available)

Table: contextid_annot

This table describes the contexts of enhancers in creDB

Column Type Example value Description
context_id INTEGER 74 Internal context ID
context_name VARCHAR Lung Name of tissue / cell type / context
ontology VARCHAR UBERON:0002048 ontology of tissue / cell type / primary cells
study VARCHAR EnhancerAtlas Name of the study the context was assayed
pmid INTEGER 27515742 Pubmed ID of the study (if available)
species VARCHAR HomoSapiens Species name

Table: enhid_track

This table links enhancer IDs to tracks. Tracks refer to the assay types that were used to identify enhancers in a certain context (cell type / tissue).

Column Type Example value Description
enh_id INTEGER 1607650 Internal enhancer ID
track VARCHAR enhancerAtlas_consensus Name describing track

Table: enhid_promid

This table links enhancers to promoters, where interactions have been identified / predicted. Not for all studies.

Column Type Example value Description
contact_id INTEGER 1754258 Internal enhancer-promoter contact ID
enh_id INTEGER 1607650 Internal enhancer ID
prom_id INTEGER 279730 Internal promoter ID
consistent INTEGER 1 Does annotated promoter matches creDB annotation?
study VARCHAR EnhancerAtlas Name of the study the contact was assayed
e_p_score VARCHAR 0.957 Confidence score of the enhancer-target interaction
context_id INTEGER 74 Internal context ID
method VARCHAR IM-PET Method to predict enhancer targets

In creDB GENCODE gene annotations (v26) are used (hg38 and “backmapped” to hg19). According to this annotation we define promoters as one base (the TSS). Therefore promoters can be associated with more than one transcript, and also with more than one gene and strand (happens, but rarely).

Table: hg_promid_range

This table describes human promoters (GENCODE v. 26)

Column Type Example value Description
prom_id INTEGER 279730 Internal promoter ID
prom_chrom VARCHAR chr1 Promoter chromosome
tss INTEGER 955503 Promoter (TSS) location
strand CHAR + Strand of promoter {+,-,*}
assembly VARCHAR hg19 Assembly

Table: hg_promid_transid_geneid

This table links promoters to transcripts and genes (in human) using GENCODE.

Column Type Example value Description
prom_id INTEGER 3899 Internal promoter ID
trans_id VARCHAR ENST00000379370 Ensembl transcript name
gene_id VARCHAR ENSG00000188157 Ensembl gene name