Cis–REgulatory elements DataBase

Accessing enhancers by biological context (tissue / cell type)

After loading the package an object of type creDB provides access to the database. In the following we show how to extract the set of enhancers annotated to be active in the heart.

First we load the database/package:

library(creDB)
cdb = CREDB.0.1
cdb

## Object of type creDB

The creDB class extends SQLiteConnection class from the RSQLite package. It provides easy access to the data with the creGet, so we do not have to worry about details of the database (see DB-schema). creGet retrieves information about annotated enhancers as columns (e.g., information about cis-regulatory sequences) for a range of keytypes (like biological context, genomic location, enhancer track, or study) and associated keys. Here enhancer tracks encode the annotation/identification approach employed to annotate enhancers. Examples are “H3K27ac” for enhancers called based on this histone mark, “eRNA”, “enhancerAtlas_consensus”, and others (see queries with creGet).

Next, we retrieve heart enhancers contained in creDB:

library(dplyr, quietly = TRUE)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
##
##     filter, lag

## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union

columns = c("enh_chrm","enh_start","enh_end","context_name")
enh     = creGet(cdb,keytype="context_name",keys="Heart",columns)
enh

## # Source:   lazy query [?? x 4]
## # Database: sqlite 3.30.1
##    enh_chrm enh_start enh_end context_name
##    <chr>        <int>   <int> <chr>
##  1 chr1        931201  934340 Heart
##  2 chr1        934811  934900 Heart
##  3 chr1        934991  935070 Heart
##  4 chr1        948961  949360 Heart
##  5 chr1       1207641 1209040 Heart
##  6 chr1       1309831 1310080 Heart
##  7 chr1       1341841 1342280 Heart
##  8 chr1       1342401 1342510 Heart
##  9 chr1       1407471 1408220 Heart
## 10 chr1       1471181 1475060 Heart
## # … with more rows

enh     = enh %>% head %>% collect
#enh     = enh %>% collect(n=Inf)

Queries with creGet

We can find out about the columns, keytypes, and keys creDB supports with creGet as follows:

listKeytypes(cdb)

## [1] "enh_id"       "context_name" "track"        "range"        "study"

listKeys(cdb,keytype="context_name")

## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.30.1
##    context_name
##    <chr>
##  1 786-O
##  2 A375
##  3 A549
##  4 Adipocyte
##  5 Aml_blast
##  6 Astrocyte
##  7 B_cell_blood
##  8 Be2c
##  9 Bj
## 10 Bronchia_epithelial
## # … with more rows

listColumns(cdb)

##                                                 enhid
##                       "Integer; internal enhancer ID"
##                                              enh_chrm
##                               "Character; chromosome"
##                                             enh_start
##                             "Integer; enhancer start"
##                                               enh_end
##                               "Integer; enhancer end"
##                                              enh_clst
##                        "Integer; internal cluster ID"
##                                            context_id
##                        "Integer; internal context ID"
##                                          context_name
## "Character; name of the context (tissue / cell line)"
##                                                 study
##                        "Character; name of the study"
##                                                  pmid
##                                  "Integer; pubmed ID"
##                                               species
##                             "Character; species name"
##                                                 track
##                         "Character; annotation track"

listKeys(cdb,keytype="track")

## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.30.1
##    track
##    <chr>
##  1 enhancer-atlas-v2_consensus
##  2 benton_ChromHMM
##  3 benton_DNasePlusHistone
##  4 benton_EncodeEnhancerlike
##  5 benton_FANTOM
##  6 benton_H3K27acMinusH3K4me3
##  7 benton_H3K27acPlusH3K4me1
##  8 benton_H3K27ac
##  9 benton_Ho14
## 10 benton_Yip12
## # … with more rows

Therefore, another creGet–based query could be:

library(GenomicRanges, quietly = TRUE)

##
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
##
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:dplyr':
##
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
##
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
##
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min

##
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:dplyr':
##
##     first, rename

## The following object is masked from 'package:base':
##
##     expand.grid

##
## Attaching package: 'IRanges'

## The following objects are masked from 'package:dplyr':
##
##     collapse, desc, slice

my.range = GRanges("chr1", IRanges(107362617,107362617+100000))
tmp      = creGet(cdb,keytype="range",keys=my.range,columns)

Custom queries against creDB

In addition to creGet, the creDB object provides direct access to the SQLite database (via the SQLiteConnection class). Here we display the available tables, please see the RSQLite and DBI packages for more information:

library(RSQLite)

dbListTables(cdb)

##  [1] "contextid_annot"          "dbplyr_001"
##  [3] "enhid_contextid"          "enhid_promid"
##  [5] "enhid_range"              "enhid_track"
##  [7] "hg_promid_range"          "hg_promid_transid_geneid"
##  [9] "sqlite_stat1"             "sqlite_stat4"

enh_range = tbl(cdb,"enhid_range")
enh_odd   = filter(enh_range, enh_start %% 2 == 1)

Saving query results

It is straight forward to export enhancer sets as bed files, for example to use them in downstream analyses:

library(rtracklayer)

heart_enh = GRanges(enh$enh_chrm,IRanges(enh$enh_start,enh$enh_end))
heart_enh = sortSeqlevels(heart_enh) ; sort(heart_enh)

## GRanges object with 6 ranges and 0 metadata columns:
##       seqnames          ranges strand
##          <Rle>       <IRanges>  <Rle>
##   [1]     chr1   931201-934340      *
##   [2]     chr1   934811-934900      *
##   [3]     chr1   934991-935070      *
##   [4]     chr1   948961-949360      *
##   [5]     chr1 1207641-1209040      *
##   [6]     chr1 1309831-1310080      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

export.bed(sort(heart_enh),con="./heart-enhancers_hg19.bed")

For more details about export.bed and on how to include meta-data/annotation, please see the rtracklayer package. The file ./heart-enhancers_hg19.bed now contains the selected enhancers.

Benton et al. data

Here we show how to access the data used in Benton et al. As an example, we compare annotated liver enhancers from two strategies: H3K27ac and ChromHmm. We find that 44% of the H3K27ac enhancers are overlapped by ChromHmm enhancers, and that 68% of the ChromHmm enhancers are overlapped by the H3K27ac enhancers:

library(creDB)
library(GenomicRanges)
library(stringr)
cdb = CREDB.0.1

#- information per enhancers
columns = c("enh_id","enh_chrm","enh_start","enh_end","context_name","track")

#- retrieve enhancers from database
enh = creGet(cdb,keytype="study", keys="benton",columns) %>% dplyr::collect()

#- remove duplicate annotation from enhancers also in enhancerAtlas
enh = enh %>% dplyr::filter(str_detect(track,"^benton_"))

#- make GenomicRanges for Liver enhancer using two different identification strategies
liver_ac = enh %>%
           dplyr::filter(context_name =="Liver" , track == "benton_H3K27ac") %>%
           dplyr::select(enh_chrm,enh_start,enh_end)
liver_ac = GRanges(liver_ac$enh_chrm,IRanges(liver_ac$enh_start,liver_ac$enh_end))

liver_ch = enh %>%
           dplyr::filter(context_name =="Liver" , track == "benton_ChromHMM") %>%
           dplyr::select(enh_chrm,enh_start,enh_end)
liver_ch = GRanges(liver_ch$enh_chrm,IRanges(liver_ch$enh_start,liver_ch$enh_end))

#- how much do they overlap?
ov = findOverlaps(liver_ac,liver_ch)
f1 = length(unique(queryHits(ov)))  /length(liver_ac) #- 44%
f2 = length(unique(subjectHits(ov)))/length(liver_ch) #- 68%

Enhancer Atlas data

creDB contains the downloadable data from the EnhancerAtlas publication. Here is how to access and cast it into binary matrix format, with enhancers as (named) rows, contexts as columns, and a binary code for enhancer activity.

library(creDB)
library(reshape2)
library(stringr)
library(Matrix)

cdb = CREDB.0.1

#- get enhancers from database
enh = creGet( cdb,  keytype="study", keys="EnhancerAtlas",
              columns=c( "enh_clst", "enh_chrm", "enh_start","enh_end","context_name")) %>%
              collect

#- add an enhancer-name column
enh$name = str_c(enh$enh_chrm,":",enh$enh_start,"-",enh$enh_end)

#- make a binary matrix indicating tissue expression per enhancer
tmp           = dcast(enh,name~context_name,value.var="name")
rownames(tmp) = tmp[,"name"] ; tmp = tmp[,-1]
mat           = Matrix(!is.na(tmp),sparse=T) ; rm(tmp)
mat

Fantom data

creDB contains the downloadable data from the Fantom repository

library(creDB)

cdb = CREDB.0.1

#- get enhancers from database
enh = creGet( cdb,  keytype="study", keys="fantom",
              columns=c( "enh_clst", "enh_chrm", "enh_start","enh_end","context_name")) %>%
              collect
head(enh)

Database structure

Heer we give an overview of the tables in creDB and their relations:

creDBschema

Each of the tables can be accessed from R directly (see above):

library(creDB)
library(dplyr,quietly = TRUE)
dbListFields(cdb,"enhid_range")

## [1] "enh_id"    "enh_chrm"  "enh_start" "enh_end"   "assembly"  "enh_clst"

enhid_range  = tbl(cdb, "enhid_range")
enhid_range

## # Source:   table<enhid_range> [?? x 6]
## # Database: sqlite 3.30.1
##     enh_id enh_chrm enh_start enh_end assembly enh_clst
##      <int> <chr>        <int>   <int> <chr>       <int>
##  1  601774 chr1         12231   12610 hg19            1
##  2  601775 chr1         12721   13220 hg19            2
##  3  723142 chr1         14831   14960 hg19            3
##  4 4358371 chr1         14992   15456 hg19            3
##  5  723143 chr1         15041   15640 hg19            3
##  6  601776 chr1         15041   15790 hg19            3
##  7 3011473 chr1         15951   16580 hg19            4
##  8 1414603 chr1         15951   16600 hg19            4
##  9 2234554 chr1         16771   16850 hg19           -1
## 10 2234555 chr1         17061   17230 hg19            5
## # … with more rows

In the following we describe each table and its columns:

Table enhid_range:

This table contains information about the location of each enhancer. The enh_clst column groups enhancers with not more than 50bp distance into gene regulatory loci.

Column	Type	Example value	Description
enh_id	INTEGER	1607650	Internal enhancer ID
enh_chrm	VARCHAR	chr1	Chromosome
enh_start	INTEGER	958511	Enhancer start
enh_end	INTEGER	961000	Enhancer end
assembly	VARCHAR	hg19	Assembly
enh_clst	INTEGER	98	Internal enhancer locus/cluster ID

Table: enhid_contextid

This table connects enhancer IDs with context IDs. A context is describes the conditions under which the enhancer was identified as active. See the contextid_annot table.

Column	Type	Example value	Description
enh_id	INTEGER	1607650	Internal enhancer ID
context_id	INTEGER	74	Internal context ID
score	VARCHAR	3.27685525	Enhancer activity score in the context (if available)

Table: contextid_annot

This table describes the contexts of enhancers in creDB

Column	Type	Example value	Description
context_id	INTEGER	74	Internal context ID
context_name	VARCHAR	Lung	Name of tissue / cell type / context
ontology	VARCHAR	UBERON:0002048	ontology of tissue / cell type / primary cells
study	VARCHAR	EnhancerAtlas	Name of the study the context was assayed
pmid	INTEGER	27515742	Pubmed ID of the study (if available)
species	VARCHAR	HomoSapiens	Species name

Table: enhid_track

This table links enhancer IDs to tracks. Tracks refer to the assay types that were used to identify enhancers in a certain context (cell type / tissue).

Column	Type	Example value	Description
enh_id	INTEGER	1607650	Internal enhancer ID
track	VARCHAR	enhancerAtlas_consensus	Name describing track

Table: enhid_promid

This table links enhancers to promoters, where interactions have been identified / predicted. Not for all studies.

Column	Type	Example value	Description
contact_id	INTEGER	1754258	Internal enhancer-promoter contact ID
enh_id	INTEGER	1607650	Internal enhancer ID
prom_id	INTEGER	279730	Internal promoter ID
consistent	INTEGER	1	Does annotated promoter matches creDB annotation?
study	VARCHAR	EnhancerAtlas	Name of the study the contact was assayed
e_p_score	VARCHAR	0.957	Confidence score of the enhancer-target interaction
context_id	INTEGER	74	Internal context ID
method	VARCHAR	IM-PET	Method to predict enhancer targets

In creDB GENCODE gene annotations (v26) are used (hg38 and “backmapped” to hg19). According to this annotation we define promoters as one base (the TSS). Therefore promoters can be associated with more than one transcript, and also with more than one gene and strand (happens, but rarely).

Table: hg_promid_range

This table describes human promoters (GENCODE v. 26)

Column	Type	Example value	Description
prom_id	INTEGER	279730	Internal promoter ID
prom_chrom	VARCHAR	chr1	Promoter chromosome
tss	INTEGER	955503	Promoter (TSS) location
strand	CHAR	+	Strand of promoter {+,-,*}
assembly	VARCHAR	hg19	Assembly

Table: hg_promid_transid_geneid

This table links promoters to transcripts and genes (in human) using GENCODE.

Column	Type	Example value	Description
prom_id	INTEGER	3899	Internal promoter ID
trans_id	VARCHAR	ENST00000379370	Ensembl transcript name
gene_id	VARCHAR	ENSG00000188157	Ensembl gene name