1 Introduction

The aim of TKCat (Tailored Knowledge Catalog) is to facilitate the management of data from knowledge resources that are frequently used alone or together in research environments. In TKCat, knowledge resources are manipulated as modeled database (MDB) objects. These objects provide access to the data tables along with a general description of the resource and a detail data model generated with ReDaMoR documenting the tables, their fields and their relationships. These MDB are then gathered in catalogs that can be easily explored an shared. TKCat provides tools to easily subset, filter and combine MDBs and create new catalogs suited for specific needs.

The TKCat R package is licensed under GPL-3.

This vignette focuses on a local usage of TKCat in R console. Two other vignettes describe more specifically how TKCat can be used with a ClickHouse database from a user or an operational perspectives. A final vignette is dedicated to an extended documentation of collections.

2 Modeled databases and embedded information

A modeled database (MDB) in TKCat gathers the following information:

  • General database information including a mandatory name and optionally the following fields: title, description, url, version and maintainer.
  • A data model created using the ReDaMoR package.
  • A list of tables corresponding to reference concepts shared by different MDBs. The way these concepts are identified is defined in specific documents called collections.
  • The data themselves organized according to the data model.

2.1 Reading examples

2.1.1 HPO

A subset of the Human Phenotype Ontology (HPO) is provided within the ReDaMoR package. The HPO aims to provide a standardized vocabulary of phenotypic abnormalities encountered in human diseases (Köhler et al. 2019). An MDB object based on files (see MDB implementations) can be read as shown below. As explained above, the data provided by the path parameter are documented with a model (dataModel parameter) and general information (dbInfo parameter).

file_hpo <- read_fileMDB(
   path=system.file("examples/HPO-subset", package="ReDaMoR"),
   dataModel=system.file("examples/HPO-model.json", package="ReDaMoR"),
   dbInfo=list(
      "name"="HPO",
      "title"="Data extracted from the HPO database",
      "description"=paste(
         "This is a very small subset of the HPO!",
         "Visit the reference URL for more information."
      ),
      "url"="http://human-phenotype-ontology.github.io/"
   )
)
## HPO
## SUCCESS
## 
## Check configuration
##    - Optional checks: 
##    - Maximum number of records: 10

The message displayed in the console indicates if the data fit the data model. It relies on the ReDaMoR::confront_data() functions and check by default the first 10 rows of each file.

The data model can then be drawn.

plot(data_model(file_hpo))

In this model, the HPO_hp table refers to the concept of phenotype and the HPO_disease to the concept of disease. These concepts are used to define the condition of individuals. The Condition collection is built in the TKCat package. Identifying the collection members in an MDB is done by providing a table of the shape as displayed below and using the collection_members() function. As described in the Merging with collections section, collections identify concepts shared by different MDB and can be used to merge resources according to these concepts.

cn <- c(
   "collection", "cid",                "resource", "mid", "table",        "field",     "static", "value",    "type"
)
cm <- matrix(data=c(
   "Condition",  "HPO_conditions_1.0", "HPO",      1,     "HPO_hp",       "condition",  TRUE,    "Phenotype", NA,
   "Condition",  "HPO_conditions_1.0", "HPO",      1,     "HPO_hp",       "source",     TRUE,    "HP",        NA,
   "Condition",  "HPO_conditions_1.0", "HPO",      1,     "HPO_hp",       "identifier", FALSE,   "id",        NA,
   "Condition",  "HPO_conditions_1.0", "HPO",      2,     "HPO_diseases", "condition",  TRUE,    "Disease",   NA,
   "Condition",  "HPO_conditions_1.0", "HPO",      2,     "HPO_diseases", "source",     FALSE,   "db",        NA,
   "Condition",  "HPO_conditions_1.0", "HPO",      2,     "HPO_diseases", "identifier", FALSE,   "id",        NA
   ),
   ncol=9, byrow=TRUE
) %>%
   set_colnames(cn) %>% 
   as_tibble() %>% 
   mutate(mid=as.integer(mid), static=as.logical(static))
collection_members(file_hpo) <- cm
file_hpo
## fileMDB HPO: Data extracted from the HPO database
##    - 9 tables with 25 fields
## 
## Collection members: 
##    - 2 Condition members
## 
## This is a very small subset of the HPO! Visit the reference URL for more information.
## (http://human-phenotype-ontology.github.io/)

2.1.2 ClinVar

A subset of the ClinVar database is provided within this package. ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence (Landrum et al. 2018). This resource can be read as a fileMDB as shown above, excepted that all the documenting information is included in the resource directory in this case and it is organized as following:

  • DESCRIPTION.json contains db_information

  • data is a directory with all the data tables

  • model is a directory with model information:

    • A json file with the ClinVar data model created with the ReDaMoR package
    • A Collections subfolder with one json file per collection with members in the ClinVar resource
file_clinvar <- read_fileMDB(
   path=system.file("examples/ClinVar", package="TKCat")
)
## ClinVar
## SUCCESS
## 
## Check configuration
##    - Optional checks: 
##    - Maximum number of records: 10

2.1.3 CHEMBL

A self-documented subset of the CHEMBL database is also provided in this package. It can be read the same way as the ClinVar resource.

file_chembl <- read_fileMDB(
   path=system.file("examples/CHEMBL", package="TKCat")
)
## CHEMBL
## SUCCESS
## 
## Check configuration
##    - Optional checks: 
##    - Maximum number of records: 10

CHEMBL is a manually curated chemical database of bioactive molecules with drug-like properties (Mendez et al. 2019).

2.2 MDB implementations

There are 3 main implementations of MDBs:

  • fileMDB objects keep the data in files and load them only when requested by the user. These implementation is the first one which is used when reading MDB as demonstrated in the examples above.

  • memoMDB objects have all the data loaded in memory. These objects are very easy to use but can take time to load and can use a lot of memory.

  • chMDB objects get the data from a ClickHouse database providing a catalog of MDBs as described in the chTKCat section below. More information about chTKCat and chMDB objects can also be found in the chTKCat user guide and the chTKCat operational manual.

The different implementations can be converted to each others using as_fileMDB(), as_memoMDB() and as_chMDB() functions.

memo_clinvar <- as_memoMDB(file_clinvar)
object.size(file_clinvar) %>% print(units="Kb")
## 154.5 Kb
object.size(memo_clinvar) %>% print(units="Kb")
## 689 Kb

A fourth implementation is metaMDB which combines several MDBs glued together with relational tables (see the Merging with collections section).

Most of the functions described below work with any MDB implementation. A few functions are specific to each implementation.

2.3 Getting information

2.3.1 General information

db_info(file_clinvar)
## $name
## [1] "ClinVar"
## 
## $title
## [1] "Data extracted from the ClinVar database"
## 
## $description
## [1] "ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. This is a very small subset of ClinVar! Visit the reference URL for more information."
## 
## $url
## [1] "https://www.ncbi.nlm.nih.gov/clinvar/"
## 
## $version
## [1] "0.9"
## 
## $maintainer
## [1] "Patrice Godard <patrice.godard@ucb.com>"

The function db_info()<- can be used to update this information.

2.3.2 Data model

As shown above the data model of an MDB can be retrieved and plot the following way.

plot(data_model(file_clinvar))

Tables names can be listed with the names() function and changed with names()<- or rename().

names(file_clinvar)
##  [1] "ClinVar_ReferenceClinVarAssertion" "ClinVar_rcvaVariant"              
##  [3] "ClinVar_ClinVarAssertions"         "ClinVar_rcvaInhMode"              
##  [5] "ClinVar_rcvaObservedIn"            "ClinVar_rcvaTraits"               
##  [7] "ClinVar_clinSigOrder"              "ClinVar_revStatOrder"             
##  [9] "ClinVar_variants"                  "ClinVar_cvaObservedIn"            
## [11] "ClinVar_cvaSubmitters"             "ClinVar_traits"                   
## [13] "ClinVar_varEntrez"                 "ClinVar_varAttributes"            
## [15] "ClinVar_varCytoLoc"                "ClinVar_varNames"                 
## [17] "ClinVar_varSeqLoc"                 "ClinVar_varXRef"                  
## [19] "ClinVar_traitCref"                 "ClinVar_traitNames"               
## [21] "ClinVar_entrezNames"

The different collection members of an MDBs are listed with the collection_members() function and updated with collection_members()<-.

collection_members(file_clinvar)
## # A tibble: 10 x 9
##    collection cid        resource   mid table     field   static value   type   
##    <chr>      <chr>      <chr>    <int> <chr>     <chr>   <lgl>  <chr>   <chr>  
##  1 BE         ClinVar_B… ClinVar      1 ClinVar_… be      TRUE   Gene    <NA>   
##  2 BE         ClinVar_B… ClinVar      1 ClinVar_… identi… FALSE  entrez  <NA>   
##  3 BE         ClinVar_B… ClinVar      1 ClinVar_… organi… TRUE   Homo s… Scient…
##  4 BE         ClinVar_B… ClinVar      1 ClinVar_… source  TRUE   Entrez… <NA>   
##  5 Condition  ClinVar_c… ClinVar      1 ClinVar_… condit… TRUE   Disease <NA>   
##  6 Condition  ClinVar_c… ClinVar      1 ClinVar_… identi… FALSE  id      <NA>   
##  7 Condition  ClinVar_c… ClinVar      1 ClinVar_… source  FALSE  db      <NA>   
##  8 Condition  ClinVar_c… ClinVar      2 ClinVar_… condit… TRUE   Disease <NA>   
##  9 Condition  ClinVar_c… ClinVar      2 ClinVar_… identi… FALSE  id      <NA>   
## 10 Condition  ClinVar_c… ClinVar      2 ClinVar_… source  TRUE   ClinVar <NA>

2.3.3 Size

The following functions are use to get the number of tables, the number of fields per table and the number of records.

length(file_clinvar)        # Number of tables
## [1] 21
lengths(file_clinvar)       # Number of fields per table
## ClinVar_ReferenceClinVarAssertion               ClinVar_rcvaVariant 
##                                 8                                 2 
##         ClinVar_ClinVarAssertions               ClinVar_rcvaInhMode 
##                                 4                                 2 
##            ClinVar_rcvaObservedIn                ClinVar_rcvaTraits 
##                                 6                                 3 
##              ClinVar_clinSigOrder              ClinVar_revStatOrder 
##                                 2                                 2 
##                  ClinVar_variants             ClinVar_cvaObservedIn 
##                                 3                                 4 
##             ClinVar_cvaSubmitters                    ClinVar_traits 
##                                 3                                 2 
##                 ClinVar_varEntrez             ClinVar_varAttributes 
##                                 3                                 5 
##                ClinVar_varCytoLoc                  ClinVar_varNames 
##                                 2                                 3 
##                 ClinVar_varSeqLoc                   ClinVar_varXRef 
##                                18                                 4 
##                 ClinVar_traitCref                ClinVar_traitNames 
##                                 4                                 3 
##               ClinVar_entrezNames 
##                                 3
count_records(file_clinvar) # Number of records per table
## ClinVar_ReferenceClinVarAssertion               ClinVar_rcvaVariant 
##                               166                               166 
##         ClinVar_ClinVarAssertions               ClinVar_rcvaInhMode 
##                               409                                16 
##            ClinVar_rcvaObservedIn                ClinVar_rcvaTraits 
##                               337                               166 
##              ClinVar_clinSigOrder              ClinVar_revStatOrder 
##                                11                                 2 
##                  ClinVar_variants             ClinVar_cvaObservedIn 
##                               138                               412 
##             ClinVar_cvaSubmitters                    ClinVar_traits 
##                               416                                18 
##                 ClinVar_varEntrez             ClinVar_varAttributes 
##                               145                              2262 
##                ClinVar_varCytoLoc                  ClinVar_varNames 
##                               138                               188 
##                 ClinVar_varSeqLoc                   ClinVar_varXRef 
##                               280                               244 
##                 ClinVar_traitCref                ClinVar_traitNames 
##                                50                                44 
##               ClinVar_entrezNames 
##                                20

The count_records() function can take a lot of time when dealing with fileMDB objects if the data files are very large. In such case it could be more clever to list data file size.

data_file_size(file_clinvar, hr=TRUE)
## ClinVar_ReferenceClinVarAssertion               ClinVar_rcvaVariant 
##                          "4.6 KB"                           "947 B" 
##         ClinVar_ClinVarAssertions               ClinVar_rcvaInhMode 
##                          "4.2 KB"                           "152 B" 
##            ClinVar_rcvaObservedIn                ClinVar_rcvaTraits 
##                          "1.4 KB"                           "788 B" 
##              ClinVar_clinSigOrder              ClinVar_revStatOrder 
##                           "145 B"                           "101 B" 
##                  ClinVar_variants             ClinVar_cvaObservedIn 
##                          "2.1 KB"                          "1.8 KB" 
##             ClinVar_cvaSubmitters                    ClinVar_traits 
##                          "2.6 KB"                           "403 B" 
##                 ClinVar_varEntrez             ClinVar_varAttributes 
##                           "711 B"                         "18.3 KB" 
##                ClinVar_varCytoLoc                  ClinVar_varNames 
##                           "544 B"                          "2.6 KB" 
##                 ClinVar_varSeqLoc                   ClinVar_varXRef 
##                          "3.9 KB"                          "2.3 KB" 
##                 ClinVar_traitCref                ClinVar_traitNames 
##                           "697 B"                           "914 B" 
##               ClinVar_entrezNames 
##                           "619 B"

2.4 Pulling, subsetting and combining

There are several possible ways to pull data tables from MDBs. The following lines return the same results.

data_tables(file_clinvar, "ClinVar_traitNames")[[1]]
file_clinvar[["ClinVar_traitNames"]]
file_clinvar$"ClinVar_traitNames"
file_clinvar %>% pull(ClinVar_traitNames)
## # A tibble: 44 x 3
##     t.id name                                                           type    
##    <int> <chr>                                                          <chr>   
##  1   912 Chudley-McCullough syndrome                                    Preferr…
##  2   912 Deafness, autosomal recessive 82                               Alterna…
##  3   912 Deafness, bilateral sensorineural, and hydrocephalus due to f… Alterna…
##  4   912 Deafness, sensorineural, with partial agenesis of the corpus … Alterna…
##  5  1352 CTSD-Related Neuronal Ceroid-Lipofuscinosis                    Alterna…
##  6  1352 Ceroid lipofuscinosis neuronal Cathepsin D-deficient           Alterna…
##  7  1352 Neuronal ceroid lipofuscinosis 10                              Preferr…
##  8  1352 Neuronal ceroid lipofuscinosis due to Cathepsin D deficiency   Alterna…
##  9  1481 Diabetes mellitus, neonatal, with congenital hypothyroidism    Preferr…
## 10  1481 NDH SYNDROME                                                   Alterna…
## # … with 34 more rows

MDBs can also be subset and combined. The corresponding functions ensure that the data model is fulfilled by the data tables.

file_clinvar[1:3]
## fileMDB ClinVar (version 0.9, Patrice Godard <patrice.godard@ucb.com>): Data extracted from the ClinVar database
##    - 3 tables with 14 fields
## 
## No collection member
## 
## ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. This is a very small subset of ClinVar! Visit the reference URL for more information.
## (https://www.ncbi.nlm.nih.gov/clinvar/)
c(file_clinvar[1:3], file_hpo[c(1,5,7)]) %>% 
   data_model() %>% auto_layout(force=TRUE) %>% plot()

The function c() concatenates the provided MDB after checking that tables names are not duplicated. It does not integrate the data with any relational table. This can achieved by merging the MDBs as described in the Merging with collections section.

2.5 Filtering and joining

An MDB can be filtered by filtering one or several tables based on field values. The filtering is propagated to other tables using the embedded data model.

In the example below, the file_clinvar object is filtered in order to focus on a few genes with pathogenic variants (the tables have been renamed using the set_names() function to improve the readability of the example). The object returned by filter() or slice is a memoMDB: all the data are in memory.

filtered_clinvar <- file_clinvar %>% 
   set_names(sub("ClinVar_", "", names(.))) %>%
   filter(
      entrezNames = symbol %in% c("PIK3R2", "UGT1A8")
   ) %>% 
   slice(ReferenceClinVarAssertion=grep(
      "pathogen",
      .$ReferenceClinVarAssertion$clinicalSignificance,
      ignore.case=TRUE
   ))

Tables can be easily joined to get diseases associated to the genes of interest.

gene_traits <- filtered_clinvar %>% 
   join_mdb_tables(
      "entrezNames", "varEntrez", "variants", "rcvaVariant",
      "ReferenceClinVarAssertion", "rcvaTraits", "traits"
   )
gene_traits$entrezNames %>%
   select(symbol, name, variants.type, variants.name, traitType, traits.name)
## # A tibble: 4 x 6
##   symbol name       variants.type   variants.name    traitType traits.name      
##   <chr>  <chr>      <chr>           <chr>            <chr>     <chr>            
## 1 PIK3R2 phosphoin… single nucleot… NM_005027.4(PIK… Disease   Megalencephaly-p…
## 2 PIK3R2 phosphoin… single nucleot… NM_005027.4(PIK… Disease   not provided     
## 3 PIK3R2 phosphoin… single nucleot… NM_005027.4(PIK… Disease   not provided     
## 4 UGT1A8 UDP glucu… Microsatellite  UGT1A1*28        Disease   Gilbert's syndro…

2.6 Merging with collections

2.6.1 Collections and collection members

Some databases refer to the same concepts and could be merged accordingly. However they often use different vocabularies.

For example, CHEMBL refers to biological entities (BE) in the CHEMBL_component_sequence table using mainly Uniprot peptide identifiers from different species.

file_chembl$CHEMBL_component_sequence %>% head()
## # A tibble: 6 x 5
##   component_id accession organism              db_source db_version
##          <int> <chr>     <chr>                 <chr>     <chr>     
## 1          259 P15260    Homo sapiens          Uniprot   2019_09   
## 2          327 Q99062    Homo sapiens          Uniprot   2019_09   
## 3          752 P35563    Rattus norvegicus     Uniprot   2019_09   
## 4          917 P07339    Homo sapiens          Uniprot   2019_09   
## 5         1807 Q54A96    Plasmodium falciparum Uniprot   2019_09   
## 6         2180 P67774    Bos taurus            Uniprot   2019_09

Whereas ClinVar refers to BE in the ClinVar_entrezNames table using human Entrez gene identifiers.

file_clinvar$ClinVar_entrezNames %>% head()
## # A tibble: 6 x 3
##   entrez name                                             symbol
##    <int> <chr>                                            <chr> 
## 1   1509 cathepsin D                                      CTSD  
## 2   1903 sphingosine-1-phosphate receptor 3               S1PR3 
## 3   3300 DnaJ heat shock protein family (Hsp40) member B2 DNAJB2
## 4   3423 iduronate 2-sulfatase                            IDS   
## 5   3910 laminin subunit alpha 4                          LAMA4 
## 6   5296 phosphoinositide-3-kinase regulatory subunit 2   PIK3R2

Some tools exist to convert such BE identifiers from one scope to the other (BED, mygene, biomaRt). TKCat provides mechanism to document these scopes in order to allow automatic conversions from and to any of them. Those concepts are called Collections in TKCat and they should be formally defined before being able to document any of their members. Two collection definitions are provided within the TKCat package and other can be imported with the import_local_collection() function.

list_local_collections()
## # A tibble: 2 x 2
##   title     description                                  
##   <chr>     <chr>                                        
## 1 BE        Collection of biological entity (BE) concepts
## 2 Condition Collection of condition concepts

The way to describe the scope of a collection member is formally defined by a JSON schema (use get_local_collection() to get the JSON of a collection). Here are the definition of the BE collection members provided by the CHEMBL_component_sequence and the ClinVar_entrezNames tables.

collection_members(file_chembl, "BE")
## # A tibble: 4 x 9
##   collection cid      resource   mid table        field   static value  type    
##   <chr>      <chr>    <chr>    <int> <chr>        <chr>   <lgl>  <chr>  <chr>   
## 1 BE         CHEMBL_… CHEMBL       1 CHEMBL_comp… be      TRUE   Pepti… <NA>    
## 2 BE         CHEMBL_… CHEMBL       1 CHEMBL_comp… identi… FALSE  acces… <NA>    
## 3 BE         CHEMBL_… CHEMBL       1 CHEMBL_comp… source  FALSE  db_so… <NA>    
## 4 BE         CHEMBL_… CHEMBL       1 CHEMBL_comp… organi… FALSE  organ… Scienti…
collection_members(file_clinvar, "BE")
## # A tibble: 4 x 9
##   collection cid       resource   mid table      field   static value   type    
##   <chr>      <chr>     <chr>    <int> <chr>      <chr>   <lgl>  <chr>   <chr>   
## 1 BE         ClinVar_… ClinVar      1 ClinVar_e… be      TRUE   Gene    <NA>    
## 2 BE         ClinVar_… ClinVar      1 ClinVar_e… identi… FALSE  entrez  <NA>    
## 3 BE         ClinVar_… ClinVar      1 ClinVar_e… organi… TRUE   Homo s… Scienti…
## 4 BE         ClinVar_… ClinVar      1 ClinVar_e… source  TRUE   Entrez… <NA>

The Collection column indicates the collection to which the table refers. The cid column indicates the version of the collection definition which should correspond to the $id of JSON schema. The resource column indicated the name of the resource and the mid column an identifier which is unique for each member of a collection in each resource. The field column indicated each part of the scope of collection. In the case of BE, 4 fields should be documented:

  • be: the type of BE (e.g. Gene or Peptide)
  • source: the source of the identifier (e.g. EntrezGene or Peptide)
  • organism: the organism to which the identifier refers (e.g Homo sapiens)
  • identifier: the identifier itself.

Each of these fields can be static or not. TRUE means that the value of this field is the same for all the records and is provided in the value column. Whereas FALSE means that the value can be different for each record and is provided in the column the name of which is given in the value column. The type column is only used for the organism field in the case of the BE collection and can take 2 values: “Scientific name” or “NCBI taxon identifier.” The definition of the pre-build BE collection members follows the terminology used in the BED package (Godard and Eyll 2018). But it can be adapted according to the solution chosen for converting BE identifiers from one scope to another.

Setting up the definition of such scope is done using the collection_members<-() function as shown in the Reading HPO example above.

2.6.2 Shared collections and merging

The aim of collections is to identify potential bridges between MDBs. The get_shared_collection() function is used to list all the collections shared by two MDBs.

get_shared_collections(filtered_clinvar, file_chembl)
## # A tibble: 3 x 5
##   collection mid.x table.x     mid.y table.y                  
##   <chr>      <int> <chr>       <int> <chr>                    
## 1 Condition      2 traits          1 CHEMBL_drug_indication   
## 2 Condition      1 traitCref       1 CHEMBL_drug_indication   
## 3 BE             1 entrezNames     1 CHEMBL_component_sequence

In this example, there are 3 different ways to merge the two MDBs filtered_clinvar and file_chembl:

  • Based on conditions provided respectively in the traits and in the CHEMBL_drug_indication tables
  • Based on conditions provided respectively in the traitsCref and in the CHEMBL_drug_indication tables
  • Based on BE provided respectively in the entrezNames and in the CHEMBL_component_sequence tables

The code below shows how to merge these two resources based on BE information. To achieve this task it relies on a function provided with TKCat along with BE collection definition (to get the function: get_collection_mapper("BE")). This function uses the BED package and you need this package to be installed with a connection to BED database in order to run the code below.

sel_coll <- get_shared_collections(file_clinvar, file_chembl) %>% 
   filter(collection=="BE")
filtered_cv_chembl <- merge(
   x=file_clinvar,
   y=file_chembl,
   by=sel_coll
)

The returned object is metaMDB gathering the original MDBs and a relational table between members of the same collection as defined by the by parameter.

Additional information about collection can be found here.

2.6.3 Merging without collection

If the Collection column of the by parameter is NA, then the relational table is built by merging identical columns in table.x and table.y (No conversion occurs). For example, file_hpo and file_clinvar MDBs could be merged according to conditions provided in the HPO_diseases and the ClinVar_traitCref tables respectively.

get_shared_collections(file_hpo, file_clinvar)
## # A tibble: 4 x 5
##   collection mid.x table.x      mid.y table.y          
##   <chr>      <int> <chr>        <int> <chr>            
## 1 Condition      1 HPO_hp           1 ClinVar_traitCref
## 2 Condition      1 HPO_hp           2 ClinVar_traits   
## 3 Condition      2 HPO_diseases     1 ClinVar_traitCref
## 4 Condition      2 HPO_diseases     2 ClinVar_traits

These conditions could be converted using a function provided with TKCat (get_collection_mapper("Condition")) and which rely on the DODO package. The two tables can also be simply concatenated without applying any conversion (loosing the advantage of such conversion obviously).

sel_coll <- get_shared_collections(file_hpo, file_clinvar) %>% 
   filter(table.x=="HPO_diseases", table.y=="ClinVar_traitCref") %>% 
   mutate(collection=NA)
sel_coll
## # A tibble: 1 x 5
##   collection mid.x table.x      mid.y table.y          
##   <lgl>      <int> <chr>        <int> <chr>            
## 1 NA             2 HPO_diseases     1 ClinVar_traitCref
hpo_clinvar <- merge(file_hpo, file_clinvar, by=sel_coll)
plot(data_model(hpo_clinvar))
hpo_clinvar$HPO_diseases_ClinVar_traitCref %>% head()
## # A tibble: 6 x 2
##   db       id    
##   <chr>    <chr> 
## 1 DECIPHER 15    
## 2 DECIPHER 45    
## 3 DECIPHER 65    
## 4 OMIM     100050
## 5 OMIM     100650
## 6 OMIM     101800

3 MDB catalogs as TKCat objects

3.1 Local TKCat

MDB can be gathered in a TKCat (Tailored Knowledge Catalog) object.

k <- TKCat(file_hpo, file_clinvar)

Gathering MDBs in such a catalog facilitate their exploration and their preparation for potential integration. Several functions are available to achieve this goal.

list_MDBs(k)                     # list all the MDBs in a TKCat object
## # A tibble: 2 x 6
##   name   title        description            url         version maintainer     
##   <chr>  <chr>        <chr>                  <chr>       <chr>   <chr>          
## 1 HPO    Data extrac… This is a very small … http://hum… <NA>    <NA>           
## 2 ClinV… Data extrac… ClinVar is a freely a… https://ww… 0.9     Patrice Godard…
get_MDB(k, "HPO")                # get a specific MDBs from the catalog
## fileMDB HPO: Data extracted from the HPO database
##    - 9 tables with 25 fields
## 
## Collection members: 
##    - 2 Condition members
## 
## This is a very small subset of the HPO! Visit the reference URL for more information.
## (http://human-phenotype-ontology.github.io/)
search_MDB_tables(k, "disease")  # Search table about "disease"
## # A tibble: 3 x 3
##   resource name                comment                 
##   <chr>    <chr>               <chr>                   
## 1 HPO      HPO_diseases        Diseases                
## 2 HPO      HPO_diseaseHP       HP presented by diseases
## 3 HPO      HPO_diseaseSynonyms Disease synonyms
search_MDB_fields(k, "disease")  # Search a field about "disease"
## # A tibble: 8 x 7
##   resource table          name    type    nullable unique comment               
##   <chr>    <chr>          <chr>   <chr>   <lgl>    <lgl>  <chr>                 
## 1 HPO      HPO_diseases   db      charac… FALSE    FALSE  Disease database      
## 2 HPO      HPO_diseases   id      charac… FALSE    FALSE  Disease ID            
## 3 HPO      HPO_diseases   label   charac… FALSE    FALSE  Disease lable (prefer…
## 4 HPO      HPO_diseaseHP  db      charac… FALSE    FALSE  Disease database      
## 5 HPO      HPO_diseaseHP  id      charac… FALSE    FALSE  Disease ID            
## 6 HPO      HPO_diseaseSy… db      charac… FALSE    FALSE  Disease database      
## 7 HPO      HPO_diseaseSy… id      charac… FALSE    FALSE  Disease ID            
## 8 HPO      HPO_diseaseSy… synonym charac… FALSE    FALSE  Disease synonym
collection_members(k)            # Get collection members of the different MDBs
## # A tibble: 5 x 3
##   resource collection table              
##   <chr>    <chr>      <chr>              
## 1 HPO      Condition  HPO_hp             
## 2 HPO      Condition  HPO_diseases       
## 3 ClinVar  BE         ClinVar_entrezNames
## 4 ClinVar  Condition  ClinVar_traitCref  
## 5 ClinVar  Condition  ClinVar_traits
c(k, TKCat(file_chembl))         # Merge 2 TKCat objects
## TKCat gathering 3 MDB objects

3.2 chTKCat

A chTKCat object is a catalog of MDB as a TKCat object described above but relying on a ClickHouse database. Therefore it requires the installation and the initialization of such a database. Two additional vignettes describes:

3.3 A shiny app for exploring MDBs

The function explore_MDBs(k) launches a shiny interface to explore MDBs in a TKCat or a chTKCat object. This exploration interface can be easily deployed using an app.R file with content similar to the one below.

library(TKCat)
explore_MDBs(k, download=TRUE)

In this interface the users can explore the resources available in the catalog. They can browse the data model of each of them with some sample data. They can also search for information provided in resources, tables or fields. Finally, if the parameter download is set to TRUE, the users will also be able to download the data: either each table individually or an archive of the whole MDB.

4 Acknowledgments

This work was entirely supported by UCB Pharma (Early Solutions department).

5 References

Godard, Patrice, and Jonathan van Eyll. 2018. BED: A Biological Entity Dictionary Based on a Graph Data Model.” F1000Research 7: 195. https://doi.org/10.12688/f1000research.13925.3.
Köhler, Sebastian, Leigh Carmody, Nicole Vasilevsky, Julius O B Jacobsen, Daniel Danis, Jean-Philippe Gourdine, Michael Gargano, et al. 2019. “Expansion of the Human Phenotype Ontology (HPO) Knowledge Base and Resources.” Nucleic Acids Research 47 (D1): D1018–27. https://doi.org/10.1093/nar/gky1105.
Landrum, Melissa J., Jennifer M. Lee, Mark Benson, Garth R. Brown, Chen Chao, Shanmuga Chitipiralla, Baoshan Gu, et al. 2018. ClinVar: Improving Access to Variant Interpretations and Supporting Evidence.” Nucleic Acids Research 46 (D1): D1062–67. https://doi.org/10.1093/nar/gkx1153.
Mendez, David, Anna Gaulton, A Patrícia Bento, Jon Chambers, Marleen De Veij, Eloy Félix, María Paula Magariños, et al. 2019. ChEMBL: Towards Direct Deposition of Bioassay Data.” Nucleic Acids Research 47 (D1): D930–40. https://doi.org/10.1093/nar/gky1075.