1 Introduction

The aim of TKCat (Tailored Knowledge Catalog) is to facilitate the management of data from knowledge resources that are frequently used alone or together in research environments. In TKCat, knowledge resources are manipulated as modeled database (MDB) objects. These objects provide access to the data tables along with a general description of the resource and a detail data model generated with ReDaMoR documenting the tables, their fields and their relationships. These MDB are then gathered in catalogs that can be easily explored an shared. TKCat provides tools to easily subset, filter and combine MDBs and create new catalogs suited for specific needs.

The TKCat R package is licensed under GPL-3.

Some MDBs refer to the same concepts and could be merged accordingly. However they often use different vocabularies or scopes. Collections are a used to identify such concepts and to define a way to document formally the scope used by the different members of these collections. Thanks to this formal description, tools can be used to automatically combine MDBs refering to the same collection but using different scopes.

This vignette describes how to create TKCat Collections, document collection members and create functions to support the merging of MDBs. The advantages and uses of Collections are presented in the general user guide.

2 Creating a collection

A collection is defined by a JSON document. This document should fulfill the requirements defined by the Collection-Schema.json. Two collection are available by default in the TKCat package.

list_local_collections()
## # A tibble: 2 x 2
##   title     description                                  
##   <chr>     <chr>                                        
## 1 BE        Collection of biological entity (BE) concepts
## 2 Condition Collection of condition concepts

Here is how the BE collection is defined.

get_local_collection("BE") %>%
   paste('```json', ., '```', sep="\n") %>% cat()
{
   "$schema": "TKCat_collections_1.0",
   "$id":"TKCat_BE_collection_1.0",
    "title": "BE collection",
    "type": "object",
    "description": "Collection of biological entity (BE) concepts",
    "properties": {
      "$schema": {"enum": ["TKCat_BE_collection_1.0"]},
      "$id": {"type": "string"},
        "collection": {"enum":["BE"]},
        "resource": {"type": "string"},
        "tables": {
            "type": "array",
            "minItems": 1,
            "items":{
                "type": "object",
                "properties":{
                    "name": {"type": "string"},
                    "fields": {
                        "type": "object",
                        "properties": {
                            "be": {
                                "type": "object",
                                "properties": {
                                    "static": {"type": "boolean"},
                                    "value": {"type": "string"}
                                },
                                "required": ["static", "value"],
                                "additionalProperties": false
                            },
                            "source": {
                                "type": "object",
                                "properties": {
                                    "static": {"type": "boolean"},
                                    "value": {"type": "string"}
                                },
                                "required": ["static", "value"],
                                "additionalProperties": false
                            },
                            "organism": {
                                "type": "object",
                                "properties": {
                                    "static": {"type": "boolean"},
                                    "value": {"type": "string"},
                                    "type": {"enum": ["Scientific name", "NCBI taxon identifier"]}
                                },
                                "required": ["static", "value", "type"],
                                "additionalProperties": false
                            },
                            "identifier": {
                                "type": "object",
                                "properties": {
                                    "static": {"type": "boolean"},
                                    "value": {"type": "string"}
                                },
                                "required": ["static", "value"],
                                "additionalProperties": false
                            }
                        },
                        "required": ["be", "source", "identifier"],
                        "additionalProperties": false
                    }
                },
                "required": ["name", "fields"],
                "additionalProperties": false
            }
        }
    },
    "required": ["$schema", "$id", "collection", "resource", "tables"],
    "additionalProperties": false
}

A collection should refer to the "TKCat_collections_1.0" $schema. It should then have the following properties:

  • $id: the identifier of the collection

  • title: the title of the collection

  • type: always object

  • description: a short description of the collection

  • properties: the properties that should be provided by collection members. In this case:

    • $schema: should be the $id of the collection

    • $id: the identifier of the collection member: a string

    • collection: should be “BE”

    • resource: the name of the resource having collection members: a string

    • tables: an array of tables corresponding to collection members. Each item being a table with the following features:

      • name: the name of the table

      • fields: the required fields

        • be: if static is true then value correspond to the be value valid for all the records. If not value correspond to the table column with the be value for each record.
        • source: if static is true then value correspond to the source value valid for all the records. If not value correspond to the table column with the source value for each record.
        • organism: if static is true then value correspond to the organism value valid for all the records. If not value correspond to the table column with the organism value for each record. type indicate how organisms are identified: "Scientific name" or "NCBI taxon identifier".

3 Identifying collection members

Identifying collection members of an MDB can be done by providing a table as shown in the general user guide or by writing a JSON file like the following one which correspond to BE members of the CHEMBL MDB.

system.file(
   "examples/CHEMBL/model/Collections/BE-CHEMBL_BE_1.0.json",
   package="TKCat"
) %>% 
   readLines() %>% paste(collapse="\n")
{
  "$schema": "TKCat_BE_collection_1.0",
  "$id": "CHEMBL_BE_1.0",
  "collection": "BE",
  "resource": "CHEMBL",
  "tables": [
    {
      "name": "CHEMBL_component_sequence",
      "fields": {
        "be": {
          "static": true,
          "value": "Peptide"
        },
        "identifier": {
          "static": false,
          "value": "accession"
        },
        "source": {
          "static": false,
          "value": "db_source"
        },
        "organism": {
          "static": false,
          "value": "organism",
          "type": "Scientific name"
        }
      }
    }
  ]
}

The identification of collection members should fulfill the requirements defined by the collection JSON document, and therefore pass the following validation.

jsonvalidate::json_validate(
   json=system.file(
      "examples/CHEMBL/model/Collections/BE-CHEMBL_BE_1.0.json",
      package="TKCat"
   ),
   schema=get_local_collection("BE")
)
## [1] TRUE

This validation is done automatically when reading a fileMDB object or when setting collection members with the collection_members() function.

4 Collection mapper functions

The merge.MDB() and the map_collection_members() functions rely on functions to map members of the same collection. When recorded (using the import_collection_mapper() function), these function can be automatically identified by TKCat, otherwise or according to user needs, these functions could be provided using the funs (for merge.MDB()) or the fun (for map_collection_members()) parameters. Two mappers are pre-recorded in TKCat, one for the BE collection and one for the Condition collection. They can be retrieved with the get_collection_mapper() function.

get_collection_mapper("BE")
function (x, y, orthologs = FALSE, restricted = FALSE, ...) 
{
    if (!requireNamespace("BED")) {
        stop("The BED package is required")
    }
    if (!BED::checkBedConn()) {
        stop("You need to connect to a BED database using", " the BED::connectToBed() function")
    }
    if (!"organism" %in% colnames(x)) {
        d <- x
        scopes <- dplyr::distinct(d, be, source)
        nd <- c()
        for (i in 1:nrow(scopes)) {
            be <- scopes$be[i]
            source <- scopes$source[i]
            toadd <- d %>% dplyr::filter(be == be, source == 
                source)
            organism <- BED::guessIdScope(toadd$identifier, be = be, 
                source = source, tcLim = Inf) %>% attr("details") %>% 
                filter(be == !!be & source == !!source) %>% pull(organism) %>% 
                unique()
            toadd <- merge(toadd, tibble(organism = organism))
            nd <- bind_rows(nd, toadd)
        }
        x <- nd %>% mutate(organism_type = "Scientific name")
    }
    if (!"organism" %in% colnames(y)) {
        d <- y
        scopes <- dplyr::distinct(d, be, source)
        nd <- c()
        for (i in 1:nrow(scopes)) {
            be <- scopes$be[i]
            source <- scopes$source[i]
            toadd <- d %>% dplyr::filter(be == be, source == 
                source)
            organism <- BED::guessIdScope(toadd$identifier, be = be, 
                source = source, tcLim = Inf) %>% attr("details") %>% 
                filter(be == !!be & source == !!source) %>% pull(organism) %>% 
                unique()
            toadd <- merge(toadd, tibble(organism = organism))
            nd <- bind_rows(nd, toadd)
        }
        y <- nd %>% mutate(organism_type = "Scientific name")
    }
    xscopes <- dplyr::distinct(x, be, source, organism, organism_type)
    yscopes <- dplyr::distinct(y, be, source, organism, organism_type)
    toRet <- NULL
    for (i in 1:nrow(xscopes)) {
        xscope <- xscopes[i, ]
        if (any(apply(xscope, 2, is.na))) {
            (next)()
        }
        xi <- dplyr::right_join(x, xscope, by = c("be", "source", 
            "organism", "organism_type"))
        xorg <- ifelse(xscope$organism_type == "NCBI taxon identifier", 
            BED::getOrgNames(xscope$organism) %>% dplyr::filter(nameClass == 
                "scientific name") %>% dplyr::pull(name), xscope$organism)
        for (j in 1:nrow(yscopes)) {
            yscope <- yscopes[j, ]
            if (any(apply(yscope, 2, is.na))) {
                (next)()
            }
            yi <- dplyr::right_join(y, yscope, by = c("be", "source", 
                "organism", "organism_type"))
            yorg <- ifelse(yscope$organism_type == "NCBI taxon identifier", 
                BED::getOrgNames(yscope$organism) %>% dplyr::filter(nameClass == 
                  "scientific name") %>% dplyr::pull(name), yscope$organism)
            if (xorg == yorg || orthologs) {
                xy <- BED::convBeIds(ids = xi$identifier, from = xscope$be, 
                  from.source = xscope$source, from.org = xorg, 
                  to = yscope$be, to.source = yscope$source, 
                  to.org = yorg, restricted = restricted) %>% 
                  dplyr::as_tibble() %>% dplyr::select(from, 
                  to)
                if (restricted) {
                  xy <- dplyr::bind_rows(xy, BED::convBeIds(ids = yi$identifier, 
                    from = yscope$be, from.source = yscope$source, 
                    from.org = yorg, to = xscope$be, to.source = xscope$source, 
                    to.org = xorg, restricted = restricted) %>% 
                    dplyr::as_tibble() %>% dplyr::select(to = from, 
                    from = to))
                }
                xy <- xy %>% dplyr::rename(identifier_x = "from", 
                  identifier_y = "to") %>% dplyr::mutate(be_x = xscope$be, 
                  source_x = xscope$source, organism_x = xscope$organism, 
                  be_y = yscope$be, source_y = yscope$source, 
                  organism_y = yscope$organism)
                toRet <- dplyr::bind_rows(toRet, xy)
            }
        }
    }
    toRet <- dplyr::distinct(toRet)
    return(toRet)
}

A mapper function must have at least an x and a y parameters. Each of them should be a data.frame with all the field values corresponding to the fields defined in the collection. Additional parameters can be defined and will be forwarded using .... This function should return a data frame with all the fields values followed by “_x” and “_y” suffix accordingly.

5 Acknowledgments

This work was entirely supported by UCB Pharma (Early Solutions department).