---
title: "Creating a new connector."
author: "Pierrick Roger"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('biodb')`"
vignette: |
  %\VignetteIndexEntry{Creating a new connector class for accessing a database.}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
abstract: |
  This vignette shows how to create a new connector class and the corresponding new entry class for accessing a remote database.
output:
  BiocStyle::html_document:
    toc: yes
    toc_depth: 4
    toc_float:
      collapsed: false
  BiocStyle::pdf_document: default
bibliography: references.bib
---

```{r, echo=FALSE}
source(system.file('vignettes_inc.R', package='biodb'))
```

# Introduction

*biodb* is a framework designed to help you implement new connectors for
databases.
To illustrate this, we will show you a practical example where we create a
connector for the [ChEBI](https://www.ebi.ac.uk/chebi/) database.
In this example, we will present you a small implementation of a *ChEBI*
connector, and show you how to declare it to your *biodb* instance.

A more complete and functional connector for accessing *ChEBI* database is
implemented in [biodbChEBI](https://github.com/pkrog/biodbChebi) library.
See \@ref(tab:biodbChebiCapabilities) for a list of the capabilities of this
official *biodb* connector.

Title / method name | Description
------------------- | -------------------------------------------
Fields parsing      | Formula, charge, InChI, InChIKey, molecular mass, monoisotopic mass, KEGG id, entity stars, SMILES.
getEntryPageUrl()   | Returns the URL of the website page of an entry.
getEntryImageUrl()  | Returns the URL to the molecule image of an entry.
wsWsdl()            | Returns the WSDL definition (i.e.: list of available web services and their parameters).
wsGetLiteEntity()   | Runs the getLiteEntity web service that returns database entries with their contents.
convIdsToChebiIds() | Converts a list of IDs (InChI, InChI Keys, CAS, ...) into a list of ChEBI IDs.
convInchiToChebi()  | Converts a list of InChI or InChI KEYs into a list of ChEBI IDs.
convCasToChebi()    | Converts a list of CAS IDs into a list of ChEBI IDs.
searchForEntries()  | Searches for entries by mass and/or by name.
: (\#tab:biodbChebiCapabilities) Capabilities of the *biodbChebi* extension package.

# Generating a new extension package

When creating a new extension package, *biodb* can help you generate all the
necessary files.

A call to `genNewExtPkg()` will generate the skeletons for the *biodb*
connector class and the *biodb* entry class, along with the testthat files, the
DESCRIPTION file, etc.
A simplified call might look like this:
```{r}
biodb::genNewExtPkg(path='biodbChebiEx', dbName='chebi.ex', connType='compound',
                    dbTitle='ChEBI connector example', entryType='xml', remote=TRUE)
```
See \@ref(tab:generatorParameters) for a brief description of the parameters.
Other parameters exist for the author's email, the author's name, for
generating a `Makefile`, or configuring for writing C++ code with `Rcpp`.

Parameter | Description
--------- | --------------------------------
path      | The path to the package folder to create.
dbName    | The name of the connector to create. 
dbTitle   | A short description of the database.
connType  | The type of connector.
entryType | The type of the entry.
remote    | Must be set to \code{TRUE} if a connection to a web server is needed.
: (\#tab:generatorParameters) A brief description of some parameters of `biodb::genNewExtPkg()`.

The files generated by the `genNewExtPkg()` function are the following ones:
```{r}
list.files('biodbChebiEx', all.files=TRUE, recursive=TRUE)
```
Inside the `biodb_ext.yml` file are stored the values of the parameters used
with `biodb::genNewExtPkg()`.
This is in case you want to upgrade some the generated files (`.gitignore`,
`.travis.yml`, `Makefile`, etc) with newer versions from *biodb* package.
You would then only need to call `biodb::upgradeExtPkg(path='biodbChebiEx')`
and the `biodb_ext.yml` file would be read for parameter values.

The `inst/definitions.yml` file defines the new connector, we will fill in some
values inside it.
Then we need to write implementations for the methods in the connector class
`R/ChebiExConn.R`.
On the other side, `R/ChebiExEntry.R`, the entry class, needs no modification
for our basic usage.

The test files in `tests/testthat` will be executed when running `R CMD check`,
they need to be edited first though.
Generic tests need to enabled inside `tests/testthat/test_100_generic.R`.
The files `tests/testthat/test_050_fcts.R` and
`tests/testthat/test_200_example.R` contain only examples,  thus they need to
be modified or removed.

The test files in `tests/long` will not be executed when running `R CMD check`.
They can be run manually after installing the package locally, by calling
`R -e "testthat::test_dir('tests/long')"`.

A skeleton vignette has also been generated (`vignettes/intro.Rmd`), and should
be completed with specific examples for this package.

# Editing the generated skeleton

Starting from the skeleton files generated by `genNewExtPkg()`, we need now to
fill in the blanks.

The first file to take care of is `inst/definitions.yml`, which contains the
definition of the new connector.

Then we will look quickly at `R/ChebiExEntry.R`, which is rather empty in our
case, and `R/ChebiExConn.R`, which requires much more attention, having several
methods that need implementation.

The naming of the classes inside the R files is important.
They must be named `ChebiExEntry` and `ChebiExConn`, in order to match the name
defined inside `inst/definitions.yml` (`chebi.ex`).
Hopefully the generator has taken care of this, and no special action is
required on this aspect, except not modifying the names.

## Editing the YAML definition of the new connector

The content of the generated YAML file `inst/definitions.yml` is as follow:
```{r, eval=FALSE, highlight=FALSE, code=readLines('biodbChebiEx/inst/definitions.yml')}
```
It is mainly filled with examples.

This YAML file contains two main parts: `databases` and `fields`.
The `databases` part is where you list the new connectors you've created, and
the `fields` part is where you define the new entry fields your new connectors
need.

### Fields definition

We just have one new field to define: `chebi.ex.id`.
This is the accession field for our new connector.
All connector accession fields are in the form `<connector_class_id>.id`.
This accession field is mainly used inside other databases, when they make
references to other databases.
The field `accession`, which is used in all entries of *biodb* connectors,
contains the same value as the connector accession field (`chebi.ex.id` in our
case) and is preferable when accessing an entry.
The definition of the new field is quite simple, See \@ref(tab:fielddecl) for
explanations of the different parameters.

Parameter            | Description
-------------------- | --------------------------------
`description`        | A free description of your field.
`type`               | The type of the field. Here we declare that this is an accession (identifier) field: `id`.
`card`               | The cardinality of the field: `one` if field accepts only one value, or `many` if multiple values can be stored inside the field.
`forbids.duplicates` | If `TRUE` then duplicates are forbidden. This supposes that we allow to store multiple values inside this field (i.e.: cardinality is set to `many`).
`case.insensitive`   | If `TRUE` then values will be compared in case insensitive mode. This is mostly useful when looking for duplicates.

: (\#tab:fielddecl) Field's parameters. Description of the parameters used when declaring a new entry field.

### Database definition

The main part is the declaration of the new connector.
This is done in the `databases` section, under the key `chebi.id`, which is the
database identifier.
See \@ref(tab:conndecl) for explanations of the different parameters.

Parameter                | Description
------------------------ | --------------------------------
`name`                   | The full name of your new connector.
`urls`                   | A list (key/values) of URLs of the remote database. The common URLs to define are `base.url` to access pages of the database website, and `ws.url` for web service URLs. Those URLs are just "prefix" and are used inside the connector class for building real URLs. You can define as much URLs as the remote database requires, like a second base URL (`base2.url`) or a second web service URL (`ws2.url`), or any other URL with the key name you want.
`xml.ns`                 | This parameter defines namespaces for XML documents returned by the remote database. This is thus only useful for databases that return data in XML format.
`scheduler.n`            | The maximum number of queries to send to the remote database, each T (stored as `scheduler.t`) seconds.
`scheduler.t`            | The time (in seconds) during which a maximum of N (stored as `scheduler.n`) queries is allowed.
`entry.content.type`     | The type of content sent by the database for an entry. Here we have specified `xml`. Allowed values are: `html`, `sdf`, `txt`, `xml`, `csv`, `tsv`, `json`, `list`. This is mainly used to add an extension to the file saved inside *biodb* cache.
`entry.content.encoding` | The text encoding used inside the entry's content by the database.
`parsing.expr`           | This is the most important part of the declaration. It is lists the different expressions to use in order to parse the values of the entry fields. The format is a key/value list, the key being the *biodb* field name, and the value the expression to run. Since the entry content type is XML, we have to use XPath expressions here. See this [XPath Tutorial](https://www.w3schools.com/xml/xpath_intro.asp), for instance, to get an introduction to XPath. Note that we can define multiple expressions, like for `formula` field, in case of XPath expressions. If the first expression fails, then next expressions will be tried.
`searchable.fields`      | A list of *biodb* entry fields that are searchable when calling a search function like `searchCompound()`.
: (\#tab:conndecl) Connector declaration's parameters. Description of the parameters used when declaring a new connector. 

### Final version of the YAML file

After setting some parsing expressions, the URLs and the searchable fields, we
get a complete definition file, that you can find at:
```{r}
defFile <- system.file("extdata", "chebi_ex.yml", package='biodb')
```

Its content is as follow:
```{r, eval=FALSE, highlight=FALSE, code=readLines(system.file("extdata", "chebi_ex.yml", package='biodb'))}
```

## The entry class

The entry class represents an entry from the database.
Each instance of an entry contains the values parsed from the database
downloaded content.

The entry class of our example extension package has been generated inside
`R/ChebiExEntry.R`.
Here is its content:
```{r, eval=FALSE, highlight=TRUE, code=readLines('biodbChebiEx/R/ChebiExEntry.R')}
```

The class inherits from `BiodbXmlEntry` since we have set the `entryType`
parameter to `"xml"`.
An entry class must inherit from the `BiodbEntry` class and define some
methods.
To simplify this step, several generic entry classes have been defined in
*biodb* (see \@ref(tab:entryClasses)), depending on the type of content
downloaded from the database.
To use one of these classes for your entry class, you only have to make your
class inherit from the desired generic class.

Entry class      | Content type handled
---------------- | --------------------------------
`BiodbCsvEntry`  | CSV file.
`BiodbHtmlEntry` | HTML, the parsing will be done using XPath expressions.
`BiodbJsonEntry` | JSON.
`BiodbListEntry` | R list.
`BiodbSdfEntry`  | SDF file (chemical data file format).
`BiodbTxtEntry`  | Text file, the parsing will be done using regular expressions.
`BiodbXmlEntry`  | XML file, the parsing will be done using XPath expressions.
: (\#tab:entryClasses) Provided abstract entry classes. These are the entry classes already defined inside *biodb* package that facilitates the parsing of the corresponding content type.

Two methods are defined that can be used to enhance our implementation.
The method `doCheckContent()` can be used to further check the parsed
content of an entry, for instance for some incoherence between fields.
The method `doParseFieldsStep2()` allows to run some custom code for complex
parsing of the entry's content.
This method is run after `doParseFieldsStep1()`, which is defined inside the
mother class (here `BiodbXmlEntry`) and executes the parsing expression defined
inside `inst/definitions.yml`.

Note: *biodb* uses [R6](https://adv-r.hadley.nz/r6.html) as
OOP (Object Oriented Programming) model.
Please see vignette
```{r, echo=FALSE, results='asis'}
make_vignette_ref('details')
```
, for more explanations.

## The connector class

The generator has generated the full class, and thus has taken care of the
inheritance part, as well as the declaration of the required methods.
See \@ref(tab:chebiExMethods) for a description of these methods.
What is left to us is the implementation of those methods.

Here is the generated skeleton:
```{r, eval=FALSE, highlight=TRUE, code=readLines('biodbChebiEx/R/ChebiExConn.R')}
```

### Inheritance

The connector class is responsible for the connection to the database.
In our case, the database is a compound database.

### Methods to implement

Method                        | Description
----------------------------- | --------------------------------
`doGetEntryPageUrl()`         | This method returns the official URL of the entry page on the database website, for each each accession number passed. The return type is thus a list. If no entry pages are available for the database, the method must return a list of `NULL` values, the same length as the input vector.
`doGetEntryImageUrl()`        | This method returns the official URL of the entry picture on the database website, for each each accession number passed. The picture returned must be visual representation of the entry (a molecule 3D model, a mass spectrum, ...). The return type is thus a list. If no entry pages are available for the database, the method must return a list of `NULL` values, the same length as the input vector.
`doGetEntryContentRequest()`  | This method is called by `getEntryContentRequest()`, and must return a list of URLs used to retrieve entry contents. If `concatenate` parameter is `FALSE`, the list returned must be the same length as the vector `id` and each URL must point to one entry content only. If `concatenate` parameter is `TRUE`, then it is permitted (but not compulsory) to return URLs that get more than one entry at a time.
`doGetEntryIds()`             | This method, called by `getEntryIds()`, should return the full list of accession numbers of the entries contained in the database, or a subset if `max.results` is set. This method is used for testing, in order to get a sample of existing entries, but may also be useful for users when developing.
`doSearchForEntries()`        | This method implements the search of entries by filtering on some field values. For our example, we have kept it simple by implementing only the search by name (field `"name"`), because a full implementation with mass search would require much more code with complex calls to *ChEBI* API. You can however see a real implementation inside [biodbChebi](https://github.com/pkrog/biodbChebi), the package that implements the *ChEBI* connector.
: (\#tab:chebiExMethods) Methods to implement inside the chebi.ex connector.

See the help inside R about `BiodbConn` for details
about the parameters of those functions.

### Remote connection methods

The remote methods are used for three different goals.
First to build URLs that access the web site, to get the URL of an entry page
(`doGetEntryPageUrl()`) or to get the URL of an entry picture
(`doGetEntryImageUrl()`) like a molecule representation.
Second to get a list of database entry identifiers (`doGetEntryIds()`).
Third to Get the content of an entry (`doGetEntryContentRequest()`).

In our implementations of `doGetEntryPageUrl()`, `doGetEntryImageUrl()` and
`doGetEntryContentRequest()` (see below), you may notice the use of the
`getPropValSlot()` method to get some base URLs (`"base.url"`, `"ws.url"`).
These values are defined inside the connector YAML definition file that we will
detail below.
Also, in those methods, we use the `BiobdUrl` class to build the URLs.
`BiodbUrl` handles the building of the URL parameters, as well as the encoding
of special characters.

### Method for searching for entries

The implemented method (`doSearchForEntries()`) is a generic method used to
search for entries inside the database by name, mass, or any other field.
For our example we have decided to implement only the search by name in order
to keep the code as simple and short as possible.
To see a full implementation of this method, look at the official *biodb*
*ChEBI* connector at [biodbChebi](https://github.com/pkrog/biodbChebi).
Inside the method's code you will see that the implementation of the call to
the *ChEBI* web service API has been left to the dedicated method
`wsGetLiteEntity()`.

### Prototype to respect for web service methods

In *biodb* official implementations of remote connectors, the implementations
of calls to web services are done in separate dedicated methods having in
common some principles.

These principles are important, because they assure a uniformity between
*biodb* extension packages, allowing users to identify immediately a web
service method and recognize the *biodb* generic parameters inside it.

Example of a web service method, taken from official *biodb* *ChEBI* extension
package:
```{r, eval=FALSE}
wsGetLiteEntity=function(search=NULL, search.category='ALL', stars='ALL',
                         max.results=10,
                         retfmt=c('plain', 'parsed', 'request', 'ids')) {
}
```

A web service method name must start with the prefix `ws`, which stands for
*web service*, and be followed by the database API name of the web service
written in Java style (i.e.: an uppercase letter for the start of each word and
lowercase letters for the rest).

The first parameters of the method are the database web service parameters.

The last parameters (`max.results` and `retfmt`) are *biodb* specific.

`max.results` controls the maximum number of results wanted, and must have a
default value (usually `10`).

`retfmt`, which stands for *return format*, controls the format of the method's
returned value.
The default value of `retfmt` is set to a vector and then processed inside the
method with the `match.arg()` method.
Thus the "real" default value is the first value of the vector, which must
always be `"plain"`.
The set of possible values for `retfmt` is variable from one web service method
to another.
However some of the values are compulsory.
See \@ref(tab:retfmtValues) for a full list of `retfmt` possible values
officially accepted by *biodb*.

Value        | Compulsory | Description
------------ | ---------- | --------------------------------
`plain`      |     yes    | Results are returned verbatim, without any change on the data returned by the server.
`parsed`     |     yes    | Results are parsed according to the data format expected from the server (JSON, CSV, ...) before being returned.
`request`    |     yes    | Instead of returning the results of the query, the query is returned as a `BiodbRequest` object. The query is only built, and is never sent to the server.
`ids`        |     no     | Results are returned as a character vector of entry identifiers.
`queryid`    |     no     | This value is used when dealing with an asynchronous web service. The value returned is the ID of the asynchronous query extracted from the parsed results returned by the server. This query ID is then used to query the query status and to query the query results, usually with two other web services.
`status`     |     no     | When dealing with an asynchronous web service query, this value asks for the current status of the query.
`data.frame` |     no     | Results are formatted into a data frame.

: (\#tab:retfmtValues) `retfmt` accepted values. The list of values of `retfmt` officially accepted by *biodb*.

You may want to look into some of *biodb* implementations of connectors to
official remote databases, and see how the calls to web services have been
implemented in dedicated web service methods.
See \@ref(tab:biodbOfficialRemoteConns).

Package                                               | Official database site
----------------------------------------------------- | --------------------------------
[biodbChebi](https://github.com/pkrog/biodbChebi)     | [ChEBI](https://www.ebi.ac.uk/chebi/)
[biodbHmdb](https://github.com/pkrog/biodbHmdb)       | [HMDB](https://hmdb.ca/)
[biodbKegg](https://github.com/pkrog/biodbKegg)       | [KEGG](https://www.kegg.jp/)
[biodbUniprot](https://github.com/pkrog/biodbUniprot) | [UniProt](https://www.uniprot.org/)

: (\#tab:biodbOfficialRemoteConns) *biodb* connectors to remote databases. Some of the *biodb* packages implementing connectors to official remote databases.

### Implementation

```{r, echo=FALSE, results='hide'}
connClass <- system.file("extdata", "ChebiExConn.R", package='biodb')
entryClass <- system.file("extdata", "ChebiExEntry.R", package='biodb')
source(connClass)
source(entryClass)
```

Here is our implementation of the connector class:
```{r, code=readLines(connClass)}
```

Here is our implementation of the entry class:
```{r, code=readLines(entryClass)}
```

## Using the new connector

To use the new connector, we first need to load the YAML definition file inside
our *biodb* instance.

To start we create an instance of the `BiodbMain` class:
```{r}
mybiodb <- biodb::newInst()
```

The loading of the definitions is done with a call to `loadDefinitions()`:
```{r}
mybiodb$loadDefinitions(defFile)
```

Now our *biodb* instance is aware of our new connector, and is ready to create
instances of it.

To create an instance of our new connector class, we proceeds as usual in
*biodb*, by calling `createConn()` on the factory instance, using our connector
identifier:
```{r}
conn <- mybiodb$getFactory()$createConn('chebi.ex')
```

Now we can retrieve a *ChEBI* entry from the remote database:
```{r}
entry <- conn$getEntry('17001')
entry$getFieldsAsDataframe()
```

Do not forget to terminate your biodb instance once you are done with it:
```{r Closing of the biodb instance}
mybiodb$terminate()
```

## Other types of connectors and entries

We describe here the other types of connectors and entries that *biodb*
provide.
The generator that we have used to generate the package skeleton for `chebi.ex`
can also be used to generate skeleton for all the types described here.

### Connector for a local database

With *biodb* we can also write a connector for a local database.
As a matter of fact, all the connectors included in *biodb* base package are
local connectors only: `mass.csv.file`, `comp.csv.file` and `mass.sqlite`.
See \@ref(tab:connMethods) for a list of methods to implement when writing a local connector.

Method                       | Description
---------------------------- | --------------------------------
`doGetNbEntries()`             | Must return the number of entries contained in the database.
`doGetEntryContentFromDb()`    | Return the content(s), as strings, of one or more entries from the database.
`doDefineParsingExpressions()` | May be overriden in order to define parsing expressions dynamically (see `CsvFileConn` class for an example).
`doGetEntryIds()`           | This method, called by `getEntryIds()`, should return the full list of accession numbers of the entries contained in the database, or a subset if `max.results` is set. This method is used for testing, in order to get a sample of existing entries, but may also be useful for users when developing.

: (\#tab:connMethods) `BiodbConn` methods to implement. The list of methods to implement when inheriting from the `BiodbConn` class.

### Connector for a mass spectra database

In the example above, we have implemented a compound database.
Another type of database is a mass spectra database.
The following connectors included in *biodb* package are mass spectra database
connectors: `mass.csv.file` and `mass.sqlite`.
See \@ref(tab:massdbConnMethods) for a list of methods to implement when
writing a mass spectra database connector.

Method                       | Description
---------------------------- | --------------------------------
`doGetChromCol()`              | Returns a data frame containing the description of the chromatographic columns.
`doGetNbPeaks()`               | Returns the total number of MS peaks contained in the database.
`doGetMzValues()`           | Returns a list of M/Z values contained inside the database, with the possibility of filtering on MS mode, MS level, and some other variables.
`doSearchMzRange()`         | Searches for spectra using an M/Z range and optional filtering on some other variables.

: (\#tab:massdbConnMethods) Methods to implement when defining a connector to a mass spectra database.

### Connector for a downloadable database

Some database servers do not propose web services, or other connection to the
database, but propose to download the whole database for local processing.

*biodb* offers the possibility to handle the connection to such database
servers, by setting `downloadable` to `TRUE` inside the definition of the
database connector.

See \@ref(tab:downloadableMethods) for a list of methods to implement inside your connector when writing a downloadable database connector.

Method                       | Description
---------------------------- | --------------------------------
`doesRequireDownload()`      | This method must return TRUE if the connector requires to download files locally with the `BiodbDownloadable` interface.
`doDownload()`               | This method must implement the download of the database file.
`doExtractDownload()`        | This method must implement the extraction of the database files (e.g.: from a zip).

: (\#tab:downloadableMethods) Methods to implement when defining a downloadable connector class.

### How to implement other types of entry classes

We have seen in the example how to parse XML entries by writing an entry class
that inherits from the `BiodbXmlEntry` class.
As stated before, *biodb* provides other types of abstract entry classes, that
facilitate the parsing of diverse entry content formats.
Here is a review of those formats.

#### HTML content

To parse HTML content, your entry class should inherit from `BiodbHtmlEntry`.
The parsing expressions must be written in *XPath* language, as for XML
content, but it uses a special parsing algorithm since HTML is less strict than
XML and allows some "illegal" constructs.

Example of a parsing expression:
```
path: //input[@id='DATA']
```

#### JSON content

To parse JSON content, your entry class should inherit from `BiodbJsonEntry`.
The parsing expressions are written in the form of lists of keys to follow as a
path inside the JSON tree.
Here is an example:
```
chrom.col.id:
- liquidChromatography
- columnCode
```

#### List content

If your connector gets entry contents directly as an R list object, like in the
case of `MassSqliteConn`, you have interest in making your entry class inherit
from `BiodbListEntry` abstract class.
With this class, the entry content is provided as a flat named R list object,
although it is also possible to pass a JSON string containing flat key/value
pairs instead.
The parsing expressions are the names used inside the list object.
Here is an example:
```
accession: id
compound.id: comp_id
formula: chem_form
```

#### CSV content

The `BiodbCsvEntry` class helps you handle entry content in CSV (using comma
separator or any other character) format.
When declaring the constructor for your own entry class, do not forget to call
the mother class constructor to pass it your separator and/or the string values
that have to be converted to `NA`:
```{r}
MyEntryClass <- R6::R6Class("MyEntryClass", inherit=biodb::BiodbCsvEntry,
    public=list(
        initialize=function() {
            super$initialize(sep=';', na.strings=c('', 'NA'))
        }
))
```

The parsing expressions are the column names of the CSV file:
```
accession: id
name: fullname
```

#### SDF content

If your entry content is in SDF (Structure Data File) chemical file format,
make you entry class inherit from `BiodbSdfEntry` abstract class.
Since the SDF format is an official standard format, the parsing expressions
are useless in this case, your class only has to inherit from `BiodbSdfEntry`.

#### Text content

The `BiodbTxtEntry` abstract class allows you to handle any text file content
for entries.
Parsing expressions are defined as regular expressions, using the
[stringr](https://stringr.tidyverse.org/) package, hence in [ICU Regular
Expressions](https://unicode-org.github.io/icu/userguide/strings/regexp.html)
format.

Here is an example:
```
accession: ^ENTRY\s+(\S+)\s+Compound
exact.mass: ^EXACT_MASS\s+(\S+)$
formula: ^FORMULA\s+(\S+)$
```

#### Implementing your own parsing

If none of the predefined formats fits your needs, your class have to inherit directly from `BiodbEntry`.

Two methods have to be implemented in this case.
The first is `doParseContent()`, which parses a string into the acceptable
format for the second function, `doParseFieldsStep1()`.

Look for instance at the code of `BiodbTxtEntry` class for a good example.
Here is an excerpt:
```{r, eval=FALSE}
doParseContent=function(content) {

    # Get lines of content
    lines <- strsplit(content, "\r?\n")[[1]]

    return(lines)
},

doParseFieldsStep1=function(parsed.content) {

    # Get parsing expressions
    parsing.expr <- .self$getParent()$getPropertyValue('parsing.expr')

    .self$.assertNotNull(parsed.content)
    .self$.assertNotNa(parsed.content)
    .self$.assertNotNull(parsing.expr)
    .self$.assertNotNa(parsing.expr)
    .self$.assertNotNull(names(parsing.expr))

    # Loop on all parsing expressions
    for (field in names(parsing.expr)) {

        # Match whole content 
        g <- stringr::str_match(parsed.content, parsing.expr[[field]])

        # Get positive results
        results <- g[ ! is.na(g[, 1]), , drop=FALSE]

        # Any match ?
        if (nrow(results) > 0)
            .self$setFieldValue(field, results[, 2])
    }
}
```

#### Extending the parsing of an existing class

When inheriting from one of the abstract class listed above (`BiodbTxtEntry`,
`BiodbJsonEntry`, `BiodbXmlEntry`, ...), you also have the opportunity to write
some custom parsing code by implementing `doParseFieldsStep2()`.

This method will be called just after `doParseFieldsStep1()`, which is
implemented by the abstract class.

See `HmdbMetabolitesEntry` class inside [biodbHmdb](https://github.com/pkrog/biodbHmdb) extension package for an example.
Here is an extract:
```{r, eval=FALSE}
doParseFieldsStep2=function(parsed.content) {

    # Remove fields with empty string
    for (f in .self$getFieldNames()) {
        v <- .self$getFieldValue(f)
        if (is.character(v) && ! is.na(v) && v == '')
            .self$removeField(f)
    }

    # Correct InChIKey
    if (.self$hasField('INCHIKEY')) {
        v <- sub('^InChIKey=', '', .self$getFieldValue('INCHIKEY'), perl=TRUE)
        .self$setFieldValue('INCHIKEY', v)
    }

    # Synonyms
    synonyms <- XML::xpathSApply(parsed.content, "//synonym", XML::xmlValue)
    if (length(synonyms) > 0)
        .self$appendFieldValue('name', synonyms)
}
```

# Session information

```{r}
sessionInfo()
```