Example of package usage

2021-08-06

Concatipede requirements and workflow

The concatipede package allows to concatenate sequences from different Multiple Sequence Alignments (MSAs) based on a correspondence table that indicates how sequences from different MSAs are linked to each other.

For this package to work, all the MSAs in fasta format should be placed in the same folder (that can be the working directory). Only the fasta files to be concatenated should be present in the directory. If other fasta files are present and you don´t want to include them, see the concatipede_prepare(exclude = "") option.

Concatipede workflow

How to use concatipede

Prepare the files

All the fasta files of interest should be put in a target directory. Here, in order for this example to work, we first set up a concatipede_test/ directory, copy the example fasta files shipped with the package in it and set it as working directory. The concatipede package functions automatically load and save files in the working directory, however options to direct them to different directories are present.

# Save the path to the initial directory for later clean-up
old_dir <- getwd()
# Create a directory to put the fasta files for this example
dir.create("concatipede_test")
# Set it as the working directory
setwd("concatipede_test")
# Copy the example fasta files shipped with the package into that directory
example_files = list.files(system.file("extdata", package="concatipede"), full.names = TRUE)
file.copy(from = example_files, to = getwd())

Now all the data used in this vignette is copied in the working directory.

Set up the template for the correspondence table

The first step is to generate a template correspondence table with all the sequence names. You can do this with the function concatipede_prepare().

But first let’s check what fasta files are in our directory:

library(concatipede)
library(tidyverse)
find_fasta()
## [1] "/private/var/folders/0j/r7b11bl54m94xd1tl6s576w80000gn/T/Rtmp01NSFA/Rbuilde572723b7179/concatipede/vignettes/COI_Macrobiotidae.fas" 
## [2] "/private/var/folders/0j/r7b11bl54m94xd1tl6s576w80000gn/T/Rtmp01NSFA/Rbuilde572723b7179/concatipede/vignettes/ITS2_Macrobiotidae.fas"
## [3] "/private/var/folders/0j/r7b11bl54m94xd1tl6s576w80000gn/T/Rtmp01NSFA/Rbuilde572723b7179/concatipede/vignettes/LSU_Macrobiotidae.fas" 
## [4] "/private/var/folders/0j/r7b11bl54m94xd1tl6s576w80000gn/T/Rtmp01NSFA/Rbuilde572723b7179/concatipede/vignettes/SSU_Macrobiotidae.fas"

Those are alignments for 4 markers of tardigrades from the family Macrobiotidae. With the function concatipede_prepare() we will generate an excel table with the sequence names in the order they are found in each alignment.

concatipede_prepare(out = "seqnames")

Once it is done, an excel file should be saved in your working directory. The template excel file looks like this:

template

Modify the correspondence table

Each row of the correspondence table will be used to build one concatenated sequence, by extracting and concatenating the corresponding sequences from the input files. You can modify the template excel correspondence table to reflect how you want to concatenate the sequences from the different alignments.

It is important that the first cell of the first row (cell A1, “name”) is not modified; the other column names are the filenames of the fasta alignments. In the name column you must set the name of the concatenated sequences.

You can copy and save different versions of the correspondence table in different sheets of the excel file: they can be selected with an appropriate option in the R session later.

After modifying the template to specify the correct matches for concatenation, the first sheet of our example excel file now looks like this: