Package 'concatipede' reference manual

Title:	Easy Concatenation of Fasta Sequences
Description:	Concatenation of multiple sequence alignments based on a correspondence table that can be edited in Excel <doi:10.5281/zenodo.5130603>.
Authors:	Matteo Vecchi [aut, cre] , Mattieu Bruneaux [aut]
Maintainer:	Matteo Vecchi <[email protected]>
License:	MIT + file LICENSE
Version:	1.0.1
Built:	2025-03-27 03:49:30 UTC
Source:	https://github.com/tardipede/concatipede

Build a template table with automatically matched sequence names

Description

The algorithm used to match sequences across fasta files based on their names is outlined below.

Usage

auto_match_seqs(x, method = "lv", xlsx)
auto_match_seqs(x, method = "lv", xlsx)

Arguments

`x`	A table (data frame or tibble) typically produced by `concatipede_prepare`. It must be of the same format as a table returned by this function: a first column called "name" followed by one column per fasta file. Those columns have the name of their corresponding fasta file, and they contain the names of the sequences in this file, with one sequence name per cell. The number of rows in the number of sequences of the fasta file with the most sequences, and the columns for the other fasta files are filled with `NA` for padding.
`method`	Method for string distance calculation. See `?stringdist::stringdist-metrics` for details. Default is `"lv"`.
`xlsx`	Optional, a path to use to save the output table as an Excel file.

Details

Let's assume a situation with N fasta files, with each fasta file i having n_i sequence names. The problem of matching the names in the best possible way across the fasta files is similar to that of identifying homologous proteins across species, using e.g. reciprocal blast.

The algorithm steps are:

For each pair of fasta files, identify matching names using a reciprocal match approach: two names match if and only if they are their reciprocal best match.
Those matches across fasta files define a graph.
We identify sub-graphs such that (i) they contain at most one sequence name per fasta file and (ii) all nodes in a given sub-graph are fully connected (i.e., they are all their best reciprocal matches across any pair of fasta files).

Value

A table (tibble) with the same columns as x and with sequence names automatically matched across fasta files. Sequence names which did not have a best reciprocal match in other fasta files are appended to the end of the table, so that the output table columns contain all the unique sequence names present in the corresponding column of the input table. The first column, "name", contains a suggested name for the row (not guaranteed to be unique). If a path was provided to the xlsx argument, an Excel file is saved and the table is returned invisibly.

Examples

xlsx_file <- concatipede_example("sequences-test-matching.xlsx")
xlsx_template <- readxl::read_xlsx(xlsx_file)
auto_match_seqs(xlsx_template)
## Not run: 
  auto_match_seqs(xlsx_template, xlsx = "my-automatic-output.xlsx")

## End(Not run)

xlsx_file <- concatipede_example("sequences-test-matching.xlsx")
xlsx_template <- readxl::read_xlsx(xlsx_file)
auto_match_seqs(xlsx_template)
## Not run: 
  auto_match_seqs(xlsx_template, xlsx = "my-automatic-output.xlsx")

## End(Not run)

Concatenate alignments

Description

This function concatenate sequences from alignments present in the working directory based on a correspondence table and saves the output in a new directory

Usage

concatipede(
  df = NULL,
  filename = NULL,
  format = c("fasta", "nexus", "phylip"),
  dir,
  plotimg = FALSE,
  out = NULL,
  remove.gaps = TRUE,
  write.outputs = TRUE,
  save.partitions = TRUE,
  excel.sheet = 1
)
concatipede(
  df = NULL,
  filename = NULL,
  format = c("fasta", "nexus", "phylip"),
  dir,
  plotimg = FALSE,
  out = NULL,
  remove.gaps = TRUE,
  write.outputs = TRUE,
  save.partitions = TRUE,
  excel.sheet = 1
)

Arguments

`df`	The user-defined correspondence table, as a data frame or equivalent. This is used only if no `filename` argument is provided.
`filename`	Filename of input correspondence table. Alternatively, if no filename is provided, the user can provide their own correspondence table as the `df` argument.
`format`	a string specifying in what formats you want the alignment
`dir`	Optional, path to the directory containing the fasta files. This argument has an effect only if fasta files names are taken from the columns of the `df` argument, and that `df` does not have an attribute `dir_name` itself. If no `dir` is provided and `df` does not have a `dir_name` attribute, the current working directory is ued with a warning.
`plotimg`	Logical, save a graphical representation of the alignment in pdf format. Default: FALSE.
`out`	specify outputs filenames
`remove.gaps`	Logical, remove gap only columns. Useful if not using all sequences in the alignments. Default: TRUE.
`write.outputs`	Logical, save concatenated alignment, partitions position table and graphical representation. If FALSE it overrides plotimg. Default: TRUE.
`save.partitions`	Logical, save in the concatenated alignmeent directory a text file with partitions limits for the concatenated alignment. Default: TRUE.
`excel.sheet`	specify what sheet from the excel spreadsheet has to be read. Either a string (the name of a sheet), or an integer (the position of the sheet).

Value

The concatenated alignment (invisibly if out is not NULL).

Examples

dir <- system.file("extdata", package = "concatipede")
z <- concatipede(filename = paste0(dir,"/Macrobiotidae_seqnames.xlsx"), dir = dir,
                 write.outputs = FALSE)
z

dir <- system.file("extdata", package = "concatipede")
z <- concatipede(filename = paste0(dir,"/Macrobiotidae_seqnames.xlsx"), dir = dir,
                 write.outputs = FALSE)
z

Get the path to one of concatipede example files

Description

Several example files are shipped with the concatipede package. This function facilitates the access to those files.

Usage

concatipede_example(example_file = NULL)
concatipede_example(example_file = NULL)

Arguments

example_file

Basename of the target example file. If NULL (the default), the basenames of the available files are listed.

Details

COI_Macrobiotidae.fas: Example fasta file.
ITS2_Macrobiotidae.fas: Example fasta file.
LSU_Macrobiotidae.fas: Example fasta file.
SSU_Macrobiotidae.fas: Example fasta file.
sequences-test-matching.xlsx: This is an Excel file (extension .xlsx) typically used to test or demonstrate the automatic matching capabilities of the concatipede package. This file represents the Excel template that could be produced by concatipede_prepare after detecting the fasta files present in a working directory.
Macrobiotidae_seqnames.xlsx: This is an Excel file (extension .xlsx) that contains the correspondence table that can be used to concatenate the sequences contained in the example fasta files COI_Macrobiotidae.fas, ITS2_Macrobiotidae.fas, LSU_Macrobiotidae.fas, and SSU_Macrobiotidae.fas.

Value

The full path to access the example file, or a list of available example files if no example_file argument was provided.

Examples

concatipede_example()
example <- concatipede_example("sequences-test-matching.xlsx")
if (requireNamespace("readxl")) {
  seqs <- readxl::read_xlsx(example)
  seqs
}

concatipede_example()
example <- concatipede_example("sequences-test-matching.xlsx")
if (requireNamespace("readxl")) {
  seqs <- readxl::read_xlsx(example)
  seqs
}

Load alignments and prepare template correspondence table for concatenate () function

Description

This function creates a template correspondence table that can also be saved in the working directory.

Usage

concatipede_prepare(fasta_files, out = "seqnames", excel = TRUE, exclude)
concatipede_prepare(fasta_files, out = "seqnames", excel = TRUE, exclude)

Arguments

`fasta_files`	Optional, a vector of paths to the fasta files that should be merged. If this argument is missing, the function automatically detects and uses all the fasta files present in the working directory.
`out`	Optional, a filename for the correspondence table template to save (without extension). No file is saved if `out` is not provided. In all cases, the function also returns a tibble with the correspondence table template (invisibly if `out` is provided).
`excel`	Boolean, should the correspondence table template be saved in excel format? If `FALSE`, it is saved as a tab-separated text file instead. Default is `TRUE`. The correct file extension is automatically appended to the `out` argument. If `out` is missing, this argument has no effect.
`exclude`	If no `fasta_files` argument has been passed, fasta files matching the `exclude` pattern will be ignored by the function when it automatically detects fasta files in the working directory.

Value

A tibble with the correspondence table template (invisibly if an out argument was provided to save the table to a file).

Examples

dir <- system.file("extdata", package = "concatipede")
fasta_files <- find_fasta(dir)
z <- concatipede_prepare(fasta_files)
z

dir <- system.file("extdata", package = "concatipede")
fasta_files <- find_fasta(dir)
z <- concatipede_prepare(fasta_files)
z

Find fasta files present in a folder

Description

Find fasta files present in a folder

Usage

find_fasta(dir, pattern = "\\.fa$|\\.fas$|\\.fasta$", exclude)
find_fasta(dir, pattern = "\\.fa$|\\.fas$|\\.fasta$", exclude)

Arguments

`dir`	Path to the directory which should be examined. If not provided, the current working directory (as returned by `getwd`) is used.
`pattern`	Regular expression used by `list.files` to detect the fasta files. The default is to list all files ending in ".fa", ".fas", and ".fasta".
`exclude`	Optional regular expression used to exclude some filenames from the list of detected files.

Value

A vector with the full paths to the found files.

Examples

# Get the directory containing the package example files
dir <- system.file("extdata", package = "concatipede")
# List the fasta files containing in that directory
find_fasta(dir)
# Exclude some files
find_fasta(dir, exclude = "COI")

# Get the directory containing the package example files
dir <- system.file("extdata", package = "concatipede")
# List the fasta files containing in that directory
find_fasta(dir)
# Exclude some files
find_fasta(dir, exclude = "COI")

Extract GenBank accession number from correspondence table

Description

Extract GenBank accession number from correspondence table formatted with the same requirements for concatipede()

Usage

get_genbank_table(
  df = NULL,
  filename = NULL,
  writetable = FALSE,
  out = "",
  excel.sheet = 1
)
get_genbank_table(
  df = NULL,
  filename = NULL,
  writetable = FALSE,
  out = "",
  excel.sheet = 1
)

Arguments

`df`	The user-defined correspondence table, as a data frame or equivalent. This is used only if no `filename` argument is provided.
`filename`	Filename of input correspondence table. Alternatively, if no filename is provided, the user can provide their own correspondence table as the `df` argument.
`writetable`	if TRUE save the Genbank table as excel file in the working directory
`out`	if writetable == T, the name to be attached to the excel filename
`excel.sheet`	specify what sheet from the excel spreadsheet you wanna read. Either a string (the name of a sheet), or an integer (the position of the sheet).

Value

Table with GenBank accession numbers

Read an Excel file

Description

This function loads a table from an Excel file.

Usage

read_xl(path, sheet = 1)
read_xl(path, sheet = 1)

Arguments

`path`	The path to the Excel file to read.
`sheet`	Optional, the sheet to read (either a string with the name of the sheet or an integer with its position). Default: 1.

Value

A tibble.

Rename sequences

Description

This function renames sequences in fasta files based on a correspondence table.

Usage

rename_sequences(
  fasta_files,
  df = NULL,
  filename = NULL,
  marker_names = NULL,
  out = NULL,
  format = "fasta",
  excel.sheet = 1,
  unalign = FALSE,
  exclude
)
rename_sequences(
  fasta_files,
  df = NULL,
  filename = NULL,
  marker_names = NULL,
  out = NULL,
  format = "fasta",
  excel.sheet = 1,
  unalign = FALSE,
  exclude
)

Arguments

`fasta_files`	Optional, a vector of paths to the fasta files that should be renamed. If this argument is missing, the function automatically detects and uses all the fasta files present in the working directory.
`df`	The user-defined correspondence table, as a data frame or equivalent. This is used only if no `filename` argument is provided.
`filename`	Filename of correspondence table. Alternatively, if no filename is provided, the user can provide their own correspondence table as the `df` argument.
`marker_names`	the name of the marker for each alignment to be appended at the end of the sequences names, in the same order as in the correspondence table
`out`	specify outputs filename
`format`	a string specifying in what formats you want the alignment. Can be "fasta", "phylip" and "nexus"
`excel.sheet`	specify what sheet from the excel spreadsheet you wanna read. Either a string (the name of a sheet), or an integer (the position of the sheet).
`unalign`	return unaligned fasta files as output
`exclude`	Optional regular expression used to exclude some filenames from the list of detected files.

Value

No return value, called for side effect of saving a correspondence table.

Writing alignments

Description

Alignments can be saved in fasta, nexus, and phylip formats.

Usage

write_fasta(x, path)

write_nexus(x, path)

write_phylip(x, path)
write_fasta(x, path)

write_nexus(x, path)

write_phylip(x, path)

Arguments

`x`	Alignment to save (an object of class `DNAbin`).
`path`	Path of the file to be written, without file extension (the appropriate extension is added automatically, i.e. the path will be extended with ".fasta", ".nexus", or ".phy" depending on the file format used).

Value

The input x (invisibly).

Examples

## Not run: 
  # Path to an example alignment file
  pkg_aln <- concatipede_example("COI_Macrobiotidae.fas")
  # Load the alignment into the R session
  aln <- ape::read.FASTA(pkg_aln)
  # Write the alignment in various formats
  # Note that the appropriate file extension is added by the writing functions.
  write_fasta(aln, "my-alignment")
  write_nexus(aln, "my-alignment")
  write_phylip(aln, "my-alignment")

## End(Not run)
## Not run: 
  # Path to an example alignment file
  pkg_aln <- concatipede_example("COI_Macrobiotidae.fas")
  # Load the alignment into the R session
  aln <- ape::read.FASTA(pkg_aln)
  # Write the alignment in various formats
  # Note that the appropriate file extension is added by the writing functions.
  write_fasta(aln, "my-alignment")
  write_nexus(aln, "my-alignment")
  write_phylip(aln, "my-alignment")

## End(Not run)

Write an Excel file

Description

This function writes an input table to an Excel file and returns its input (invisibly).

Usage

write_xl(x, path)
write_xl(x, path)

Arguments

`x`	A data frame or tibble to write to an Excel file.
`path`	The path to the Excel file to be written.

Value

The input table x, invisibly (so that the function can be part of a pipeline with the pipe operator).

Package 'concatipede'

Help Index

Build a template table with automatically matched sequence names

Description

Usage

Arguments

Details

Value

Examples

Concatenate alignments

Description

Usage

Arguments

Value

Examples

Get the path to one of concatipede example files

Description

Usage

Arguments

Details

Value

Examples

Load alignments and prepare template correspondence table for concatenate () function

Description

Usage

Arguments

Value

Examples

Find fasta files present in a folder

Description

Usage

Arguments

Value

Examples

Extract GenBank accession number from correspondence table

Description

Usage

Arguments

Value

Read an Excel file

Description

Usage

Arguments

Value

Rename sequences

Description

Usage

Arguments

Value

Writing alignments

Description

Usage

Arguments

Value

Examples

Write an Excel file

Description

Usage

Arguments

Value