Title: | Easy Concatenation of Fasta Sequences |
---|---|
Description: | Concatenation of multiple sequence alignments based on a correspondence table that can be edited in Excel <doi:10.5281/zenodo.5130603>. |
Authors: | Matteo Vecchi [aut, cre] |
Maintainer: | Matteo Vecchi <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.1 |
Built: | 2025-02-25 03:46:55 UTC |
Source: | https://github.com/tardipede/concatipede |
The algorithm used to match sequences across fasta files based on their names is outlined below.
auto_match_seqs(x, method = "lv", xlsx)
auto_match_seqs(x, method = "lv", xlsx)
x |
A table (data frame or tibble) typically produced by
|
method |
Method for string distance calculation. See
|
xlsx |
Optional, a path to use to save the output table as an Excel file. |
Let's assume a situation with N fasta files, with each fasta file i having n_i sequence names. The problem of matching the names in the best possible way across the fasta files is similar to that of identifying homologous proteins across species, using e.g. reciprocal blast.
The algorithm steps are:
For each pair of fasta files, identify matching names using a reciprocal match approach: two names match if and only if they are their reciprocal best match.
Those matches across fasta files define a graph.
We identify sub-graphs such that (i) they contain at most one sequence name per fasta file and (ii) all nodes in a given sub-graph are fully connected (i.e., they are all their best reciprocal matches across any pair of fasta files).
A table (tibble) with the same columns as x
and with sequence
names automatically matched across fasta files. Sequence names which did
not have a best reciprocal match in other fasta files are appended to
the end of the table, so that the output table columns contain all the
unique sequence names present in the corresponding column of the input
table. The first column, "name", contains a suggested name for the row
(not guaranteed to be unique). If a path was provided to the xlsx
argument, an Excel file is saved and the table is returned invisibly.
xlsx_file <- concatipede_example("sequences-test-matching.xlsx") xlsx_template <- readxl::read_xlsx(xlsx_file) auto_match_seqs(xlsx_template) ## Not run: auto_match_seqs(xlsx_template, xlsx = "my-automatic-output.xlsx") ## End(Not run)
xlsx_file <- concatipede_example("sequences-test-matching.xlsx") xlsx_template <- readxl::read_xlsx(xlsx_file) auto_match_seqs(xlsx_template) ## Not run: auto_match_seqs(xlsx_template, xlsx = "my-automatic-output.xlsx") ## End(Not run)
This function concatenate sequences from alignments present in the working directory based on a correspondence table and saves the output in a new directory
concatipede( df = NULL, filename = NULL, format = c("fasta", "nexus", "phylip"), dir, plotimg = FALSE, out = NULL, remove.gaps = TRUE, write.outputs = TRUE, save.partitions = TRUE, excel.sheet = 1 )
concatipede( df = NULL, filename = NULL, format = c("fasta", "nexus", "phylip"), dir, plotimg = FALSE, out = NULL, remove.gaps = TRUE, write.outputs = TRUE, save.partitions = TRUE, excel.sheet = 1 )
df |
The user-defined correspondence table, as a data frame or equivalent. This is used only if no |
filename |
Filename of input correspondence table. Alternatively, if no filename is provided, the user can provide their own correspondence table as the |
format |
a string specifying in what formats you want the alignment |
dir |
Optional, path to the directory containing the fasta files. This argument has an effect only if fasta files names are taken from the columns of the |
plotimg |
Logical, save a graphical representation of the alignment in pdf format. Default: FALSE. |
out |
specify outputs filenames |
remove.gaps |
Logical, remove gap only columns. Useful if not using all sequences in the alignments. Default: TRUE. |
write.outputs |
Logical, save concatenated alignment, partitions position table and graphical representation. If FALSE it overrides plotimg. Default: TRUE. |
save.partitions |
Logical, save in the concatenated alignmeent directory a text file with partitions limits for the concatenated alignment. Default: TRUE. |
excel.sheet |
specify what sheet from the excel spreadsheet has to be read. Either a string (the name of a sheet), or an integer (the position of the sheet). |
The concatenated alignment (invisibly if out
is not NULL).
dir <- system.file("extdata", package = "concatipede") z <- concatipede(filename = paste0(dir,"/Macrobiotidae_seqnames.xlsx"), dir = dir, write.outputs = FALSE) z
dir <- system.file("extdata", package = "concatipede") z <- concatipede(filename = paste0(dir,"/Macrobiotidae_seqnames.xlsx"), dir = dir, write.outputs = FALSE) z
Several example files are shipped with the concatipede package. This function facilitates the access to those files.
concatipede_example(example_file = NULL)
concatipede_example(example_file = NULL)
example_file |
Basename of the target example file. If |
Example fasta file.
Example fasta file.
Example fasta file.
Example fasta file.
This is an Excel file (extension .xlsx) typically used to test or demonstrate the automatic matching capabilities of the concatipede package. This file represents the Excel template that could be produced by concatipede_prepare
after detecting the fasta files present in a working directory.
This is an Excel file (extension .xlsx) that contains the correspondence table that can be used to concatenate the sequences contained in the example fasta files COI_Macrobiotidae.fas, ITS2_Macrobiotidae.fas, LSU_Macrobiotidae.fas, and SSU_Macrobiotidae.fas.
The full path to access the example file, or a list of available example files if no example_file
argument was provided.
concatipede_example() example <- concatipede_example("sequences-test-matching.xlsx") if (requireNamespace("readxl")) { seqs <- readxl::read_xlsx(example) seqs }
concatipede_example() example <- concatipede_example("sequences-test-matching.xlsx") if (requireNamespace("readxl")) { seqs <- readxl::read_xlsx(example) seqs }
This function creates a template correspondence table that can also be saved in the working directory.
concatipede_prepare(fasta_files, out = "seqnames", excel = TRUE, exclude)
concatipede_prepare(fasta_files, out = "seqnames", excel = TRUE, exclude)
fasta_files |
Optional, a vector of paths to the fasta files that should be merged. If this argument is missing, the function automatically detects and uses all the fasta files present in the working directory. |
out |
Optional, a filename for the correspondence table template to save (without extension). No file is saved if |
excel |
Boolean, should the correspondence table template be saved in excel format? If |
exclude |
If no |
A tibble with the correspondence table template (invisibly if an out
argument was provided to save the table to a file).
dir <- system.file("extdata", package = "concatipede") fasta_files <- find_fasta(dir) z <- concatipede_prepare(fasta_files) z
dir <- system.file("extdata", package = "concatipede") fasta_files <- find_fasta(dir) z <- concatipede_prepare(fasta_files) z
Find fasta files present in a folder
find_fasta(dir, pattern = "\\.fa$|\\.fas$|\\.fasta$", exclude)
find_fasta(dir, pattern = "\\.fa$|\\.fas$|\\.fasta$", exclude)
dir |
Path to the directory which should be examined. If not provided,
the current working directory (as returned by |
pattern |
Regular expression used by |
exclude |
Optional regular expression used to exclude some filenames from the list of detected files. |
A vector with the full paths to the found files.
# Get the directory containing the package example files dir <- system.file("extdata", package = "concatipede") # List the fasta files containing in that directory find_fasta(dir) # Exclude some files find_fasta(dir, exclude = "COI")
# Get the directory containing the package example files dir <- system.file("extdata", package = "concatipede") # List the fasta files containing in that directory find_fasta(dir) # Exclude some files find_fasta(dir, exclude = "COI")
Extract GenBank accession number from correspondence table formatted with the same requirements for concatipede()
get_genbank_table( df = NULL, filename = NULL, writetable = FALSE, out = "", excel.sheet = 1 )
get_genbank_table( df = NULL, filename = NULL, writetable = FALSE, out = "", excel.sheet = 1 )
df |
The user-defined correspondence table, as a data frame or equivalent. This is used only if no |
filename |
Filename of input correspondence table. Alternatively, if no filename is provided, the user can provide their own correspondence table as the |
writetable |
if TRUE save the Genbank table as excel file in the working directory |
out |
if writetable == T, the name to be attached to the excel filename |
excel.sheet |
specify what sheet from the excel spreadsheet you wanna read. Either a string (the name of a sheet), or an integer (the position of the sheet). |
Table with GenBank accession numbers
This function loads a table from an Excel file.
read_xl(path, sheet = 1)
read_xl(path, sheet = 1)
path |
The path to the Excel file to read. |
sheet |
Optional, the sheet to read (either a string with the name of the sheet or an integer with its position). Default: 1. |
A tibble.
This function renames sequences in fasta files based on a correspondence table.
rename_sequences( fasta_files, df = NULL, filename = NULL, marker_names = NULL, out = NULL, format = "fasta", excel.sheet = 1, unalign = FALSE, exclude )
rename_sequences( fasta_files, df = NULL, filename = NULL, marker_names = NULL, out = NULL, format = "fasta", excel.sheet = 1, unalign = FALSE, exclude )
fasta_files |
Optional, a vector of paths to the fasta files that should be renamed. If this argument is missing, the function automatically detects and uses all the fasta files present in the working directory. |
df |
The user-defined correspondence table, as a data frame or equivalent. This is used only if no |
filename |
Filename of correspondence table. Alternatively, if no filename is provided, the user can provide their own correspondence table as the |
marker_names |
the name of the marker for each alignment to be appended at the end of the sequences names, in the same order as in the correspondence table |
out |
specify outputs filename |
format |
a string specifying in what formats you want the alignment. Can be "fasta", "phylip" and "nexus" |
excel.sheet |
specify what sheet from the excel spreadsheet you wanna read. Either a string (the name of a sheet), or an integer (the position of the sheet). |
unalign |
return unaligned fasta files as output |
exclude |
Optional regular expression used to exclude some filenames from the list of detected files. |
No return value, called for side effect of saving a correspondence table.
Alignments can be saved in fasta
, nexus
, and phylip
formats.
write_fasta(x, path) write_nexus(x, path) write_phylip(x, path)
write_fasta(x, path) write_nexus(x, path) write_phylip(x, path)
x |
Alignment to save (an object of class |
path |
Path of the file to be written, without file extension (the appropriate extension is added automatically, i.e. the path will be extended with ".fasta", ".nexus", or ".phy" depending on the file format used). |
The input x
(invisibly).
## Not run: # Path to an example alignment file pkg_aln <- concatipede_example("COI_Macrobiotidae.fas") # Load the alignment into the R session aln <- ape::read.FASTA(pkg_aln) # Write the alignment in various formats # Note that the appropriate file extension is added by the writing functions. write_fasta(aln, "my-alignment") write_nexus(aln, "my-alignment") write_phylip(aln, "my-alignment") ## End(Not run)
## Not run: # Path to an example alignment file pkg_aln <- concatipede_example("COI_Macrobiotidae.fas") # Load the alignment into the R session aln <- ape::read.FASTA(pkg_aln) # Write the alignment in various formats # Note that the appropriate file extension is added by the writing functions. write_fasta(aln, "my-alignment") write_nexus(aln, "my-alignment") write_phylip(aln, "my-alignment") ## End(Not run)
This function writes an input table to an Excel file and returns its input (invisibly).
write_xl(x, path)
write_xl(x, path)
x |
A data frame or tibble to write to an Excel file. |
path |
The path to the Excel file to be written. |
The input table x
, invisibly (so that the function can be part of a pipeline with the pipe operator).