alignment_tools package

Submodules

alignment_tools.core module

alignment_tools.core.PDB2dataFrame(id: str, pdb_file: str) → DataFrame

Parses a PDB file and extracts atomic-level information into a pandas DataFrame.

Parameters:

idstr: An identifier for the structure (e.g., PDB ID or custom name).
pdb_filestr: Path to the PDB file to be parsed.

Returns:

pd.DataFrame

A DataFrame where each row corresponds to an atom in the structure. Columns include:

‘atom_name’: Name of the atom (e.g., ‘CA’, ‘N’).

‘residue_name’: 3-letter residue name (e.g., ‘ALA’).

‘chain_id’: Chain identifier (e.g., ‘A’).

‘residue_id’: Residue number.

‘x’, ‘y’, ‘z’: Cartesian coordinates of the atom.

‘bfactor’: B-factor (temperature factor).

‘occupancy’: Occupancy value of the atom.

‘element’: Chemical element symbol.

‘id’: The structure ID passed to the function.

Notes

Uses Biopython’s PDBParser to extract atomic data.
Only includes atoms from the first model (if multiple models exist).
Skips hydrogen atoms only if they are not present in the file (no filtering is done).

alignment_tools.core.batch_find_sequence_start_end(fullsequence: str, subsequence_grouped: str) → str

Find the start and end positions of multiple peptide subsequences within a full sequence.

Splits the subsequence_grouped string (which contains semicolon-separated peptide subsequences), then uses find_peptide_positions_within_full_sequence to compute the start and end position of each peptide in the fullsequence.

Parameters:

fullsequencestr: The full protein or nucleotide sequence in which to search for subsequences.
subsequence_groupedstr: A semicolon-separated string of peptide subsequences (e.g., “ABC;DEF;GHI”).

Returns:

str: A semicolon-separated string of start and end positions for each peptide, in the format ‘start_end;start_end;…’.

Notes

The function assumes each peptide appears exactly once in the full sequence. If a peptide is not found, find_peptide_positions_within_full_sequence may raise an error or return an invalid result depending on its internal logic.

alignment_tools.core.calculate_SASA_from_alphafold_pdb(uniprot_id: str) → None

Calculates the solvent accessible surface area (SASA) of a protein structure from an AlphaFold PDB file and writes the SASA data to a new PDB file.

Parameters:

uniprot_idstr: The UniProt identifier corresponding to the AlphaFold structure file named ‘AF-{uniprot_id}-F1.pdb’ which must be present in the working directory.

Returns:

None: The function writes a new PDB file named ‘AF-{uniprot_id}-F1.sasa.pdb’ with SASA values included.

Notes

Uses the freesasa Python library to calculate SASA.
The AlphaFold PDB file must exist in the current directory before calling this function.
Any exceptions during calculation or file handling are caught, and a failure message is printed. This prevents the program from crashing but suppresses detailed error information.

alignment_tools.core.calculate_distance_from_ref_point(df: DataFrame, resid: int, atom_name: str = 'CA')

Calculates the Euclidean distance from a reference atom in a specified residue to all other atoms in the DataFrame.

Parameters:

dfpd.DataFrame: A DataFrame containing atomic coordinates and residue information. Expected columns: [‘residue_id’, ‘atom_name’, ‘x’, ‘y’, ‘z’].
residint: The residue ID containing the reference atom.
atom_namestr, optional: The name of the reference atom within the residue (default is ‘CA’).

Returns:

pd.DataFrame: The input DataFrame with an additional column ‘distance’ representing the Euclidean distance of each atom to the reference point.

Raises:

ValueError: If the specified reference atom is not found in the DataFrame.

alignment_tools.core.calculte_conservation_scores_per_residue(fasta_df: DataFrame, residue='C')

Calculates conservation scores per residue type across aligned sequences.

This function computes a conservation score for a specified residue type (e.g., ‘C’) based on the average frequency of that residue appearing in each position across a multiple sequence alignment. The input DataFrame is expected to have aligned sequences.

Parameters:

fasta_dfpandas.DataFrame: DataFrame containing aligned sequences with columns: - ‘Proteins’: identifiers of sequences - ‘sequence’: aligned sequences (strings) containing gaps if any
residuestr, optional: The amino acid residue to calculate conservation for (default is ‘C’).

Returns:

fasta_dfpandas.DataFrame: The original DataFrame augmented with a new column conserved_<residue> containing conservation scores per sequence position as semicolon-separated strings.
conserved_dfpandas.DataFrame: A numeric matrix (DataFrame) of the conservation scores per residue per position, indexed by ‘Proteins’.

Notes

The function filters out sequences with the exact value ‘..:.’ in the ‘sequence’ column.
Uses a helper function _tokenize (not provided here) to transform sequences into numeric vectors indicating the presence of the specified residue.
Conservation score per position is computed as the product of the residue presence matrix and the mean frequency across sequences.
The returned conserved_df can be used for further analysis or visualization.

Examples

fasta_df, conserved_df = calculate_conservation_scores_per_residue(fasta_df, residue=’C’) fasta_df.columns Index([‘Proteins’, ‘sequence’, ‘tokens’, ‘conserved_C’], dtype=’object’)

alignment_tools.core.convert_aln2fasta(input_file, output_file)

Converts a CLUSTAL .aln alignment file to FASTA format.

This function reads a CLUSTAL-formatted alignment file (commonly with .aln extension), parses the sequence blocks, reconstructs full sequences per identifier, and writes them in standard FASTA format to the output file.

Parameters:

input_filestr: Path to the input .aln file in CLUSTAL format.
output_filestr: Path to the output .fasta file to be created.

Returns:

None: Writes output to the specified FASTA file.

Notes

Skips CLUSTAL headers, empty lines, and consensus lines (lines containing ‘*’).
Handles multiple alignment blocks and concatenates sequences by ID.
Assumes alignment lines are whitespace-separated: <ID> <SEQ>.

Examples

convert_aln2fasta(“alignment.aln”, “alignment.fasta”) Converted alignment.aln to alignment.fasta

alignment_tools.core.download_alphafold_pdb_from_uniprot_id(uniprot_id: str, database_version: str = 'v4') → None

Downloads the AlphaFold PDB structure file for a given UniProt ID.

Parameters:

uniprot_idstr: The UniProt identifier of the protein to download (e.g., ‘P12345’).
database_versionstr, optional: The AlphaFold database version to use (default is ‘v4’).

Returns:

None: The function saves the downloaded PDB file in the current working directory with the filename format: ‘AF-{uniprot_id}-F1.pdb’.

Notes

The function uses curl via os.system(), so it requires curl to be installed and available in the system’s PATH.
The file is fetched from: https://alphafold.ebi.ac.uk/files/
The downloaded file corresponds to: ‘AF-{uniprot_id}-F1-model_{database_version}.pdb’ but it is saved locally as: ‘AF-{uniprot_id}-F1.pdb’

alignment_tools.core.find_max_tuple(tuples)

Returns the maximum tuple from a list of tuples, ignoring any tuples containing None.

Parameters:

tupleslist of tuple: A list of tuples to evaluate.

Returns:

tuple or None: The tuple with the largest value, or None if no valid tuples exist.

alignment_tools.core.find_min_tuple(tuples)

Returns the minimum tuple from a list of tuples, ignoring any tuples containing None.

Parameters:

tupleslist of tuple: A list of tuples to evaluate.

Returns:

tuple or None: The tuple with the smallest value, or None if no valid tuples exist.

alignment_tools.core.find_motif_positions(sequence: str, motif: str) → int

Finds the 1-based index of the first occurrence of a motif in a given sequence.

This function searches for the first exact occurrence of the specified motif (substring) within a larger sequence (e.g., a protein or nucleotide sequence). It returns the index of the first character of that match using 1-based indexing, which is standard in bioinformatics.

Parameters:

sequencestr: The full sequence in which to search for the motif.
motifstr: The motif (substring) to locate within the sequence.

Returns:

int: The 1-based position of the first occurrence of the motif in the sequence. Returns None if the motif is not found or an error occurs.

Examples

find_motif_positions(“ACDEFGHIK”, “EFG”) 4

find_motif_positions(“ACDEFGHIK”, “XYZ”) None

alignment_tools.core.find_peptide_positions_within_full_sequence(fullsequence: str, peptide: str) → str

Find the start and end positions of a peptide within a full sequence.

Uses alignment_functions.find_motif_positions to locate the starting index of the peptide motif in the full sequence, then calculates the end position based on the peptide length.

Parameters:

fullsequencestr: The full protein or nucleotide sequence in which to search.
peptidestr: The peptide sequence (motif) to find within the full sequence.

Returns:

str: A string formatted as ‘start_end’ representing the 1-based inclusive positions of the peptide within the full sequence.

Notes

The function assumes that alignment_functions.find_motif_positions returns the starting index (0-based) of the peptide in the full sequence.

alignment_tools.core.find_residue_position(sequence: str, start_pos: int = 1, residue: str = 'C') → str

Finds all positions of a specific residue in a given sequence and returns them as a semicolon-separated string.

This function scans the input sequence for occurrences of a specified residue (e.g., ‘C’ for cysteine) and returns their positions, adjusted by the given start position (useful when the sequence is a subregion of a longer protein).

Parameters:

sequencestr: The amino acid sequence in which to search.
start_posint, optional: The starting position of the sequence relative to a global context (default is 1). This is added to the zero-based index to get the correct biological position.
residuestr, optional: The single-letter code of the amino acid to search for (default is ‘C’).

Returns:

str: A semicolon-separated string of 1-based positions where the residue is found. Returns an empty string if the residue is not found.

Examples

find_residue_position(“ACKCDEFGC”, start_pos=1, residue=’C’) ‘2;4;9’

find_residue_position(“MQDRVKRPMNAFIVWSRDQRRKMALEN”, start_pos=10, residue=’R’) ‘17;20;21;23’

alignment_tools.core.get_Cys_sasa_per_protein(uniprot_id, resids)

Retrieves the solvent accessible surface area (SASA) values for the sulfur atom (SG) of cysteine residues specified by residue IDs from an AlphaFold SASA-annotated PDB file.

Parameters:

uniprot_idstr: The UniProt identifier of the protein. The function expects a corresponding AlphaFold SASA PDB file named ‘AF-{uniprot_id}-F1.sasa.pdb’ in the working directory.
residsstr: A semicolon-separated string of residue IDs (integers) corresponding to cysteine residues. Example: “45;78;102”

Returns:

str: A semicolon-separated string of SASA values (as strings) for the sulfur atom (SG) of each cysteine residue in the input list. If SASA cannot be found or an error occurs for a residue, ‘n.d.’ (not determined) is returned for that residue.

Notes

Internally uses a helper function _get_CYS_sasa_for_residue_id that attempts to extract the ‘bfactor’ value (used here to store SASA) for the SG atom of the specified residue.
The function handles exceptions gracefully, returning ‘n.d.’ for missing or problematic residues, and returns None silently if the entire process fails.

alignment_tools.core.get_domain_part(x, start, end, tolerance=20) → str

Extracts a substring from a protein sequence with extended boundaries.

This function extracts a region of a protein sequence around a specified start and end position, extending the region by a specified tolerance (in amino acids) on both sides. The function ensures that the extended range does not exceed the sequence boundaries.

Parameters:

xstr: The full protein sequence.
startint: The starting index of the domain (1-based, inclusive).
endint: The ending index of the domain (1-based, inclusive).
toleranceint, optional: Number of residues to extend on both sides of the domain (default is 20).

Returns:

str: The extracted domain region from the protein sequence, including the tolerance extension, if within bounds.

Notes

Indexing is adjusted for 0-based Python slicing.
If the start or end with tolerance goes beyond the sequence, it is clipped to valid indices.
Prints the start and end indices used for slicing (for debugging).

Examples

get_domain_part(“ABCDEFGHIJKLMNOPQRSTUVWXYZ”, 5, 12,tolerance=2) ‘CDEFGHIJKLMN’

alignment_tools.core.get_matching_peptides_index(sites: str, digested_peptidess_intervals: str) → str

Find indexes of digested peptide intervals that contain any of the given sites.

Parses a semicolon-separated list of numeric sites and a semicolon-separated list of interval strings (in ‘start_end’ format), then identifies which intervals contain at least one of the sites.

Parameters:

sitesstr: A semicolon-separated string of integers representing site positions (e.g., “71;87;129”).
digested_peptidess_intervalsstr: A semicolon-separated string of peptide intervals in ‘start_end’ format (e.g., “70_80;81_103;127_140”).

Returns:

str: A semicolon-separated string of indexes (0-based) corresponding to intervals that contain at least one of the given sites.

alignment_tools.core.get_min_max_residueIDs_from_reference_residue(uniprot_id, ref_residue_id: int, neighbor_distance: int = 6, atom_name='CA')

Identifies the minimum and maximum residue IDs within a specified distance from a reference residue in a protein structure.

Parameters:

uniprot_idstr: The UniProt ID of the protein. Used to construct the PDB file name (expected format: ‘AF-{uniprot_id}-F1.pdb’).
ref_residue_idint: The residue ID to use as the reference point.
neighbor_distanceint, optional: The distance threshold (in Ångströms) to define neighboring residues (default is 6 Å).
atom_namestr, optional: The atom name in the reference residue to calculate distance from (default is ‘CA’ for alpha carbon).

Returns:

tuple: A tuple containing: - minimum residue ID within the distance threshold - maximum residue ID within the distance threshold

Raises:

ValueError: If the reference atom is not found in the structure.

Notes

Requires the corresponding PDB file to be named ‘AF-{uniprot_id}-F1.pdb’ and located in the working directory.
Uses Euclidean distance from the specified atom in the reference residue.
Assumes the structure only contains one model.

alignment_tools.core.get_peptides_list_by_index(inds: str, list_peptides: str) → str

Select peptides by index from a semicolon-separated peptide list.

Parses inds as a semicolon-separated list of integer indexes, and uses them to extract peptides from the list_peptides string.

Parameters:

indsstr: A semicolon-separated string of integers representing the indexes of the peptides to select. Example: “0;2;4”
list_peptidesstr: A semicolon-separated string of peptide sequences. Example: “PEP1;PEP2;PEP3;PEP4;PEP5”

Returns:

str or None: A semicolon-separated string of selected peptides, or None if an error occurs (e.g., index out of range or invalid input).

alignment_tools.core.make_dictionary_seqs(fasta: str, splitter='|') → dict

Parses a FASTA file and returns a dictionary and DataFrame of sequences.

This function reads a FASTA file and constructs: 1. A dictionary where keys are parsed from the FASTA header using a delimiter

and values are the corresponding sequences.

A pandas DataFrame containing the same data with columns: ‘Proteins’ and ‘sequence’.

Parameters:

fastastr: Path to the input FASTA file.
splitterstr, optional: A character used to split the FASTA header (default is ‘|’). The second part after the split (index 1) is used as the dictionary key. If splitter is None or an empty string, the full header is used.

Returns:

tuple

fasta_ids_dicdict

Dictionary mapping parsed identifiers to sequences.

fasta_dfpandas.DataFrame

DataFrame with columns:

‘Proteins’: parsed identifiers
‘sequence’: full sequence strings

Raises:

IndexError: If the FASTA header does not contain at least two parts when using a splitter.
FileNotFoundError: If the FASTA file path is invalid.

Examples

dic, df = make_dictionary_seqs(“example.fasta”) dic[“P12345”] ‘MKTLLILTCLVAVALARPKH’ df.head()

Proteins sequence

0 P12345 MKTLLILTCLVAVALARPKH

alignment_tools.core.make_fasta_file(sub_df, file_name='final.fasta')

Writes protein sequences from a DataFrame to a FASTA-format file.

Each row in the DataFrame should contain a protein sequence and a corresponding protein name. The function creates a FASTA file where each sequence is preceded by a header line with the protein name.

Parameters:

sub_dfpandas.DataFrame: A DataFrame containing at least two columns: - ‘sequence’: the protein sequence as a string - ‘Proteins’: the identifier or name of the protein
file_namestr, optional: The name of the output FASTA file (default is “final.fasta”).

Returns:

None: The function writes to a file and returns nothing.

Raises:

KeyError: If the ‘sequence’ or ‘Proteins’ column is not present in sub_df.

alignment_tools.core.make_input_data(df, column)

Converts a DataFrame column containing array-like objects into a clean 2D DataFrame.

This function stacks the values in the specified column (which should contain array-like entries, such as lists or arrays) into a 2D NumPy array, constructs a new DataFrame from it, and sets its index to the original uniprot_id column.

Parameters:

dfpandas.DataFrame: The input DataFrame containing a column of array-like values and a uniprot_id column.
columnstr: The name of the column in df that contains the array-like values to be stacked.

Returns:

df_cleanpandas.DataFrame: A new DataFrame where each row corresponds to one entry in the original column, and the index is set to uniprot_id.

Raises:

KeyError: If column or uniprot_id does not exist in the input DataFrame.
ValueError: If the entries in the specified column are not array-like or are inconsistent in length.

alignment_tools.core.make_interproscan_annotaiton(annotation_file: str)

Parses an InterProScan annotation file and extracts entries with ATP-related domains.

This function reads an InterProScan annotation TSV file (without a header), assigns meaningful column names, and filters for entries where the domain description contains the substring “ATP” (e.g., ATP-binding, ATPase, etc.).

Parameters:

annotation_filestr: Path to the InterProScan annotation file (TSV format, no header expected).

Returns:

pd.DataFrame

A DataFrame containing only the rows where the domain name includes “ATP”. The DataFrame includes the following columns:

uniprot : str

hash : str

length : int

db : str

pfam : str

domain : str

start_position : int

end_position : int

Notes

Assumes the file has at least 8 tab-separated columns.
No validation is performed on the structure or content of the file beyond column count.
Filtering is case-sensitive (i.e., only matches “ATP”, not “atp”).

alignment_tools.core.parse_fasta_sequences(input_file, from_site=0, to_site=None)

Parses a FASTA file and returns a dictionary of sequence IDs and their (optionally sliced) sequences.

This function reads a FASTA file and returns a dictionary where each key is a sequence ID (from the FASTA header) and the value is the corresponding nucleotide or amino acid sequence. Optionally, you can extract a slice of the sequence using from_site and to_site.

Parameters:

input_filestr: Path to the FASTA file to be parsed.
from_siteint, optional: Start index for slicing the sequence (default is 0, inclusive).
to_siteint or None, optional: End index for slicing the sequence (default is None, meaning slice until the end).

Returns:

dict: A dictionary where keys are sequence IDs and values are sequences (or sliced subsequences).

Notes

Slicing follows Python’s 0-based indexing and is inclusive of from_site, exclusive of to_site.
Sequences are returned as plain strings.

Examples

parse_fasta_sequences(“example.fasta”) {‘seq1’: ‘MKTLLILTCLVAVALARPKH’, ‘seq2’: ‘GAVRQKLIED’}

parse_fasta_sequences(“example.fasta”, from_site=5, to_site=10) {‘seq1’: ‘LILTC’, ‘seq2’: ‘QKLIE’}

alignment_tools.core.replace_scores_in_aligned_values(list1, list2)

Replaces non-zero values in a list with new values from another list.

This function takes two inputs: - list1: a string representation of a Python list containing numerical values,

possibly with zeros indicating positions to keep unchanged.

list2: a list (or iterable) of values that will replace the non-zero elements in list1 in the order they appear.

The function returns a NumPy array where each non-zero element in the parsed list1 is replaced by the corresponding value from list2.

Parameters:

list1str: String representation of a list of floats, e.g. ‘[0.0, 1.5, 0.0, 2.3]’. Zero values indicate positions to be preserved.
list2list or iterable of float-compatible: List of replacement values for non-zero positions in list1.

Returns:

numpy.ndarray: Array with non-zero positions replaced by values from list2.

Raises:

ValueError: If the number of non-zero elements in list1 does not match the length of list2.

Examples

replace_scores_in_aligned_values(‘[0, 2.0, 0, 3.5]’, [10, 20]) array([ 0., 10., 0., 20.])

alignment_tools.core.run_clustal(input_file: str, draw_tree=False)

Runs ClustalW on a FASTA file to generate a multiple sequence alignment (MSA) and optionally draws a dendrogram.

This function runs the external ClustalW program on the given FASTA file, producing an alignment file (.aln) and a dendrogram file (.dnd). It reads the alignment and computes a consensus sequence. Optionally, it draws and displays the phylogenetic tree.

Parameters:

input_filestr: Path to the input FASTA file containing sequences to be aligned.
draw_treebool, optional: If True, displays a dendrogram of the phylogenetic tree (default is False).

Returns:

alignmentBio.Align.MultipleSeqAlignment: The multiple sequence alignment object read from the ClustalW output.
aln_output_filestr: Path to the generated alignment file (.aln).

Raises:

subprocess.CalledProcessError: If the ClustalW command fails.

Notes

Requires ClustalW installed and available in the system PATH.
Uses Biopython modules: AlignIO, AlignInfo, Phylo.
Uses matplotlib.pyplot for drawing the tree.

Examples

alignment, aln_file = run_clustal(“sequences.fasta”, draw_tree=True) print(alignment)

alignment_tools.core.run_interproscan(input_file: str, interpro_command='/media/LIMS/Src/fasta_annotation/interproscan-5.73-104.0/interproscan.sh') → None

Runs InterProScan to annotate protein sequences with domain and family information.

This function executes the InterProScan command-line tool on a given FASTA file to generate protein domain annotations (e.g., Pfam, TIGRFAM, etc.) in TSV format.

Parameters:

input_filestr: Path to the input FASTA file containing protein sequences to annotate.
interpro_commandstr, optional: Full path to the InterProScan executable script (default is a specific local path).

Returns:

None: The function does not return a value but generates an output TSV file in the same directory as the input file, with annotations from InterProScan.

Raises:

subprocess.CalledProcessError: If the InterProScan command fails.

Notes

Ensure InterProScan is installed and the path provided in interpro_command is correct.
The output TSV file will have the same base name as the input but with an .tsv extension.
Additional command-line options (e.g., output directory, formats) can be added by modifying the command list.

Examples

run_interproscan(“proteins.fasta”)

alignment_tools package

Submodules

alignment_tools.core module

alignment_tools.digest module

Module contents