alignment_tools package
Submodules
alignment_tools.core module
- alignment_tools.core.PDB2dataFrame(id: str, pdb_file: str) DataFrame
Parses a PDB file and extracts atomic-level information into a pandas DataFrame.
- Parameters:
- idstr
An identifier for the structure (e.g., PDB ID or custom name).
- pdb_filestr
Path to the PDB file to be parsed.
- Returns:
- pd.DataFrame
A DataFrame where each row corresponds to an atom in the structure. Columns include:
‘atom_name’: Name of the atom (e.g., ‘CA’, ‘N’).
‘residue_name’: 3-letter residue name (e.g., ‘ALA’).
‘chain_id’: Chain identifier (e.g., ‘A’).
‘residue_id’: Residue number.
‘x’, ‘y’, ‘z’: Cartesian coordinates of the atom.
‘bfactor’: B-factor (temperature factor).
‘occupancy’: Occupancy value of the atom.
‘element’: Chemical element symbol.
‘id’: The structure ID passed to the function.
Notes
Uses Biopython’s PDBParser to extract atomic data.
Only includes atoms from the first model (if multiple models exist).
Skips hydrogen atoms only if they are not present in the file (no filtering is done).
- alignment_tools.core.batch_find_sequence_start_end(fullsequence: str, subsequence_grouped: str) str
Find the start and end positions of multiple peptide subsequences within a full sequence.
Splits the subsequence_grouped string (which contains semicolon-separated peptide subsequences), then uses find_peptide_positions_within_full_sequence to compute the start and end position of each peptide in the fullsequence.
- Parameters:
- fullsequencestr
The full protein or nucleotide sequence in which to search for subsequences.
- subsequence_groupedstr
A semicolon-separated string of peptide subsequences (e.g., “ABC;DEF;GHI”).
- Returns:
- str
A semicolon-separated string of start and end positions for each peptide, in the format ‘start_end;start_end;…’.
Notes
The function assumes each peptide appears exactly once in the full sequence. If a peptide is not found, find_peptide_positions_within_full_sequence may raise an error or return an invalid result depending on its internal logic.
- alignment_tools.core.calculate_SASA_from_alphafold_pdb(uniprot_id: str) None
Calculates the solvent accessible surface area (SASA) of a protein structure from an AlphaFold PDB file and writes the SASA data to a new PDB file.
- Parameters:
- uniprot_idstr
The UniProt identifier corresponding to the AlphaFold structure file named ‘AF-{uniprot_id}-F1.pdb’ which must be present in the working directory.
- Returns:
- None
The function writes a new PDB file named ‘AF-{uniprot_id}-F1.sasa.pdb’ with SASA values included.
Notes
Uses the freesasa Python library to calculate SASA.
The AlphaFold PDB file must exist in the current directory before calling this function.
Any exceptions during calculation or file handling are caught, and a failure message is printed. This prevents the program from crashing but suppresses detailed error information.
- alignment_tools.core.calculate_distance_from_ref_point(df: DataFrame, resid: int, atom_name: str = 'CA')
Calculates the Euclidean distance from a reference atom in a specified residue to all other atoms in the DataFrame.
- Parameters:
- dfpd.DataFrame
A DataFrame containing atomic coordinates and residue information. Expected columns: [‘residue_id’, ‘atom_name’, ‘x’, ‘y’, ‘z’].
- residint
The residue ID containing the reference atom.
- atom_namestr, optional
The name of the reference atom within the residue (default is ‘CA’).
- Returns:
- pd.DataFrame
The input DataFrame with an additional column ‘distance’ representing the Euclidean distance of each atom to the reference point.
- Raises:
- ValueError
If the specified reference atom is not found in the DataFrame.
- alignment_tools.core.calculte_conservation_scores_per_residue(fasta_df: DataFrame, residue='C')
Calculates conservation scores per residue type across aligned sequences.
This function computes a conservation score for a specified residue type (e.g., ‘C’) based on the average frequency of that residue appearing in each position across a multiple sequence alignment. The input DataFrame is expected to have aligned sequences.
- Parameters:
- fasta_dfpandas.DataFrame
DataFrame containing aligned sequences with columns: - ‘Proteins’: identifiers of sequences - ‘sequence’: aligned sequences (strings) containing gaps if any
- residuestr, optional
The amino acid residue to calculate conservation for (default is ‘C’).
- Returns:
- fasta_dfpandas.DataFrame
The original DataFrame augmented with a new column conserved_<residue> containing conservation scores per sequence position as semicolon-separated strings.
- conserved_dfpandas.DataFrame
A numeric matrix (DataFrame) of the conservation scores per residue per position, indexed by ‘Proteins’.
Notes
The function filters out sequences with the exact value ‘..:.’ in the ‘sequence’ column.
Uses a helper function _tokenize (not provided here) to transform sequences into numeric vectors indicating the presence of the specified residue.
Conservation score per position is computed as the product of the residue presence matrix and the mean frequency across sequences.
The returned conserved_df can be used for further analysis or visualization.
Examples
fasta_df, conserved_df = calculate_conservation_scores_per_residue(fasta_df, residue=’C’) fasta_df.columns Index([‘Proteins’, ‘sequence’, ‘tokens’, ‘conserved_C’], dtype=’object’)
- alignment_tools.core.convert_aln2fasta(input_file, output_file)
Converts a CLUSTAL .aln alignment file to FASTA format.
This function reads a CLUSTAL-formatted alignment file (commonly with .aln extension), parses the sequence blocks, reconstructs full sequences per identifier, and writes them in standard FASTA format to the output file.
- Parameters:
- input_filestr
Path to the input .aln file in CLUSTAL format.
- output_filestr
Path to the output .fasta file to be created.
- Returns:
- None
Writes output to the specified FASTA file.
Notes
Skips CLUSTAL headers, empty lines, and consensus lines (lines containing ‘*’).
Handles multiple alignment blocks and concatenates sequences by ID.
Assumes alignment lines are whitespace-separated: <ID> <SEQ>.
Examples
convert_aln2fasta(“alignment.aln”, “alignment.fasta”) Converted alignment.aln to alignment.fasta
- alignment_tools.core.download_alphafold_pdb_from_uniprot_id(uniprot_id: str, database_version: str = 'v4') None
Downloads the AlphaFold PDB structure file for a given UniProt ID.
- Parameters:
- uniprot_idstr
The UniProt identifier of the protein to download (e.g., ‘P12345’).
- database_versionstr, optional
The AlphaFold database version to use (default is ‘v4’).
- Returns:
- None
The function saves the downloaded PDB file in the current working directory with the filename format: ‘AF-{uniprot_id}-F1.pdb’.
Notes
The function uses curl via os.system(), so it requires curl to be installed and available in the system’s PATH.
The file is fetched from: https://alphafold.ebi.ac.uk/files/
The downloaded file corresponds to: ‘AF-{uniprot_id}-F1-model_{database_version}.pdb’ but it is saved locally as: ‘AF-{uniprot_id}-F1.pdb’
- alignment_tools.core.find_max_tuple(tuples)
Returns the maximum tuple from a list of tuples, ignoring any tuples containing None.
- Parameters:
- tupleslist of tuple
A list of tuples to evaluate.
- Returns:
- tuple or None
The tuple with the largest value, or None if no valid tuples exist.
- alignment_tools.core.find_min_tuple(tuples)
Returns the minimum tuple from a list of tuples, ignoring any tuples containing None.
- Parameters:
- tupleslist of tuple
A list of tuples to evaluate.
- Returns:
- tuple or None
The tuple with the smallest value, or None if no valid tuples exist.
- alignment_tools.core.find_motif_positions(sequence: str, motif: str) int
Finds the 1-based index of the first occurrence of a motif in a given sequence.
This function searches for the first exact occurrence of the specified motif (substring) within a larger sequence (e.g., a protein or nucleotide sequence). It returns the index of the first character of that match using 1-based indexing, which is standard in bioinformatics.
- Parameters:
- sequencestr
The full sequence in which to search for the motif.
- motifstr
The motif (substring) to locate within the sequence.
- Returns:
- int
The 1-based position of the first occurrence of the motif in the sequence. Returns None if the motif is not found or an error occurs.
Examples
find_motif_positions(“ACDEFGHIK”, “EFG”) 4
find_motif_positions(“ACDEFGHIK”, “XYZ”) None
- alignment_tools.core.find_peptide_positions_within_full_sequence(fullsequence: str, peptide: str) str
Find the start and end positions of a peptide within a full sequence.
Uses alignment_functions.find_motif_positions to locate the starting index of the peptide motif in the full sequence, then calculates the end position based on the peptide length.
- Parameters:
- fullsequencestr
The full protein or nucleotide sequence in which to search.
- peptidestr
The peptide sequence (motif) to find within the full sequence.
- Returns:
- str
A string formatted as ‘start_end’ representing the 1-based inclusive positions of the peptide within the full sequence.
Notes
The function assumes that alignment_functions.find_motif_positions returns the starting index (0-based) of the peptide in the full sequence.
- alignment_tools.core.find_residue_position(sequence: str, start_pos: int = 1, residue: str = 'C') str
Finds all positions of a specific residue in a given sequence and returns them as a semicolon-separated string.
This function scans the input sequence for occurrences of a specified residue (e.g., ‘C’ for cysteine) and returns their positions, adjusted by the given start position (useful when the sequence is a subregion of a longer protein).
- Parameters:
- sequencestr
The amino acid sequence in which to search.
- start_posint, optional
The starting position of the sequence relative to a global context (default is 1). This is added to the zero-based index to get the correct biological position.
- residuestr, optional
The single-letter code of the amino acid to search for (default is ‘C’).
- Returns:
- str
A semicolon-separated string of 1-based positions where the residue is found. Returns an empty string if the residue is not found.
Examples
find_residue_position(“ACKCDEFGC”, start_pos=1, residue=’C’) ‘2;4;9’
find_residue_position(“MQDRVKRPMNAFIVWSRDQRRKMALEN”, start_pos=10, residue=’R’) ‘17;20;21;23’
- alignment_tools.core.get_Cys_sasa_per_protein(uniprot_id, resids)
Retrieves the solvent accessible surface area (SASA) values for the sulfur atom (SG) of cysteine residues specified by residue IDs from an AlphaFold SASA-annotated PDB file.
- Parameters:
- uniprot_idstr
The UniProt identifier of the protein. The function expects a corresponding AlphaFold SASA PDB file named ‘AF-{uniprot_id}-F1.sasa.pdb’ in the working directory.
- residsstr
A semicolon-separated string of residue IDs (integers) corresponding to cysteine residues. Example: “45;78;102”
- Returns:
- str
A semicolon-separated string of SASA values (as strings) for the sulfur atom (SG) of each cysteine residue in the input list. If SASA cannot be found or an error occurs for a residue, ‘n.d.’ (not determined) is returned for that residue.
Notes
Internally uses a helper function _get_CYS_sasa_for_residue_id that attempts to extract the ‘bfactor’ value (used here to store SASA) for the SG atom of the specified residue.
The function handles exceptions gracefully, returning ‘n.d.’ for missing or problematic residues, and returns None silently if the entire process fails.
- alignment_tools.core.get_domain_part(x, start, end, tolerance=20) str
Extracts a substring from a protein sequence with extended boundaries.
This function extracts a region of a protein sequence around a specified start and end position, extending the region by a specified tolerance (in amino acids) on both sides. The function ensures that the extended range does not exceed the sequence boundaries.
- Parameters:
- xstr
The full protein sequence.
- startint
The starting index of the domain (1-based, inclusive).
- endint
The ending index of the domain (1-based, inclusive).
- toleranceint, optional
Number of residues to extend on both sides of the domain (default is 20).
- Returns:
- str
The extracted domain region from the protein sequence, including the tolerance extension, if within bounds.
Notes
Indexing is adjusted for 0-based Python slicing.
If the start or end with tolerance goes beyond the sequence, it is clipped to valid indices.
Prints the start and end indices used for slicing (for debugging).
Examples
get_domain_part(“ABCDEFGHIJKLMNOPQRSTUVWXYZ”, 5, 12,tolerance=2) ‘CDEFGHIJKLMN’
- alignment_tools.core.get_matching_peptides_index(sites: str, digested_peptidess_intervals: str) str
Find indexes of digested peptide intervals that contain any of the given sites.
Parses a semicolon-separated list of numeric sites and a semicolon-separated list of interval strings (in ‘start_end’ format), then identifies which intervals contain at least one of the sites.
- Parameters:
- sitesstr
A semicolon-separated string of integers representing site positions (e.g., “71;87;129”).
- digested_peptidess_intervalsstr
A semicolon-separated string of peptide intervals in ‘start_end’ format (e.g., “70_80;81_103;127_140”).
- Returns:
- str
A semicolon-separated string of indexes (0-based) corresponding to intervals that contain at least one of the given sites.
- alignment_tools.core.get_min_max_residueIDs_from_reference_residue(uniprot_id, ref_residue_id: int, neighbor_distance: int = 6, atom_name='CA')
Identifies the minimum and maximum residue IDs within a specified distance from a reference residue in a protein structure.
- Parameters:
- uniprot_idstr
The UniProt ID of the protein. Used to construct the PDB file name (expected format: ‘AF-{uniprot_id}-F1.pdb’).
- ref_residue_idint
The residue ID to use as the reference point.
- neighbor_distanceint, optional
The distance threshold (in Ångströms) to define neighboring residues (default is 6 Å).
- atom_namestr, optional
The atom name in the reference residue to calculate distance from (default is ‘CA’ for alpha carbon).
- Returns:
- tuple
A tuple containing: - minimum residue ID within the distance threshold - maximum residue ID within the distance threshold
- Raises:
- ValueError
If the reference atom is not found in the structure.
Notes
Requires the corresponding PDB file to be named ‘AF-{uniprot_id}-F1.pdb’ and located in the working directory.
Uses Euclidean distance from the specified atom in the reference residue.
Assumes the structure only contains one model.
- alignment_tools.core.get_peptides_list_by_index(inds: str, list_peptides: str) str
Select peptides by index from a semicolon-separated peptide list.
Parses inds as a semicolon-separated list of integer indexes, and uses them to extract peptides from the list_peptides string.
- Parameters:
- indsstr
A semicolon-separated string of integers representing the indexes of the peptides to select. Example: “0;2;4”
- list_peptidesstr
A semicolon-separated string of peptide sequences. Example: “PEP1;PEP2;PEP3;PEP4;PEP5”
- Returns:
- str or None
A semicolon-separated string of selected peptides, or None if an error occurs (e.g., index out of range or invalid input).
- alignment_tools.core.make_dictionary_seqs(fasta: str, splitter='|') dict
Parses a FASTA file and returns a dictionary and DataFrame of sequences.
This function reads a FASTA file and constructs: 1. A dictionary where keys are parsed from the FASTA header using a delimiter
and values are the corresponding sequences.
A pandas DataFrame containing the same data with columns: ‘Proteins’ and ‘sequence’.
- Parameters:
- fastastr
Path to the input FASTA file.
- splitterstr, optional
A character used to split the FASTA header (default is ‘|’). The second part after the split (index 1) is used as the dictionary key. If splitter is None or an empty string, the full header is used.
- Returns:
- tuple
- fasta_ids_dicdict
Dictionary mapping parsed identifiers to sequences.
- fasta_dfpandas.DataFrame
- DataFrame with columns:
‘Proteins’: parsed identifiers
‘sequence’: full sequence strings
- Raises:
- IndexError
If the FASTA header does not contain at least two parts when using a splitter.
- FileNotFoundError
If the FASTA file path is invalid.
Examples
dic, df = make_dictionary_seqs(“example.fasta”) dic[“P12345”] ‘MKTLLILTCLVAVALARPKH’ df.head()
Proteins sequence
0 P12345 MKTLLILTCLVAVALARPKH
- alignment_tools.core.make_fasta_file(sub_df, file_name='final.fasta')
Writes protein sequences from a DataFrame to a FASTA-format file.
Each row in the DataFrame should contain a protein sequence and a corresponding protein name. The function creates a FASTA file where each sequence is preceded by a header line with the protein name.
- Parameters:
- sub_dfpandas.DataFrame
A DataFrame containing at least two columns: - ‘sequence’: the protein sequence as a string - ‘Proteins’: the identifier or name of the protein
- file_namestr, optional
The name of the output FASTA file (default is “final.fasta”).
- Returns:
- None
The function writes to a file and returns nothing.
- Raises:
- KeyError
If the ‘sequence’ or ‘Proteins’ column is not present in sub_df.
- alignment_tools.core.make_input_data(df, column)
Converts a DataFrame column containing array-like objects into a clean 2D DataFrame.
This function stacks the values in the specified column (which should contain array-like entries, such as lists or arrays) into a 2D NumPy array, constructs a new DataFrame from it, and sets its index to the original uniprot_id column.
- Parameters:
- dfpandas.DataFrame
The input DataFrame containing a column of array-like values and a uniprot_id column.
- columnstr
The name of the column in df that contains the array-like values to be stacked.
- Returns:
- df_cleanpandas.DataFrame
A new DataFrame where each row corresponds to one entry in the original column, and the index is set to uniprot_id.
- Raises:
- KeyError
If column or uniprot_id does not exist in the input DataFrame.
- ValueError
If the entries in the specified column are not array-like or are inconsistent in length.
- alignment_tools.core.make_interproscan_annotaiton(annotation_file: str)
Parses an InterProScan annotation file and extracts entries with ATP-related domains.
This function reads an InterProScan annotation TSV file (without a header), assigns meaningful column names, and filters for entries where the domain description contains the substring “ATP” (e.g., ATP-binding, ATPase, etc.).
- Parameters:
- annotation_filestr
Path to the InterProScan annotation file (TSV format, no header expected).
- Returns:
- pd.DataFrame
A DataFrame containing only the rows where the domain name includes “ATP”. The DataFrame includes the following columns:
uniprot : str
hash : str
length : int
db : str
pfam : str
domain : str
start_position : int
end_position : int
Notes
Assumes the file has at least 8 tab-separated columns.
No validation is performed on the structure or content of the file beyond column count.
Filtering is case-sensitive (i.e., only matches “ATP”, not “atp”).
- alignment_tools.core.parse_fasta_sequences(input_file, from_site=0, to_site=None)
Parses a FASTA file and returns a dictionary of sequence IDs and their (optionally sliced) sequences.
This function reads a FASTA file and returns a dictionary where each key is a sequence ID (from the FASTA header) and the value is the corresponding nucleotide or amino acid sequence. Optionally, you can extract a slice of the sequence using from_site and to_site.
- Parameters:
- input_filestr
Path to the FASTA file to be parsed.
- from_siteint, optional
Start index for slicing the sequence (default is 0, inclusive).
- to_siteint or None, optional
End index for slicing the sequence (default is None, meaning slice until the end).
- Returns:
- dict
A dictionary where keys are sequence IDs and values are sequences (or sliced subsequences).
Notes
Slicing follows Python’s 0-based indexing and is inclusive of from_site, exclusive of to_site.
Sequences are returned as plain strings.
Examples
parse_fasta_sequences(“example.fasta”) {‘seq1’: ‘MKTLLILTCLVAVALARPKH’, ‘seq2’: ‘GAVRQKLIED’}
parse_fasta_sequences(“example.fasta”, from_site=5, to_site=10) {‘seq1’: ‘LILTC’, ‘seq2’: ‘QKLIE’}
- alignment_tools.core.replace_scores_in_aligned_values(list1, list2)
Replaces non-zero values in a list with new values from another list.
This function takes two inputs: - list1: a string representation of a Python list containing numerical values,
possibly with zeros indicating positions to keep unchanged.
list2: a list (or iterable) of values that will replace the non-zero elements in list1 in the order they appear.
The function returns a NumPy array where each non-zero element in the parsed list1 is replaced by the corresponding value from list2.
- Parameters:
- list1str
String representation of a list of floats, e.g. ‘[0.0, 1.5, 0.0, 2.3]’. Zero values indicate positions to be preserved.
- list2list or iterable of float-compatible
List of replacement values for non-zero positions in list1.
- Returns:
- numpy.ndarray
Array with non-zero positions replaced by values from list2.
- Raises:
- ValueError
If the number of non-zero elements in list1 does not match the length of list2.
Examples
replace_scores_in_aligned_values(‘[0, 2.0, 0, 3.5]’, [10, 20]) array([ 0., 10., 0., 20.])
- alignment_tools.core.run_clustal(input_file: str, draw_tree=False)
Runs ClustalW on a FASTA file to generate a multiple sequence alignment (MSA) and optionally draws a dendrogram.
This function runs the external ClustalW program on the given FASTA file, producing an alignment file (.aln) and a dendrogram file (.dnd). It reads the alignment and computes a consensus sequence. Optionally, it draws and displays the phylogenetic tree.
- Parameters:
- input_filestr
Path to the input FASTA file containing sequences to be aligned.
- draw_treebool, optional
If True, displays a dendrogram of the phylogenetic tree (default is False).
- Returns:
- alignmentBio.Align.MultipleSeqAlignment
The multiple sequence alignment object read from the ClustalW output.
- aln_output_filestr
Path to the generated alignment file (.aln).
- Raises:
- subprocess.CalledProcessError
If the ClustalW command fails.
Notes
Requires ClustalW installed and available in the system PATH.
Uses Biopython modules: AlignIO, AlignInfo, Phylo.
Uses matplotlib.pyplot for drawing the tree.
Examples
alignment, aln_file = run_clustal(“sequences.fasta”, draw_tree=True) print(alignment)
- alignment_tools.core.run_interproscan(input_file: str, interpro_command='/media/LIMS/Src/fasta_annotation/interproscan-5.73-104.0/interproscan.sh') None
Runs InterProScan to annotate protein sequences with domain and family information.
This function executes the InterProScan command-line tool on a given FASTA file to generate protein domain annotations (e.g., Pfam, TIGRFAM, etc.) in TSV format.
- Parameters:
- input_filestr
Path to the input FASTA file containing protein sequences to annotate.
- interpro_commandstr, optional
Full path to the InterProScan executable script (default is a specific local path).
- Returns:
- None
The function does not return a value but generates an output TSV file in the same directory as the input file, with annotations from InterProScan.
- Raises:
- subprocess.CalledProcessError
If the InterProScan command fails.
Notes
Ensure InterProScan is installed and the path provided in interpro_command is correct.
The output TSV file will have the same base name as the input but with an .tsv extension.
Additional command-line options (e.g., output directory, formats) can be added by modifying the command list.
Examples
run_interproscan(“proteins.fasta”)