PDB File Parser

PDB File Parser#

High-performance PDB file parsing and molecular structure handling using the pdbreader library.

Module Overview#

PDB file parser for molecular structure analysis using pdbreader.

This module provides functionality to parse PDB (Protein Data Bank) files and extract atomic coordinates and molecular information using the pdbreader library.

This module provides comprehensive PDB file parsing capabilities with robust error handling, automatic bond detection, and structure validation. It uses the pdbreader library for efficient parsing and provides structured data access through dataclass objects.

class hbat.core.pdb_parser.PDBParser[source]#

Bases: object

Parser for PDB format files using pdbreader.

This class handles parsing of PDB (Protein Data Bank) format files and converts them into HBAT’s internal atom and residue representations. Uses the pdbreader library for robust PDB format handling.

__init__() → None[source]#

Initialize PDB parser.

Creates a new parser instance with empty atom and residue lists.

parse_file(filename: str) → bool[source]#

Parse a PDB file.

Reads and parses a PDB format file, extracting all ATOM and HETATM records and converting them to HBAT’s internal representation.

Parameters:: filename (str) – Path to the PDB file to parse
Returns:: True if parsing completed successfully, False otherwise
Return type:: bool
Raises:: IOError if file cannot be read

parse_lines(lines: List[str]) → bool[source]#

Parse PDB format lines.

Parses PDB format content provided as a list of strings, useful for processing in-memory PDB data.

Parameters:: lines (List[str]) – List of PDB format lines
Returns:: True if parsing completed successfully, False otherwise
Return type:: bool

get_atoms_by_element(element: str) → List[Atom][source]#

Get all atoms of specific element.

Parameters:: element (str) – Element symbol (e.g., ‘C’, ‘N’, ‘O’)
Returns:: List of atoms matching the element
Return type:: List[Atom]

get_atoms_by_residue(res_name: str) → List[Atom][source]#

Get all atoms from residues with specific name.

Parameters:: res_name (str) – Residue name (e.g., ‘ALA’, ‘GLY’)
Returns:: List of atoms from matching residues
Return type:: List[Atom]

get_hydrogen_atoms() → List[Atom][source]#

Get all hydrogen atoms.

Returns:: List of all hydrogen and deuterium atoms
Return type:: List[Atom]

has_hydrogens() → bool[source]#

Check if structure contains hydrogen atoms.

Determines if the structure has a reasonable number of hydrogen atoms compared to heavy atoms, indicating explicit hydrogen modeling.

Returns:: True if structure appears to contain explicit hydrogens
Return type:: bool

get_residue_list() → List[Residue][source]#

Get list of all residues.

Returns:: List of all residues in the structure
Return type:: List[Residue]

get_chain_ids() → List[str][source]#

Get list of unique chain IDs.

Returns:: List of unique chain identifiers in the structure
Return type:: List[str]

get_statistics() → Dict[str, Any][source]#

Get basic statistics about the structure.

Provides counts of atoms, residues, chains, and element composition.

Returns:: Dictionary containing structure statistics
Return type:: Dict[str, Any]

get_bonds() → List[Bond][source]#

Get list of all bonds.

Returns:: List of all bonds in the structure
Return type:: List[Bond]

get_bonds_for_atom(serial: int) → List[Bond][source]#

Get all bonds involving a specific atom.

Parameters:: serial (int) – Atom serial number
Returns:: List of bonds involving this atom
Return type:: List[Bond]

get_bonded_atoms(serial: int) → List[int][source]#

Get serial numbers of atoms bonded to the specified atom.

Parameters:: serial (int) – Atom serial number
Returns:: List of bonded atom serial numbers
Return type:: List[int]

get_bond_detection_statistics() → Dict[str, int][source]#

Get statistics about bond detection methods used.

Returns a dictionary with counts of bonds detected by each method.

Main Classes#

PDBParser#

class hbat.core.pdb_parser.PDBParser[source]#

Bases: object

Parser for PDB format files using pdbreader.

High-performance PDB file parser with integrated structure analysis capabilities.

Key Features:

Robust Parsing: Handles malformed PDB files with comprehensive error recovery
Automatic Bond Detection: Identifies covalent bonds using distance criteria and atomic data
Element Mapping: Uses utility functions for accurate atom type identification
Structure Validation: Provides comprehensive structure quality assessment
Performance Optimization: Efficient processing of large molecular complexes

Usage Examples:

from hbat.core.pdb_parser import PDBParser

# Basic parsing
parser = PDBParser()
atoms, residues, bonds = parser.parse_file("protein.pdb")

print(f"Parsed {len(atoms)} atoms")
print(f"Found {len(residues)} residues")
print(f"Detected {len(bonds)} bonds")

# Advanced parsing with validation
try:
    atoms, residues, bonds = parser.parse_file("complex.pdb")

    # Get comprehensive statistics
    stats = parser.get_statistics()
    print(f"Parsing time: {stats.parse_time:.2f} seconds")
    print(f"Has hydrogens: {parser.has_hydrogens()}")
    print(f"Chain count: {len(stats.chains)}")

except Exception as e:
    print(f"Parsing failed: {e}")

Performance Characteristics:

Processes ~50,000 atoms per second on modern hardware
Memory usage scales linearly with structure size
Efficient handling of large protein complexes (>100k atoms)
Optimized for both single structures and batch processing

__init__() → None[source]#

Initialize PDB parser.

Creates a new parser instance with empty atom and residue lists.

parse_file(filename: str) → bool[source]#

Parse a PDB file.

Reads and parses a PDB format file, extracting all ATOM and HETATM records and converting them to HBAT’s internal representation.

Parameters:: filename (str) – Path to the PDB file to parse
Returns:: True if parsing completed successfully, False otherwise
Return type:: bool
Raises:: IOError if file cannot be read

parse_lines(lines: List[str]) → bool[source]#

Parse PDB format lines.

Parses PDB format content provided as a list of strings, useful for processing in-memory PDB data.

Parameters:: lines (List[str]) – List of PDB format lines
Returns:: True if parsing completed successfully, False otherwise
Return type:: bool

get_atoms_by_element(element: str) → List[Atom][source]#

Get all atoms of specific element.

Parameters:: element (str) – Element symbol (e.g., ‘C’, ‘N’, ‘O’)
Returns:: List of atoms matching the element
Return type:: List[Atom]

get_atoms_by_residue(res_name: str) → List[Atom][source]#

Get all atoms from residues with specific name.

Parameters:: res_name (str) – Residue name (e.g., ‘ALA’, ‘GLY’)
Returns:: List of atoms from matching residues
Return type:: List[Atom]

get_hydrogen_atoms() → List[Atom][source]#

Get all hydrogen atoms.

Returns:: List of all hydrogen and deuterium atoms
Return type:: List[Atom]

has_hydrogens() → bool[source]#

Check if structure contains hydrogen atoms.

Determines if the structure has a reasonable number of hydrogen atoms compared to heavy atoms, indicating explicit hydrogen modeling.

Returns:: True if structure appears to contain explicit hydrogens
Return type:: bool

get_residue_list() → List[Residue][source]#

Get list of all residues.

Returns:: List of all residues in the structure
Return type:: List[Residue]

get_chain_ids() → List[str][source]#

Get list of unique chain IDs.

Returns:: List of unique chain identifiers in the structure
Return type:: List[str]

get_statistics() → Dict[str, Any][source]#

Get basic statistics about the structure.

Provides counts of atoms, residues, chains, and element composition.

Returns:: Dictionary containing structure statistics
Return type:: Dict[str, Any]

get_bonds() → List[Bond][source]#

Get list of all bonds.

Returns:: List of all bonds in the structure
Return type:: List[Bond]

get_bonds_for_atom(serial: int) → List[Bond][source]#

Get all bonds involving a specific atom.

Parameters:: serial (int) – Atom serial number
Returns:: List of bonds involving this atom
Return type:: List[Bond]

get_bonded_atoms(serial: int) → List[int][source]#

Get serial numbers of atoms bonded to the specified atom.

Parameters:: serial (int) – Atom serial number
Returns:: List of bonded atom serial numbers
Return type:: List[int]

get_bond_detection_statistics() → Dict[str, int][source]#

Get statistics about bond detection methods used.

Returns a dictionary with counts of bonds detected by each method.

Data Structure Classes#

Atom#

class hbat.core.pdb_parser.Atom(serial: int, name: str, alt_loc: str, res_name: str, chain_id: str, res_seq: int, i_code: str, coords: NPVec3D, occupancy: float, temp_factor: float, element: str, charge: str, record_type: str, residue_type: str = 'L', backbone_sidechain: str = 'S', aromatic: str = 'N')[source]#

Bases: object

Represents an atom from a PDB file.

This class stores all atomic information parsed from PDB format including coordinates, properties, and residue information.

Parameters:

serial (int) – Atom serial number
name (str) – Atom name
alt_loc (str) – Alternate location indicator
res_name (str) – Residue name
chain_id (str) – Chain identifier
res_seq (int) – Residue sequence number
i_code (str) – Insertion code
coords (NPVec3D) – 3D coordinates
occupancy (float) – Occupancy factor
temp_factor (float) – Temperature factor
element (str) – Element symbol
charge (str) – Formal charge
record_type (str) – PDB record type (ATOM or HETATM)

Comprehensive atomic data structure with PDB information and calculated properties.

Core Properties:

PDB Information: Serial number, name, residue context, coordinates
Chemical Properties: Element, formal charge, occupancy, B-factor
Geometric Properties: 3D coordinates as Vec3D objects
Connectivity: Bond partners and chemical environment
Validation: Quality metrics and flags

Usage Example:

from hbat.core.pdb_parser import Atom
from hbat.core.vector import Vec3D

# Access atom properties
atom = atoms[0]  # From parser results

print(f"Atom: {atom.name} ({atom.element})")
print(f"Residue: {atom.res_name} {atom.res_num}")
print(f"Position: {atom.coord}")
print(f"B-factor: {atom.b_factor:.2f}")

# Geometric calculations
distance = atom.coord.distance_to(other_atom.coord)
print(f"Distance: {distance:.2f} Å")

__init__(serial: int, name: str, alt_loc: str, res_name: str, chain_id: str, res_seq: int, i_code: str, coords: NPVec3D, occupancy: float, temp_factor: float, element: str, charge: str, record_type: str, residue_type: str = 'L', backbone_sidechain: str = 'S', aromatic: str = 'N') → None[source]#

Initialize an Atom object.

Parameters:

serial (int) – Atom serial number
name (str) – Atom name
alt_loc (str) – Alternate location indicator
res_name (str) – Residue name
chain_id (str) – Chain identifier
res_seq (int) – Residue sequence number
i_code (str) – Insertion code
coords (NPVec3D) – 3D coordinates
occupancy (float) – Occupancy factor
temp_factor (float) – Temperature factor
element (str) – Element symbol
charge (str) – Formal charge
record_type (str) – PDB record type (ATOM or HETATM)

is_hydrogen() → bool[source]#

Check if atom is hydrogen.

Returns:: True if atom is hydrogen or deuterium
Return type:: bool

is_metal() → bool[source]#

Check if atom is a metal.

Returns:: True if atom is a common metal ion
Return type:: bool

__iter__() → Iterator[Tuple[str, Any]][source]#

Iterate over atom attributes as (name, value) pairs.

Returns:: Iterator of (attribute_name, value) tuples
Return type:: Iterator[Tuple[str, Any]]

to_dict() → Dict[str, Any][source]#

Convert atom to dictionary.

Returns:: Dictionary representation of the atom
Return type:: Dict[str, Any]

classmethod fields() → List[str][source]#

Get list of field names.

Returns:: List of field names
Return type:: List[str]

__repr__() → str[source]#: String representation of the atom.

__eq__(other: object) → bool[source]#: Check equality with another Atom.

Residue#

class hbat.core.pdb_parser.Residue(name: str, chain_id: str, seq_num: int, i_code: str, atoms: List[Atom])[source]#

Bases: object

Represents a residue containing multiple atoms.

This class groups atoms belonging to the same residue and provides methods for accessing and analyzing residue-level information.

Parameters:

name (str) – Residue name (e.g., ‘ALA’, ‘GLY’)
chain_id (str) – Chain identifier
seq_num (int) – Residue sequence number
i_code (str) – Insertion code
atoms (List[Atom]) – List of atoms in this residue

Residue-level data structure containing atom collections and residue properties.

Properties:

Identification: Residue name, number, chain, insertion code
Atom Collections: All atoms, backbone atoms, side chain atoms
Chemical Classification: Protein, DNA, RNA, or hetrogen residue
Geometric Properties: Center of mass, radius of gyration
Connectivity: Inter-residue bonds and interactions

Usage Example:

# Access residue information
residue = residues[0]  # From parser results

print(f"Residue: {residue.name} {residue.number}")
print(f"Chain: {residue.chain}")
print(f"Atom count: {len(residue.atoms)}")

# Get specific atom types
backbone_atoms = residue.get_backbone_atoms()
sidechain_atoms = residue.get_sidechain_atoms()

print(f"Backbone atoms: {len(backbone_atoms)}")
print(f"Side chain atoms: {len(sidechain_atoms)}")

__init__(name: str, chain_id: str, seq_num: int, i_code: str, atoms: List[Atom]) → None[source]#

Initialize a Residue object.

Parameters:

name (str) – Residue name (e.g., ‘ALA’, ‘GLY’)
chain_id (str) – Chain identifier
seq_num (int) – Residue sequence number
i_code (str) – Insertion code
atoms (List[Atom]) – List of atoms in this residue

get_atom(atom_name: str) → Atom | None[source]#

Get specific atom by name.

Parameters:: atom_name (str) – Name of the atom to find
Returns:: The atom if found, None otherwise
Return type:: Optional[Atom]

get_atoms_by_element(element: str) → List[Atom][source]#

Get all atoms of specific element.

Parameters:: element (str) – Element symbol (e.g., ‘C’, ‘N’, ‘O’)
Returns:: List of atoms matching the element
Return type:: List[Atom]

center_of_mass() → NPVec3D[source]#

Calculate center of mass of residue.

Computes the mass-weighted centroid of all atoms in the residue.

Returns:: Center of mass coordinates
Return type:: NPVec3D

get_aromatic_center() → NPVec3D | None[source]#

Calculate aromatic ring center if residue is aromatic.

For aromatic residues (PHE, TYR, TRP, HIS), calculates the geometric center of the aromatic ring atoms.

Returns:: Center coordinates of aromatic ring, None if not aromatic
Return type:: Optional[NPVec3D]

__iter__() → Iterator[Tuple[str, Any]][source]#

Iterate over residue attributes as (name, value) pairs.

Returns:: Iterator of (attribute_name, value) tuples
Return type:: Iterator[Tuple[str, Any]]

to_dict() → Dict[str, Any][source]#

Convert residue to dictionary.

Returns:: Dictionary representation of the residue
Return type:: Dict[str, Any]

classmethod fields() → List[str][source]#

Get list of field names.

Returns:: List of field names
Return type:: List[str]

__repr__() → str[source]#: String representation of the residue.

__eq__(other: object) → bool[source]#: Check equality with another Residue.

Bond#

class hbat.core.pdb_parser.Bond(atom1_serial: int, atom2_serial: int, bond_type: str = 'covalent', distance: float | None = None, detection_method: str = 'distance_based')[source]#

Bases: object

Represents a chemical bond between two atoms.

This class stores information about atomic bonds, including the atoms involved and bond type/origin.

Parameters:

atom1_serial (int) – Serial number of first atom
atom2_serial (int) – Serial number of second atom
bond_type (str) – Type of bond (‘covalent’, ‘explicit’, etc.)
distance (Optional[float]) – Distance between bonded atoms in Angstroms
detection_method (str) – Method used to detect this bond

Chemical bond representation with geometric and chemical properties.

Bond Properties:

Atom Partners: Two atoms forming the covalent bond
Bond Length: Distance between bonded atoms
Bond Type: Single, double, triple, aromatic
Chemical Environment: Intra-residue vs. inter-residue bonds
Validation: Bond length validation against expected values

Usage Example:

# Analyze bond properties
bond = bonds[0]  # From parser results

print(f"Bond: {bond.atom1.name} - {bond.atom2.name}")
print(f"Length: {bond.length:.3f} Å")
print(f"Type: {bond.bond_type}")

# Validate bond length
if bond.is_valid_length():
    print("Bond length within expected range")

__init__(atom1_serial: int, atom2_serial: int, bond_type: str = 'covalent', distance: float | None = None, detection_method: str = 'distance_based') → None[source]#

Initialize a Bond object.

Parameters:

atom1_serial (int) – Serial number of first atom
atom2_serial (int) – Serial number of second atom
bond_type (str) – Type of bond (‘covalent’, ‘explicit’, etc.)
distance (Optional[float]) – Distance between bonded atoms in Angstroms
detection_method (str) – Method used to detect this bond

involves_atom(serial: int) → bool[source]#

Check if bond involves the specified atom.

Parameters:: serial (int) – Atom serial number
Returns:: True if bond involves this atom
Return type:: bool

get_partner(serial: int) → int | None[source]#

Get the bonding partner of the specified atom.

Parameters:: serial (int) – Atom serial number
Returns:: Serial number of bonding partner, None if atom not in bond
Return type:: Optional[int]

__iter__() → Iterator[Tuple[str, Any]][source]#

Iterate over bond attributes as (name, value) pairs.

Returns:: Iterator of (attribute_name, value) tuples
Return type:: Iterator[Tuple[str, Any]]

to_dict() → Dict[str, Any][source]#

Convert bond to dictionary.

Returns:: Dictionary representation of the bond
Return type:: Dict[str, Any]

classmethod fields() → List[str][source]#

Get list of field names.

Returns:: List of field names
Return type:: List[str]

__repr__() → str[source]#: String representation of the bond.

__eq__(other: object) → bool[source]#: Check equality with another Bond.

Parsing Methods#

File Parsing#

PDBParser.parse_file(filename: str) → bool[source]#

Parse a PDB file.

Reads and parses a PDB format file, extracting all ATOM and HETATM records and converting them to HBAT’s internal representation.

Parameters:: filename (str) – Path to the PDB file to parse
Returns:: True if parsing completed successfully, False otherwise
Return type:: bool
Raises:: IOError if file cannot be read

Parse PDB file from disk with comprehensive error handling.

PDBParser.parse_lines(lines: List[str]) → bool[source]#

Parse PDB format lines.

Parses PDB format content provided as a list of strings, useful for processing in-memory PDB data.

Parameters:: lines (List[str]) – List of PDB format lines
Returns:: True if parsing completed successfully, False otherwise
Return type:: bool

Parse PDB content from string lines for in-memory processing.

Structure Analysis#

PDBParser.get_statistics() → Dict[str, Any][source]#

Get basic statistics about the structure.

Provides counts of atoms, residues, chains, and element composition.

Returns:: Dictionary containing structure statistics
Return type:: Dict[str, Any]

Retrieve comprehensive parsing and structure statistics.

PDBParser.has_hydrogens() → bool[source]#

Check if structure contains hydrogen atoms.

Determines if the structure has a reasonable number of hydrogen atoms compared to heavy atoms, indicating explicit hydrogen modeling.

Returns:: True if structure appears to contain explicit hydrogens
Return type:: bool

Check if the parsed structure contains hydrogen atoms.

Bond Detection#

Utility Functions#

Type Conversion#

hbat.core.pdb_parser._safe_int_convert(value: Any, default: int = 0) → int[source]#

Safely convert a value to integer, handling NaN and None values.

Parameters:

value (Any) – Value to convert
default (int) – Default value to use if conversion fails

Returns:

Integer value or default

Return type:

int

Safely convert values to integers with NaN and None handling.

hbat.core.pdb_parser._safe_float_convert(value: Any, default: float = 0.0) → float[source]#

Safely convert a value to float, handling NaN and None values.

Parameters:

value (Any) – Value to convert
default (float) – Default value to use if conversion fails

Returns:

Float value or default

Return type:

float

Safely convert values to floats with robust error handling.

Error Handling#

Exception Types:

The parser handles various error conditions gracefully:

File I/O Errors: Missing files, permission issues, corrupted data
Format Errors: Malformed PDB records, invalid coordinates
Chemical Errors: Invalid atom types, impossible geometries
Memory Errors: Structures too large for available memory

Error Recovery:

try:
    atoms, residues, bonds = parser.parse_file("problematic.pdb")
except FileNotFoundError:
    print("PDB file not found")
except ValueError as e:
    print(f"Invalid PDB format: {e}")
except MemoryError:
    print("Structure too large for available memory")

Validation Warnings:

The parser provides detailed warnings for common issues:

Missing atoms in standard residues
Unusual bond lengths or angles
Non-standard residue names
Duplicate atom serial numbers
Chain breaks and missing residues

Performance Optimization#

Efficient Data Structures:

Dataclasses: Minimal memory overhead with fast attribute access
Vec3D Integration: Optimized 3D coordinate handling
Lazy Evaluation: Properties computed on-demand
Memory Pooling: Efficient object reuse for large structures

Algorithmic Optimizations:

Spatial Indexing: Fast neighbor searching for bond detection
Vectorized Operations: NumPy-compatible coordinate processing
Chunked Processing: Memory-efficient handling of large files
Parallel Parsing: Future support for multi-threaded parsing

Benchmarks:

Typical performance on modern hardware:

Small proteins (<1000 atoms): <10 ms parsing time
Medium proteins (1000-10000 atoms): 10-100 ms parsing time
Large complexes (10000+ atoms): 100-1000 ms parsing time
Memory usage: ~1-2 MB per 1000 atoms

Integration with Analysis Pipeline#

Analyzer Integration:

The parser integrates seamlessly with the analysis pipeline:

from hbat.core.analyzer import MolecularInteractionAnalyzerractionAnalyzer
from hbat.core.pdb_parser import PDBParser

# Direct integration
analyzer = MolecularInteractionAnalyzerractionAnalyzer()
results = analyzer.analyze_file("protein.pdb")  # Uses parser internally

# Manual parsing for custom processing
parser = PDBParser()
atoms, residues, bonds = parser.parse_file("protein.pdb")

# Custom pre-processing
filtered_atoms = [a for a in atoms if a.element != 'H']

# Analyze processed structure
results = analyzer.analyze_structure(filtered_atoms, residues, bonds)

Structure Fixing Integration:

The parser works with the PDB fixer for structure enhancement:

from hbat.core.pdb_fixer import PDBFixer

# Parse original structure
parser = PDBParser()
atoms, residues, bonds = parser.parse_file("original.pdb")

# Apply structure fixing
fixer = PDBFixer()
fixed_structure = fixer.add_missing_hydrogens(atoms, residues)

# Re-parse enhanced structure
enhanced_atoms, enhanced_residues, enhanced_bonds = parser.parse_structure(fixed_structure)

PDB File Parser

Contents

PDB File Parser#

Module Overview#

Main Classes#

PDBParser#

Data Structure Classes#

Atom#

Residue#

Bond#

Parsing Methods#

File Parsing#

Structure Analysis#

Bond Detection#

Utility Functions#

Type Conversion#

Error Handling#

Performance Optimization#

Integration with Analysis Pipeline#