Skip to content

Implement Modular I/O Subsystem with Encapsulated bio-forge Pipeline #3

@TKanX

Description

@TKanX

Description:

This task focuses on building the complete Input/Output (io) subsystem for dreid-forge. The architecture will follow a highly modular, Facade-based design, where each file format is handled by its own dedicated submodule, further broken down into reader and writer components. A key requirement is the complete encapsulation of the bio-forge library, which will be used internally to handle complex biological file formats (PDB, mmCIF) and their preparation pipeline (repair, protonation, topology generation). For standard chemical formats (SDF, MOL2), the readers will perform direct parsing of topology. The entire subsystem will be exposed through a clean, high-level API in src/io/mod.rs, providing a unified interface for all data ingestion and serialization tasks.

Tasks:

  • Phase 1: Establish I/O Module Architecture

    • Create the full directory structure: src/io/, src/io/error.rs, src/io/util.rs, and subdirectories for pdb, mmcif, sdf, mol2, lammps, and bgf, each with reader.rs and/or writer.rs.
    • In src/io/error.rs: Define the io::Error enum using thiserror to handle file parsing, I/O operations, missing metadata, and errors propagated from the internal bio-forge library.
    • In src/io/mod.rs:
      • Define the public-facing configuration structs: BioReadConfig and ProtonationConfig.
      • Implement the top-level API functions: read_structure, write_structure, read_template, and write_lammps_package.
      • Define the WritableStructure trait to allow write_structure to accept both System and ForgedSystem.
      • Re-export public types and functions for a clean user-facing module.
  • Phase 2: Implement Core Conversion Layer

    • In src/io/util.rs:
      • Implement from_bio_topology function to convert a fully processed bio_forge::Topology into our model::System. This is the primary bridge from bio-forge.
      • Implement to_bio_topology function to convert our model::System (with BioMetadata) back into a bio_forge::Topology. This is the primary bridge to bio-forge for writing.
      • Implement necessary helper functions for converting enums (Element, BondOrder) between the two crates to ensure type safety.
  • Phase 3: Implement Biological Format Readers (PDB & mmCIF)

    • In src/pdb/reader.rs:
      • Implement the read function that orchestrates the full bio-forge pipeline:
        • Call bio_forge::io::read_pdb_structure.
        • Conditionally apply repair and protonation based on BioReadConfig.
        • Build the topology using bio_forge::ops::TopologyBuilder (a mandatory step to get bonds).
        • Convert the final bio_forge::Topology to model::System using the util layer.
    • In src/mmcif/reader.rs:
      • Implement the read function following the same pipeline as the PDB reader.
  • Phase 4: Implement Chemical Format Readers (SDF & MOL2)

    • In src/sdf/reader.rs:
      • Implement a direct-to-System parser for SDF/MOL format. It should read atom elements, coordinates, and the connection table (CT block) to populate System.atoms and System.bonds. BioMetadata will be None.
    • In src/mol2/reader.rs:
      • Implement a direct parser for MOL2 format molecules.
    • In src/io/mod.rs:
      • Implement the read_template wrapper around bio_forge::io::read_mol2_template as specified.
  • Phase 5: Implement All Writers

    • In src/pdb/writer.rs and src/mmcif/writer.rs:
      • Implement write functions that check for BioMetadata, convert the System to bio_forge::Topology, and call the corresponding bio-forge writer.
    • In src/bgf/writer.rs:
      • Implement a write function for the BGF format, leveraging the bio-forge writer.
    • In src/sdf/writer.rs and src/mol2/writer.rs:
      • Implement writers for standard chemical formats.
    • In src/lammps/writer.rs:
      • Implement the write function to generate the *.data and *.in.settings file pair.
      • The settings writer must implement the "smart" if/else logic to adapt to user-defined boundary conditions.
      • The data writer must correctly map ForgedSystem to all required LAMMPS sections, including Masses, Atoms (with molecule IDs), and topology sections with type IDs.
  • Phase 6: Verification

    • Add unit tests for each reader and writer to ensure format correctness.
    • Create integration tests that read a file, process it through a mock forge pipeline, and write it out, ensuring data integrity.
    • Verify that the LAMMPS output can successfully run the water molecule test case without manual modification.

Metadata

Metadata

Assignees

Labels

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions