Project G2D: A Proposal for Generative
AI in Precision Drug Discovery
I. Executive Summary: Project G2D - A Paradigm Shift
in Genotype-to-Drug Design
The pharmaceutical research and development (R&D) sector is confronting a profound and
systemic crisis of productivity. The traditional model of drug discovery is characterized by
staggering costs, protracted timelines, and a debilitatingly low probability of success. On
average, bringing a single new therapy to market requires an investment exceeding $2.5 billion
and takes between 13 to 15 years. This unsustainable trajectory, often termed "Eroom's Law"
(Moore's Law in reverse), is further compounded by an approximately 90% failure rate for drug
candidates that enter clinical trials, with most attritions resulting from a lack of efficacy or
unforeseen toxicity. This high-risk, low-yield paradigm represents a critical bottleneck to medical
innovation and a significant barrier to addressing unmet patient needs.
This proposal introduces "Project G2D" (Genotype-to-Drug), a next-generation computational
platform engineered to fundamentally redesign the drug discovery process. G2D leverages a
bespoke, multimodal generative artificial intelligence (AI) engine to design novel,
highly-efficacious, and safe small-molecule drugs entirely from scratch (de novo). Inspired by
the transformative impact of AI platforms like AlphaFold on structural biology, G2D aims to solve
the more complex generation problem for therapeutics, moving beyond the mere prediction of
existing structures or screening of known libraries to explore and create within the vast,
uncharted chemical space.
The core innovation and principal differentiator of the G2D platform is its deep, foundational
integration of pharmacogenomics. Unlike generic drug design platforms that create
one-size-fits-all molecules, G2D utilizes patient-specific genetic data—including genomic,
transcriptomic, and proteomic information—as a primary input to the generative process. This
allows the platform to design therapeutic candidates tailored to the unique genetic and
molecular signatures of specific patient subpopulations, representing a true leap forward in the
practical application of precision medicine. By conditioning the drug design process on the very
biology of the intended patient, G2D aims to dramatically increase the probability of clinical
success.
The primary objective of this project is to develop, train, and validate the G2D platform by
executing a complete discovery cycle over a 24-month period. This cycle will encompass the
identification of a novel therapeutic target, the generation of a diverse library of optimized
candidate molecules, and the selection of a lead preclinical candidate (PCC) with a robust in
silico and initial experimental data package. This timeline represents a radical acceleration
compared to the industry standard of 3 to 6 years for this discovery phase.
To achieve this ambitious goal, this proposal outlines a request for seed funding to assemble a
world-class interdisciplinary team, secure the necessary high-performance computational
resources, acquire essential software licenses, and execute the comprehensive 24-month R&D
plan detailed herein. The ultimate vision for Project G2D extends beyond a single asset; it is to
build a scalable, continuously improving engine for creating a new class of personalized
pharmaceuticals. This platform will serve as the foundation for a robust internal pipeline of
high-value therapeutic programs and create opportunities for strategic, revenue-generating
partnerships with leading biopharmaceutical companies.
II. The Imperative for Innovation: Redefining the Drug
Discovery Landscape
The strategic rationale for Project G2D is anchored in the confluence of two powerful, opposing
forces: the declining efficiency of the traditional pharmaceutical R&D model and the exponential
rise of technological capabilities in artificial intelligence and computation. This creates a unique
and timely opportunity to apply a proven technological paradigm to solve one of biology's most
pressing and costly challenges.
2.1. The Systemic Failure of the Traditional Pipeline
The current drug discovery pipeline is a sequential, high-attrition process that is becoming
economically unsustainable. The widely cited cost to bring a single drug to market, estimated
between $1 billion and $2.6 billion, is not merely an expenditure on a single successful product;
it is an average that must account for the immense cost of the vast majority of projects that fail.
These failures occur at every stage, with Phase 2 clinical trials costing anywhere from $7 million
to $19 million and Phase 3 trials escalating to between $11.5 million and $52.9 million per study.
The probability of success is alarmingly low. Only about 10-13% of drugs that enter Phase 1
clinical trials ultimately receive regulatory approval. The primary drivers of this attrition are a lack
of clinical efficacy and unacceptable safety or toxicity profiles, issues that are often only
discovered late in the development process after hundreds of millions of dollars have been
invested. This inefficiency begins at the earliest stages of discovery. Methods like
High-Throughput Screening (HTS), while capable of testing millions of compounds, are akin to
searching for a needle in a haystack, yielding hit rates as low as 2.5% and often identifying
compounds with poor drug-like properties that require extensive and costly optimization. This
entire paradigm is governed by Eroom's Law, the troubling observation that despite decades of
advances in biology and technology, the number of new drugs approved per billion dollars of
R&D spending has halved roughly every nine years. This trend is not sustainable and demands
a fundamental shift in methodology.
2.2. The AI Revolution in Biology
Simultaneously, the world of computation is governed by Moore's Law, the observation of
exponential growth in computing power at decreasing cost. This has enabled the rise of artificial
intelligence and machine learning, which are now mature enough to tackle the immense
complexity of biological data. The discontinuity between the negative trajectory of pharma R&D
and the positive trajectory of computational power presents the central opportunity for this
project.
The transformative potential of AI in biology is no longer theoretical. The success of DeepMind's
AlphaFold serves as a powerful narrative precedent. By applying deep learning to vast datasets
of protein sequences and structures, AlphaFold effectively solved the grand challenge of protein
structure prediction, providing highly accurate 3D models for nearly all known proteins and
revolutionizing the field of structural biology. This achievement demonstrated that complex,
high-dimensional biological problems, once thought intractable, are amenable to AI-driven
solutions.
Project G2D is conceived as the logical successor to this revolution. While AlphaFold solved a
prediction problem, G2D aims to solve the far more complex generation problem. Instead of
predicting the structure of an existing biological entity, the G2D platform will generate entirely
new biological entities—novel small molecules—that are designed from scratch to perform a
specific therapeutic function. This moves beyond optimizing molecules within existing chemical
libraries and circumvents their inherent limitations by exploring a vastly larger and unknown
chemical space. This is the next frontier for AI in biology, shifting from understanding what is to
creating what could be.
2.3. Market Validation and Investment Momentum
The financial markets have recognized this paradigm shift, leading to a surge of investment and
activity in the AI-driven drug discovery sector. The market is projected to grow at a compound
annual growth rate (CAGR) of 37.67% between 2024 and 2030, with forecasts anticipating a
market size of tens of billions of dollars. This explosive growth is fueled by AI's potential to
dramatically improve cost efficiency and accelerate development timelines, two of the most
critical value drivers in the pharmaceutical industry.
This market enthusiasm is substantiated by a series of landmark venture capital investments,
which serve as clear validation of the strategic direction of Project G2D:
● Xaira Therapeutics, an AI platform for drug discovery, launched with a staggering $1
billion in Series A funding.
● Isomorphic Labs, a spin-off from Google's DeepMind, raised $600 million in its first
external funding round to advance its AI-first approach to drug discovery.
● Formation Bio, which uses AI to accelerate drug development, secured $372 million in a
Series D round.
● Insilico Medicine, a pioneer in generative AI for drug R&D, has raised hundreds of
millions and advanced multiple AI-designed drugs into human clinical trials.
These massive investments in companies pursuing similar goals underscore the immense
perceived value and strategic importance of building foundational AI platforms for drug
discovery. They confirm that the market is actively seeking and funding solutions to the R&D
crisis, positioning Project G2D in a high-growth, high-demand sector with a clear pathway to
significant value creation and future funding or acquisition opportunities.
III. The G2D Platform: A Technical Blueprint for
Genotype-to-Drug Design
The G2D platform is not a single algorithm but a cohesive, end-to-end computational system
designed to translate complex biological and genetic data into novel, optimized therapeutic
candidates. The workflow is structured into four integrated phases, creating a continuous cycle
of generation, prediction, and validation that bridges the gap between in silico design and
real-world biology.
3.1. Phase 1: Precision Target and Pharmacogenomic Footprinting
The foundation of any successful drug discovery program is the selection of a high-quality
biological target. This phase moves beyond traditional target identification by deeply integrating
patient-level genetic data to define not just what to target, but how to target it in specific
populations.
● Objective: To identify and validate a high-value biological target with a clear, causal link
to a disease of high unmet need, and to comprehensively map its structural variations
across patient populations.
● Methodology:
1. Multi-Omics Data Integration and Target Identification: The process begins by
aggregating and analyzing vast, high-dimensional datasets from public and
proprietary sources, including genomics (TCGA, UK Biobank), transcriptomics
(GEO), and proteomics. AI algorithms will analyze these multi-omics datasets to
identify genes and proteins whose expression or activity is strongly correlated with
disease pathology.
2. Network Pharmacology Analysis: To understand the target in its biological
context, graph-based neural networks will be employed to construct and analyze
protein-protein interaction networks and disease pathways. This systems-level view
helps prioritize targets that are central to disease mechanisms and predicts
potential on-target and off-target effects, moving beyond a simplistic one-drug,
one-target paradigm.
3. Pharmacogenomic Variation Analysis: This is a cornerstone of the G2D
platform's innovation. Once a primary target is selected, we will mine large-scale
human genetic databases (e.g., the Million Health Discoveries Program, dbSNP) to
identify single nucleotide polymorphisms (SNPs), insertions, and deletions that
occur within the patient population. The focus will be on non-synonymous variants
that alter the amino acid sequence of the target protein, particularly those within or
near potential binding sites.
4. Personalized Binding Site Characterization: Using state-of-the-art protein
structure prediction tools like AlphaFold3 , we will generate a high-resolution 3D
model of the target protein. Crucially, this process will be repeated for each
identified genetic variant, creating an ensemble of protein structures that represent
the target's natural variation in the human population. Computational tools like
SiteMap (from the Schrödinger suite) will then be used to analyze these structures
and identify and characterize the precise 3D geometry of the binding pockets for
each variant. The output of this phase is not a single binding pocket, but a set of
personalized binding pockets, each representing a specific patient genotype.
3.2. Phase 2: The Generative Engine - Crafting Novel Therapeutic
Molecules
This phase constitutes the creative core of the G2D platform, where novel molecular structures
are generated from scratch, guided by the personalized target information from Phase 1.
● Objective: To generate libraries of novel, diverse, and chemically valid small molecules
that are computationally optimized to bind with high affinity to the specific,
genotype-defined target pockets.
● Proposed Architecture (A Hybrid, Multi-Objective Approach): The choice of
generative architecture is a critical design decision. Early models based on 1D/2D string
representations like SMILES have shown promise but suffer from fundamental limitations:
they often fail to capture the crucial 3D spatial information required for structure-based
design and can struggle to represent topological similarity effectively. A small change in a
molecule's 3D structure can lead to a completely different SMILES string, making the
learned chemical space non-smooth and difficult to navigate. Given that our project is
explicitly structure-based, a 3D-native approach is required.
1. Core Generative Model (3D Graph Diffusion): The G2D engine will be built
around a 3D graph-based diffusion model. These models operate directly on the 3D
coordinates and atom types of a molecule, represented as a graph. They work by
learning to reverse a "diffusion" process that gradually adds noise to a molecule's
structure until it is unrecognizable. By learning this reversal, the model can start
from random noise and generate novel, coherent 3D molecular structures. This
approach has demonstrated state-of-the-art performance in generating high-quality,
valid, and novel molecules that inherently possess 3D structure.
2. Conditional Generation via Pharmacogenomic Input: The generative process
will be explicitly conditioned. The pharmacogenomic information from Phase
1—such as the 3D coordinates of a variant-specific binding pocket or a
disease-specific gene expression signature—will be encoded into a numerical
vector. This vector will serve as a conditional input or "prompt" for the diffusion
model, guiding it to generate molecules that are specifically tailored to interact with
that particular biological context. This technique, similar in concept to models like
G2D-Diff, ensures that the generated molecules are not random but are biased
towards the desired personalized therapeutic profile.
3. Reinforcement Learning (RL) for Multi-Objective Optimization: To refine the
generated molecules towards multiple desired properties simultaneously, the
generative model will be coupled with a reinforcement learning framework. An
Actor-Critic model is well-suited for this task. The diffusion model acts as the
"Actor," proposing new molecules. A "Critic" network then evaluates these
molecules based on a comprehensive reward function. The feedback from the Critic
is used to update the Actor, progressively teaching it to generate molecules that
achieve higher and higher reward scores.
3.3. Phase 3: The In Silico Gauntlet - Predictive Filtering and Scoring
This phase defines the "Critic" in the RL framework. It involves a battery of predictive models
that screen the millions of generated molecules and assign a single, multi-parameter reward
score to each, guiding the optimization process. Addressing these properties in silico before
synthesis is a major departure from traditional methods and is critical for reducing the high
attrition rates caused by poor pharmacokinetics and toxicity.
● Objective: To rapidly and accurately screen and prioritize generated molecules based on
a holistic profile of efficacy, safety, and developability.
● Multi-Parameter Reward Function Components:
1. Binding Affinity: The predicted binding energy and pose of the generated
molecule within the target binding pocket. This will be calculated using
high-throughput molecular docking software like Glide or AutoDock. This score
directly measures the primary efficacy objective.
2. ADMET Profile: A suite of machine learning models will predict the key
pharmacokinetic and toxicological properties of each molecule. This includes
models for Absorption (e.g., solubility, permeability), Distribution, Metabolism (e.g.,
inhibition of key Cytochrome P450 enzymes), Excretion, and Toxicity (e.g.,
cardiotoxicity, hepatotoxicity).
3. Drug-likeness and Chemical Validity: A set of standard chemoinformatic filters
will be applied. This includes checking for chemical stability, ensuring correct atomic
valency, and calculating a Quantitative Estimate of Drug-likeness (QED) score,
which assesses how closely a molecule resembles known oral drugs. These checks
are often performed using toolkits like RDKit.
4. Synthetic Accessibility: A crucial parameter to avoid generating "fantasy
molecules" is a synthetic accessibility score (e.g., SAscore). This score estimates
the difficulty of synthesizing a molecule in a laboratory, penalizing overly complex or
unusual chemical structures. This directly addresses a major limitation of early
generative models and ensures the prioritized candidates are practically viable.
3.4. Phase 4: The "Lab-in-the-Loop" - Iterative Experimental Validation
Computational prediction alone is insufficient. The G2D platform's long-term success depends
on its ability to learn from real-world experimental data. This phase establishes a virtuous cycle
of prediction, testing, and refinement.
● Objective: To bridge the in silico-to-in vivo gap by using targeted experimental data to
continuously improve the accuracy and predictive power of the entire G2D platform.
● Methodology:
1. Candidate Selection and Synthesis: A small, diverse set of the highest-scoring
molecules from the in silico gauntlet are selected for physical synthesis, which can
be performed by a contract research organization (CRO).
2. Experimental Assay: The synthesized compounds undergo a panel of initial
wet-lab experiments, such as biochemical binding assays (to confirm target
engagement) and cell-based functional assays (to measure biological activity).
3. Data Feedback: The results of these experiments—critically, both positive and
negative outcomes—are digitized, structured, and fed back into the G2D platform's
proprietary database.
4. Platform Retraining: This new, high-quality, experimentally-validated data is used
as a fresh training set to fine-tune all the predictive models within the platform. This
iterative "lab-in-the-loop" process ensures that with each cycle, the generative
engine becomes better at creating successful molecules, and the predictive models
become more accurate at identifying them.
To provide clear justification for the selection of our proposed generative architecture, the
following table compares leading generative AI models across key attributes relevant to de novo
drug design.
Table 3.1: Comparative Analysis of Generative AI Architectures for Drug Design
Model Architecture Core Mechanism Strengths Weaknesses Suitability for G2D
Project
Variational Learns a Excellent for Can struggle to Moderate. Good
Autoencoders compressed, exploring and generate highly for lead
(VAEs) continuous "latent optimizing within a novel molecules optimization but
space" learned chemical outside the training less ideal for de
representation of space; distribution; output novo exploration of
molecular data computationally quality can be uncharted
Model Architecture Core Mechanism Strengths Weaknesses Suitability for G2D
Project
from which new efficient. lower than other chemical space.
samples can be methods.
drawn.
Generative A "Generator" Capable of Notoriously difficult Low. The training
Adversarial network creates generating highly to train ("mode instability and lack
Networks (GANs) molecules, and a realistic and novel collapse"); of fine-grained
"Discriminator" samples that provides less control make it
network tries to closely match the control over the less suitable for
distinguish them training data generation our multi-objective
from real ones. distribution. process. optimization
They train in needs.
competition.
3D Graph Learns to reverse Generates Computationally High. The ability to
Diffusion Models a process of high-quality, valid more intensive to generate
gradually adding 3D structures train and sample molecules directly
noise to 3D directly; from compared to in 3D space is a
molecular state-of-the-art VAEs or GANs. perfect match for
structures, sample quality and our
enabling diversity; structure-based,
generation from a well-suited for personalized
random starting structure-based binding pocket
point. design. approach.
Reinforcement An agent (the Excellent for Requires a High (as a hybrid
Learning (RL) generator) learns goal-directed, well-defined and component). RL
to take actions multi-objective accurate reward is the ideal
(build a molecule) optimization; can function; can be framework for
to maximize a guide generation sample-inefficient optimizing the
cumulative reward towards specific, without a good molecules. By
(a score based on desirable pre-trained combining it with a
desired properties (e.g., generator. powerful
properties). high affinity, low pre-trained 3D
toxicity). Diffusion Model,
we get the best of
both worlds:
high-quality
generation and
precise,
multi-objective
optimization.
This analysis makes it clear that a hybrid architecture, combining the high-quality 3D generation
capabilities of a Diffusion Model with the goal-directed optimization power of Reinforcement
Learning, provides the most robust and technically advanced solution for achieving the
objectives of Project G2D.
IV. Navigating the Frontier: Acknowledged Risks and
Proactive Mitigation
A proposal that ignores the significant challenges of its field is not credible. The frontier of AI in
drug discovery is fraught with known risks. Acknowledging these challenges and presenting a
clear, proactive mitigation strategy for each is a hallmark of expert planning and is essential for
building investor confidence.
4.1. Challenge: Data Quality, Scarcity, and Bias ("Garbage In, Garbage
Out")
● Risk: The performance of any AI model is fundamentally constrained by the quality,
quantity, and representativeness of its training data. Publicly available biomedical
datasets are often noisy, incomplete, inconsistent, and may contain inherent biases (e.g.,
demographic or experimental). Training a model on such "garbage" data will inevitably
lead to a "garbage" model that produces unreliable predictions, fails to generalize to
real-world scenarios, and may even perpetuate health disparities. Furthermore, for truly
novel biological targets, high-quality bioactivity data can be extremely scarce, a significant
hurdle for data-hungry deep learning models.
● Mitigation Strategy:
1. Rigorous Data Governance and Curation: A formal data governance framework
will be established from day one. All data ingested into the G2D platform, whether
from public or proprietary sources, will pass through a rigorous curation pipeline.
This involves automated and manual checks for quality, normalization of formats,
tracking of data provenance, and version control. Every data point will be treated as
a valuable asset.
2. Advanced Learning Techniques for Data Scarcity: To overcome the challenge of
limited data for novel targets, we will employ several state-of-the-art machine
learning techniques. Transfer Learning will be used to pre-train our models on
massive, general-purpose chemical databases (e.g., ZINC, ChEMBL) before
fine-tuning them on the smaller, target-specific dataset. This allows the model to
learn general chemical principles from the large dataset and apply them to the
specific problem. Data Augmentation and Data Synthesis techniques will also be
used to artificially expand our training sets in a chemically meaningful way.
3. Strategic Data Collaboration: Following the successful model of companies like
Inductive Bio, we will actively explore participation in or the creation of
pre-competitive data-sharing consortia. While proprietary data on novel targets and
compounds remains confidential, data related to common challenges like ADMET
properties can be anonymized and pooled. This "give-to-get" model allows all
participants to build more robust and generalizable predictive models by training on
a much larger and more diverse dataset than any single company could assemble
alone.
4.2. Challenge: The "Black Box" and Synthesizability
● Risk: A common criticism of deep learning is the "black box" problem. The complex,
non-linear nature of these models can make their internal decision-making processes
opaque and difficult for human experts to interpret. This lack of transparency can erode
trust with medicinal chemists, who are reluctant to invest significant resources in
synthesizing a molecule without understanding why the model believes it will be effective.
A related and equally critical risk is that generative models can propose molecules that
are theoretically potent but are practically impossible or prohibitively expensive to
synthesize, a problem that plagued early de novo design efforts.
● Mitigation Strategy:
1. Integration of Explainable AI (XAI): The G2D platform will not be a black box. We
will integrate XAI methodologies to enhance the interpretability of our models. For
example, when the platform proposes a high-scoring molecule, it will also provide
an analysis highlighting the specific molecular substructures or predicted
interactions that contributed most significantly to its high score. This provides a
causal rationale that a medicinal chemist can evaluate, critique, and use to guide
their own intuition, fostering a collaborative human-machine partnership rather than
a blind reliance on the algorithm.
2. Synthesizability-Aware Generation: As detailed in the methodology (Section 3.3),
a synthetic accessibility score will be a core component of the multi-objective
reward function used in our reinforcement learning framework. The model will be
explicitly trained to optimize for molecules that are not only potent and safe but also
easy to make. By penalizing overly complex or rare chemical motifs during the
generation process, we guide the model to explore regions of chemical space that
are synthetically tractable, dramatically increasing the probability that our top
candidates can be successfully produced in the lab.
4.3. Challenge: Validation and the In Silico to In Vivo Gap
● Risk: The ultimate test of a drug candidate is not its performance on a computer, but its
behavior in a biological system. There is a well-documented and significant gap between
in silico predictions and real-world experimental outcomes. Over-reliance on
computational benchmarks alone is a known pitfall that can lead to wasted effort on
candidates that look promising on paper but fail in the lab. Retrospective validation, or
"rediscovering" known drugs, is often biased and not a true test of a model's prospective
power.
● Mitigation Strategy:
1. A Multi-Tiered, Prospective Validation Framework: We will implement a
rigorous, multi-stage validation process that moves progressively from
computational checks to experimental confirmation.
■ Tier 1 (Computational Benchmarking): All generated libraries will be
assessed against a standard set of in silico metrics, including chemical
validity, uniqueness, novelty against the training set, and diversity. This
ensures the basic quality of the generated output.
■ Tier 2 (Prospective Experimental Validation): The "Lab-in-the-Loop"
(Section 3.4) is our primary strategy for mitigating the reality gap. We will
mandate that a defined percentage of the most promising candidates from
each generation cycle be synthesized and tested experimentally. This
prospective validation is the only true measure of the model's real-world
performance.
■ Tier 3 (Orthogonal Confirmation): A key principle of robust science is the
confirmation of results using independent methods. A candidate's predicted
high binding affinity from a docking simulation should be confirmed with an
orthogonal experimental method, such as a biochemical binding assay.
Similarly, a positive result in one cell-based assay should be confirmed in a
different, complementary assay to rule out artifacts.
2. Embracing Failure as Valuable Data: In the G2D paradigm, a failed experiment is
not a setback; it is a crucial data point. When a highly-ranked in silico candidate
fails in a wet-lab assay, that negative result is captured, structured, and fed back
into the platform. This information is incredibly valuable, as it teaches the model
about the limitations of its current predictive abilities and helps it learn the subtle
chemical features that lead to experimental failure. This process of learning from
both successes and failures is what allows the platform to progressively close the in
silico-to-in vivo gap and become more accurate with every iteration.
V. Operational and Financial Framework
The successful execution of Project G2D requires a strategic allocation of capital to assemble a
specialized team and secure the necessary computational and software resources. This section
outlines a detailed operational plan and a comprehensive two-year budget projection.
5.1. Core Team and Expertise
A small, elite, interdisciplinary team is required to build and operate the G2D platform. The
following roles are critical for the initial 24-month period:
● Project Lead / Principal Investigator (1 FTE): Responsible for overall project strategy,
scientific direction, milestone management, fundraising, and building strategic
partnerships. This role requires extensive experience in both computational science and
pharmaceutical R&D.
● Lead Machine Learning Engineer (1 FTE): Responsible for the end-to-end design,
implementation, training, and optimization of the G2D generative platform. This role
requires a PhD in a relevant field (e.g., computer science, computational biology) and
demonstrated expertise in generative models (specifically diffusion models and
reinforcement learning) and their application to scientific problems. Salary benchmarks for
senior AI/ML engineers in the drug discovery space range from $160,000 to over
$360,000 annually, depending on experience and location. We will target a competitive
salary to attract top talent.
● Computational Chemist (1 FTE): Responsible for all chemoinformatics aspects of the
project. This includes target analysis, defining the parameters of the RL reward function
(binding affinity, ADMET, etc.), analyzing the chemical properties of generated molecules,
and acting as the primary interface with experimental collaborators or CROs. This role
requires a PhD in computational chemistry or a related field. Average salaries for
experienced computational chemists range from approximately $110,000 to $190,000+.
● Data Engineer / Bioinformatician (1 FTE): Responsible for building and maintaining the
project's data infrastructure. This includes the automated data ingestion pipeline,
management of multi-omics and pharmacogenomic databases, and ensuring data quality
and integrity.
5.2. Computational Infrastructure: A Cloud-First Strategy
The computational demands of training large-scale generative models are immense. An
on-premise high-performance computing (HPC) cluster represents a significant upfront capital
expenditure (a single high-performance GPU server can cost up to $290,000) and comes with
ongoing maintenance overhead and a lack of flexibility. Therefore, Project G2D will adopt a
cloud-first strategy.
● Rationale: Cloud platforms such as Amazon Web Services (AWS), Google Cloud
Platform (GCP), and Microsoft Azure provide on-demand access to state-of-the-art GPU
accelerators, scalable storage, and a rich ecosystem of managed AI/ML services. This
approach converts a large capital expense (capex) into a more manageable operational
expense (opex) and provides the elasticity to scale resources up or down based on
project needs, which is critical for a startup.
● Provider and Services: We propose utilizing AWS as the primary cloud provider, given
its market leadership and robust life sciences ecosystem. Specifically, we will leverage
Amazon SageMaker for model development, training, and deployment, a platform
successfully used by industry leaders like Insilico Medicine to accelerate their model
implementation cycles. Data will be stored in Amazon S3 for cost-effective, scalable
object storage.
● Cost Estimation: Cloud computing will be a major component of the operational budget.
Costs will be modeled based on:
○ GPU Training: The most significant cost driver. This involves renting high-end GPU
instances (e.g., AWS P4/P5 instances with NVIDIA A100/H100 GPUs) for training
and fine-tuning the G2D models.
○ Data Storage: Costs for storing terabytes of genomic data, chemical libraries, and
experimental results. Based on current pricing, 100 TB of cloud storage costs
approximately $2,000-$2,300 per month, while 500 TB costs around
$10,000-$11,000 per month.
○ General Compute and Networking: Costs for data preprocessing, API hosting,
and day-to-day operations on standard CPU instances.
5.3. Software and Data Resources
In addition to computational hardware, the project requires access to specialized commercial
software and public data repositories.
● Commercial Software:
○ Schrödinger Suite: This is the industry-standard software suite for structure-based
drug design. Access to modules like Maestro (visualization), Glide (docking),
SiteMap (binding site analysis), and Desmond (molecular dynamics) is essential.
While academic licenses can be heavily discounted (starting around $6,000 for a
base package), a commercial startup license with sufficient capacity (tokens) will be
required. We will budget accordingly based on quotes for a comprehensive
package.
○ Chemical Computing Group (CCG) MOE: A powerful and well-regarded
alternative or complementary suite to Schrödinger, providing an integrated drug
discovery software environment.
● Public Data Repositories:
○ The project will leverage numerous freely available, high-quality public databases
that are foundational to modern drug discovery. These include ChEMBL (bioactivity
data), DrugBank (approved drug data), and the Protein Data Bank (PDB) (protein
structures).
○ Large chemical libraries like ZINC and GDB-17 will be used for model pre-training
and benchmarking.
5.4. Comprehensive Two-Year Budget Projection
The following table provides a detailed line-item budget for the first two years of operation. All
figures are estimates based on the market research and data presented in the sourced
materials.
Table 5.1: Projected Two-Year Operational Budget
Category Line Item Year 1 Cost Year 2 Cost Total Cost Justification/No
(USD) (USD) (USD) tes
Personnel Lead ML $250,000 $260,000 $510,000 Based on
Engineer senior AI/ML
(Salary + drug discovery
Benefits) role
benchmarks.
Computational $180,000 $187,000 $367,000 Based on
Chemist experienced
(Salary + PhD-level
Benefits) computational
chemist salary
data.
Data Engineer $170,000 $177,000 $347,000 Competitive
(Salary + salary for a
Benefits) specialized
bioinformatics/d
ata engineering
role.
Project Lead $280,000 $290,000 $570,000 Salary for a
(Salary + senior
Benefits) leadership role
with scientific
and business
responsibilities.
Personnel $880,000 $914,000 $1,794,000
Subtotal
Cloud GPU Training $300,000 $350,000 $650,000 Estimated cost
Computing (High-Performa for training and
nce Instances) iterative
fine-tuning of
large
generative
models.
Category Line Item Year 1 Cost Year 2 Cost Total Cost Justification/No
(USD) (USD) (USD) tes
Data Storage $50,000 $60,000 $110,000 Based on
(S3/Blob, ~200 storage costs
TB) of
~$2,100/month
for 100TB.
General $40,000 $40,000 $80,000 For data
Compute & processing,
Networking APIs, web
servers, and
daily
operations.
Cloud Subtotal $390,000 $450,000 $840,000
Software & Schrödinger $75,000 $75,000 $150,000 Estimate for a
Data Suite (Startup comprehensive
License) commercial
startup license
with multiple
tokens.
CCG MOE $20,000 $20,000 $40,000 Estimate for a
Suite (Startup complementary
License) software
package.
Data $50,000 $50,000 $100,000 Budget for
Acquisition licensing
(Proprietary specialized
Datasets) datasets or
biobank access
if required.
Software $145,000 $145,000 $290,000
Subtotal
External R&D CRO Services $200,000 $300,000 $500,000 For synthesis
(Synthesis & and initial
Assays) experimental
validation of top
candidates
from the
"Lab-in-the-Loo
p".
External R&D $200,000 $300,000 $500,000
Subtotal
General & Legal, $50,000 $50,000 $100,000 Standard
Admin Accounting, operational
Insurance costs for a
startup.
Office Space & $60,000 $60,000 $120,000 For a small,
Category Line Item Year 1 Cost Year 2 Cost Total Cost Justification/No
(USD) (USD) (USD) tes
Utilities collaborative
office space.
G&A Subtotal $110,000 $110,000 $220,000
Total $1,725,000 $1,919,000 $3,644,000
Operating
Cost
Contingency $258,750 $287,850 $546,600 To cover
Fund (15%) unforeseen
expenses and
fluctuations in
cloud
computing
costs.
Total Funding $4,190,600
Request
VI. Strategic Vision and Pathway to a Preclinical Asset
The ultimate goal of Project G2D is to create a durable, scalable, and highly valuable enterprise.
This requires not only technical excellence but also a clear strategic vision for converting
scientific innovation into tangible assets. The long-term product of this endeavor is not a single
drug, but a reusable, ever-improving platform for generating personalized medicines. The first
preclinical candidate serves as the critical proof-of-concept that validates the platform's power
and potential.
6.1. Benchmarking Against Success: Learning from the Pioneers
To establish credible timelines and ambitious yet achievable goals, we benchmark our project
against the demonstrated successes of leading companies in the AI drug discovery space. Their
achievements provide a roadmap for what is possible with a well-executed AI-driven strategy.
● Insilico Medicine: A clear leader in generative AI, Insilico has advanced multiple
AI-designed drug candidates into human clinical trials. Their platform has demonstrated
remarkable efficiency, reducing the average time from project initiation to Preclinical
Candidate (PCC) nomination to just 12-18 months, compared to the industry norm of 3-6
years. This acceleration is achieved with extreme capital efficiency, requiring the
synthesis and testing of only ~60-80 molecules per program. Their lead asset for
Idiopathic Pulmonary Fibrosis (IPF), INS018_055, progressed from novel target discovery
to Phase 1 clinical trials in under 30 months, a feat that showcases the power of their
end-to-end AI platform.
● Atomwise: Atomwise has successfully pioneered a platform-centric business model,
leveraging its AtomNet® technology for massive-scale virtual screening. The company
has engaged in over 775 research collaborations, demonstrating the high demand for and
value of its computational services. Their platform's ability to screen millions of
compounds in a matter of days has enabled the discovery of novel hits for previously
"undruggable" targets.
● Recursion Pharmaceuticals: Recursion's strategy highlights the immense value of
integrating automated wet-lab experiments with machine learning at scale. By generating
vast, proprietary maps of biology through cellular imaging, they have built a powerful,
data-first engine for unbiased drug discovery, demonstrating the virtuous cycle created
when computational models are continuously fed with high-quality, purpose-built
experimental data.
These case studies collectively prove that AI platforms can dramatically shorten timelines,
reduce costs, and tackle previously intractable biological problems. The G2D project is designed
to incorporate the key lessons from these pioneers: the end-to-end integration of Insilico, the
collaborative potential demonstrated by Atomwise, and the data-centric, iterative learning
philosophy of Recursion.
6.2. Business Model: A Hybrid Strategy for De-Risked Growth
Early-stage technology companies often face a strategic dilemma: should they focus on
developing their own internal products or license their technology platform as a service to
others?. A purely internal pipeline approach concentrates risk on a few assets but captures all
the potential upside. A purely platform-as-a-service model generates near-term revenue but
may give away significant long-term value.
We propose a hybrid strategy for the initial 24-48 months to de-risk the venture and maximize
value creation:
1. Internal Pipeline Development (75% Focus): The majority of the team's resources and
efforts will be dedicated to advancing our own lead asset, as outlined in this proposal.
Successfully nominating a high-quality PCC is the single most important value inflection
point and will form the core of the company's enterprise value.
2. Selective Strategic Partnerships (25% Focus): In parallel, we will pursue 1-2 selective,
paid collaborations with established pharmaceutical or biotechnology companies. These
partnerships will involve applying the G2D platform to a target of interest to our partner.
This serves multiple strategic purposes: it provides early, non-dilutive revenue to extend
our operational runway; it serves as external validation of the G2D platform's capabilities
by industry experts; and it builds critical relationships for potential future licensing or
acquisition opportunities.
This balanced approach allows us to build long-term asset value while simultaneously validating
our technology and generating near-term revenue, creating a more resilient and attractive
investment proposition.
6.3. Project Roadmap and Key Milestones (24 Months)
To ensure disciplined execution and provide clear, measurable indicators of progress, the
project will be managed according to the following phased 24-month roadmap. Each phase
concludes with a critical deliverable that de-risks the project and serves as a key milestone for
investors and partners.
Table 6.1: Phased 24-Month Project Roadmap
Phase Timeframe (Months) Key Objectives Deliverables/Milestones
Phase 1: Foundation Months 1-6 Assemble core team. <li>Fully staffed core
& Target ID Establish secure cloud team.</li><li>Operation
infrastructure and al and benchmarked
software environment. computational
Phase Timeframe (Months) Key Objectives Deliverables/Milestones
Execute multi-omics environment on
and pharmacogenomic AWS.</li><li>A
analysis to identify and comprehensive report
validate a lead target. detailing the selection
and validation of the
lead biological target,
including its
pharmacogenomic
variation profile and a
set of personalized
binding pockets.</li>
Phase 2: G2D Months 7-15 Build, train, and <li>A functional,
Platform Build & validate v1.0 of the documented G2D
Initial Generation end-to-end G2D platform capable of
generative platform. generating and scoring
Perform initial molecules.</li><li>A
large-scale generation generated library of
cycles conditioned on over 1 million novel,
the personalized target diverse, and valid
data. molecules for the lead
target.</li><li>A
prioritized short-list of
the top 100 candidate
molecules, with full in
silico profiles, ready for
initial experimental
validation.</li>
Phase 3: Iterative Months 16-24 Execute a minimum of <li>Demonstrated
Optimization & Lead three full improvement in model
Identification "Lab-in-the-Loop" predictive accuracy
cycles, using CROs for across at least three
synthesis and assays. iterative
Continuously retrain cycles.</li><li>Identifica
and improve the G2D tion of a lead chemical
platform with new series of <10 highly
experimental data. optimized compounds
Optimize lead chemical with initial positive
series for potency, experimental
selectivity, and ADMET data.</li><li>Final
properties. Milestone:
Nomination of a
single Preclinical
Candidate (PCC) with
a comprehensive data
package, including all
in silico predictions and
Phase Timeframe (Months) Key Objectives Deliverables/Milestones
initial experimental
validation,
demonstrating
readiness for formal
IND-enabling
studies.</li>
This roadmap provides a clear and actionable plan to translate the vision of Project G2D into a
tangible, high-value preclinical asset within a 24-month timeframe, positioning the company for
its next phase of growth and clinical development.
Works cited
1. Bringing the power of high-performance computing to drug discovery and healthcare - Oracle,
[Link] 2. From
Data to Drugs: The Role of Artificial Intelligence in Drug Discovery - Wyss Institute,
[Link]
ery/ 3. AI-Driven Drug Discovery: A Comprehensive Review | ACS Omega,
[Link] 4. AI and machine learning:
Revolutionising drug discovery and transforming patient care - Roche,
[Link] 5.
A Survey of Generative AI for de novo Drug Design: New Frontiers in Molecule and Protein
Generation - arXiv, [Link] 6. survey of generative AI for de novo
drug design: new frontiers in molecule and protein ... - Oxford Academic,
[Link] 7. Full article: Drug discovery in the
context of precision medicine and artificial intelligence,
[Link] 8. Opinion: How AI and
Genetics Could Restore Public Trust in Pharma - BioSpace,
[Link]
rust-in-pharma 9. AI in Drug Discovery: Unlocking the Personalized Medicine | TechAhead,
[Link]
10. The AI startup aiming to revolutionize healthcare – with IFC support,
[Link] 11. The
Future of Medical Research: HPC, AI & HBM - EE Times,
[Link] 12. Pharma companies
are counting on cloud computing and AI to make drug development faster and cheaper |
ZDNET,
[Link]
ake-drug-development-faster-and-cheaper/ 13. Molecular design in drug discovery: a
comprehensive review of deep generative models | Briefings in Bioinformatics | Oxford
Academic, [Link] 14. Our Unique
Approach to TechBio - Recursion, [Link] 15. Concepts of Artificial
Intelligence for Computer-Assisted Drug Discovery | Chemical Reviews,
[Link] 16. AI enhances drug discovery and
development | National Science Review - Oxford Academic,
[Link] 17. Integrating artificial intelligence
in drug discovery and early drug development: a transformative approach - PMC,
[Link] 18. Generative AI in Drug Discovery:
Applications and Market Impact - DelveInsight,
[Link] 19. Healthcare
And AI Is A Hot Combination For Startups - Crunchbase News,
[Link] 20.
Google-backed AI drug discovery startup raises $600 million | SemiWiki,
[Link]
22442/ 21. Insilico Medicine: Main, [Link] 22. AI approaches for the discovery and
validation of drug targets - PMC - PubMed Central,
[Link] 23. Pharmacogenomics in the Age of AI:
Merging Genetics with Machine Learning for Tailored Treatments,
[Link] 24. Schrödinger
software profile and price details - AIIA, [Link]
25. DrugPose: benchmarking 3D generative methods for early stage drug discovery,
[Link] 26. Improving Molecule
Generation and Drug Discovery with a Knowledge-enhanced Generative Model - arXiv,
[Link] 27. De Novo Drug Design Using Reinforcement Learning
with Graph-Based Deep Generative Models | Request PDF - ResearchGate,
[Link]
ment_Learning_with_Graph-Based_Deep_Generative_Models 28. Generative artificial
intelligence in drug discovery: basic ... - Frontiers,
[Link] 29.
EarlGAN: An enhanced actor-critic reinforcement learning agent-driven GAN for de novo drug
design | Request PDF - ResearchGate,
[Link]
orcement_learning_agent-driven_GAN_for_de_novo_drug_design 30. De novo drug design
through artificial intelligence: an introduction - ResearchGate,
[Link]
ntelligence_an_introduction 31. Fixing drug discovery's most persistent problem with AI - Drug
Target ...,
[Link]
m-with-ai/ 32. Schrödinger licenses - Software - NSC,
[Link] 33. Is Schrodinger's GLIDE
software good? - ResearchGate,
[Link] 34. De novo drug
design through artificial intelligence: an ... - Frontiers,
[Link] 35.
Unleashing AI in Drug Discovery: Prospects and Challenges - DrugBank Blog,
[Link] 36.
Current strategies to address data scarcity in artificial intelligence-based drug discovery: A
comprehensive review | Request PDF - ResearchGate,
[Link]
city_in_artificial_intelligence-based_drug_discovery_A_comprehensive_review 37. Current
strategies to address data scarcity in artificial intelligence-based drug discovery: A
comprehensive review - PubMed, [Link] 38. Best
Practices for AI and ML in Drug Discovery and Development - Clarivate,
[Link]
ug-discovery-and-development-lifecycle/ 39. Inductive Bio Raises $25M Series A to Transform
Small Molecule Drug Discovery with Industry-Wide AI Platform - FirstWord Pharma,
[Link] 40. The Role of AI in Drug Discovery: Challenges,
Opportunities, and Strategies - PMC, [Link] 41.
On The Difficulty of Validating Molecular Generative Models Realistically: A Case Study on
Public and Proprietary Data | Theoretical and Computational Chemistry | ChemRxiv | Cambridge
Open Engage,
[Link] 42. Full
article: Validation guidelines for drug-target prediction methods,
[Link] 43. Machine Learning
Research Scientist/Engineer, AI/ML for Drug Discovery - Remote - Atomwise | Ladders,
[Link]
covery-atomwise-virtual-travel_80870503 44. $46-$117/hr Computational Drug Discovery Salary
Jobs - ZipRecruiter, [Link] 45.
Computational Chemist Salary in 2025 | PayScale,
[Link] 46. Computational
Chemist Salary in California (July 01, 2025),
[Link] 47. HPC Cost
Model and Rates | Duke University School of Medicine,
[Link]
ormance-computing-hpc/hpc-cost 48. Insilico Medicine Accelerates Drug Discovery Using
Amazon SageMaker | Case Study,
[Link] 49. Which Cloud
is Cheaper? AWS, Azure, GCP, and Stackit | by Angelo - Medium,
[Link]
t-bc2bc14c083f 50. Computational Suites | School of Chemical Sciences | Illinois,
[Link] 51. Subscription
Rates for TAMU Researchers FY21 - Laboratory for Molecular Simulation,
[Link] 52. Deep generative molecular
design reshapes drug discovery - PMC, [Link] 53.
DrugBank Online | Database for Drug and Drug Target Info, [Link] 54. Using
ChEMBL web services for building applications and data processing workflows relevant to drug
discovery - PMC, [Link] 55. Insilico Medicine
begins clinical trial for AI-designed cancer drug ISM3412,
[Link]
[Link] 56. Insilico Medicine announces developmental candidate
benchmarks and timelines for novel therapeutics discovered using generative AI - FirstWord
Pharma, [Link] 57. Insilico Medicine Reports Benchmarks
for its AI-Designed Therapeutics - BiopharmaTrend,
[Link]
-ai-designed-therapeutics/ 58. Generative Artificial Intelligence for Drug Discovery: How the First
AI-Discovered and AI-designed Drug Progressed to Phase 2 Clinical Testing - Research
Communities,
[Link]
-how-the-first-ai-discovered-and-ai-designed-drug-progressed-to-phase-2-clinical-testing 59.
How Atomwise Accelerated and Innovated Drug Discovery and Time to Market with WEKA and
AWS,
[Link]
very-and-time-to-market-with-weka-and-aws/ 60. How Atomwise Accelerated and Innovated
Drug Discovery and Time to Market with Weka and AWS,
[Link]
_Study_WekaIO_FINAL.pdf 61. AI Case Study: Atomwise – AI in Drug Discovery - Redress
Compliance, [Link] 62.
Splunk case study: Recursion Pharma Targets 100 Diseases With Machine Learning,
[Link]
63. Atomwise: Strategic Opportunities in AI for Pharma - Case - Faculty & Research,
[Link]