• When Scientific AI Forgets How It Got There

    Pharma R&D has spent the last three years quietly making AI load-bearing. Models now propose targets, score compounds, design assays, summarize literature, and increasingly draft sections of regulatory documents. The conversation has been about capability — what AI can do. The conversation we are not having, and need to, is about provenance — what the institution can prove about how it did it.

    The argument is narrow and specific: scientific AI without deterministic provenance is not just messy. It is institutionally dangerous.

    By deterministic provenance, I mean the ability to point at any AI-derived result — a hit list, a predicted structure, an annotated cohort, a generated experimental protocol — and reconstruct, bit-for-bit, the data, code, model version, parameters, random seeds, and upstream transformations that produced it. Not “logged somewhere.” Not “the analyst remembers.” Reproducible on demand, by someone other than the person who first ran it.

    Most pharma AI today does not meet this bar. Notebooks reference data files whose schemas have since drifted. Foundation models are version-pinned on paper but rarely in practice. Pipelines stitch together SaaS APIs, internal scripts, and curated spreadsheets. The output looks crisp; the lineage is a smear.

    This is fine when AI is a brainstorming partner. It becomes dangerous the moment AI output enters the institutional record — IND-enabling packages, IP filings, partnership data rooms, internal go/no-go decisions. Three failure modes follow.

    First, regulatory exposure. FDA and EMA expectations are converging on reconstructable analyses. “Our model said so” is not a defense if the model, weights, and inputs cannot be reproduced. Sponsors who accelerate with AI but cannot defend the chain of derivation are accumulating a regulatory debt that comes due during inspection, not before.

    Second, audit blindness when programs fail. Most clinical and preclinical programs fail. The institutional value of a failed program is the post-mortem — what did we believe, why did we believe it, where did the evidence break. When AI-derived intermediates cannot be reconstructed, the post-mortem cannot be performed honestly. The organization loses the ability to learn from its own failures, which is the most expensive form of institutional damage.

    Third, decision velocity outrunning evidence integrity. AI compresses cycle times, which is the point. But the same compression means provenance debt accrues faster than humans can repair it. A team can make twelve AI-assisted decisions in the time it used to make one, with one-twelfth the lineage discipline. The danger is not that any one decision is wrong. It is that the organization can no longer tell which decisions rest on which evidence.

    The fix is not more logging. Logging is observational and lossy. Deterministic provenance has to be a property of the system that produces results, not a record kept alongside it — data, code, and computation tracked in the same structure, with versioning and lineage that are queryable and reproducible. Designed in, not bolted on.

    Leaders evaluating AI tooling should ask a single question of every vendor and every internal team: if a regulator, a partner, or a future post-mortem asks how this result was produced, can we rebuild it from if a regulator, a partner, or afuture post-mortem asks how this result was produced, can we rebuild it from first inputs, without depending on the person who ran it? If the honest answer is no, the institution is not adopting AI. It is borrowing against its own credibility.

    The organizations that win the next decade of scientific AI will not be the ones with the largest models. They will be the ones whose AI outputs are still defensible three years after the analyst has left.


    This is the conviction behind how we built the DataJoint platform: provenance as a first-class property of scientific computation, not alogbook kept beside it.

  • DataJoint Enables Seamless Migration of CWL Pipelines to Its Governed, Reproducible Scientific Data Infrastructure

    DataJoint Enables Seamless Migration of CWL Pipelines to Its Governed, Reproducible Scientific Data Infrastructure

    DataJoint today announced native support for converting Common Workflow Language (CWL) pipelines into DataJoint pipelines, enabling research organizations to immediately modernize existing scientific workflows — without sacrificing prior investment or starting from scratch.


    CWL: Widely Adopted, But Increasingly Constrained

    Common Workflow Language has become a de facto standard across pharmaceutical R&D, genomics, and academic research for defining portable, reproducible computational workflows. Major cloud and bioinformatics platforms support CWL natively, and it is broadly adopted across federally funded genomics programs and industry R&D consortia — making it one of the most widely deployed workflow standards in life sciences.

    Yet CWL has recognized limitations in production environments: limited error handling and debugging, no native provenance tracking, poor support for partial re-runs when a step fails mid-pipeline, and no mechanism to query workflow state. As AI-driven research demands tighter auditability and reproducibility, these gaps create real scientific and operational risk.

    What DataJoint Provides

    DataJoint’s CWL conversion layer reads existing CWL workflow definitions and executes them as native DataJoint pipelines. Research teams can extend these pipelines — mixing CWL definitions with DataJoint’s Python-based schema framework — and run them in interpreted mode today, with compiled execution on the roadmap. Key capabilities include:

    Automatic provenance. Every CWL step is backed by DataJoint’s schema-driven provenance layer, creating a complete, queryable record of inputs, outputs, and computational history.

    Granular retry and resilience. Failed steps can be individually retried or corrected without re-running the entire pipeline — a critical capability for long-running, high-cost workflows.

    Queryable state. Workflow state is accessible via DataJoint’s standard query syntax, enabling real-time monitoring and downstream analysis.

    Natural parallelization. Pipelines are decomposed into discrete, independently executable steps that support cluster-level parallelism and graceful pause/resume without lost progress.

    Structured entity database. Critically, DataJoint does not simply execute CWL workflows — it builds a structured database around the scientific entities those workflows produce. The conversion process involves explicitly defining the entities created at each stage (such as processed samples, imaging results, or analysis outputs) and the dependencies between them. This transforms a pipeline from a sequence of compute steps into a living, queryable scientific record — one that captures not just what ran, but what was produced, how it relates to other data, and how it can be reused.

    “Scientific AI will only be as trustworthy as the data foundation beneath it. CWL gave the research community a powerful way to define workflows — DataJoint gives those workflows the provenance, traceability, and governance they need to support defensible science and AI-ready research at scale.” — Jim Olson, CEO, DataJoint

  • DataJoint at CoSyNe 2026: Building AI-Ready Data Workflows for Neuroscience

    DataJoint at CoSyNe 2026: Building AI-Ready Data Workflows for Neuroscience

    This March, DataJoint’s Chief Science Officer Dimitri Yatsenko, PhD, and SciOps Engineer Milagros Marín, PhD, presented a tutorial at CoSyNe 2026 (Computational and Systems Neuroscience) called Building AI-Ready Data Workflows for Neuroscience Experiments.

    The materials from this talk are now available to the public at the conclusion of this blog post.


    The session brought together computational and systems neuroscientists for a hands-on look at what it takes to make scientific data infrastructure ready for AI — not someday, but now.

    Here are the key ideas we shared:

    1. Operational rigor is the foundation for AI in science. Dimitri opened with a provocation: How must research teams transform their work to harness AI? The answer isn’t better models — it’s better data discipline. Without structured schemas, enforced provenance, and reproducible computations, AI agents have nothing reliable to work with. We built on the SciOps Capability Maturity Model — a five-level roadmap from ad hoc scripts to closed-loop AI-assisted discovery — giving labs a concrete path to assess and grow their operational readiness.

    Dr. Dimitri Yatsenko presenting the DataJoint RNA-Seq pipeline in collaboration with the Cadwell Lab at UCSF

    2. The schema is not a record of the science it is the science. We showed how DataJoint’s relational workflow model unifies database, code, and computation into a single formal schema. Tables represent workflow steps, rows represent artifacts, and foreign keys prescribe execution order. The pipeline diagram is the database, not documentation that drifts from reality.

    DataJoint platform architecture. The open-source Python library provides the relational workflow model—schema definition, query algebra, and distributed computation. This core integrates with a relational database (system of record), object storage (for scalable data), and code repositories (for version-controlled pipeline definitions). The managed platform adds infrastructure, observability, and orchestration for production deployments. Milagros’ adaptation from Yatsenko & Nguyen, arXiv:2602.16585 (2026)

    3. Three production pipelines, three scales, one platform. Milagros walked through three real-world projects running on DataJoint: ORION pipelines (brain organoids generation with four electrophysiology modalities integrated, tracking complete provenance from iPSC to spike waveform, in collaboration with the Shcheglovitov lab at the University of Utah); Project AEON (24/7 continuous behavior at the Sainsbury Wellcome Centre, processing 7 million data points per day and weeks of Neuropixels recordings); and DatJoint MoSeq pipeline (unsupervised behavioral syllable discovery in collaboration with the Datta Lab at Harvard Medical School).

    Dr. Milagros Marín demonstrates how DataJoint orchestrates the AEON foraging platform at UCL’s Sainsbury Wellcome Centre — unifying Bonsai-acquired data streams, SLEAP pose estimation, and continuous electrophysiology into a structured, automated pipeline for weeks-long freely moving behavior experiments.

    4. Reproducibility validated, not just claimed. Each project included rigorous validation — positive and negative controls for ORION, dynamic schema generation tested at scale for AEON, and benchmark-matched syllable durations for MoSeq. The pipeline reproduces the science, not just the workflow.

    5. AI agents can query and reason over structured pipelines. We demonstrated an AI assistant that connects directly to a DataJoint pipeline, queries behavioral data, interprets distributions, and generates scientific summaries — all made possible by a self-documenting, queryable schema. A

    As Milagros put it: “Scientists direct, AI agents execute, and the data infrastructure doesn’t just store science — it understands it.”

    6. Open-source, community-ready, publication-grade. All three project codebases are open-source. The ORION pipeline has a paper in preparation (Marín et al., 2026), and a poster will be presented at FENS Forum 2026, and AEON’s preprint (Campagner et al. 2025) is out. We’re building toward an ecosystem where any lab can adopt these workflows and plug in their own protocols.

    DataJoint Tutorial at CoSyNe 2026

    It was energizing to connect with the CoSyNe community — researchers who think deeply about computation and are ready to bring that rigor to their data infrastructure. The conversation reinforced something we believe strongly: the lab of the future doesn’t just manage files — it manages knowledge.


    Missed the tutorial?

    DataJoint Tutorial at datajoint.com/cosyne-2026 — including live demos of all three pipelines and DataJoint’s framework for AI-ready research operations.

    Want to explore what AI-ready workflows look like for your lab? Visit https://docs.datajoint.com or write me an email at [email protected] to build AI-ready infrastructure for labs and institutions.

    Trusted Data. Trusted AI. Trusted Science.

  • Dave Schuette Joins DataJoint Board of Directors

    Dave Schuette Joins DataJoint Board of Directors

    DataJoint, the scientific data infrastructure company enabling defensible and reproducible AI in regulated R&D, today announced that Dave Schuette has joined its Board of Directors as an independent member. The appointment strengthens DataJoint’s leadership as the company expands into pharmaceutical and life sciences markets following the recent launch of its Agentic AI platform.

    Schuette is the founder and managing partner of Slide3, a boutique consulting firm serving pharmaceutical, financial services, and technology clients. With more than 25 years of experience as a business management executive, he brings a track record of transforming organizations and creating disruptive operational strategies at scale.

    Prior to founding Slide3 in 2018, Schuette served as EVP and President of the Enterprise Business Unit at Synchronoss Technologies, leading growth across healthcare and life sciences. He was a founding partner of Knowledgent, a data and analytics firm acquired by Accenture, and held senior roles at BusinessEdge Solutions, acquired by EMC. A Top 25 Consultant of the Year honoree from Consulting Magazine, he also brings direct pharma industry experience, including work with Bristol-Myers Squibb.

    “Dave’s expertise in pharma and technology, and his ability to help companies scale with clarity and purpose, are exactly what DataJoint needs at this inflection point,” said Jim Olson, CEO of DataJoint. “His experience bridging scientific rigor with operational agility makes him an invaluable addition to our board.”

    Schuette joins weeks after DataJoint launched DataJoint Agentic AI, a governed execution layer that enables semi-autonomous AI operation across scientific workflows — allowing pharma and biotech organizations to automate complex pipelines while maintaining full reproducibility and auditability.

    “DataJoint has built something genuinely differentiated — a platform that makes AI-ready data a reality, not just an aspiration,” said Dave Schuette. “I’m proud to join the board and help accelerate its mission at a time when trustworthy scientific AI has never mattered more.”

  • DataJoint Launches Agentic AI Control Layer for Scientific Workflows

    DataJoint Launches Agentic AI Control Layer for Scientific Workflows

    DataJoint today announced the launch of DataJoint Agentic AI, a governed execution layer for scientific workflows that enables semi-autonomous AI operation on rigorously structured, provenance-rich data.

    As pharmaceutical and academic institutions accelerate investment in generative and agentic AI to further innovation, many are confronting a critical constraint: AI systems trained on fragmented, under-described scientific data cannot reliably reproduce, audit, or defend their outputs. In regulated research environments, this lack of context creates material scientific and operational risk.


    DataJoint addresses this challenge at its source

    The platform captures multi-modal scientific data in precisely defined, interconnected frameworks — embedding rich metadata and full computational provenance at the point of every experimental result. By grounding AI agents in this context-rich foundation, DataJoint enables automated workflow execution while preserving reproducibility, traceability, and decision accountability.

    “Scientific AI will only be as trustworthy as the data foundation beneath it,” said Jim Olson, CEO of DataJoint. “We built DataJoint to ensure that every AI-driven insight is grounded in structured provenance and computational context — so that scientific decisions are not just faster, but defensible and reliable.”

    DataJoint’s agentic AI enables semi-autonomous execution of complex, multi-step scientific pipelines across imaging, electrophysiology, genomics, behavioral data, and more — within a governed, reproducible framework built for regulated and research environments. For pharma and biotech, this means faster hypothesis validation and AI-ready datasets that support regulatory confidence. For academic and medical centers, it means scaling sophisticated research without sacrificing rigor. And all for the purpose of accelerating discoveries and speeding innovation.

    For example, an AI agent operating within DataJoint can validate experimental inputs, trigger downstream processing, detect data and structure inconsistencies, and ensure computational reproducibility — all while maintaining a complete, queryable record of decisions and transformations.

    DataJoint’s structured scientific data infrastructure is already deployed in leading academic medical centers and industry research environments, supporting reproducible multi-modal pipelines at scale.


    Industry Showcases

    DataJoint will demonstrate its Agentic AI capabilities at:

    PMWC 2026 (Precision Medicine World Conference)
    March 4–6, 2026 | San Jose, CA

    Lab of the Future USA Congress
    March 2–3, 2026 | Boston, MA

    These events convene leaders in precision medicine, biopharma R&D, and digital laboratory transformation.

  • Welcoming John Apathy to the DataJoint Team as Strategic Advisor

    Welcoming John Apathy to the DataJoint Team as Strategic Advisor

    We’re thrilled to share that John Apathy has joined DataJoint as a Strategic Advisor, bringing deep expertise in data-driven innovation and AI strategy in life sciences R&D.


    John has spent over three decades leading digital transformation across organizations like Bristol Myers Squibb, Celgene, and GlaxoSmithKline—helping research and development teams turn complex data into scientific breakthroughs. Today, as Chief Solutions Officer at XponentL Data, a Genpact company, he continues to guide organizations in making data and AI a true competitive advantage.


    At DataJoint, we’re on a mission to make scientific research more reproducible, integrated, and AI-ready through our SciOps platform. John’s experience will be key in helping us accelerate that mission—empowering scientists to connect instruments, data, and computation into automated workflows that drive discovery.


    As our CEO Jim Olson put it:

    “John’s deep experience in digital and data transformation makes him an outstanding addition to our advisory team.”

    We couldn’t agree more. Welcome, John—excited to build the future of data-driven science together!

  • DataJoint at SfN 2025

    DataJoint at SfN 2025

    The DataJoint team is excited to connect with the neuroscience community at the Society for Neuroscience (SfN) 2025 in San Diego this November 15-19! We’re bringing insights from groundbreaking large-scale projects to practical tools that are shaping how labs manage and share their data today.

    From Circuit Mapping to Career Pathways: Our SfN Journey

    Our presence at SfN this year tells a story that begins with one of the most ambitious neuroscience projects of the decade and extends to the everyday challenges facing labs worldwide.


    Saturday, Nov 15: The MICrONS Legacy

    The Machine Intelligence from Cortical Networks (MICrONS) project represents a watershed moment in systems neuroscience by creating the largest functional connectome of mammalian cortex to date. This massive collaborative effort, recently published in Nature, has generated unprecedented multimodal datasets combining electron microscopy, calcium imaging, and electrophysiology across millions of synapses.

    DataJoint’s Chief Science Officer, Dimitri Yatsenko, will chair a nanosymposium diving deep into these insights:

    Nanosymposium NANO003: Insights From the MICrONS Project
    Saturday, November 15, 1:00–2:45 PM
    📍 San Diego Convention Center, Room 30


    Sunday, Nov 16: Building Your Career in Neuroinformatics

    The increasing complexity of multimodal datasets in neuroscience presents not only technical challenges but also new career opportunities. This Sunday, Dimitri, alongside Mathew Abrams, Uma Karmarkar, and Stephanie Albin, will lead a professional development workshop. The session will focus on the expanding field of neuroinformatics and explore career paths beyond traditional academia. Panelists, who have successfully navigated diverse careers leveraging neuroscience skills and big data analysis, will share their journeys and offer advice to attendees on how to pursue similar unconventional roles.

    Life After the PhD: Career Opportunities in Brain Data Science
    Sunday, November 16, 3:00–5:00 PM
    📍 San Diego Convention Center, Room 2

    Whether you’re a graduate student considering your options or a PI thinking about how to support your team’s development, this session will illuminate the expanding opportunities at the intersection of neuroscience, data science, and software engineering.


    Monday, Nov 17: Practical Solutions for Multimodal Data

    The lessons from large-scale collaborations like MICrONS are directly informing how we approach everyday lab data challenges. Visit our poster to see how principled data management frameworks are making multiphoton imaging data more accessible and reusable:

    Poster 12424: A Principled Framework for Compression and Standardization of Multiphoton Data
    Session PSTR198: Techniques and Software for Imaging and Neural Analyses
    Monday, November 17, 8:00 AM–12:00 PM

    This work reflects DataJoint’s current focus: helping labs capture, standardize, and share the increasingly complex multimodal datasets that modern neuroscience demands.


    Let’s Connect: Visit Booth 3326

    Throughout the conference, our team will be at Booth 3326, ready to discuss:

    • How collaborative data frameworks like those used in MICrONS can scale down to individual labs
    • Strategies for managing multimodal datasets (ephys, imaging, behavior, and more)
    • Your specific data management challenges and how DataJoint might help

    From the largest circuit mapping projects to your next experiment, the thread connecting our SfN activities is clear: neuroscience is increasingly about managing, integrating, and sharing complex data.

    We can’t wait to see you in San Diego. Let’s connect and explore what you are working on and how we can get there together.

  • DataJoint at the PharmStars PharmaTech Innovation Summit

    DataJoint at the PharmStars PharmaTech Innovation Summit

    On November 18th, DataJoint will join innovators, pharma R&D leaders, and investors at the PharmStars PharmaTech Innovation Summit in Boston. We’re excited to present how the DataJoint Scientific Data Platform is helping research organizations modernize their data foundations to accelerate discovery, improve reproducibility, and support the next generation of AI-driven science.


    Modern R&D workflows generate massive, multimodal datasets: imaging, sequencing, electrophysiology, clinical measures, behavioral data, and more. Yet the context needed to make that data useful often remains locked across instruments, lab systems, and custom scripts. This fragmentation slows research, limits collaboration, and undermines confidence in AI/ML insights.

    DataJoint solves this by unifying three pillars of scientific data operations in one platform:

    • Multimodal Data Integration
      • Structured and linked data spanning experiments, samples, instruments, pipelines, and analyses.
    • Scientific Context & Provenance
      • Every dataset and model result is traceable back to its experimental conditions and computational lineage.
    • Reproducible Computational Workflows
      • Standardized workflows that run reliably across teams, data centers, and cloud environments.

     The result: trusted, AI-ready data with transparency and governance built in.

    At PharmStars, DataJoint’s VP of Growth & Partnerships, Dana Wojtasinski, together with DataJoint’s CEO, Jim Olson, will be sharing real examples of how leading neuroscience, translational biology, and research teams are using DataJoint to:

    • Accelerate time to insight
    • Improve repeatability across studies, labs, and therapeutic programs
    • Scale computational pipelines for ML and data science
    • Reduce data engineering overhead and silos

    The PharmaTech Innovation Summit brings together the people building the next wave of R&D infrastructure—and we’re looking forward to collaborating with organizations shaping that future.

    If you work in biopharma or invest in digital R&D innovation and would like to attend, feel free to reach out for an invitation at [email protected].

    Trusted Data. Trusted AI. Trusted Science.

  • Brian Napack Joins DataJoint as Strategic Advisor to Advance Scientific Research

    Brian Napack Joins DataJoint as Strategic Advisor to Advance Scientific Research

    DataJoint is thrilled to welcome Brian Napack as a strategic advisor and investor. A visionary leader with decades of experience in education, research, publishing, and technology, Brian brings a proven track record of scaling innovation to create lasting impact. As the former CEO of John Wiley and current Executive Chairman of 2U, Brian has consistently championed initiatives that enhance the productivity and ROI of science and education.

    Brian will support DataJoint’s mission to revolutionize data management and AI in scientific research, helping labs worldwide overcome challenges in fragmented data, collaboration, and reproducibility. With over 100 labs already leveraging DataJoint’s platform and a recent $4.9M Seed funding round, the company is poised to transform research workflows in academia, life sciences, and beyond.

    We’re excited to partner with Brian as we continue to drive scientific discovery forward!

  • DataJoint Raises $4.9M to Transform Life Sciences Data Management with AI

    DataJoint Raises $4.9M to Transform Life Sciences Data Management with AI

    DataJoint has closed a $4.9M Seed funding round, co-led by Nina Capital, Inoca Capital Partners, and Capital Factory. The funding will fuel the growth of its team, expand its AI-powered SaaS platform, and extend its reach into life sciences and pharma industries across the U.S. and Europe.

    Already used by over 100 labs, including Johns Hopkins and Harvard, DataJoint’s platform harmonizes multimodal data and streamlines workflows. The company’s participation in the PharmStars accelerator further highlights its role as a key innovator in digital health and pharma collaboration.


    “This investment enables us to scale and bring transformative solutions to researchers and organizations,” said CEO Jim Olson.

    With its cutting-edge AI integration, DataJoint is set to revolutionize data management in life sciences.