Facilities & Other Resources (Phase I)

PromptGenix LLC · Contact: dohoon.kim1@icloud.com · promptgenix.org
Summary

Phase I development and validation will be performed using a modern workstation + optional cloud burst capacity, an open-source bioinformatics/ML stack, and public biomedical repositories (datasets + literature). The environment is designed to be auditable and deterministic: pinned versions, captured configs, and checksummed artifacts enable reproducible evidence objects and stable hypothesis ranking outputs.

Key design choice: Confidence is computed by an evidence-weighted Bayesian decision engine; LLMs are used only for interpretability (evidence-linked rationale + citations) under “no evidence → no claim” constraints.
Computing environment

Primary development

  • Workstation: macOS and/or Linux development environment for pipeline engineering, testing, and report generation.
  • Local execution: accession-driven runs, caching of intermediate artifacts, and deterministic re-runs for regression tests.
  • Storage: structured project directories with immutable run outputs (configs + logs + evidence objects + reports).
Deterministic runs Immutable artifacts Checksums Audit logs

Optional scalable compute

  • Cloud burst (optional): containerized or VM-based execution for larger public datasets and parallel jobs.
  • Batch execution: reproducible job definitions with pinned dependencies and resource-limited profiles.
  • Security boundary: Phase I uses public datasets only; no sensitive/private data required for feasibility.
Phase I boundary: all validation uses public sources (datasets + literature) to minimize IP/data-use conflicts.
Software & pipeline stack

Omics pipelines

  • RNA-Seq / scRNA-Seq: standard alignment/quantification + QC + differential analysis + pathway scoring.
  • Flow cytometry: QC + harmonization + summary features + cell-type/marker-level evidence extraction.
  • Evidence objects: effect size, uncertainty, quality, and context descriptors produced per dataset.

Inference & ranking

  • Bayesian updating: priors from structured biology + literature signals updated by evidence likelihoods.
  • Uncertainty: posterior confidence with credible intervals + explicit supporting vs. conflicting evidence coverage.
  • Calibration tests: held-out checks to verify confidence behavior and robustness to heterogeneity.

Interpretability & outputs

  • LLM guardrails: narrative constrained to evidence objects + citations; no hallucinated claims.
  • Reports: exportable HTML/PDF with ranked hypotheses, confidence bands, evidence links, and next-step experiments.
  • Traceability: every claim points to evidence artifacts or cited literature sources.
Public data sources
Source category Examples How used in Phase I
Public omics datasets GEO / SRA (bulk RNA-Seq, scRNA-Seq), curated public immunology studies Accession-driven retrieval; evidence extraction (effect sizes, uncertainty, context) and re-run reproducibility tests
Immune profiling FlowRepository and other public cytometry resources Harmonization + summary evidence signals; cross-dataset consistency checks for hypothesis ranking
Scientific literature PubMed / PMC (open abstracts/full text where available) Literature-derived priors and citation grounding; support vs. conflict signal extraction for traceable explanations
Knowledge resources Pathway and gene-set resources (e.g., KEGG/Reactome-like sets), marker panels, cell-type references Prior construction features (pathway relevance, cell specificity); standardized hypothesis templates
Reproducibility controls

Version pinning + checksums + immutable artifacts.

What is pinned and captured

  • Code + configs: repository commit hash and per-run configuration snapshot.
  • Dependencies: pinned package versions (Python/R), tool versions, and runtime metadata.
  • Inputs: accessions, retrieval timestamps, and input-file manifests.
  • Artifacts: evidence objects, intermediate outputs, and final reports stored immutably per run.

How outputs are verified

  • Checksums: per-file SHA-256 checksums for key artifacts (evidence objects, figures, reports).
  • Determinism tests: same inputs/configs should reproduce the same artifacts (within defined tolerances).
  • Run logs: structured logs capturing steps, warnings, and evidence coverage (supporting/conflicting/missing).
  • Audit trail: ranked hypotheses link back to evidence objects and cited sources.
Reviewer-facing deliverable: Each Phase I pilot run produces a compact, verifiable package: (1) run manifest, (2) pinned versions + config, (3) evidence object bundle, (4) hypothesis ranking outputs with uncertainty, and (5) HTML/PDF report with traceable links.
Availability

All resources required for Phase I feasibility (compute, software stack, and public data access) are available to execute the proposed work. The system is designed to support scalable Phase II deployments (secure VPC/on-prem) without changing the core reproducibility guarantees.