Facilities & Other Resources (Phase I)

PromptGenix LLC · Contact: dohoon.kim1@icloud.com · promptgenix.org

Summary

Phase I development and validation will be performed using a modern workstation + optional cloud burst capacity, an open-source bioinformatics/ML stack, and public biomedical repositories (datasets + literature). The environment is designed to be auditable and deterministic: pinned versions, captured configs, and checksummed artifacts enable reproducible evidence objects and stable hypothesis ranking outputs.

Key design choice: Confidence is computed by an evidence-weighted Bayesian decision engine; LLMs are used only for interpretability (evidence-linked rationale + citations) under “no evidence → no claim” constraints.

Computing environment

Primary development

Workstation: macOS and/or Linux development environment for pipeline engineering, testing, and report generation.
Local execution: accession-driven runs, caching of intermediate artifacts, and deterministic re-runs for regression tests.
Storage: structured project directories with immutable run outputs (configs + logs + evidence objects + reports).

Deterministic runs Immutable artifacts Checksums Audit logs

Optional scalable compute

Cloud burst (optional): containerized or VM-based execution for larger public datasets and parallel jobs.
Batch execution: reproducible job definitions with pinned dependencies and resource-limited profiles.
Security boundary: Phase I uses public datasets only; no sensitive/private data required for feasibility.

Phase I boundary: all validation uses public sources (datasets + literature) to minimize IP/data-use conflicts.

Software & pipeline stack

Omics pipelines

RNA-Seq / scRNA-Seq: standard alignment/quantification + QC + differential analysis + pathway scoring.
Flow cytometry: QC + harmonization + summary features + cell-type/marker-level evidence extraction.
Evidence objects: effect size, uncertainty, quality, and context descriptors produced per dataset.

Inference & ranking

Bayesian updating: priors from structured biology + literature signals updated by evidence likelihoods.
Uncertainty: posterior confidence with credible intervals + explicit supporting vs. conflicting evidence coverage.
Calibration tests: held-out checks to verify confidence behavior and robustness to heterogeneity.

Interpretability & outputs

LLM guardrails: narrative constrained to evidence objects + citations; no hallucinated claims.
Reports: exportable HTML/PDF with ranked hypotheses, confidence bands, evidence links, and next-step experiments.
Traceability: every claim points to evidence artifacts or cited literature sources.

Public data sources

Source category	Examples	How used in Phase I
Public omics datasets	GEO / SRA (bulk RNA-Seq, scRNA-Seq), curated public immunology studies	Accession-driven retrieval; evidence extraction (effect sizes, uncertainty, context) and re-run reproducibility tests
Immune profiling	FlowRepository and other public cytometry resources	Harmonization + summary evidence signals; cross-dataset consistency checks for hypothesis ranking
Scientific literature	PubMed / PMC (open abstracts/full text where available)	Literature-derived priors and citation grounding; support vs. conflict signal extraction for traceable explanations
Knowledge resources	Pathway and gene-set resources (e.g., KEGG/Reactome-like sets), marker panels, cell-type references	Prior construction features (pathway relevance, cell specificity); standardized hypothesis templates

Reproducibility controls

Version pinning + checksums + immutable artifacts.

What is pinned and captured

Code + configs: repository commit hash and per-run configuration snapshot.
Dependencies: pinned package versions (Python/R), tool versions, and runtime metadata.
Inputs: accessions, retrieval timestamps, and input-file manifests.
Artifacts: evidence objects, intermediate outputs, and final reports stored immutably per run.

How outputs are verified

Checksums: per-file SHA-256 checksums for key artifacts (evidence objects, figures, reports).
Determinism tests: same inputs/configs should reproduce the same artifacts (within defined tolerances).
Run logs: structured logs capturing steps, warnings, and evidence coverage (supporting/conflicting/missing).
Audit trail: ranked hypotheses link back to evidence objects and cited sources.

Reviewer-facing deliverable: Each Phase I pilot run produces a compact, verifiable package: (1) run manifest, (2) pinned versions + config, (3) evidence object bundle, (4) hypothesis ranking outputs with uncertainty, and (5) HTML/PDF report with traceable links.

Availability

All resources required for Phase I feasibility (compute, software stack, and public data access) are available to execute the proposed work. The system is designed to support scalable Phase II deployments (secure VPC/on-prem) without changing the core reproducibility guarantees.