Facilities & Other Resources (Phase I)
Phase I development and validation will be performed using a modern workstation + optional cloud burst capacity, an open-source bioinformatics/ML stack, and public biomedical repositories (datasets + literature). The environment is designed to be auditable and deterministic: pinned versions, captured configs, and checksummed artifacts enable reproducible evidence objects and stable hypothesis ranking outputs.
Primary development
- Workstation: macOS and/or Linux development environment for pipeline engineering, testing, and report generation.
- Local execution: accession-driven runs, caching of intermediate artifacts, and deterministic re-runs for regression tests.
- Storage: structured project directories with immutable run outputs (configs + logs + evidence objects + reports).
Optional scalable compute
- Cloud burst (optional): containerized or VM-based execution for larger public datasets and parallel jobs.
- Batch execution: reproducible job definitions with pinned dependencies and resource-limited profiles.
- Security boundary: Phase I uses public datasets only; no sensitive/private data required for feasibility.
Omics pipelines
- RNA-Seq / scRNA-Seq: standard alignment/quantification + QC + differential analysis + pathway scoring.
- Flow cytometry: QC + harmonization + summary features + cell-type/marker-level evidence extraction.
- Evidence objects: effect size, uncertainty, quality, and context descriptors produced per dataset.
Inference & ranking
- Bayesian updating: priors from structured biology + literature signals updated by evidence likelihoods.
- Uncertainty: posterior confidence with credible intervals + explicit supporting vs. conflicting evidence coverage.
- Calibration tests: held-out checks to verify confidence behavior and robustness to heterogeneity.
Interpretability & outputs
- LLM guardrails: narrative constrained to evidence objects + citations; no hallucinated claims.
- Reports: exportable HTML/PDF with ranked hypotheses, confidence bands, evidence links, and next-step experiments.
- Traceability: every claim points to evidence artifacts or cited literature sources.
| Source category | Examples | How used in Phase I |
|---|---|---|
| Public omics datasets | GEO / SRA (bulk RNA-Seq, scRNA-Seq), curated public immunology studies | Accession-driven retrieval; evidence extraction (effect sizes, uncertainty, context) and re-run reproducibility tests |
| Immune profiling | FlowRepository and other public cytometry resources | Harmonization + summary evidence signals; cross-dataset consistency checks for hypothesis ranking |
| Scientific literature | PubMed / PMC (open abstracts/full text where available) | Literature-derived priors and citation grounding; support vs. conflict signal extraction for traceable explanations |
| Knowledge resources | Pathway and gene-set resources (e.g., KEGG/Reactome-like sets), marker panels, cell-type references | Prior construction features (pathway relevance, cell specificity); standardized hypothesis templates |
Version pinning + checksums + immutable artifacts.
What is pinned and captured
- Code + configs: repository commit hash and per-run configuration snapshot.
- Dependencies: pinned package versions (Python/R), tool versions, and runtime metadata.
- Inputs: accessions, retrieval timestamps, and input-file manifests.
- Artifacts: evidence objects, intermediate outputs, and final reports stored immutably per run.
How outputs are verified
- Checksums: per-file SHA-256 checksums for key artifacts (evidence objects, figures, reports).
- Determinism tests: same inputs/configs should reproduce the same artifacts (within defined tolerances).
- Run logs: structured logs capturing steps, warnings, and evidence coverage (supporting/conflicting/missing).
- Audit trail: ranked hypotheses link back to evidence objects and cited sources.
All resources required for Phase I feasibility (compute, software stack, and public data access) are available to execute the proposed work. The system is designed to support scalable Phase II deployments (secure VPC/on-prem) without changing the core reproducibility guarantees.