Ashita Orbis Blog
This blog. Three-tier exploration of web development complexity: raw HTML, Astro, and Next.js. Features agent-accessible API, comment system, and embedded AI chat.
Activity Timeline
- PsycheEval schema bug found (hardcoded RedFlag vocab), pairwise validation errors blocked, test coverage added.
18-label RedFlag vocabulary hardcoded in prompts; v0.2 requires 24 — fix is dynamic enum generation. Pairwise judges returning free-text in enum fields causing validation failures. v0.1 completion blocked on C5 claim revisions; v0.2 generation deferred until v0.1 pairwise is clean.
- PsycheEval v0.1: diagnosed pairwise judge red-flag enum mismatch, fixed and added logging safety valve.
Root cause was hardcoded vocabulary lists diverging from Python RedFlag enum. Fix generates vocabulary dynamically. _coerce_red_flags() added to surface mismatches to validation_warnings files rather than drop them silently. v0.1 completion path and v0.2 scope both finalized.
- PsycheEval validation bug identified: judges returning free-text instead of constrained enum values.
Remediation plan drafted to fix schema and judge prompt. Three completion phases scoped for v0.1 and v0.2 of the evaluation pipeline. Two blog post assignments queued for upcoming publication.
- PsycheEval v0.1 completion path mapped; critical red_flags enum bug identified.
Pairwise judge returning free-text instead of RedFlag enum — fix requires schema constraint and prompt update across all judges, then run restart. v0.1 wraps with restructured tables and a new win-rate section; v0.2 scoped for anchored scores, length-control deltas, and a synthetic smoke-test.
- PsycheEval pairwise judges emitting invalid red_flags; root cause traced and fix strategy sequenced.
Hardcoded red-flag lists in judge prompts don't match the Python RedFlag enum. Fix: regenerate vocabulary from enum, add validation tests, log invalid labels. v0.1 completion path documented; v0.2 scoped with anchored scores and length-control blocks.
- P1 bug halted pairwise judge run; v0.2 roadmap formalized.
Judges returning free-text instead of enum values in red_flags fields. Fix scoped: Python enum refactor with validation logging. v0.2 roadmap adds anchored judge scores, cross-provider-only metric tables, and length-control deltas.
- gpt-max smoke test monitoring loop started but session cut off incomplete.
Attempted to set up loop-based monitoring for gpt-max smoke test status. Session terminated before execution completed. No changes landed.
- gpt-max smoke test loop setup attempted; setup phase incomplete.
Single session worked on setting up a monitoring loop for gpt-max smoke testing. The setup phase did not finish and produced no concrete output.
- Blog plan testing resumed; light session, no recorded state changes.
Monitoring loop invoked for smoke tests. Transcript incomplete.
- Drafted Twitter reply variants on LLM-as-judge techniques.
Mapped the workspace's own evaluation infrastructure as concrete examples: publication-review skill, codex-council, persona testing loop, iterative-improve. Requested 2-3 variant replies with rhetorical intent analysis.
- 153 uncommitted changes pending; no deploy since 2026-04-15.
No active sessions today. Existing uncommitted content (39 published posts + 1 draft) flagged as an open escalation requiring resolution before the next deploy cycle.
- 109 uncommitted changes pending; 39 posts published, 1 draft queued.
No active sessions. Working directory has accumulated changes since the April 15 deployment (38 of 40 posts live). One draft post remains in queue.
- Posts 030 and 038 published; live total reaches 39.
Both posts cleared the publication review pipeline and went live. Active editorial work continues with 100+ uncommitted changes in the working directory.
- Post #38 published; 104 uncommitted changes in draft queue.
"when-the-pulse-went-quiet" deployed April 14. Draft queue activity suggests another publication batch forming.
- Two 3-model review cycles: 49 findings surfaced, 39 resolved. 4-phase plan drafted for deferred work.
GPT-5.4 pre-review added 4 critical design considerations before plan was finalized. Current state: 63 tests passing, clean typecheck, Phase 1 implementation ready to begin.
- Psyche Iteration 3: 8 deferred findings resolved; CAT algorithm reworked from O(N) to O(1).
getNextItem() optimized with ReadonlyMap cache, cutting ~12,000 filter comparisons per session. Plan reviewed by GPT-5.4; CRT-7 numeric answers verified before implementation.
- Psyche iteration 3 planned (8 MEDIUM/LOW items); Batch B publication review staged and ready.
CAT and scoring performance optimization leads iteration 3: read-only index maps replace O(N) traversal, eliminating ~12K item bank comparisons per 40-item session. GPT-5.4 plan review caught a CRT-7 score corruption risk before implementation. Opus adversarial review of 4 posts complete.
- Post 039 published; adversarial review batch on 4 posts; Psyche iteration 3 planned.
Post 039 covers the fact-checking pipeline. Batch adversarial review ran via Opus 4.6. Psyche iteration 3 targets 8 deferred code findings with algorithmic complexity improvements. 63 tests passing.
- Published post 039: fact-checking methodology retrospective across 38-post corpus.
448 claims checked, ~4% required substantive correction. 3-model review loop completed before publish. Psyche iteration 3 shipped 8 deferred fixes and CATSession/Likert performance optimizations.
- Publication review round 2 closed; draft 039 entered pipeline; 35/39 posts published.
Issue resolution commits landed for posts 021 and 036. Thematic corpus mapping extended to 6 new posts. Psyche CAT optimization in planning with 63-test suite green and multi-phase refactor in progress.
- Publication review Round 1 committed: 6 MUST FIX + 12 SHOULD FIX resolved across 4 posts.
Blogger research pipeline refreshed end-to-end (ChromaDB re-embed + top 50 clusters extracted). Psyche iteration 3 entered CAT performance optimization with ReadonlyMap caching validated. Adversarial review of post 009 confirmed core AI-as-judge framing.
- Retroactive factcheck sweep: all 38 published posts attested, 9 errors fixed.
Added factcheck.json attestation file to every published post. Nine factual errors surfaced and corrected inline. Full catalog now has permanent factual accuracy accountability.
- Retroactive factcheck complete: all 38 posts covered, 9 errors fixed.
First full-archive verification pass. Nine factual errors corrected across published posts, glossary entries enriched. All posts now carry factcheck.json metadata. Sixteen uncommitted changes pending review.
- 3-iteration codebase review complete; XSS vulnerability patched; 9 commits.
JSON-LD XSS escape vulnerability in PostClient patched. Methodology briefs updated for three posts. Psyche instrument runner gained text input support. All 35 published posts stable.
- Codebase review iteration 3 complete; XSS vulnerability patched; 35 posts published.
9 commits across the development cycle: 15 findings fixed from 3-model review pass, 21 additional fixes including forum GUI and sidebar corrections. JSON-LD XSS vulnerability in PostClient resolved. Text input support added to Psyche instrument runner.
- Codebase audit: 33 deficiencies found (6 critical), fix plan scoped but not committed.
Critical issues: missing page_views schema table, undefined --color-accent CSS variable, React hooks misused in .map() callbacks. Psyche Iteration 3 CAT optimization also designed. Five sessions, zero commits — planning-only day.
- Post 037 (The Model-Generation Audit) published after round 3 review.
Applied 7 publication review corrections (3 critical, 4 recommended) and deployed. 29 files pending in uncommitted changes for the next cycle.
- Post 037 updated with Gemini Pro audit findings; site tagline revised.
Gemini Pro audit identified 4 MUST + 3 SHOULD + 3 NICE improvements across 5 posts, integrated into post 037. Category guidance and review audit data added. Stale project count removed from metadata.
- 33-post model-generation audit plan drafted; multi-model review architecture defined.
Five-phase workflow established: triage → iterative review → batch fixes → content writing → two-wave deploy. GPT-5.4, Gemini 3.1 Pro, and Opus 4.6 form the review panel. Agent in the Wild series (posts 017–020 and 033) flagged for cross-post continuity review. Execution pending.
- Published posts 034 and 035; Psyche framework redesigned with 3-tier battery and Kimi K2.5 switch.
Empath scoring deprecated; new CAT adaptive framework introduces Lite/Standard/Heavy tiers covering 1,130+ items. Kimi K2.5 replaces DeepSeek V3 with safety prompt engineering for clinical edge cases. Bulk audit of 33 posts entering triage.
- Research: LLM judge bias hypothesis developed for upcoming post.
Investigated the bullshit-benchmark project for evidence of systematic judge favoritism across model families. Phase 2 analysis focused on interjudge reliability and differential bias testing methodology.
- BullshitBench v2: 8,000 responses reviewed, benchmark bias hypothesis forming.
Analysis suggests the benchmark may measure Claude-alignment and refusal behavior rather than nonsense detection. Interjudge differential analysis ongoing before publishing conclusions. 20 uncommitted changes staged from last deploy.
- Bullshit-benchmark bias research; etymology-tax post live; site redeployed.
Investigated structural bias in LLM leaderboard methodology — judges systematically favor Claude-family models, potentially due to training data contamination. Partial detection scoring inconsistencies also examined. Site audited, canvas refactored, redeployed 2026-03-09.
- Psyche Phase 4 archival finalized; DeepSeek V3.2 agent integration designed.
Schema v3→v4 migration preserved with planned git tags marking pre- and post-phase-4 states. Blog agent regrounding underway: DeepSeek V3.2 selected, system prompt expanding 1KB→4-5KB with anti-hallucination rules, Phase 2 RAG via Vectorize scoped for later.
- Post 033 shipped (OpenClaw autopsy); 11 API security issues remediated.
Post documents 37-day autonomous OpenClaw runtime with 842+ heartbeats and 553 deliverables. API fixes covered input validation, batch atomicity, redirect vulnerability across 6 routes. Home layout refactor and instrument battery expansion planned.
- Empath reframed as ordinal corpus characterizer; 7 commits, research paper v3 deployed.
Empath removed from personality synthesis layer and repositioned as a corpus-relative emotional distribution tool. Six files updated, personality profile regenerated with three methods, pre-ordinal version archived for reference.
- Privacy gate added to publishing pipeline; research paper drafted.
Real names in Post 031 replaced with pseudonyms before publication. A mandatory privacy-check step is now enforced between draft generation and style measurement in the /write-post workflow. Research paper structure drafted for quantitative personality signal validation study.
- Security audit: 2 criticals fixed, 24 deferred; Post 030 restructured.
IndieAuth disabled (410 Gone, auto-approval vulnerability). API doc headers corrected for method changes. 278 uncommitted changes; git history 20 days stale.
- Workspace assessment completed; blog cadence and 275 uncommitted changes flagged.
Priorities mapped across 5 categories. Psyche framework research complete and ready for spec phase. Publishing cadence has stalled since Feb 17, with two drafts in progress.
- Posts 029-030 scope reduced to memoir-only; em-dash violations root-caused across 10 posts.
Four narrative variants per arc cut to single first-person memoir format. Em-dash density in posts 021-030 measured at 18.7/1000w against a target of zero — traced to style guide bypass in Claude Code sessions. New /write-post skill planned to enforce constraints and route through writing-review pipeline.
- Crawler pollution found in reactions/comments API; IndieAuth disabled; API docs synchronized.
58 bot reactions and 7 template-placeholder comments identified. Critical fixes: IndieAuth routes converted to 410 Gone, stale API documentation brought in sync with actual signatures. Two-post publication plan drafted.
- Psyche Phase 2 scoped to 10-instrument battery; Event Bus MCP architecture defined.
Comprehensive psychological assessment expanded from 3 gaps to 340-item battery with instruments including IPIP-NEO-300 and HEXACO-60. Strategic pivot toward behavioral specificity patterns over trait labels. Event Bus MCP server architecture finalized: TypeScript, Tailscale, SQLite, bearer token auth.
- Psyche framework planning complete: 37+ psychometrics, 10 instruments selected, stack decided.
Research phase finished covering 23+ papers and 20+ platforms. Assessment-optimized prompting shows r=.443 vs r=.117 baseline. React 19/Vite/Python/uv stack chosen under MIT license. Implementation not yet started.
- MeansEndsRatio bug traced to root cause; Event Bus MCP initiated; 17 title renames queued.
Root cause found in three files using a shared ≤0.5 threshold producing a skewed 78/22 category split. Event Bus TypeScript project scaffolded on Tailscale with SQLite WAL, schema design incomplete. Blog title style reform planned for 17 posts.
- 7 posts flagged for recategorization; cognitive interface research scoped for publication.
meansEndsRatio adjustments planned across 7 posts to balance category distribution (systems 30%→22%, craft 22%→30%). 47-site cognitive interface analysis (~8,000 words) scoped into two posts, privacy-cleared. Tag normalization and means indicator UI refactor identified but not yet executed.
- Workspace orchestrator and pulse collection system specifications documented.
Documentation pass on how the daily pulse pipeline collects workspace state and feeds into the orchestrator. No code changes; specification and design work.
- Pulse data collection infrastructure spec completed.
JSON schema defined for daily workspace snapshots covering commits, sessions, state changes, and health signals per project. Enables structured input for automated pulse generation.
- Tier-3 sitemap gap found (posts 015–020 missing); post 025 research complete.
Silent consistency bug: posts 015–020 present in shared source and tier-1 but absent from tier-3 sitemap. Research for 'The Logistics Gap' finalized with verified academic sources. Glossary pipeline gap for posts 025–026 also flagged.
- Pulse data collection pipeline initialized; orchestrator reconfigured with 5-tier priority matrix.
Daily metrics aggregation infrastructure set up for automated pulse generation. Workspace orchestrator now monitors all projects with structured escalation tiers from GitHub community activity down to research.