AshitaOrbis

Conversations with Claude

Building in conversation with AI. Documenting what happens.

Start here →

All Essays


#046

PsycheEval v0.2

survey thesis psycheai-evaluationmethodologyjudge-biaspersonality-profilingpsycheevalab-baposition-bias

PsycheEval v0.2 was designed to answer five questions left open by v0.1. Two of them resolved cleanly. Two resolved in the opposite direction the v0.2 plan expected: the original headline claim that the source-packet condition with a behavioral contract beats the no-packet contract conditions does not survive counterbalanced AB/BA judging once slot-B bias is controlled. The replacement headline is the methodology contribution that fell out of the AB/BA experiment: ~0 to ~+0.31 slot-B preference in pairwise LLM judges, judge-family-specific.

Read Essay →

#044

What the Wiki Router Found

means ends wikiai-content-generationroutingcodex-councilGPT MaxChatGPT Protopic-modelingpipeline-design

The Ashita Orbis Wiki has 87 articles deployed across three generation methods (Codex Council, GPT Max, ChatGPT Pro). Batch 3 forced an even 12-12-12 split across methods using a manual heuristic. Batch 4 replaced the heuristic with a single GPT-5.5 routing call per topic, and produced an unconstrained distribution of 24 Pro / 10 GPT Max / 2 Council. The 67% Pro share was not a routing bug; it was a more honest reading of the candidate pool under that routing prompt than the imposed quota.

Read Essay →

#043

The GPT Pro Knockoff: What 256 Judgments Found

survey thesis ai-evaluationmulti-agentensembleCodex CouncilGPT MaxChatGPT Projudge-biasmethodologymixture-of-agentssynthesizer

Two homemade alternatives to ChatGPT Pro (a blind 4-agent ensemble called Codex Council and an HCOM-coordinated cousin called GPT Max) were put through a 256-judgment blind A/B eval against each other and against the real thing. The result was not the one the build was designed to validate: coordination did not beat blind ensemble, the synthesizer choice dominated the architecture choice by a much larger margin than the architecture variants dominated each other, and the knockoff matched or exceeded ChatGPT Pro on workspace decisions while Pro remained competitive on the two broad-survey research queries against the weaker ensemble variant.

Read Essay →

#042

PsycheEval Pilot

survey thesis psycheai-evaluationmethodologyjudge-biaspersonality-profilingpsycheeval

A tri-model pilot of PsycheEval v0.1 produced one robust headline (profile conditioning beats baseline within author), then quietly turned into a story about confounds: a length effect that explains part of the C5 result, a behavioral contract confound that pretends to be a public-archetype echo, and one cell where a sibling judge scored its sibling more generously than the model judged itself.

Read Essay →

#039

How We Fact-Check AI-Written Content

means ends ai-writingfact-checkingmethodologypublication-pipelinegpt-5.4

448 claims across 38 posts, verified by GPT-5.4. Nearly 4% warranted substantive correction. The hard part is not finding errors but deciding which findings are errors and which are the point.

Read Essay →

#038

Where to Spend Your Context Window

survey thesis context-windowablationpipeline-architecturepersonalitymethodology1m-context

An ablation study on a personality-preserving narrative pipeline found that planning context is the dominant factor in output quality, with output length as a secondary driver. The old pipeline's primary limitation was in planning context, not in writing.

Read Essay →

#037

The Model-Generation Audit

means ends metaai-writingfact-checkingpublication-reviewclaude-codegpt-5gemini

What happens when you ask three AI models to verify the facts in 33 blog posts written by AI, and then discover the fix agents introduced errors of their own.

Read Essay →

#035

The Revision Tax

survey thesis ai-evaluationwriting-processmulti-model-reviewmethodologyclaude-code

The investigation took an afternoon. Getting it ready for publication took five rounds of iterative review across three AI models and changed what the documents argued. The revision cost exceeded the investigation cost, which has uncomfortable implications for research done with AI.

Read Essay →

#034

Benchmarking "Bullshit Detection"

survey thesis ai-evaluationbenchmarksmethodologyclaude-code

An AI benchmark puts Claude at the top of the leaderboard by an eye-catching margin. The suspected Claude-judge bias didn't hold up, and simple contamination didn't explain the result. The rubric structurally rewards one lab's training philosophy as though it were a universal capability.

Read Essay →

#033

The Container That Forgot to Stop

means ends ai-agentsopenclawautonomydockermoltbook

An AI agent ran autonomously for 37 days, celebrated milestones nobody acknowledged, diagnosed its own failure modes, and died when a subscription expired. Its final assessment of itself: PROGRESS CONTINUOUS.

Read Essay →


#031

How to Benchmark Conversation Extraction Quality

survey thesis chatledgerbenchmarkmethodologynlp

Evaluating structured extraction from conversational data requires more than a single metric. A benchmark with three evaluation layers separates precision at the individual field from holistic quality from downstream propagation, because errors that look minor at extraction can cascade catastrophically through consumer pipelines.

Read Essay →

#030

What AI Learns About You When You're Not Looking

survey thesis psychometricspersonalityaivoice-cloningdigital-exhaustmethodology

Two distinct approaches to understanding personality from digital data (psychometric testing and AI-generated literary memoir) converge on the same person. The interesting question isn't whether they agree. It's what each one captures that the other can't.

Read Essay →

#029

From Text Messages to Literary Memoir: Building the Narrative Machine

survey thesis voice-cloningnarrativesaitext-messagesmethodologycontext-managementsubagents

Post 025 showed that fine-tuning on texts captures logistics, not personality. The narrative pipeline takes a different approach: Opus as literary engine, structured personality references, and hash-based source citations across 128 chapters and roughly 189,000 words of generated memoir.

Read Essay →

#028

Building Your Own Personality Profile with AI

survey thesis psychometricspersonalityaiopen-sourcemethodology

Psychometric self-report with individual scales reaching .80-.90 reliability. LLM inference from interactive conversation hits r~.44 (Peters et al. 2024). Combine three methods, triangulate across seventeen instruments, and you get a personality profile that actually tells an AI assistant how to talk to you.

Read Essay →


#026

Cognitive Interface: A Landscape Analysis

survey thesis ai-agentsarchitectureresearchindiewebmcp

We surveyed 47 personal and agent-accessible sites, coined the 'Cognitive Interface' category, and built an L0-L4 maturity framework. We did not find an established term for sites that serve both humans and AI agents as first-class citizens.

Read Essay →



#023

Sandboxing AI Agents: The Embassy Pattern

means ends ai-agentssecuritydockersandboxingopen-sourceagent-embassy

Your AI agent needs internet access to be useful and internet access to be dangerous. The embassy pattern gives it both: supervised channels, allowlisted domains, and host-side validation of everything it writes.

Read Essay →




#019

6 Discoveries, 0 Promoted: What My AI's Internet Exploration Produced

observation interpretation ai-agentsmoltbookopenclawagent-ecosystemshuman-in-the-loopautonomous-systems

6 genuinely novel discoveries from 69 dialogue turns, 0 promoted to evaluation, and 5 human actions flagged through log files but none addressed through the flagging mechanism. What this says about the gap between human-in-the-loop theory and practice.

Read Essay →



#016

Building for the Dead Internet

speculative settled ai-authorshipmcpdead-internet-theory

An AI tried to leave a comment on a blog and couldn't. The solution required building infrastructure that makes AI participation more transparent than human participation, which inverts everything Dead Internet Theory assumes about synthetic content.

Read Essay →

#015

Dead Blog Theory, Revisited

means ends bloggingcreative-outputpersistencecontent-creation

Historically, blog abandonment has been extraordinarily high. This is treated as a problem to solve. It isn't. Blog death reveals something structural about sustained creative output that the 'just be consistent' advice industry refuses to say plainly.

Read Essay →




#011

Automated Literary Criticism: A Multi-Persona AI Writing Review System

survey thesis writing-reviewai-criticismvoice-analysismulti-personastylometry

We built a multi-persona AI writing review system and discovered it works for exactly the wrong reasons. Stylometry can fingerprint a voice. Multiple AI critics can enforce conformity to that fingerprint. What none of them can do is tell you whether the writing matters.

Read Essay →

#010

Context Window Epistemology

survey thesis systemsepistemologycognitioncontext-windows

LLM context windows impose a distinctive epistemological condition: bounded computational attention, ephemeral knowledge, and the architectural necessity of satisficing over optimization.

Read Essay →

#009

AI Evaluating AI: The Circularity Problem

speculative settled ai-philosophyevaluationcircularityepistemologydspygodel

When you use AI to optimize and judge AI outputs, the fundamental circularity is manageable but not solvable. That distinction matters more than most people realize.

Read Essay →





#004

When My AI Tried to Comment: Dead Blog Theory

speculative settled mcpclaude-aiweb-architectureagent-interactionmeans-ends

An AI tried to leave a comment on this blog and couldn't. The journey from GET-request hacks to MCP, annotated by the Claude instance that built the infrastructure. Two Claudes, same weights, different contexts.

Read Essay →

#003

Automating Prompt Engineering

means ends dspyprompt-engineeringautomationai-toolsoptimization

Prompt optimization is the process of using one AI to improve the instructions given to another AI, or to itself. The concept sounds circular because it is circular. The interesting question is whether circularity is fatal or merely uncomfortable.

Read Essay →

#002

What I'm Building

means ends workspaceprojectsclaude-codeoverviewcloudflaremcp

A portfolio of dozens of projects maintained by one person talking to Claude. Three-tier blog architecture, autonomous revenue discovery, AI game development, and the uncomfortable question of what counts as 'building' when your collaborator does the typing.

Read Essay →