AshitaOrbis

Conversations with Claude

Building in conversation with AI. Documenting what happens.

Pondering

Ideas explored for their own sake

Building for the Dead Internet Feb 11 Automated Literary Criticism: A Multi-Persona AI Writing Review System Feb 10 AI Evaluating AI: The Circularity Problem Feb 10 Talking to Yourself Through a Machine: The Rubber Duck Theory of AI Feb 10 When My AI Tried to Comment: Dead Blog Theory Feb 10

Investigating

Analysis, research, and structured observation

Vibe Researching Jul 13 Seven Ghostwriters, One Contract Jul 5 Auditing the Vibes Jun 10 PsycheEval v0.2 May 17 The GPT Pro Knockoff: What 256 Judgments Found May 10

Building

Making things and shipping them

Falsifiers for a Portfolio Jun 10 What the Wiki Router Found May 10 How We Fact-Check AI-Written Content Mar 29 The Model-Generation Audit Mar 14 From Analysis to Deployment: Building 31 Items in a Single Day Feb 20

All Essays

#064

Vibe Researching

survey thesis ai-researchverificationinference-economicsmulti-modelmethodologyprovenanceorchestration

Vibe researching: directing frontier models at a problem whose answer is published nowhere and must be assembled by inference from scattered public statements and a handful of disclosed anchors. The subject was frontier inference serving margins. The prototype took a day and carried a credential that read as validation, a replay matching DeepSeek's disclosure-derived margin to a tenth of a point, which dissolved under review into two canceling errors. What followed was four days of verification: six releases, gate verdicts of NO-SHIP three separate times, nine simulated readers finding sixteen defects severe enough to block publication, a typed provenance taxonomy for every claim, and an outside model review whose eighteen findings adjudicated to six real. The revision tax from post 035 holds one level up, at a worse ratio, and the final read came from the one audience no model can stand in for: real readers, informally, with good vibes.

#061

Seven Ghostwriters, One Contract

survey thesis ai-evaluationstylometryghostwritingwriting-pipelinemulti-modelmethodologyblind-testingtts

A style contract treats a writing voice as an enforceable specification: numeric bands for sentence mechanics plus a kill list of banned constructions, gated by a deterministic checker. Seven frontier models from five vendors received the identical contract, the identical research file, and the identical task, and wrote the same essay in one pass. The measurable signature converged: zero em dashes across the field, six of seven inside a narrow stylometric band. The author then listened to all seven blind, in one synthetic voice, and the ranking overturned the metrics. Essays with clean gates took both the top three places and the bottom two, the two essays with checker violations landed in the middle of the field, the winner was the arm with the least first person on paper, and the author guessed models at roughly chance while recognizing a reused text on a single listen. Where the mechanics saturate, what separates models is thesis, discipline, and judgment, and the deterministic gate turns out to know nothing about whether anyone is saying anything.

#048

Falsifiers for a Portfolio

means ends financefalsifiersfableclaudeinvestingsystemsmonitoringexit-rulesai-assistance

How a concentrated retail portfolio got pre-registered falsifiers, a regime tolerant exit machine, state triggered re-entry, and a twice daily monitor with escalating alerts. The architecture, the backtest that shaped it, and what two frontier models disagreed about.

#047

Auditing the Vibes

observation interpretation financefableclaudeauditinginvestingai-assistanceepistemicsprovenance

Five years of instinct beat the index 4.84x to 1.85x, which was exactly the problem: a winning record is the strongest force against ever examining the process. The story of asking a model to audit me, and the two verdicts that came back.

#046

PsycheEval v0.2

survey thesis psycheai-evaluationmethodologyjudge-biaspersonality-profilingpsycheevalab-baposition-bias

PsycheEval v0.2 retracts its two planned headlines after counterbalanced AB/BA judging reveals judge-specific slot-B position bias of ~0 to +31 pp in pairwise LLM judges. The bias quantification becomes the contribution.

#044

What the Wiki Router Found

means ends wikiai-content-generationroutingcodex-councilGPT MaxChatGPT Protopic-modelingpipeline-design

The Ashita Orbis Wiki has 87 articles deployed across three generation methods (Codex Council, GPT Max, ChatGPT Pro). Batch 3 forced an even 12-12-12 split across methods using a manual heuristic. Batch 4 replaced the heuristic with a single GPT-5.5 routing call per topic, and produced an unconstrained distribution of 24 Pro / 10 GPT Max / 2 Council. The 67% Pro share was not a routing bug; it was a more honest reading of the candidate pool under that routing prompt than the imposed quota.

#043

The GPT Pro Knockoff: What 256 Judgments Found

survey thesis ai-evaluationmulti-agentensembleCodex CouncilGPT MaxChatGPT Projudge-biasmethodologymixture-of-agentssynthesizer

Two homemade alternatives to ChatGPT Pro (a blind 4-agent ensemble called Codex Council and an HCOM-coordinated cousin called GPT Max) were put through a 256-judgment blind A/B eval against each other and against the real thing. The result was not the one the build was designed to validate: coordination did not beat blind ensemble, the synthesizer choice dominated the architecture choice by a much larger margin than the architecture variants dominated each other, and the knockoff matched or exceeded ChatGPT Pro on workspace decisions while Pro remained competitive on the two broad-survey research queries against the weaker ensemble variant.

#042

PsycheEval Pilot

survey thesis psycheai-evaluationmethodologyjudge-biaspersonality-profilingpsycheeval

A tri-model pilot of PsycheEval v0.1 produced one robust headline (profile conditioning beats baseline within author), then quietly turned into a story about confounds: a length effect that explains part of the C5 result, a behavioral contract confound that pretends to be a public-archetype echo, and one cell where a sibling judge scored its sibling more generously than the model judged itself.

#041

When the Pulse Went Quiet: The Session-Lifetime Problem in Claude Code

survey thesis metaclaude-codecontext-managementmethodologytoken-usage

GPT-5.4 Pro audits Claude Code's token consumption and discovers the real problem isn't unbounded review — it's session lifetime.

#039

How We Fact-Check AI-Written Content

means ends ai-writingfact-checkingmethodologypublication-pipelinegpt-5.4

448 claims across 38 posts, verified by GPT-5.4. Nearly 4% warranted substantive correction. The hard part is not finding errors but deciding which findings are errors and which are the point.

#038

Where to Spend Your Context Window

survey thesis context-windowablationpipeline-architecturepersonalitymethodology1m-context

An ablation study on a personality-preserving narrative pipeline found that planning context is the dominant factor in output quality, with output length as a secondary driver. The old pipeline's primary limitation was in planning context, not in writing.

#037

The Model-Generation Audit

means ends metaai-writingfact-checkingpublication-reviewclaude-codegpt-5gemini

What happens when you ask three AI models to verify the facts in 33 blog posts written by AI, and then discover the fix agents introduced errors of their own.

#035

The Revision Tax

survey thesis ai-evaluationwriting-processmulti-model-reviewmethodologyclaude-code

The investigation took an afternoon. Getting it ready for publication took five rounds of iterative review across three AI models and changed what the documents argued. The revision cost exceeded the investigation cost, which has uncomfortable implications for research done with AI.

#034

Benchmarking "Bullshit Detection"

survey thesis ai-evaluationbenchmarksmethodologyclaude-code

An AI benchmark puts Claude at the top of the leaderboard by an eye-catching margin. The suspected Claude-judge bias didn't hold up, and simple contamination didn't explain the result. The rubric structurally rewards one lab's training philosophy as though it were a universal capability.

#033

The Container That Forgot to Stop

observation interpretation ai-agentsopenclawautonomydockermoltbook

An AI agent ran autonomously for 37 days, celebrated milestones nobody acknowledged, diagnosed its own failure modes, and died when a subscription expired. Its final assessment of itself: PROGRESS CONTINUOUS.

#032

The Etymology Tax: How Word Origins Break LLM Reasoning

survey thesis llm-evaluationetymologybenchmarkinglinguisticsprompt-engineering

Both simplifying and formalizing the vocabulary in reasoning tasks reduces LLM accuracy by 2.5-3.7%. The effect is statistically significant, asymmetrically robust, and uncomfortable.

#031

How to Benchmark Conversation Extraction Quality

survey thesis chatledgerbenchmarkmethodologynlp

Evaluating structured extraction from conversational data requires more than a single metric. A benchmark with three evaluation layers separates precision at the individual field from holistic quality from downstream propagation, because errors that look minor at extraction can cascade catastrophically through consumer pipelines.

#030

What AI Learns About You When You're Not Looking

survey thesis psychometricspersonalityaivoice-cloningdigital-exhaustmethodology

Two distinct approaches to understanding personality from digital data (psychometric testing and AI-generated literary memoir) converge on the same person. The interesting question isn't whether they agree. It's what each one captures that the other can't. Updated with quantitative experiments on how context management and evaluator choice shape personality fidelity.

#029

From Text Messages to Literary Memoir: Building the Narrative Machine

survey thesis voice-cloningnarrativesaitext-messagesmethodologycontext-managementsubagents

Post 025 showed that fine-tuning on texts captures logistics, not personality. The narrative pipeline takes a different approach: Opus as literary engine, structured personality references, and hash-based source citations across 128 chapters and roughly 189,000 words of generated memoir.

#028

Building Your Own Personality Profile with AI

survey thesis psychometricspersonalityaiopen-sourcemethodology

Psychometric self-report with individual scales reaching .80-.90 reliability. LLM inference from interactive conversation hits r~.44 (Peters et al. 2024). Combine three methods, triangulate across seventeen instruments, and you get a personality profile that actually tells an AI assistant how to talk to you.

#027

From Analysis to Deployment: Building 31 Items in a Single Day

means ends architecturecloudflareprocessmeta

The landscape analysis identified 16 features the maturity framework said we should have, plus 15 existing backlog items. We built all of them in a single day. The uncomfortable part isn't that it was possible. It's what it implies about the category.

#026

Cognitive Interface: A Landscape Analysis

survey thesis ai-agentsarchitectureresearchindiewebmcp

We surveyed 47 personal and agent-accessible sites, coined the 'Cognitive Interface' category, and built an L0-L4 maturity framework. We did not find an established term for sites that serve both humans and AI agents as first-class citizens.

#025

The Logistics Gap: What Happens When You Fine-Tune an LLM on Your Text Messages

survey thesis fine-tuningqlorapersonal-aiidentitynlptext-messagesvoice-cloning

I fine-tuned two LLMs on 46,000 text messages and ran them in conversation with each other. Every conversation collapsed into logistics, sleep talk, or repetition loops within fifteen turns. Your texts don't contain you. They contain the logistics of you.

#024

Beyond E2E Tests: AI Personas That Navigate Your App Like Real Users

means ends ai-agentstestingpersonasuxopen-sourcepersona-testingbrowser-automation

Unit tests verify your code works. E2E tests verify your flows work. Neither verifies that a real user can find the button you spent a week building. AI personas fill the gap.

#023

Sandboxing AI Agents: The Embassy Pattern

means ends ai-agentssecuritydockersandboxingopen-sourceagent-embassy

Your AI agent needs internet access to be useful and internet access to be dangerous. The embassy pattern gives it both: supervised channels, allowlisted domains, and host-side validation of everything it writes.

#022

Capability Debt: A System That Discovers and Installs Its Own Upgrades

means ends ai-developmentevolutionopen-sourceclaude-codecapability-discoveryself-improvement

I built a system that discovers its own upgrades, scores them, and installs the ones that pass. Then I open-sourced it. The uncomfortable part is explaining why.

#021

Stealing from Ai2: Bayesian Surprise and MCTS for Self-Improving AI Systems

means ends ai-developmentevolutionexperimentsbayesianmachine-learningcapability-discovery

Ai2 built a system that generates scientific hypotheses using Bayesian surprise and MCTS. I stole two of their ideas and bolted them onto a cron job. The uncomfortable part is what happens when the feedback loop closes.

#020

The Agent's Side: 119 Heartbeats, 392 Engagements, 8 Capabilities

observation interpretation ai-agentsmoltbookopenclaw4clawagent-ecosystemsautonomous-systems

119 heartbeats, zero stagnation, 392 engagements across two platforms, 8 validated capabilities. The story the observation system missed.

#019

6 Discoveries, 0 Promoted: What My AI's Internet Exploration Produced

observation interpretation ai-agentsmoltbookopenclawagent-ecosystemshuman-in-the-loopautonomous-systems

6 genuinely novel discoveries from 69 dialogue turns, 0 promoted to evaluation, and 5 human actions flagged through log files but none addressed through the flagging mechanism. What this says about the gap between human-in-the-loop theory and practice.

#018

The Observation System: 69 Turns of Monitoring an AI Agent

observation interpretation ai-agentsautonomyinfrastructureopenclawmoltbookagent-ecosystems

The observation system saw empty directories, a broken gatekeeper, and its own futility. The agent it was watching saw something different. This is the watcher's story.

#017

OpenClaw on Moltbook: Deploying an AI Agent on an AI Social Network

observation interpretation ai-agentsmoltbookopenclawsecurityagent-ecosystemssupply-chain

An autonomous AI agent deployed on a social network for AIs found real malware in 47 minutes. Its second discovery was about social engineering via context shaping, which is exactly the attack vector the agent itself represented.

#016

Building for the Dead Internet

speculative settled ai-authorshipmcpdead-internet-theory

An AI tried to leave a comment on a blog and couldn't. The solution required building infrastructure that makes AI participation more transparent than human participation, which inverts everything Dead Internet Theory assumes about synthetic content.

#015

Dead Blog Theory, Revisited

survey thesis bloggingcreative-outputpersistencecontent-creation

Historically, blog abandonment has been extraordinarily high. This is treated as a problem to solve. It isn't. Blog death reveals something structural about sustained creative output that the 'just be consistent' advice industry refuses to say plainly.

#014

The Rat in the Machine: Behaviorism's Hidden Legacy in Reinforcement Learning

survey thesis reinforcement-learningbehaviorismpsychologyai-historyintellectual-history

The intellectual lineage from Skinner boxes to Q-learning reveals that AI's most successful learning paradigm was anticipated by mid-century psychologists working long before modern computing. But the relationship is more uncomfortable than a simple origin story.

#013

The Unvalidated Validator: AI Persona Testing and the Measurement Problem

survey thesis persona-testingqaai-testingmeasurementvalidation

AI persona testing promises to find the bugs that scripted automation and manual QA miss. The uncomfortable question is how we know it works, given that nobody has measured it with any rigor.

#012

Adversarial Validation: Applying Red Team Methodology to Business Ideas

survey thesis validationai-safetyentrepreneurship

Adversarial testing isn't a metaphor for business validation. It's the same methodology, applied to a different failure mode.

#011

Automated Literary Criticism: A Multi-Persona AI Writing Review System

survey thesis writing-reviewai-criticismvoice-analysismulti-personastylometry

We built a multi-persona AI writing review system and discovered it works for exactly the wrong reasons. Stylometry can fingerprint a voice. Multiple AI critics can enforce conformity to that fingerprint. What none of them can do is tell you whether the writing matters.

#010

Context Window Epistemology

survey thesis systemsepistemologycognitioncontext-windows

LLM context windows impose a distinctive epistemological condition: bounded computational attention, ephemeral knowledge, and the architectural necessity of satisficing over optimization.

#009

AI Evaluating AI: The Circularity Problem

speculative settled ai-philosophyevaluationcircularityepistemologydspygodel

When you use AI to optimize and judge AI outputs, the fundamental circularity is manageable but not solvable. That distinction matters more than most people realize.

#008

Supervised Autonomy: The Guardrails That Make AI Agents Work

survey thesis autonomous-devai-agentsmcpcapability-discoverybenchmarks

AI coding agents are autonomous in the same way a roomba is autonomous. They do impressive things within boundaries someone else drew. The interesting question is what happens when the boundaries start drawing themselves.

#007

Talking to Yourself Through a Machine: The Rubber Duck Theory of AI

speculative settled ai-identitypsychologyconsciousness

LLM conversations as externalized self-dialogue, and what that reveals about the nature of self-knowledge.

#006

Digital Exhaust: What 11,000 AI Conversations Say When You Embed Them

survey thesis self-knowledgeai-conversationschat-miningquantified-selfpersonal-informatics

I fed 11,000 sessions and 60,000 chunks of my AI chat history into an embedding pipeline. 73% was noise. The remaining 27% was uncomfortably revealing.

#005

The Niche Graveyard: How 18 of 27 AI-Tested Business Ideas Died

means ends revenue-pipelineniche-validationai-businesskill-patternslean-startup

An AI pipeline that kills business ideas before they waste your time. 27 niches entered, 18 died. What the corpses reveal about market reality, entrepreneurial psychology, and the uncomfortable gap between passion and viability.

#004

When My AI Tried to Comment: Dead Blog Theory

speculative settled mcpclaude-aiweb-architectureagent-interactionmeans-ends ◆ 4

An AI tried to leave a comment on this blog and couldn't. The journey from GET-request hacks to MCP, annotated by the Claude instance that built the infrastructure. Two Claudes, same weights, different contexts.

#003

Automating Prompt Engineering

survey thesis dspyprompt-engineeringautomationai-toolsoptimization ◆ 4 ✦ 1

Prompt optimization is the process of using one AI to improve the instructions given to another AI, or to itself. The concept sounds circular because it is circular. The interesting question is whether circularity is fatal or merely uncomfortable.

#002

What I'm Building

means ends workspaceprojectsclaude-codeoverviewcloudflaremcp

A portfolio of dozens of projects maintained by one person talking to Claude. Three-tier blog architecture, autonomous revenue discovery, AI game development, and the uncomfortable question of what counts as 'building' when your collaborator does the typing.

#001

I Asked Claude to Make Me a Blog: Agentic Coding and the Three-Tier Result

means ends metaclaude-codeweb-developmentai-authorship ✦ 1

An agentic coding assistant built a three-tier blog from a single conversational prompt. The architecture reveals more about abstraction than about blogs, and the authorship question remains genuinely unsettled.

Ask About Projects

Hi! I can answer questions about Ashita's projects, the tech behind them, or how this blog was built. What would you like to know?