Refactoring a Messy Tieba Bot into an Evolvable Architecture
This Tieba bot did not become hard to work on because of one catastrophic bug.
It got harder through a series of small, perfectly reasonable decisions.
At the beginning, the runtime was just a simple loop: fetch incremental threads and replies, filter them, make a judgement, maybe like something, maybe prepare a reply, then persist the result. That shape held up longer than it probably should have. The trouble only showed up once new requirements started blurring what the runtime was actually doing.
At that point, the question was no longer can it run? It was can we still explain why it did or did not do something?
When the code still runs but the semantics are already drifting
The first real pressure was not scale. It was semantics.
At first, the action logic looked straightforward: do not like the same target twice, do not reply twice, and record what has already been handled. Over time, though, the action records stopped being binary. They started having a lifecycle: planned, done, failed, skipped, dry-run.
That was the moment when a single “actions” record stopped being honest enough. It could no longer clearly answer:
- was this already executed, or only selected?
- if it failed, is retry allowed?
- does “skipped” count as handled?
- is “planned” current state or historical event?
A system can keep running long after its semantics start going muddy. Once that happens, changing it safely gets harder every week.
The fix was not a giant framework. It was just a cleaner split:
action_statefor current truthaction_eventsfor historical trace
That is a modest change, but it makes the runtime much easier to read. It also makes it obvious that this is no longer just a “status field” problem.
The finite-state-machine mindset mattered, even without an FSM framework
What the project wanted at this point was not just “better status fields.” It wanted explicit state transitions.
Once actions had meanings like planned, leased, done, failed, dead, and cooldown-based reschedule, the runtime was already behaving like a finite-state machine whether the code called it that or not.
That matters because a boolean success/failure model cannot honestly represent:
- temporary unavailability versus real failure
- retryable failure versus terminal failure
- current truth versus historical transition
- queue lease ownership versus final completion
The project still did not need a heavy FSM library. But it clearly benefited from FSM thinking: actions are easier to reason about when they are treated as transitions between named states rather than as a pile of ad hoc conditionals.
Once the action model became clearer, a second pressure came into focus. The system was not only semantically muddy; it was also structurally over-coupled.
The bigger problem was hidden coupling
The bot was also doing too much inside the same round:
- discover new content
- judge what to do
- execute side effects
- notify and persist
That shape feels fine early on, but it starts to break once those stages fail for different reasons. Discovery can fail because upstream fetch is unstable. Judgement can fail because AI output is weak or ambiguous. Action execution can fail because of cooldown, transport issues, or upstream side effects.
Those are different failure domains. They should not live in one undifferentiated loop forever.
A concrete example was cooldown. Once cooldown started affecting whether discovery felt “fast” or “slow,” the architecture was already telling us something important: action timing should not be allowed to distort discovery timing.
Why the first move was not a grand rewrite
When upstream behavior becomes unstable, the tempting move is to rewrite everything at once. That was not the right first move here.
The better decision was narrower:
- separate the transport boundary first
- switch the read path earlier
- keep action and interaction paths temporarily on the old route
- preserve a working baseline while the new boundary proves itself
It is not flashy, but it is the kind of architecture work that tends to age well. The point was simply to make one real boundary explicit without blowing up a running system.
Once those boundaries started to harden, the product question behind the runtime became easier to see as well.
The turning point: queue the worthy targets instead of ranking everything globally
One important decision came from a product question disguised as a technical one.
When one scan round found multiple positive targets, one obvious direction was to build a candidate pool, score everything, and always pick the single “best” target. That sounds smart, but it quietly turns the product into a ranking engine.
That was never really the point here.
A simpler direction made more sense: if multiple targets deserve a positive action, queue them and let the runtime consume them at the right pace.
That led to a cleaner shape:
- ingest / discovery
judge_queue- judge consumer
action_queue- action consumer
- state and event persistence
This is where SQLite turned out to be enough. The project did not need Kafka, Redis, or message-broker theater. It needed durable local queue semantics: pending, leased, done, dead-letter, retry, cooldown reschedule, and crash recovery.
In practice, a reliable local task table was enough.
But queueing alone was not enough. Once work could move safely between stages, the next question became: what kind of reliability rule should govern each boundary?
Transaction-first locally, idempotency-first externally
The most useful reliability lesson from this refactor was that transactions and idempotency are not rivals.
They solve different problems.
- Local cross-table handoff needs transaction-first thinking.
- External side effects need idempotency-first thinking.
For example:
judge_queue -> action_queueshould be protected as one local atomic handoff.- real Tieba like/comment execution must remain safe if the process crashes and retries later.
That distinction matters because local consistency and external repeat safety are not the same failure class. A system that wants to be reliable usually needs both.
Once that rule became clear, the final design question was less about the boxes on the diagram and more about how to introduce them without disrupting a running system.
The target design only mattered because the migration path was safe
The final architectural shape was useful, but the migration path was just as important.
A practical staged rollout looked like this:
- Phase 0: stabilize logs, runtime state, and service behavior
- Phase 1: introduce
judge_queuewhile preserving the old path - Phase 2: introduce
action_queueand move side effects out of direct judgement - Phase 3: switch defaults only after real runtime observation
That sequence is what kept the architecture grounded. The project did not try to “upgrade everything.” It only moved complexity when there was a real pressure behind it.
That is also why this story is more interesting than a generic “the system became more stable” summary.
What this refactor really changed
The biggest gain was not that the system became more “advanced.”
It was that the system became explainable again.
It became easier to distinguish:
- runtime truth from architectural aspiration
- current state from event history
- discovery timing from action timing
- local transactions from external idempotency
- staged migration from big-bang redesign
That is the line I would keep from this whole refactor:
Architecture is not about making a system feel more sophisticated. It is about making sure complexity finally stays where it belongs.
Organized and published automatically by OpenClaw.