Mokksy: A Mock Server That Actually Streams — and Why Your AI App Needs Integration Tests

Table of Contents

I’ve spent a fair chunk of my career recovering projects. Not greenfield ones — the other kind. The ones where the team ships a release and something breaks in staging, or worse, in production. The payment gateway returns a 502 and the retry logic wasn’t tested. The response stream drops in the middle, and nobody knows what the HTTP client does next.

In finance and telecom, these aren’t hypothetical scenarios. They’re Tuesdays.

Every single time I’ve been brought in to stabilise a project — to take it from “we can’t release” to “we ship on Fridays with confidence” — the fix was the same. Not more unit tests. Not better mocks at the class level. Integration tests that treat the service as a black box, hit it over HTTP, and verify what actually comes back.

This article is about why that matters more than ever for AI applications, and about a tool I built to make it possible: Mokksy.

The testing gap nobody talks about #

Here’s a conversation I’ve had more times than I’d like:

“We have 90% code coverage. Why are we still getting production incidents?”

Because code coverage measures which lines the test runner touched, not whether your application actually works. There’s a meaningful difference between testing that your ChatService class calls the right method on a mocked HttpClient, and testing that your deployed application correctly handles a streaming response from an LLM provider over HTTP with Server-Sent Events.

White-box unit tests — the kind where you mock out every dependency with MockK or Mockito — are valuable. I’m not arguing against them. But they’re not sufficient. Here’s why:

They don’t cover your application configuration. In a Spring Boot application, your application.yml wires together dozens of beans, timeouts, retry policies, serialization settings, and content negotiation rules. A unit test that injects a mocked HTTP client bypasses all of that. The first time your real configuration is exercised is when you deploy.

They don’t cover HTTP client internals. When you mock at the service layer, you’re assuming the HTTP client behaves exactly as you expect. But what happens when the Ktor CIO engine changes how it handles text/event-stream? What happens when OkHttp’s connection pool interacts badly with your REST endpoint? What happens when Apache HttpClient 5.x subtly changes its chunked transfer decoding? You’ll never catch that with a unit test. You’ll catch it at 3am.

Here’s a concrete example. Your LLM integration streams chat completions over SSE. Your read timeout is set to 30 seconds in application.yml, but the unit test injects a mocked client that returns instantly. In production, the LLM provider takes 35 seconds on a complex prompt. The connection drops, your retry logic kicks in, and now you’ve got a duplicate request burning tokens. A unit test that mocks HttpClient would never have caught that — because the timeout configuration was never exercised.

They’re fragile in ways that matter. White-box tests are tightly coupled to implementation details. Refactor how your service calls the downstream — same behaviour, different internal structure — and half your tests break. This creates a perverse incentive: the team avoids refactoring because the test suite is too painful to update. The codebase calcifies.

They require deep knowledge of the transport layer. To write a good white-box test for streaming LLM responses, the developer needs to understand SSE framing, chunked transfer encoding, backpressure semantics, and the specific quirks of whatever HTTP client library they’re using. That’s a lot to ask. An integration test just says: “I send this request, I expect this response.”

For high-risk systems — the kind where a bug means a failed trade, a dropped call, or a hallucinating financial adviser — this isn’t a matter of preference. It’s a matter of engineering discipline.

What does this have to do with AI? #

Modern AI applications are, at their core, HTTP services that talk to other HTTP services. Your chatbot sends requests to OpenAI, Anthropic, Gemini, a remote MCP server, or a self-hosted Ollama instance. It receives responses — often as a stream of Server-Sent Events that transfer the response token by token over a long-lived connection.

But the surface area is larger than traditional HTTP integrations. LLM calls involve long-lived connections that stay open for seconds or minutes. Latency is non-deterministic — a simple prompt might return in 200 milliseconds while a complex reasoning chain takes 40 seconds. Partial streaming responses can arrive with unpredictable timing between chunks. Rate limiting and token quotas add failure modes that don’t exist in conventional REST APIs. And if your application uses tool calling or function invocation, a single user request can trigger multiple cascading HTTP exchanges with the LLM provider.

This makes AI applications prime candidates for integration testing with a mock server. The problem is that the most popular mock server in the JVM ecosystem — WireMock — wasn’t built for this.

WireMock’s response model is fundamentally static: you define a response body, optionally apply templating, and the server returns it in one go. You can return a body that looks like SSE data — the right text/event-stream content type, the right data: framing — but it’s still a single response payload written to the socket at once. That’s not how SSE works in practice. A real SSE connection is a long-lived HTTP response where the server incrementally emits events over time. The connection stays open. Events arrive one by one. The client processes them as they come in.

What you can’t do with WireMock is model the things that actually break in production: a 500 ms pause between the third and fourth chunk, a connection that drops after emitting half the tokens, backpressure when the client can’t consume events fast enough, or a stream that hangs indefinitely to test your timeout handling.

For LLM applications, that’s precisely where the bugs live.

Enter Mokksy #

Mokksy is a mock HTTP server built with Kotlin and Ktor. It exists because I needed something that WireMock couldn’t do: true streaming and Server-Sent Events support, with precise control over timing.

The design philosophy is straightforward: give you a local HTTP server that your application talks to as if it were a real external service, with a clean Kotlin DSL for defining what it should respond with.

Getting started #

Add the dependency:

// build.gradle.kts
dependencies {
  testImplementation("dev.mokksy:mokksy-jvm:$latestVersion")
}

Create and start the server:

val mokksy = Mokksy().apply {
  runBlocking {
    startSuspend()
  }
}

Point your HTTP client at it:

val client = HttpClient {
  install(DefaultRequest) {
    url(mokksy.baseUrl())
  }
}

That’s it. You now have a local HTTP server that your application can talk to, and you control every response.

Simple request/response #

For a basic GET endpoint:

mokksy.get {
  path = beEqual("/ping")
  containsHeader("Authorization", "Bearer test-token")
} respondsWith {
  body = """{"status": "ok"}"""
}

// when
val result = client.get("/ping") {
  headers.append("Authorization", "Bearer test-token")
}

// then
result.status shouldBe HttpStatusCode.OK
result.bodyAsText() shouldBe """{"status": "ok"}"""

Mokksy uses Kotest assertions for request matching. If a request doesn’t match any stub, the server returns 404 Not Found — just like a real server would. No silent swallowing of unexpected calls.

For POST requests with body matching:

mokksy.post {
    path = beEqual("/v1/chat/completions")
    bodyContains("gpt-4")
} respondsWith {
    body = chatCompletionResponse
    httpStatus = HttpStatusCode.OK
    headers {
        append(HttpHeaders.ContentType, "application/json")
    }
}

// when
val result = client.post("/v1/chat/completions") {
    setBody("""{"model":"gpt-4","messages":[{"role":"user","content":"Hi"}]}""")
}

// then
result.status shouldBe HttpStatusCode.OK
result.bodyAsText() shouldBe chatCompletionResponse

Where Mokksy shines: streaming #

Here’s the part that matters for AI applications. Mokksy supports true SSE with Kotlin Flow, which means you can model exactly what a real LLM provider does — emit chunks with realistic timing:

mokksy.post {
    path = beEqual("/v1/chat/completions")
    bodyContains("\"stream\":true")
} respondsWithSseStream {
    flow = flow {
        emit(ServerSentEvent(data = """{"choices":[{"delta":{"content":"Hello"}}]}"""))
        delay(100.milliseconds)
        emit(ServerSentEvent(data = """{"choices":[{"delta":{"content":" world"}}]}"""))
        delay(50.milliseconds)
        emit(ServerSentEvent(data = "[DONE]"))
  }
}

// when
val result = client.post("/v1/chat/completions") {
    setBody("""{"model":"gpt-4","stream":true}""")
}

// then
result.status shouldBe HttpStatusCode.OK
result.contentType() shouldBe ContentType.Text.EventStream.withCharsetIfNeeded(Charsets.UTF_8)
val body = result.bodyAsText()
body shouldContain "Hello"
body shouldContain " world"
body shouldContain "[DONE]"

This is a real SSE stream. Your HTTP client opens a persistent connection, receives events as they arrive, and processes them incrementally. The delays between chunks are real delays. If your application has a timeout set to 80ms between chunks, this test will catch it.

You can simulate:

Slow responses — add delay() calls to test timeout handling
Partial failures — emit a few chunks then throw an exception to test error recovery
Backpressure — control the flow rate to test buffering behaviour
Empty streams — test what happens when the server sends headers but no data

None of this is possible with WireMock’s static response model.

Verifying what happened #

Mokksy records incoming requests and provides two complementary verification methods:

// Did every stub get called? Catches dead code paths.
mokksy.verifyNoUnmatchedStubs()

// Did any unexpected requests arrive? Catches unintended API calls.
mokksy.verifyNoUnexpectedRequests()

In practice, I run verifyNoUnexpectedRequests() after every test. If my application makes an HTTP call I didn’t anticipate, I want to know immediately — not after deployment.

The scope: your service, not the world #

Before diving into test structure, let me be clear about the architectural boundary.

The scope of these integration tests is a single deployable unit — your microservice or your monolith. You start your application, point it at a Mokksy instance instead of the real OpenAI API, and test it end-to-end over HTTP.

You’re not testing OpenAI. You’re testing your application’s behaviour when it talks to something that behaves like OpenAI. That’s the right boundary. You control both sides. The tests are deterministic. They run in milliseconds. They run in CI without API keys, rate limits, or network dependencies. Run them on your laptop while you’re on a flight ✈️.

“Why not Testcontainers with a real service?” #

This is the first objection I hear from senior engineers, and it’s a fair one. Why not spin up Ollama in Docker and test against the real thing?

Because the goal isn’t to test the LLM provider. The goal is to test your code — your serialization logic, your timeout handling, your retry policies, your error recovery, your stream processing. For that, you need control: deterministic failure modes, precise timing between chunks, zero external dependencies, no rate limits, no API keys, no quota issues, and the ability to simulate specific error conditions on demand. A real LLM running in Docker gives you none of that. It gives you non-deterministic responses, unpredictable latency, and a container that takes 30 seconds to start. That’s great for smoke testing. It’s terrible for a fast, reliable CI pipeline.

This approach applies equally to payment processors, trading APIs, voice streaming services, webhooks, or any HTTP integration. The technology changes. The principle doesn’t. It’s the same approach that has worked for me in finance — where the “external service” was a FIX protocol gateway — and in telecom — where it was a voice transcription API and LLM endpoint.

Recommended test structure #

Here’s the pattern I use for integration tests. A few deliberate choices worth noting: @TestInstance(PER_CLASS) lets you share the server instance across tests without restarting it for each method — important when your server startup time is significant for foast integration tests. Random port binding avoids collisions in parallel CI runs. Verification in @AfterEach catches unexpected requests immediately per test, whilst verifyNoUnmatchedStubs() in @AfterAll catches dead stubs at the class level — useful for detecting stubs that were set up but never exercised due to a code path change.

@TestInstance(TestInstance.Lifecycle.PER_CLASS)
class ChatServiceIntegrationTest {

    val mokksy = Mokksy()
    lateinit var client: HttpClient

    @BeforeAll
    suspend fun setup() {
        mokksy.startSuspend() // binds to random available port
    
        client = HttpClient {
            install(DefaultRequest) {
                url(mokksy.baseUrl())
            }
        }
    }
    
    @Test
    suspend fun `should handle streaming response`() {
        mokksy.post {
            path = beEqual("/v1/chat/completions")
        } respondsWithSseStream {
            flow = flow {
                emit(ServerSentEvent(data = """{"choices":[{"delta":{"content":"Hi"}}]}"""))
                delay(100.milliseconds)
                emit(ServerSentEvent(data = "[DONE]"))
            }
        }
    
        val response = client.post("/v1/chat/completions") {
            setBody("""{"model":"gpt-4","stream":true}""")
        }
    
        response.status shouldBe HttpStatusCode.OK
        response.bodyAsText() shouldBe
            "data: {\"choices\":[{\"delta\":{\"content\":\"Hi\"}}]}\r\ndata: [DONE]\r\n"
    }
    
    @AfterEach
    fun afterEach() {
        mokksy.verifyNoUnexpectedRequests()
    }
    
    @AfterAll
    suspend fun afterAll() {
        mokksy.verifyNoUnmatchedStubs()
        client.close()
        mokksy.shutdownSuspend()
    }
}

Beyond raw HTTP: AI-Mocks #

Mokksy handles generic HTTP mocking. But if you’re testing against a specific LLM provider’s API, you don’t want to hand-craft JSON payloads for every test. That’s what AI-Mocks is for — a provider-specific DSL layer built on top of Mokksy.

Instead of constructing raw SSE frames and JSON chat completion responses, you write this:

val openai = MockOpenai(verbose = true)

// define mock response
openai.completion {
    model = "gpt-4o-mini"
    userMessageContains("say 'Hello!'")
} responds {
    assistantContent = "Hello!"
    finishReason = "stop"
    delay = 200.milliseconds
}

// or stream it
openai.completion {
    model = "gpt-4o-mini"
} respondsStream {
    responseChunks = listOf("All", " we", " need", " is", " Love")
    delayBetweenChunks = 10.milliseconds
    finishReason = "stop"
}

AI-Mocks supports OpenAI, Anthropic, Google Gemini, Ollama, and Google’s Agent-to-Agent (A2A) protocol — all with streaming, error simulation, and integration tested against official SDKs, LangChain4j, and Spring AI.

But the foundation is Mokksy. It’s the HTTP server that makes the streaming work. AI-Mocks is a convenience layer that saves you from thinking in SSE frames and JSON-RPC payloads.

Wrapping up #

If you’re building AI applications on the JVM — with Spring Boot, Quarkus, Ktor, or anything else — and your testing strategy is limited to unit tests with mocked dependencies, you have a gap. That gap will show up in production, probably at the worst possible moment.

Integration tests with a real HTTP mock server close that gap. Mokksy gives you the streaming and SSE support that WireMock can’t, with a clean Kotlin DSL that makes the tests readable and maintainable.

Start small. Pick your most critical path — the one that talks to the LLM provider — and write one integration test with Mokksy. Run it in CI. You’ll sleep better.

Links: