Integration Testing on the JVM: My Ideal Process, End to End

Unit tests tell me a function does what I think it does. They don’t tell me my service starts, binds its ports, reads its config, talks to a database, consumes from Kafka, and survives an LLM provider returning a 503 mid-stream. That second category is where most production incidents live, and it’s the one I care about most.

This post lays out the integration-testing process I’ve converged on for JVM web services. The examples come from three repositories you can clone and run: koog-spring-boot-assistant (Spring Boot + WebFlux), quarkus-assistant-demo (Quarkus), and Mokksy (for the Docker-image variant). They’re Kotlin because suspending functions and a fluent DSL make these tests pleasant to read — but everything here applies to Java too, and with virtual threads on Java 21+ you get the same ergonomics without coroutines.

Assume a typical service: a REST API, perhaps a WebSocket or messaging endpoint (Kafka/SQS), a database, and an outbound dependency or two — here, an LLM provider. The system under test (SUT) is a real, booted application, not a sliced @WebMvcTest context.

Put end-to-end tests in their own module

The decision that pays off most: integration tests live in a separate module, not in src/test alongside your unit tests.

In the koog repository the root pom.xml declares two modules:

xml

1<modules>
2    <module>app</module>
3    <module>integration-tests</module>
4</modules>

The integration-tests module depends on app as a black box. It builds the application, then drives it from the outside over HTTP and WebSocket — the same surface a real client sees. No reaching into Spring beans, no @MockBean, no shared application-context tricks.

Three reasons the separation earns its keep:

The two suites run on different clocks. Unit tests are cheap and run on every save. Integration tests boot a real app and cost real seconds. Mix them, and your fast feedback loop inherits the slow suite’s startup cost.
Failsafe and Surefire already want this split. Maven’s convention runs unit tests in test (Surefire) and integration tests in verify (Failsafe). mvn verify runs everything; mvn test stays fast.
The dependency direction stays honest. The test module can only see the public surface, which stops you from accidentally testing implementation details.

On Gradle, the same idea maps to a dedicated source set or a separate subproject. The module boundary is the point, not the build tool.

Bring the Environment up before the Server

There are two layers of infrastructure, and the order matters: the Environment starts first, the Server second.

The Environment aggregates everything the SUT depends on: a database, a Kafka or SQS simulator, an HTTP stub for third-party APIs (WireMock or Mokksy), and — for an AI service — an LLM simulator. In the koog repository the LLM side is ai-mocks, Mokksy’s OpenAI-shaped mock server:

kotlin

 1object TestEnvironment {
 2    val mockOpenai = MockOpenai(verbose = true)
 3
 4    init {
 5        Awaitility.setDefaultTimeout(5.seconds.toJavaDuration())
 6        Awaitility.setDefaultPollDelay(500.milliseconds.toJavaDuration())
 7        Awaitility.setDefaultPollInterval(500.milliseconds.toJavaDuration())
 8
 9        System.setProperty("OPENAI_API_KEY", "dummyOpenAIKey")
10        System.setProperty("spring.profiles.active", "test")
11    }
12}

Every dependency binds to an ephemeral port. Don’t hardcode 5432 or 9092 — let the OS assign a free port and read it back. This is what lets the full suite run on a laptop while Docker is busy with three other projects, and it’s a hard requirement for parallel CI.

Real downstreams belong in Testcontainers. For dependencies you can’t fake faithfully — a real Postgres, a real Redis — the Environment starts them as containers and hands their mapped ports to the Server. Start them individually, or point Testcontainers at a docker-compose.yml so your test topology and your local-dev topology are the same file. One source of truth beats two that drift apart.

For Kafka, reach for Redpanda rather than the full Kafka image. It’s Kafka-API-compatible, starts in a second or two instead of waiting on a ZooKeeper/KRaft dance, and Testcontainers ships a first-class RedpandaContainer. On a suite where startup time is the budget, that swap alone buys back minutes.

The LLM simulator deserves a callout, because it changes what “integration test” even means for an AI service. A real model is slow, nondeterministic, and costs money per call. Mokksy lets me assert on the request the app sends and script the response — including token-by-token streaming and deliberate failures. A flaky, expensive dependency becomes a fast, deterministic one I fully control.

Simulate external services; test the real contract separately

That phrase — one I fully control — is the whole argument, and it’s worth dwelling on, because the alternative is a trap I’ve watched good teams fall into.

At one payment service provider, the integration suite ran against real bank sandboxes, and in a few places against production environments with designated test accounts. Those tests are genuinely valuable: they’re the only thing that catches real integration problems and silent API or behavior drift on the bank’s side — the contract changing under you without a changelog. I wouldn’t give that signal up.

But as your primary test suite, they’re a liability. You don’t control the external service, so when the sandbox is down for maintenance, your build is red and it’s not your fault. You can’t make it return a 500, time out, or respond slowly on demand, so the failure paths — the ones that matter most in a payment system — go untested. And the latency and flakiness leak straight into your wall-clock budget.

So I split the two concerns. The bulk of behavior — happy paths, business rules, and especially failure injection — runs against a simulator I control, on every PR. A small, clearly labeled set of contract tests against the real sandbox runs on a schedule (nightly, or pre-release), gated behind an environment flag so it never blocks a PR. The simulator tells me my service behaves correctly; the scheduled contract tests tell me the real service still matches the assumptions my simulator encodes. When the two disagree, that’s the drift you actually wanted to catch — and it surfaces in an isolated, expected place instead of randomly reddening someone’s unrelated PR.

This exact itch is why I built Mokksy and its LLM-focused layer, AI-Mocks. I wanted a mock server I could assert requests against, script precise responses for, and — crucially — instruct to fail, stall, or stream on command, which is what makes the failure-injection tests later in this post possible at all. A faithful simulator you own is worth more day to day than a real dependency you merely borrow.

Feed the Environment’s ports into the Server

Once the Environment is up, you hold a bag of bound ports. The Server needs them as configuration before it boots: set them as system properties or environment variables, read them through your normal config mechanism, and the app never knows it’s under test.

The koog demo does this in the Server initializer:

kotlin

 1object Server {
 2    val port: Int
 3        get() = (applicationContext as ReactiveWebServerApplicationContext)
 4            .webServer.port
 5
 6    private var applicationContext: ApplicationContext
 7
 8    init {
 9        System.setProperty(
10            "ai.koog.openai.base-url",
11            TestEnvironment.mockOpenai.baseUrl(),
12        )
13
14        applicationContext = SpringApplication.run(
15            Application::class.java,
16            "--server.port=0",
17            "--spring.profiles.active=test",
18        )
19    }
20    // ...
21}

--server.port=0 applies the same ephemeral-port trick to the SUT itself, and webServer.port reads back whatever the OS assigned. The property is set before SpringApplication.run — configuration has to be in place before the context refreshes.

Raw System.setProperty works, but it leaks: anything you set stays set for the rest of the JVM and can poison later tests. Two libraries close that gap:

system-stubs scopes system properties and environment variables to a test or lifecycle, then restores them afterward. No bleed-over between tests.
finchly offers a small, typed helper for reading test configuration and env vars, instead of scattering System.getenv calls across the suite.

For deciding whether a test runs at all, JUnit Pioneer is the tool I reach for; the koog repository pulls it in. Use @EnabledIfEnvironmentVariable to skip the LLM-hitting tests when no API key is present, @RetryingTest for the genuinely network-bound cases, and environment-driven toggles to run the full matrix on CI but a fast subset locally. Gating is what keeps the local suite honest about its time budget.

Boot the Server once per JVM

Spring Boot takes a few seconds to start. Booting it per test method would blow any budget you set, so the Server is a JVM-wide singleton that boots in a static initializer — it comes up exactly once, before JUnit instantiates any test class.

In Kotlin, object gives you this for free: a lazily-initialized singleton whose init block runs the first time the base test class references it. That’s why TestEnvironment and Server above are objects rather than classes. In Java, a static final field or a JUnit extension with a static-scoped store does the same job.

Running the SUT in the same JVM as the test buys a quietly enormous benefit: you can debug the whole stack by running an integration test in debug mode. Set a breakpoint in the test, set another deep inside a controller or service, hit debug, and both stop. No remote-debug agent, no attaching to a separate process, no port juggling — the test drives a real request through real application code, and you step straight through it. This single property has saved me more time than any other part of the setup.

“Booted” and “ready to serve traffic” are not the same thing, though. The context can be up while the HTTP listener, the connection pool, or a Kafka consumer is still warming. So I never Thread.sleep and hope — I probe a real endpoint with Awaitility until it answers correctly:

kotlin

1fun awaitServerIsRunning() {
2    val chatClient = ChatClient(port)
3    await
4        .ignoreExceptions()
5        .until {
6            runBlocking { chatClient.version() == "1.0" }
7        }
8}

ignoreExceptions() is doing real work here: during startup the endpoint throws connection-refused, which is expected rather than a failure. Awaitility swallows those and keeps polling until the version endpoint returns the value that means “fully wired.” This is your readiness check, and it has the same shape as a Kubernetes readiness probe — a useful property, since you’re exercising the signal your orchestrator will rely on in production.

Wrap the SUT in a test client that reads like the domain

Once the Server answers, wrap the raw HTTP client in a small test-client abstraction — a DSL that speaks the language of the feature, not of HTTP. Tests should talk about sending a message and getting an answer, not about content-type headers and status codes.

Here’s the chat client from the koog repository, trimmed to essentials:

kotlin

 1class ChatClient(val port: Int) : ChatSession {
 2    private val client = HttpClient {
 3        install(ContentNegotiation) { json() }
 4    }
 5
 6    suspend fun sendMessage(
 7        message: String,
 8        requestId: String? = "REQ_${Uuid.random().toHexString()}",
 9        expectedStatusCode: HttpStatusCode = HttpStatusCode.OK,
10    ): Answer {
11        val response = client.post("http://localhost:$port/api/chat") {
12            contentType(ContentType.Application.Json)
13            setBody(ChatRequest(chatRequestId = requestId, message = message))
14        }
15        response.status shouldBe expectedStatusCode
16        val answer = response.body<Answer>()
17        answer.chatRequestId shouldBe requestId   // correlation check, always
18        return answer
19    }
20}

Two details carry weight. First, the default requestId is a fresh UUID per call — the reason for that comes up shortly. Second, the client asserts the response echoes back the request ID it sent. That correlation check lives in the client, so every test gets it for free and no test can accept another’s answer by accident.

I usually extract a ChatSession interface so the same tests run against both REST and WebSocket transports. The WebSocket client implements the same sendMessage / sendMessageStreaming contract, and the tests barely change.

Let the base class assert readiness

A small abstract base class holds the shared wiring and the per-test guardrails:

kotlin

 1abstract class AbstractIntegrationTest {
 2    protected val mockOpenai = TestEnvironment.mockOpenai
 3    protected val server = Server
 4    protected val chatClient = ChatClient(server.port)
 5
 6    @BeforeEach
 7    fun awaitServer() {
 8        server.awaitServerIsRunning()
 9    }
10
11    @AfterEach
12    fun afterEach() {
13        mockOpenai.verifyNoUnmatchedRequests()
14    }
15}

@BeforeEach re-asserts readiness — cheap once the server is up, and a loud failure if a previous test left things in a bad state. @AfterEach verifies the LLM mock saw no unexpected calls, which catches a whole class of bugs where the app makes a request you never anticipated. Keep this class small; it’s infrastructure, not a place for test logic.

Write tests that fit on one screen

A test you can’t take in at a glance is a test you can’t trust when it goes red at 5pm. I optimize hard for readability: each test sets up its mocks, does one thing, and asserts on the result. If it spills past one screen, it’s doing too much. With JUnit 6 the suspend test methods stay flat — no nesting inside a runTest { } block.

Happy path — script the success case and assert the answer:

kotlin

 1class AiChatPositiveTest : AbstractIntegrationTest() {
 2    @Test
 3    suspend fun `Should answer a Question`() {
 4        val seed = nextInt()
 5        val question = "To be or not to be, $seed?"
 6        val expectedAnswer = "It's a good question: $question"
 7
 8        mockOpenai.moderation { inputContains(question) } responds { flagged = false }
 9        mockOpenai.completion {
10            userMessageContains(question)
11        } respondsStream { responseFlow = flowOf(expectedAnswer) }
12
13        val response = chatClient.sendMessage(question)
14
15        response.message.trim() shouldBe expectedAnswer
16    }
17}

Negative path — the interesting failures are usually business rules, not crashes. Here moderation flags the input and the app must refuse gracefully:

kotlin

1mockOpenai.moderation { inputContains(question) } responds {
2    flagged = true
3    category(ModerationCategory.VIOLENCE, 0.9)
4}
5
6val response = chatClient.sendMessage(question)
7response.message.trim() shouldBe "Forgive me, but your message defies our guidelines."

Dependency failures — your service should survive every dependency returning every error code. That’s what parameterized tests are for, and it’s where the LLM simulator earns its keep: you can’t easily make the real OpenAI API return a 418.

kotlin

 1@ParameterizedTest
 2@ValueSource(ints = [400, 401, 403, 404, 418, 500, 503])
 3suspend fun `Should handle LLM request failure`(errorStatusCode: Int) {
 4    val question = "To be or not to be, ${nextInt()}?"
 5
 6    mockOpenai.moderation { inputContains(question) } responds { flagged = false }
 7    mockOpenai.completion {
 8        userMessageContains(question)
 9    } respondsError { httpStatusCode = errorStatusCode }
10
11    val response = chatClient.sendMessage(question, expectedStatusCode = HttpStatusCode.OK)
12    response.message shouldBe "Alas, I cannot help thee now."  // graceful degradation
13}

One small class, seven failure modes, and a clear contract: a broken upstream never reaches the user as a 500.

Flow tests — for streaming or multi-step interactions, assert on the sequence and its timing. The WebSocket test scripts a delay between chunks and checks the response actually streamed rather than arriving in one lump:

kotlin

 1mockOpenai.completion { userMessageContains(question) } respondsStream {
 2    responseFlow = expectedTokens.asFlow().onEach { delay(500.milliseconds) }
 3}
 4
 5val (tokens, duration) = measureTimedValue {
 6    wsClient.sendMessageStreaming(question).map { it.message }.toList()
 7}
 8
 9tokens shouldBe expectedTokens
10duration shouldBeGreaterThanOrEqualTo (500.milliseconds * tokens.size)

Don’t seed the database for anything your API can create

A tempting shortcut is to insert rows straight into the database in @BeforeEach so the data is simply there. Resist it. Anything your application can create through its API should be created through its API. A row inserted behind the app’s back skips validation, skips events, and skips the exact code path a real client hits — so the test passes while the create endpoint quietly rots. Build the fixture by calling POST /things, and you exercise the creation path for free, every time.

The exception is data that is genuinely outside your service’s scope. You’re testing your service, not the identity provider, so a well-known test user, a fixed API key, or a seeded tenant that your auth layer expects to exist is fair to provision directly, or through the dependency’s own setup. The line is ownership: if your service owns the lifecycle of that data, create it through your service; if it merely consumes data another system owns, stub or seed it and move on.

Design every test for parallel execution

This is the hinge of the whole approach. Unit tests are cheap, so you can have thousands of independent ones. Integration tests are expensive, because the Environment takes seconds to come up — so the suite must run in parallel to stay inside budget. JUnit makes this a configuration flag:

properties

1junit.jupiter.execution.parallel.enabled=true
2junit.jupiter.execution.parallel.config.strategy=dynamic
3junit.jupiter.execution.parallel.mode.default=concurrent

The moment tests run concurrently against a shared, long-lived SUT, they will interfere with each other. That’s not a risk to mitigate; it’s a property to design around. Two rules make it work.

Rule 1 — every test uses unique data, and verifies that uniqueness in the response. The seed = nextInt() baked into every question above isn’t decoration; it guarantees test A never matches test B’s mock or reads test B’s answer. The request-ID correlation check in ChatClient is the other half: a test only accepts a response carrying its own ID. Unique in, verified unique out.

Rule 2 — never assume the size or contents of a shared collection. If twenty tests create records concurrently, list().size shouldBe 1 is a guaranteed flake. Assert that your record is present and your deleted record is absent — never the total count.

This reshapes the CRUD lifecycle test. You don’t assert on global state; you trace your own entity through it:

List → your ID is not present (don’t assert the list is empty).
Create → 201/202.
Get + List → busy-wait until your ID appears.
Delete → 202.
Get → busy-wait until 404.
List → your ID is gone.

Embrace eventual consistency instead of fighting it

Real systems are asynchronous. A create returns 202 Accepted and the write propagates afterward; an event fires and a projection updates a beat later. A test that does create() then immediately get() expecting 200 is testing a race — and it will lose that race on a loaded CI box.

So I build eventual consistency into the tests. Every read-after-write becomes a poll rather than a single assertion, with Awaitility handling the busy-wait under a sane timeout:

kotlin

1await.atMost(5.seconds).untilAsserted {
2    val found = client.get(id)
3    found.status shouldBe HttpStatusCode.OK
4    found.body<Record>().requestId shouldBe myRequestId
5}

This is slower per step than an instant assertion, and that’s fine: it’s correct, and it mirrors how clients actually consume your API. The time budget comes from parallelism, not from skipping the wait.

For messaging: drain into memory, search by predicate

Messaging tests need one crucial twist over HTTP. The naive shape — “poll the topic, expect to see my event” — breaks under parallelism. If test A’s poll() happens to pull test B’s message, that message is consumed and gone; test B then waits forever for an event it will never see. The broker’s at-least-once guarantee can’t help you when your own test code drops the message on the floor.

The fix: drain the topic continuously into an in-memory buffer, and let each test search that buffer by predicate. Start one consumer per topic when the Environment comes up, run it on a background thread, and append every message to a lock-free concurrent queue. Tests then query the buffer, not the broker:

kotlin

 1class CapturingConsumer<T>(topic: String, bootstrap: String, parse: (String) -> T) {
 2    // Lock-free, O(1) appends, weakly-consistent iteration that's safe to
 3    // scan while the background thread is still writing.
 4    private val messages = ConcurrentLinkedQueue<T>()
 5
 6    init {
 7        thread(isDaemon = true, name = "test-consumer-$topic") {
 8            val consumer = KafkaConsumer<String, String>(/* ... */).apply {
 9                subscribe(listOf(topic))
10            }
11            while (!Thread.interrupted()) {
12                consumer.poll(Duration.ofMillis(200))
13                    .forEach { messages.add(parse(it.value())) }
14            }
15        }
16    }
17
18    fun awaitMessage(predicate: (T) -> Boolean): T =
19        await.atMost(10.seconds).until(
20            { messages.firstOrNull(predicate) },
21            notNullValue(),
22        )!!
23}

The test stays small and obvious:

kotlin

1chatClient.placeOrder(orderId = myId)
2
3val event = orderEvents.awaitMessage { it.orderId == myId }
4event.status shouldBe "PLACED"

Three properties fall out for free: the broker stays drained, so nothing backs up; no message is lost, because the consumer never stops reading; and parallel tests don’t fight over poll() calls, because they all scan the same shared buffer and filter for their own request ID.

When the SUT consumes a topic rather than producing one, flip the pattern: publish a test event into the topic, then poll the SUT’s API until the side effect appears. Either way you assert on what actually crossed the broker — not on an in-process publisher capture that proves only that your code called publish().

When you can’t boot the app in-process: run it in a container

The in-process Server is my default, largely for that debugging benefit. But sometimes it isn’t an option — the app isn’t a JVM process you can call SpringApplication.run on, or you specifically want to test the Docker image you’re about to ship, not just the code inside it. The architecture survives the switch almost untouched: keep the Environment, keep the test client, keep the unique-data discipline, and change only how the SUT comes up.

This is exactly how Mokksy verifies its own published image. An abstract base class holds every behavioral test and exposes a single getBaseUrl():

java

 1@TestInstance(TestInstance.Lifecycle.PER_CLASS)
 2public abstract class AbstractFileConfigIT {
 3    protected abstract String getBaseUrl();
 4
 5    @Test
 6    void post_withBodyMatch_returnsConfiguredStatusAndHeaders() throws Exception {
 7        var response = post("/things", "{\"id\":\"42\"}");
 8        assertThat(response.statusCode()).isEqualTo(201);
 9        assertThat(response.headers().firstValue("Location")).hasValue("/things/42");
10    }
11    // ...the rest of the contract tests
12}

One subclass runs the server in-process. Another, DockerJavaIT, runs the actual built image via Testcontainers and overrides nothing but the base URL:

java

 1@TestInstance(TestInstance.Lifecycle.PER_CLASS)
 2class DockerJavaIT extends AbstractFileConfigIT {
 3
 4    private final GenericContainer<?> container =
 5        new GenericContainer<>(DockerImageName.parse("mokksy/server-jvm:snapshot"))
 6            .withImagePullPolicy(imageName -> false)          // use the locally built image
 7            .withEnv("MOKKSY_CONFIG", "/config/it-stubs.yaml")
 8            .withCopyFileToContainer(
 9                MountableFile.forClasspathResource("/it-stubs.yaml"),
10                "/config/it-stubs.yaml")
11            .withExposedPorts(8080)
12            .waitingFor(Wait.forLogMessage(".*Responding at.*", 1)) // readiness, log-based
13            .withStartupTimeout(Duration.ofSeconds(10));
14
15    @BeforeAll  void beforeAll() { container.start(); }
16    @AfterAll   void afterAll()  { container.stop();  }
17
18    @Override
19    protected String getBaseUrl() {
20        return "http://" + container.getHost() + ":" + container.getFirstMappedPort();
21    }
22}

Same tests, two runtimes. The in-process variant gives fast feedback and breakpoints; the Docker variant proves the image, the entrypoint, the config-file wiring, and the container’s own readiness signal all work. Note the readiness check is still explicit — Wait.forLogMessage here instead of an HTTP probe, but the same principle: don’t proceed until the SUT says it’s ready.

The pattern isn’t even JVM-specific. The same shape — an Environment of containerized dependencies, a SUT brought up once, a thin client speaking the domain, unique data per test, polling for eventual consistency — translates cleanly to a Node service tested with Vitest or Jest, or to anything else. The one thing you may give up off-JVM is single-process debugging across test and SUT; stepping a debugger across that boundary is a real convenience on the JVM and less certain elsewhere. The discipline carries over regardless.

The metric that governs the design: wall-clock time

Everything above serves two numbers I treat as hard limits:

Under 3-5 minutes to run the full suite on a developer’s laptop.
Under 10 minutes on CI.

Cross those thresholds and people stop running tests locally and stop opening small PRs, because the feedback loop hurts. The whole architecture — separate module, boot-once Server, simulated dependencies over real ones, Redpanda over Kafka, aggressive parallelism — exists to defend those numbers. When the suite creeps toward the limit, the fix is almost always more parallelism or a faster simulator, rarely fewer tests or longer sleeps.

That’s the loop I keep returning to: a real booted app, real boundaries, simulated dependencies, unique data, eventual-consistency-aware assertions, and a clock I refuse to blow past. I’ve built this setup — and introduced it to teams — across a wide spread of domains: Forex high-frequency trading platforms, payment gateways, mobile payment providers, and high-scale communication providers such as Twilio. Different stacks, different latency and consistency demands; the pattern held up in every one of them.

I’d be glad to hear how the Kotlin and JVM community handles the parts I still find awkward — particularly making concurrent mode reliable for every class rather than falling back to same_thread, and keeping LLM simulators faithful as provider APIs drift. If you have a sharper approach, tell me.