Building modern chat experiences with Microsoft Agent Framework and OpenAI ChatKit

If you've built an agent with Microsoft Agent Framework, you know it takes less than ten lines to get a working prototype. What takes longer is wrapping that agent in a production quality chat interface, one with real time streaming, interactive widgets, file uploads and a polished feel. That's exactly the gap OpenAI ChatKit fills.

ChatKit is a batteries-included framework for chat UIs. It handles the tricky parts: streaming tokens as they arrive, rendering rich widgets, managing attachments and preserving conversation state. Pair it with Agent Framework and you get the best of both worlds, structured agent orchestration on the backend, and a refined user experience on the frontend.

I've been tinkering with a sample called SwiftRover that brings these two together. It's a travel assistant with three distinct capabilities: real-time flight tracking, parking sign analysis using GPT-4o vision, and expense analysis using the o3 reasoning model. What makes it interesting isn't any single feature but rather how the integration patterns work across all of them. Let me walk you through what I've learned.

What is OpenAI ChatKit?

ChatKit is OpenAI's answer to the "demo to production" gap for chat interfaces. Instead of building your own streaming logic, message handling, file uploads and widget rendering, you get a drop-in solution.

On the frontend, it's a React component:

import { ChatKit, useChatKit } from "@openai/chatkit-react";

export default function App() {
  const chatkit = useChatKit({
    api: {
      url: "/chatkit",
      uploadStrategy: { type: "two_phase" },
    },
    composer: {
      attachments: {
        enabled: true,
        accept: { "image/*": [".png", ".jpg", ".jpeg"] },
      },
    },
  });

  return <ChatKit control={chatkit.control} />;
}

On the backend, you implement a ChatKitServer that handles messages and streams responses. The server emits events like ThreadItemAddedEvent, ThreadItemUpdatedEvent and ThreadItemDoneEvent. ChatKit consumes these to render streaming text, widgets and progress indicators.

The widget system is where it gets interesting. You can render cards, buttons, images and custom layouts, all streamed alongside text. This is how SwiftRover shows flight status cards and parking analysis results instead of plain text.

ChatKit handles the plumbing. You focus on what your agent actually does.

The Integration Challenge

Here's the problem: Agent Framework speaks one language (ChatMessage, AgentRunResponseUpdate) and ChatKit speaks another (ThreadItem, ThreadStreamEvent). Someone has to translate.

That's what the agent-framework-chatkit package does. It provides two key helpers:

ThreadItemConverter: Converts ChatKit thread items to Agent Framework messages (and handles attachments)
stream_agent_response(): Converts Agent Framework streaming output to ChatKit events

The flow looks like this:

User Message → ChatKit → ThreadItemConverter → ChatAgent.run_stream()
    → stream_agent_response() → ChatKit Events → UI

Here's the bridge in action:

from agent_framework_chatkit import ThreadItemConverter, stream_agent_response

class MyChatKitServer(ChatKitServer):
    def __init__(self, ...):
        self.agent = ChatAgent(
            chat_client=AzureOpenAIChatClient(...),
            instructions="You are a helpful assistant.",
            tools=[get_flight_status, analyse_parking, ...],
        )
        self.converter = ThreadItemConverter(
            attachment_data_fetcher=self._fetch_attachment_data,
        )

    async def respond(self, thread, input_user_message, context):
        # Convert ChatKit messages to Agent Framework format
        thread_items = await self.store.load_thread_items(thread.id, ...)
        agent_messages = await self.converter.to_agent_input(thread_items)

        # Run the agent
        agent_stream = self.agent.run_stream(agent_messages)

        # Stream back as ChatKit events
        async for event in stream_agent_response(agent_stream, thread.id):
            yield event

The pattern is consistent across all three features in SwiftRover. The agent does its thing, the bridge translates, and ChatKit renders.

Architecture at a Glance

Three Features, One Pattern

1. Flight Tracking (Agent + Tool + Widget)

The flight tracker uses the standard agent-with-tools pattern. A tool fetches data from aviationstack API, and the result triggers a custom widget.

async def get_flight_status(
    flight_iata: Annotated[str | None, Field(description="Flight IATA code")] = None,
    dep_iata: Annotated[str | None, Field(description="Departure airport")] = None,
) -> str:
    result = await fetch_flight_status(flight_iata, dep_iata)

    if isinstance(result, str):
        return result  # Error message

    # Wrap in marker class for widget detection
    return FlightResponse(summary_text, result)

The trick is the FlightResponse wrapper. In the respond() method, I intercept tool results to detect when to render a widget:

async def intercept_stream():
    async for update in agent_stream:
        if update.contents:
            for content in update.contents:
                if isinstance(content, FunctionResultContent):
                    if isinstance(content.result, FlightResponse):
                        flight_data = content.result.data
        yield update

# After streaming completes, render the widget
if flight_data:
    widget = render_flight_widget(flight_data)
    async for event in stream_widget(thread.id, widget):
        yield event

The widget itself is built from ChatKit primitives:

def render_flight_widget(data: FlightStatusData) -> WidgetRoot:
    return Card(children=[
        Row(children=[
            Image(src=airplane_icon, width=48, height=48),
            Col(children=[
                Title(value=f"Flight {data.flight_iata}"),
                Text(value=data.airline_name),
            ])
        ]),
        Box(background=status_color, children=[
            Text(value=data.flight_status.upper())
        ]),
        # ... departure/arrival details
    ])

2. Parking Sign Analysis (Vision + Direct Handler)

For image uploads, I bypass the agent entirely. When an image attachment arrives, route directly to GPT-5.1 vision. This is just another way to showing you how it can work:

async def respond(self, thread, input_user_message, context):
    # Check for image attachments
    if input_user_message.attachments:
        for attachment in input_user_message.attachments:
            if attachment.type == "image":
                parking_image = await self._fetch_attachment_data(attachment.id)

                # Direct vision call, skip the agent
                result = await analyse_parking_sign(parking_image, content_type)

                widget = render_parking_widget(result)
                async for event in stream_widget(thread.id, widget):
                    yield event
                return

    # No image, continue with agent...

The vision analysis returns structured data that maps cleanly to a verdict widget:

@dataclass
class ParkingAnalysisData:
    can_park: bool
    verdict: str
    confidence: str  # high, medium, low
    restrictions: list[ParkingRestriction]
    advice: str

This is the same parking analysis pattern from ParkingGPT, but now integrated into a chat interface rather than a standalone mobile app.

3. Expense Analysis (o3 Reasoning + Workflow)

The expense feature demonstrates ChatKit's workflow visualisation. When using o3's reasoning model, users see "Thought for X seconds" with streaming reasoning summaries.

if intent == QueryIntent.EXPENSE:
    # Create workflow item for thinking indicator
    workflow_item = WorkflowItem(
        id=workflow_id,
        thread_id=thread.id,
        workflow=Workflow(
            type="reasoning",
            tasks=[ThoughtTask(type="thought", title="Thinking...", content="")],
            expanded=True,
        ),
    )
    yield ThreadItemAddedEvent(item=workflow_item)

    # Stream reasoning from o3
    async for event_type, data in analyse_expenses_streaming():
        if event_type == "reasoning_delta":
            thought_task.content += data
            yield ThreadItemUpdatedEvent(
                item_id=workflow_id,
                update=WorkflowTaskUpdated(task=thought_task, task_index=0),
            )

    # Finalize with timing
    final_workflow = Workflow(
        tasks=[ThoughtTask(title=f"Thought for {seconds}s", content=reasoning)],
        summary=CustomSummary(title=f"Thought for {seconds}s", icon="sparkle"),
        expanded=False,
    )
    yield ThreadItemDoneEvent(item=final_workflow_item)

The o3 Responses API provides reasoning_summary events that stream the model's chain-of-thought, perfect for showing users that something meaningful is happening during those thinking seconds.

Intent Routing

With three distinct capabilities, routing matters. SwiftRover uses a fast model for intent classification:

async def classify_intent(user_message: str, has_image: bool) -> str:
    if has_image:
        return QueryIntent.PARKING  # Images go straight to vision

    # Fast classification with gpt-4.1-mini
    response = await client.responses.create(
        model="gpt-4.1-mini",
        input=[
            {"role": "system", "content": "Classify into: flight, parking, expense, general"},
            {"role": "user", "content": user_message}
        ],
        max_output_tokens=20,
    )

    return map_to_intent(response.output_text.strip())

This keeps expensive reasoning models for when they're actually needed.

Some important notes

Widget vs Text: Render widgets when the data is structured and benefits from visual hierarchy. Flight status, parking verdicts, and expense summaries all qualify. General chat stays as text.

Marker Classes: Wrap tool results in custom string subclasses (FlightResponse, ShowAirportSelector) to signal widget rendering. It's a clean pattern that keeps tool functions pure.

Two-Phase Upload: ChatKit's upload strategy creates an attachment, returns an upload URL, then the client POSTs the actual bytes. Your AttachmentStore needs to handle both phases.

Streaming Reasoning: The o3 model's reasoning summaries are gold for UX. Stream them into a WorkflowItem so users see thinking progress, not just a spinner.

Intent First: Classify before routing. A fast model call is cheaper than sending everything through a full agent loop when you can handle it directly.

What About .NET?

Currently, the agent-framework-chatkit package is Python-only. For .NET, you can currently explore something new that uses the AG-UI (Agent-User Interaction) protocol instead, a framework-agnostic streaming protocol that provides interoperability with other AI frameworks. Different approach, similar goal. I've a sample for that in my repo for you to play!

This complete sample lives in the Generative AI repo and a detailed README will guide you on how to run it.

The integration between Agent Framework and ChatKit isn't complicated once you see the pattern. Convert messages, stream responses, render widgets. What makes it powerful is that each piece does its job well. Agent Framework handles orchestration, tools and guardrails. ChatKit handles the UI, streaming and polish. The bridge keeps them talking.

Until next time.