Files
llama.cpp/tools/server/webui/src/stories/ChatMessage.stories.svelte
Pascal 12bbc3fa50 refactor: centralize CoT parsing in backend for streaming mode (#16394)
* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing

- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages

* refactor: implement streaming-aware universal reasoning parser

Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.

- Rework try_parse_reasoning() to track whitespace, partial tags, and
  multiple reasoning segments, allowing proper separation of reasoning_content
  and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
  formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
  behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments

The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.

Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.

* refactor: address review feedback from allozaur

- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)

- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows

* refactor: address review feedback from ngxson

* debug: say goodbye to curl -N, hello one-click raw stream

- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story

- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example

* npm run format

* chat-parser: address review feedback from ngxson

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2025-10-08 23:18:41 +03:00

208 lines
6.2 KiB
Svelte

<script module lang="ts">
import { defineMeta } from '@storybook/addon-svelte-csf';
import ChatMessage from '$lib/components/app/chat/ChatMessages/ChatMessage.svelte';
const { Story } = defineMeta({
title: 'Components/ChatScreen/ChatMessage',
component: ChatMessage,
parameters: {
layout: 'centered'
}
});
// Mock messages for different scenarios
const userMessage: DatabaseMessage = {
id: '1',
convId: 'conv-1',
type: 'message',
timestamp: Date.now() - 1000 * 60 * 5,
role: 'user',
content: 'What is the meaning of life, the universe, and everything?',
parent: '',
thinking: '',
children: []
};
const assistantMessage: DatabaseMessage = {
id: '2',
convId: 'conv-1',
type: 'message',
timestamp: Date.now() - 1000 * 60 * 3,
role: 'assistant',
content:
'The answer to the ultimate question of life, the universe, and everything is **42**.\n\nThis comes from Douglas Adams\' "The Hitchhiker\'s Guide to the Galaxy," where a supercomputer named Deep Thought calculated this answer over 7.5 million years. However, the question itself was never properly formulated, which is why the answer seems meaningless without context.',
parent: '1',
thinking: '',
children: []
};
const assistantWithReasoning: DatabaseMessage = {
id: '3',
convId: 'conv-1',
type: 'message',
timestamp: Date.now() - 1000 * 60 * 2,
role: 'assistant',
content: "Here's the concise answer, now that I've thought it through carefully for you.",
parent: '1',
thinking:
"Let's consider the user's question step by step:\\n\\n1. Identify the core problem\\n2. Evaluate relevant information\\n3. Formulate a clear answer\\n\\nFollowing this process ensures the final response stays focused and accurate.",
children: []
};
const rawOutputMessage: DatabaseMessage = {
id: '6',
convId: 'conv-1',
type: 'message',
timestamp: Date.now() - 1000 * 60,
role: 'assistant',
content:
'<|channel|>analysis<|message|>User greeted me. Initiating overcomplicated analysis: Is this a trap? No, just a normal hello. Respond calmly, act like a helpful assistant, and do not start explaining quantum physics again. Confidence 0.73. Engaging socially acceptable greeting protocol...<|end|>Hello there! How can I help you today?',
parent: '1',
thinking: '',
children: []
};
let processingMessage = $state({
id: '4',
convId: 'conv-1',
type: 'message',
timestamp: 0, // No timestamp = processing
role: 'assistant',
content: '',
parent: '1',
thinking: '',
children: []
});
let streamingMessage = $state({
id: '5',
convId: 'conv-1',
type: 'message',
timestamp: 0, // No timestamp = streaming
role: 'assistant',
content: '',
parent: '1',
thinking: '',
children: []
});
</script>
<Story
name="User"
args={{
message: userMessage
}}
play={async () => {
const { updateConfig } = await import('$lib/stores/settings.svelte');
updateConfig('disableReasoningFormat', false);
}}
/>
<Story
name="Assistant"
args={{
class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
message: assistantMessage
}}
play={async () => {
const { updateConfig } = await import('$lib/stores/settings.svelte');
updateConfig('disableReasoningFormat', false);
}}
/>
<Story
name="AssistantWithReasoning"
args={{
class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
message: assistantWithReasoning
}}
play={async () => {
const { updateConfig } = await import('$lib/stores/settings.svelte');
updateConfig('disableReasoningFormat', false);
}}
/>
<Story
name="RawLlmOutput"
args={{
class: 'max-w-[56rem] w-[calc(100vw-2rem)]',
message: rawOutputMessage
}}
play={async () => {
const { updateConfig } = await import('$lib/stores/settings.svelte');
updateConfig('disableReasoningFormat', true);
}}
/>
<Story
name="WithReasoningContent"
args={{
message: streamingMessage
}}
asChild
play={async () => {
const { updateConfig } = await import('$lib/stores/settings.svelte');
updateConfig('disableReasoningFormat', false);
// Phase 1: Stream reasoning content in chunks
let reasoningText =
'I need to think about this carefully. Let me break down the problem:\n\n1. The user is asking for help with something complex\n2. I should provide a thorough and helpful response\n3. I need to consider multiple approaches\n4. The best solution would be to explain step by step\n\nThis approach will ensure clarity and understanding.';
let reasoningChunk = 'I';
let i = 0;
while (i < reasoningText.length) {
const chunkSize = Math.floor(Math.random() * 5) + 3; // Random 3-7 characters
const chunk = reasoningText.slice(i, i + chunkSize);
reasoningChunk += chunk;
// Update the reactive state directly
streamingMessage.thinking = reasoningChunk;
i += chunkSize;
await new Promise((resolve) => setTimeout(resolve, 50));
}
const regularText =
"Based on my analysis, here's the solution:\n\n**Step 1:** First, we need to understand the requirements clearly.\n\n**Step 2:** Then we can implement the solution systematically.\n\n**Step 3:** Finally, we test and validate the results.\n\nThis approach ensures we cover all aspects of the problem effectively.";
let contentChunk = '';
i = 0;
while (i < regularText.length) {
const chunkSize = Math.floor(Math.random() * 5) + 3; // Random 3-7 characters
const chunk = regularText.slice(i, i + chunkSize);
contentChunk += chunk;
// Update the reactive state directly
streamingMessage.content = contentChunk;
i += chunkSize;
await new Promise((resolve) => setTimeout(resolve, 50));
}
streamingMessage.timestamp = Date.now();
}}
>
<div class="w-[56rem]">
<ChatMessage message={streamingMessage} />
</div>
</Story>
<Story
name="Processing"
args={{
message: processingMessage
}}
play={async () => {
const { updateConfig } = await import('$lib/stores/settings.svelte');
updateConfig('disableReasoningFormat', false);
// Import the chat store to simulate loading state
const { chatStore } = await import('$lib/stores/chat.svelte');
// Set loading state to true to trigger the processing UI
chatStore.isLoading = true;
// Simulate the processing state hook behavior
// This will show the "Generating..." text and parameter details
await new Promise((resolve) => setTimeout(resolve, 100));
}}
/>