| Kimi K2.6 | Kimi tool-call and reasoning format | 22 / 36 | 21 | 1 | 2 / 36 | 0 | 2 | Current main only fails a multi-step search-and-crawl workflow in streaming and non-streaming modes. The model returns no structured tool calls and asks for endpoint clarification instead of executing the workflow. No raw marker leakage was observed in current main. | Dynamo 1.2.0 had 18 parser/API-boundary failures and three endpoint timeouts. Model-native tool-call syntax appeared in reasoning instead of structured tool_calls, and some final assistant text was routed away from assistant content. Current main removes those Dynamo failures and leaves two model-workflow failures. |
| DeepSeek V4 Pro | DeepSeek tool-call and reasoning format | 0 / 46 | 0 | 0 | 0 / 46 | 0 | 0 | No failures in the captured current-main run. | No change needed. Dynamo 1.2.0 and current main are both clean. |
| GLM 5.1 | GLM tool-call format | 4 / 48 | 4 | 0 | 3 / 48 | 3 | 0 | Current main still fails delimiter-literal preservation in streaming and non-streaming modes because delimiter-looking text is not preserved in the structured argument. One non-streaming no-tools request also timed out. | Current main improves from 4 to 3 Dynamo/runtime failures by removing a Dynamo 1.2.0 timeout in the multi-step search-and-crawl workflow. The delimiter-string preservation issue remains. |
| MiniMax 2.7 | MiniMax tool-call format | 8 / 46 | 2 | 6 | 4 / 46 | 2 | 2 | Current main has four failures. A simple arithmetic auto-tool prompt answers in text instead of producing the requested structured tool call in streaming and non-streaming modes. A delimiter-like literal string prompt returns a structured tool call in both modes, but the marker-looking text inside the argument is not preserved exactly; this is counted as a parser/API-boundary failure. | Current main now uses the full 46-probe coverage and improves from 8 failures to 4. The multi-step tool-loop workflow and context echo auto-tool prompt that failed in Dynamo 1.2.0 now pass. Dynamo/parser-boundary failures remain at 2, while other failures drop from 6 to 2. |
| Gemma 4 31B IT | Gemma tool-call and reasoning format | 2 / 48 | 2 | 0 | 2 / 46 | 2 | 0 | Current main still fails delimiter-literal preservation in streaming and non-streaming modes. The response produces a structured tool call, but the SQL string is truncated before the expected literal marker text. | No observed failure-count improvement. Dynamo 1.2.0 and current main have the same failure class, with fewer probes in the current-main run. |
| Qwen3.6-35B-A3B | Qwen tool-call format | 1 / 48 | 1 | 0 | 0 / 46 | 0 | 0 | No failures in the captured current-main run. | Current main is clean. The Dynamo 1.2.0 non-streaming timeout in the multi-step search-and-crawl workflow is gone. |
| GPT-OSS 120B | GPT-OSS tool-call format | 14 / 48 | 2 | 12 | 14 / 48 | 2 | 12 | Current main still has 14 failures. Multi-tool and parallel-tool prompts produce only one structured tool call, a simple calculation prompt answers in text instead of calling the tool, a marker-literal string argument omits the requested marker-like text, and the search/crawl final answer still misses the expected evidence. No raw model-native marker leakage was observed. | The refreshed GPT-OSS current-main run is no longer worse than Dynamo 1.2.0 by count; both are 14 / 48. The prior main-only required-tool regression is gone, and the streaming multi-step workflow now returns final content instead of an empty assistant message, but the core multi-tool, parallel-tool, literal-marker, and final-answer gaps remain. |