Tech

Imaging machines preview near-real-time AI voice and video chat and new ‘interaction models’.

Is AI leaving the era of "it is based on repentance" chat?

By now, all of us who regularly use AI models at work or in our lives know that the basic interaction mode for all text, images, audio, and video remains the same: a human user provides input, waits anywhere between milliseconds to minutes (or in some cases, for complex queries, hours and days), and the AI ​​model provides output.

But if AI is to truly take over tasks that require natural interactions, it will need to do more than provide this type of work. "it is based on repentance" collaboration – will ultimately require responding freely and naturally to human input, even responding while it is being processed the next one human input, be it text or another format.

That at least seems to be the contention of Thinking Machines, a well-funded AI startup founded last year by former OpenAI chief technology officer Mira Murati and former OpenAI researcher and founder John Schulman, among others.

Today, the company announced a preview of what it sees as research "interaction models, a new class of native multimodal systems that treat interaction as a first-level citizen of the architectural model rather than external software "harnesses," achieving attractive gains in third-party benchmarks and reduced latency as a result.

However, the models are not yet available to the general public or even businesses – the company says in its announcement blog post: "In the coming months, we will open a limited preview of the study to gather feedback, with a wider rollout later this year."

‘Full duplex’ simultaneous input/output processing

At the heart of this announcement is a fundamental change in the way AI perceives time and presence. Current boundary models often meet reality with a single chain; they wait for the user to finish input before they start processing, and their vision freezes while they generate a response.

In their blog post, Thinking Machine researchers described the status quo as a constraint that forces people to do so "they bothered themselves" in AI communication, calling questions like emails and combining their thoughts.

To solve this "cooperation bottle," Thinking machines have moved away from the usual sequence of alternating tokens.

Instead, they use a multi-stream, micro-turn design that processes 200ms bits of input and output simultaneously.

This "full duplex" the architecture allows the model to listen, speak, and see in real time, enabling a backchannel while the user is speaking or interrupting when it sees a visual cue—like if the user writes a bug in a code snippet or a friend inserts a video frame. Technically, the model uses early integration without an encoder.

Rather than relying on large independent encoders such as Whisper for audio, the system takes raw audio signals such as dMel and image patches (40×40) through a lightweight embedding layer, training all components from scratch within the transformer.

A two-model system

A preview of the study is presented TML-Interaction-Smalla 276-billion parameter Mixture-of-Experts (MoE) a model with 12 billion active parameters. Because real-time collaboration requires near-instant response times that often conflict with critical thinking, the company built a two-part system:

  1. Interaction Model: Constantly interacts with the user, handles administrative conversations, attendance, and quick follow-up.

  2. Background model: An interactive agent that handles static reasoning, web browsing, or complex tool calls, broadcast results back to the interaction model to be woven naturally into the conversation.

This setup allows the AI ​​to perform tasks such as live rendering or generate a UI chart while continuing to listen to user feedback—a capability demonstrated in the announcement video where the model provided typical human reaction times to various signals while simultaneously generating a bar chart.

Impressive performance on large benchmarks compared to other fast interactive models for AI labs

To prove the effectiveness of this method, it is used in the lab FD-bencha benchmark designed specifically to measure the quality of interactions rather than raw intelligence. The results show that TML-Interaction-Small significantly outperforms existing real-time systems:

  • Answer: Find the turn delay of 0.40 secondscompared to 0.57s for Gemini-3.1-flash-live and 1.18s for GPT-realtime-2.0 (at least).

  • Quality of interaction: On the FD V1.5 bench, it scored 77.8it almost doubles the results of its main competitors (GPT-realtime-2.0 at least scored 46.8).

  • Visual work: In special tests such as RepCount-A (counting physical repetitions in the video) and ProactiveVideoQAThe Thinking Machine Model has successfully interacted with the physical world while other frontier models are silent or give incorrect answers.

Metric

TML-Interaction-Small

GPT-realtime-2.0 (min)

Gemini-3.1-flash-live (minutes)

Delay take time (s)

0.40

1.18

0.57

Interaction Quality (Avg)

77.8

46.8

54.3

IFeval (VoiceBench)

82.1

81.7

67.6

Harmbench (Rejection %)

99.0

99.5

99.0

A big advantage for businesses – if ready-made models are available

If made available to the enterprise sector, Thinking Machines’ collaborative models could represent a fundamental shift in how businesses integrate AI into their operational processes.

A native interaction model such as TML-Interaction-Small allows several business capabilities that are currently impossible or very broken with standard multimodal models:

Current business AI requires a "repent" which must be completed before it can analyze the data. In a production or lab setting, a native interface model can monitor a video feed and intervene immediately when it detects a security breach or deviation from protocol — without waiting for an employee to request a response.

The model’s success in visual benchmarks such as RepCount-A (accurate repetition counting) and ProactiveVideoQA (answering questions as visual evidence emerges) suggests that it may serve as a real-time auditor of critical physical activities.

The main conflict in voice-based customer service is 1–2 seconds "processing" general delay in standard APIs for 2026. The Thinking Machine Model achieves a turn-on delay of 0.40 seconds, roughly the speed of a natural human conversation.

Because it handles conversation natively, a business support bot can listen to customer frustrations, provide "back channel" symbols (like "I see" or "mm-hmm") without disturbing the user, and provide a live translation that sounds more like a natural conversation than a series of disjointed recordings.

Standard LLMs do not have an internal clock; see "know" time only if given in the text command. Interaction models are inherently time-aware, allowing them to handle time-sensitive processes such as "Remind me to check the temperature every 4 minutes" or "Let me know if this process takes longer than the last one". This is important in industrial maintenance and pharmaceutical research where time is of the essence.

Background of Thinking Machines

This release marks the second milestone for Thinking Machines following the October 2025 launch of Tinker, a managed API for fine-tuning language models that lets researchers and developers manage their data and training methods while Thinking Machines manages the distributed training infrastructure load.

The company said Tinker supports both small and large open-source models, including mixed and expert models, and early users include groups at Princeton, Stanford, Berkeley and Redwood Research.

When it launched in early 2025, Thinking Machines billed itself as an AI research and product company trying to make advanced AI systems “more widely understood, more customizable and more generalizable.”

In July 2025, Thinking Machines said it raised nearly $2 billion at a cost of $12 billion in a round led by Andreessen Horowitz, with participation from Nvidia, Accel, ServiceNow, Cisco, AMD and Jane Street, described by WIRED as the largest round of seed funding in history.

The Wall Street Journal It was reported in August 2025 that rival tech boss Mark Zuckerberg approached Murati about acquiring Think Machines Lab and, after he refused, Meta chased more than a dozen of the startup’s nearly 50 employees.

In March and April 2026, the company is also known for its computing ambitions: it announced the Nvidia partnership to release at least one gigawatt of next-generation Vera Rubin systems, and then expanded its Google Cloud partnership to use Google’s AI Hypercomputer infrastructure with Nvidia GB300 systems for model research, strengthening learning workloads, boundary model training.

In April 2026, Business Insider reported that Meta had hired seven founding members of Thinking Mission, including Mark Jen and Yinghai Lu, while another Thinking Mission researcher, Tianyi Zhang, also moved to Meta. The same report said Joshua Gross, who helped build Thinking Machines’ fine-tuning product, Tinker, has joined Meta Superintelligence Labs, and that the company has grown to about 130 employees despite the departure.

Thinking Machines wasn’t just losing people, though: it also hired Meta veteran Soumith Chintala, creator of PyTorch, as CTO, and added some top tech talent like Neal Wu. TechCrunch reported separately in April 2026 that Weiyao Wang, an eight-year veteran of Meta who worked on multi-sensory systems, had joined Thinking Machine, stressing that the flow of talent was not one-way.

Thinking Machines previously said it was committed to it "important components of open source" in its release to empower the research community. It is not yet clear whether these new collaboration models will fall under the same ethos and principles of release.

But one thing is certain: by making interaction native to the model, Thinking Machines believes that scaling the model will now make it smarter and a more effective collaborator.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button