ICASSP 2026 Oral Open-world HOI MLLM reasoning

Towards Open-World Human-Object Interaction Reasoning with Multimodal Large Language Model

Eastman Z. Y. Wu Yali Li Shengjin Wang *

Department of Electronic Engineering, Tsinghua University; Beijing National Research Center for Information Science and Technology (BNRist), China; National Engineering Research Center of Dangerous Articles and Explosives Detection Technologies, Beijing, China. * Corresponding author.

A reasoning-first HOI detector that produces structured predictions and human-readable instance-level explanations.

69.28 V-COCO R50-DETR mF1
50.61 HICO-DET Oracle mF1

Background photo: Brooke Lark / Unsplash

HOI-MLLM formulates human-object interaction detection as structured multimodal reasoning, enabling open-vocabulary predictions, instance-level chain-of-thought explanations, and strong HOI detection performance.

01 Open-world Reason about interactions beyond predefined concepts.
02 Interpretable Generate HOI-specific reasoning chains before the final answer.
03 Structured Produce parseable human, object, class, and interaction outputs.

MLLM Benchmark

Leading General-Purpose MLLMs on HOI F1

Beyond standard HOI benchmarks, HOI-MLLM is compared with recent Gemini, GPT, Claude, GLM, Qwen, Grok, Moonshot, and other MLLM families. The model ranks first while retaining structured, instance-level HOI outputs.

50.6% HOI-MLLM F1
49.9% Best GPT baseline
48.2% Best Qwen baseline
HOI F1 benchmark comparing HOI-MLLM with specialized methods and multiple multimodal large language model families.
F1 comparison grouped by MLLM family. HOI-MLLM is shown against specialized HOI methods and general-purpose multimodal models.

Overview

Abstract

Human-Object Interaction (HOI) detection extends conventional object detection by reasoning about higher-level semantic relationships between humans and objects. Most existing HOI detectors are built on DETR-style architectures and rely on external knowledge from large language models or vision-language models. However, they still face insufficient semantic understanding, restricted open-world knowledge, and weak interpretability.

We propose HOI-MLLM, a framework for HOI detection that leverages the reasoning capability of multimodal large language models. We construct balanced supervised fine-tuning data with curated chain-of-thought annotations, and adopt a two-stage training strategy that combines SFT warm-up with GRPO-based post-training guided by HOI-specific reward functions. Experiments on V-COCO and HICO-DET demonstrate state-of-the-art performance while producing interpretable reasoning chains.

Paradigm HOI detection as structured multimodal generation.
Reasoning Four-step CoT supervision for instance-level interaction inference.
Optimization HOI-specific GRPO rewards for format, interaction, pairing, and F1.

Motivation

From Closed-Set Prediction to HOI Reasoning

Comparison between traditional HOI detection and HOI-MLLM reasoning.
Traditional HOI detectors rely on predefined concepts and external knowledge. HOI-MLLM directly reasons over visual cues and produces structured, instance-level HOI predictions.

Method

Structured Reasoning with MLLMs

1

Balanced HOI Data

Curate a smaller but balanced training set from long-tailed HOI benchmarks, prioritizing rare interactions at the instance level.

2

Structured Output

Generate parseable HOI results containing human boxes, object boxes, object classes, and interactions.

3

Chain-of-Thought

Supervise reflection, visual cue mining, interaction reasoning, and final structured HOI output.

4

GRPO Rewards

Optimize image-level format and interaction rewards together with instance-level pairing and F1 rewards.

HOI-MLLM supervised fine-tuning and GRPO post-training pipeline.
The training pipeline combines SFT with chain-of-thought supervision and GRPO-based post-training using HOI-specific rewards.

Results

Strong HOI Detection and Reasoning Performance

Benchmark Setting Method mF1 mPrec mRec
V-COCO / R50-DETR Object Detections
V-COCO R50-DETR PViC 67.12 68.13 67.32
V-COCO R50-DETR CMMP 66.55 60.73 74.76
V-COCO R50-DETR CMMP† 67.65 60.98 77.47
V-COCO R50-DETR ADA-CM 64.67 58.67 73.16
V-COCO R50-DETR ADA-CM† 67.80 64.21 73.17
V-COCO R50-DETR EZ-HOI 68.01 61.59 77.04
V-COCO R50-DETR EZ-HOI† 68.42 62.56 77.68
V-COCO R50-DETR HOI-MLLM 69.28 74.44 65.56
V-COCO / Oracle Object Boxes
V-COCO Oracle PViC† 79.57 83.74 84.03
V-COCO Oracle CMMP 79.41 73.07 87.93
V-COCO Oracle CMMP† 81.06 75.19 89.22
V-COCO Oracle ADA-CM 78.18 73.84 83.95
V-COCO Oracle ADA-CM† 79.02 74.64 85.22
V-COCO Oracle EZ-HOI 80.60 77.86 84.41
V-COCO Oracle EZ-HOI† 81.02 78.91 84.84
V-COCO Oracle HOI-MLLM 81.94 85.32 79.35
HICO-DET / R50-DETR Object Detections
HICO-DET R50-DETR PViC 28.68 25.29 44.10
HICO-DET R50-DETR CMMP 26.08 24.40 36.64
HICO-DET R50-DETR CMMP† 33.24 33.24 43.52
HICO-DET R50-DETR ADA-CM 28.20 27.37 35.62
HICO-DET R50-DETR ADA-CM† 34.14 32.25 44.69
HICO-DET R50-DETR EZ-HOI 27.82 26.00 36.58
HICO-DET R50-DETR EZ-HOI† 31.44 30.57 40.20
HICO-DET R50-DETR HOI-MLLM 30.32 35.71 36.21
HICO-DET / Oracle Object Boxes
HICO-DET Oracle PViC† 44.41 39.73 61.63
HICO-DET Oracle CMMP 36.55 30.74 59.22
HICO-DET Oracle CMMP† 44.76 40.77 60.32
HICO-DET Oracle ADA-CM 39.78 36.68 53.68
HICO-DET Oracle ADA-CM† 45.81 41.97 61.39
HICO-DET Oracle EZ-HOI 42.72 37.82 59.48
HICO-DET Oracle EZ-HOI† 47.52 43.32 63.24
HICO-DET Oracle HOI-MLLM 50.61 51.37 56.47
Variant SFT Mixed CoT GRPO mF1 mPrec mRec
A1 - - - - 31.31 41.92 27.46
A2 yes - - - 64.07 71.00 59.46
A4 yes yes yes - 68.10 73.85 63.71
A5 yes yes yes yes 69.28 74.44 65.56

Qualitative Analysis

Open-World, Fine-Grained, Instance-Level Explanations

Zoomed qualitative example of a child holding a spoon with chocolate residue.

Case A / Open-vocabulary detail

Chocolate spoon interaction

The prediction uses fine object attributes that are not part of a closed HOI vocabulary: chocolate residue, an open mouth, and two-handed spoon holding.

Visual cue
spoon contact, mouth state, residue
HOI output
hold spoon, lick spoon
Zoomed qualitative example of baseball players where tagging and obstruction are inferred from context.

Case B / Open-world context

Baseball rule reasoning

The model combines player locations, body pose, and field context to infer interactions that require scene knowledge rather than object naming alone.

Context cue
base path, sliding player, tagging motion
HOI output
tag person, obstruct person
Zoomed qualitative example of a person holding, reading, and texting on a cell phone.

Case C / Fine-grained visual cues

Cell phone reading and texting

The prediction relies on subtle posture and hand gestures: the person bends forward, holds the phone closely, and appears to tap or type.

Visual cue
bending posture, close phone holding, tapping gesture
HOI output
hold, text on, read cell phone
Zoomed qualitative example where a dog is misdetected as a human and HOI-MLLM predicts no interaction.

Case D / Reflective reasoning

No-interaction correction

The model rejects the mistaken human detection by recognizing that the box belongs to a dog in a water basin, then outputs no valid human-object interaction.

Reflective cue
invalid human box, dog appearance, water basin context
HOI output
no_interaction with tie

Reference

Citation

@inproceedings{wu2026hoi_mllm,
  title     = {Towards Open-World Human-Object Interaction Reasoning with Multimodal Large Language Model},
  author    = {Wu, Eastman Z. Y. and Li, Yali and Wang, Shengjin},
  booktitle = {ICASSP 2026 - IEEE International Conference on Acoustics, Speech and Signal Processing},
  year      = {2026}
}