ICASSP 2026 Oral Open-world HOI MLLM reasoning

Towards Open-World Human-Object Interaction Reasoning with Multimodal Large Language Model

Eastman Z. Y. Wu Yali Li Shengjin Wang ^*

Department of Electronic Engineering, Tsinghua University; Beijing National Research Center for Information Science and Technology (BNRist), China; National Engineering Research Center of Dangerous Articles and Explosives Detection Technologies, Beijing, China. * Corresponding author.

Paper PDF Poster MLLM Benchmark Code BibTeX

A reasoning-first HOI detector that produces structured predictions and human-readable instance-level explanations.

69.28 V-COCO R50-DETR mF1

50.61 HICO-DET Oracle mF1

Background photo: Brooke Lark / Unsplash

HOI-MLLM formulates human-object interaction detection as structured multimodal reasoning, enabling open-vocabulary predictions, instance-level chain-of-thought explanations, and strong HOI detection performance.

01 Open-world Reason about interactions beyond predefined concepts.

02 Interpretable Generate HOI-specific reasoning chains before the final answer.

03 Structured Produce parseable human, object, class, and interaction outputs.

MLLM Benchmark

Leading General-Purpose MLLMs on HOI F1

Beyond standard HOI benchmarks, HOI-MLLM is compared with recent Gemini, GPT, Claude, GLM, Qwen, Grok, Moonshot, and other MLLM families. The model ranks first while retaining structured, instance-level HOI outputs.

50.6% HOI-MLLM F1

49.9% Best GPT baseline

48.2% Best Qwen baseline

Open benchmark PDF See benchmark tables

HOI F1 benchmark comparing HOI-MLLM with specialized methods and multiple multimodal large language model families. — F1 comparison grouped by MLLM family. HOI-MLLM is shown against specialized HOI methods and general-purpose multimodal models.

Overview

Abstract

Human-Object Interaction (HOI) detection extends conventional object detection by reasoning about higher-level semantic relationships between humans and objects. Most existing HOI detectors are built on DETR-style architectures and rely on external knowledge from large language models or vision-language models. However, they still face insufficient semantic understanding, restricted open-world knowledge, and weak interpretability.

We propose HOI-MLLM, a framework for HOI detection that leverages the reasoning capability of multimodal large language models. We construct balanced supervised fine-tuning data with curated chain-of-thought annotations, and adopt a two-stage training strategy that combines SFT warm-up with GRPO-based post-training guided by HOI-specific reward functions. Experiments on V-COCO and HICO-DET demonstrate state-of-the-art performance while producing interpretable reasoning chains.

Paradigm HOI detection as structured multimodal generation.

Reasoning Four-step CoT supervision for instance-level interaction inference.

Optimization HOI-specific GRPO rewards for format, interaction, pairing, and F1.

Motivation

From Closed-Set Prediction to HOI Reasoning

Comparison between traditional HOI detection and HOI-MLLM reasoning. — Traditional HOI detectors rely on predefined concepts and external knowledge. HOI-MLLM directly reasons over visual cues and produces structured, instance-level HOI predictions.

Method

Structured Reasoning with MLLMs

Balanced HOI Data

Curate a smaller but balanced training set from long-tailed HOI benchmarks, prioritizing rare interactions at the instance level.

Structured Output

Generate parseable HOI results containing human boxes, object boxes, object classes, and interactions.

Chain-of-Thought

Supervise reflection, visual cue mining, interaction reasoning, and final structured HOI output.

GRPO Rewards

Optimize image-level format and interaction rewards together with instance-level pairing and F1 rewards.

HOI-MLLM supervised fine-tuning and GRPO post-training pipeline. — The training pipeline combines SFT with chain-of-thought supervision and GRPO-based post-training using HOI-specific rewards.

Results

Strong HOI Detection and Reasoning Performance

Benchmark	Setting	Method	mF1	mPrec	mRec
V-COCO / R50-DETR Object Detections
V-COCO	R50-DETR	PViC	67.12	68.13	67.32
V-COCO	R50-DETR	CMMP	66.55	60.73	74.76
V-COCO	R50-DETR	CMMP†	67.65	60.98	77.47
V-COCO	R50-DETR	ADA-CM	64.67	58.67	73.16
V-COCO	R50-DETR	ADA-CM†	67.80	64.21	73.17
V-COCO	R50-DETR	EZ-HOI	68.01	61.59	77.04
V-COCO	R50-DETR	EZ-HOI†	68.42	62.56	77.68
V-COCO	R50-DETR	HOI-MLLM	69.28	74.44	65.56
V-COCO / Oracle Object Boxes
V-COCO	Oracle	PViC†	79.57	83.74	84.03
V-COCO	Oracle	CMMP	79.41	73.07	87.93
V-COCO	Oracle	CMMP†	81.06	75.19	89.22
V-COCO	Oracle	ADA-CM	78.18	73.84	83.95
V-COCO	Oracle	ADA-CM†	79.02	74.64	85.22
V-COCO	Oracle	EZ-HOI	80.60	77.86	84.41
V-COCO	Oracle	EZ-HOI†	81.02	78.91	84.84
V-COCO	Oracle	HOI-MLLM	81.94	85.32	79.35
HICO-DET / R50-DETR Object Detections
HICO-DET	R50-DETR	PViC	28.68	25.29	44.10
HICO-DET	R50-DETR	CMMP	26.08	24.40	36.64
HICO-DET	R50-DETR	CMMP†	33.24	33.24	43.52
HICO-DET	R50-DETR	ADA-CM	28.20	27.37	35.62
HICO-DET	R50-DETR	ADA-CM†	34.14	32.25	44.69
HICO-DET	R50-DETR	EZ-HOI	27.82	26.00	36.58
HICO-DET	R50-DETR	EZ-HOI†	31.44	30.57	40.20
HICO-DET	R50-DETR	HOI-MLLM	30.32	35.71	36.21
HICO-DET / Oracle Object Boxes
HICO-DET	Oracle	PViC†	44.41	39.73	61.63
HICO-DET	Oracle	CMMP	36.55	30.74	59.22
HICO-DET	Oracle	CMMP†	44.76	40.77	60.32
HICO-DET	Oracle	ADA-CM	39.78	36.68	53.68
HICO-DET	Oracle	ADA-CM†	45.81	41.97	61.39
HICO-DET	Oracle	EZ-HOI	42.72	37.82	59.48
HICO-DET	Oracle	EZ-HOI†	47.52	43.32	63.24
HICO-DET	Oracle	HOI-MLLM	50.61	51.37	56.47

Variant	SFT	Mixed	CoT	GRPO	mF1	mPrec	mRec
A1	-	-	-	-	31.31	41.92	27.46
A2	yes	-	-	-	64.07	71.00	59.46
A4	yes	yes	yes	-	68.10	73.85	63.71
A5	yes	yes	yes	yes	69.28	74.44	65.56

Qualitative Analysis

Open-World, Fine-Grained, Instance-Level Explanations

Zoomed qualitative example of a child holding a spoon with chocolate residue.

Case A / Open-vocabulary detail

Chocolate spoon interaction

The prediction uses fine object attributes that are not part of a closed HOI vocabulary: chocolate residue, an open mouth, and two-handed spoon holding.

Visual cue: spoon contact, mouth state, residue
HOI output: hold spoon, lick spoon

Zoomed qualitative example of baseball players where tagging and obstruction are inferred from context.

Case B / Open-world context

Baseball rule reasoning

The model combines player locations, body pose, and field context to infer interactions that require scene knowledge rather than object naming alone.

Context cue: base path, sliding player, tagging motion
HOI output: tag person, obstruct person

Zoomed qualitative example of a person holding, reading, and texting on a cell phone.

Case C / Fine-grained visual cues

Cell phone reading and texting

The prediction relies on subtle posture and hand gestures: the person bends forward, holds the phone closely, and appears to tap or type.

Visual cue: bending posture, close phone holding, tapping gesture
HOI output: hold, text on, read cell phone

Zoomed qualitative example where a dog is misdetected as a human and HOI-MLLM predicts no interaction.

Case D / Reflective reasoning

No-interaction correction

The model rejects the mistaken human detection by recognizing that the box belongs to a dog in a water basin, then outputs no valid human-object interaction.

Reflective cue: invalid human box, dog appearance, water basin context
HOI output: no_interaction with tie

Reference

Citation

@inproceedings{wu2026hoi_mllm,
  title     = {Towards Open-World Human-Object Interaction Reasoning with Multimodal Large Language Model},
  author    = {Wu, Eastman Z. Y. and Li, Yali and Wang, Shengjin},
  booktitle = {ICASSP 2026 - IEEE International Conference on Acoustics, Speech and Signal Processing},
  year      = {2026}
}