A Multimodal Data Collection Framework for Dialogue-Driven Assistive Robotics to Clarify Ambiguities: A Wizard-of-Oz Pilot Study

Abstract

Integrated control of wheelchairs and wheelchair-mounted robotic arms (WMRAs) has strong potential to increase independence for users with severe motor limitations, yet existing interfaces often lack the flexibility needed for intuitive assistive interaction. Although data-driven AI methods show promise, progress is limited by the lack of multimodal datasets that capture natural Human–Robot Interaction (HRI), particularly conversational ambiguity in dialogue-driven control.

To address this gap, we propose a multimodal data collection framework that employs a dialogue-based interaction protocol and a two-room Wizard-of-Oz (WoZ) setup to simulate robot autonomy while eliciting natural user behavior.

The framework records five synchronized modalities: RGB-D video, conversational audio, inertial measurement unit (IMU) signals, end-effector Cartesian pose, and whole-body joint states across five assistive tasks. Using this framework, we collected a pilot dataset of 53 trials from five participants and validated its quality through motion smoothness analysis and user feedback. The results show that the framework effectively captures diverse ambiguity types and supports natural dialogue-driven interaction, demonstrating its suitability for scaling to a larger dataset for learning, benchmarking, and evaluation of ambiguity-aware assistive control. The dataset and codes will be released at https://madibabaiasl.github.io/WheelArmWoZDataset/ upon paper acceptance, and a demonstration video is available at https://youtu.be/4Ei7vba7TNY.

Video

Existing Multimodal Dataset Comparison

Prior datasets largely focus on multimodal manipulation or simulated dialogue interactions, but rarely provide real-world data that combines natural dialogue with both manipulation and navigation tasks.

Results

Task Demonstrations

The qualitative analysis presents videos and conversations from a sample of each task to assess video quality, WheelArm motions, and dialogue between the participants and WheelArm. The figure below demonstrates high-quality multimodal data collected by our framework.

General Information and Motion Analysis

Distribution of data across tasks, time duration, end-effector path length, and jerk analysis of the wheelchair and end-effector movement.

Dialogue Analysis

We analyze the task distribution by ambiguity type, utterance counts, and ambiguity distribution.

Questionnaire Analysis

The Likert-scale tables summarize user feedback on the dialogue-based interaction and WheelArm autonomy.

BibTeX

@article{liu2026biorob,
  author    = {Liu, Guangping and Hawkins, Nicholas and Madden, Billy and Sultan, Tipu and Esposito, Flavio and Babaiasl, Madi},
  title     = {A Multimodal Data Collection Framework for Dialogue-Driven Assistive Robotics to Clarify Ambiguities: A Wizard-of-Oz Pilot Study},
  journal   = {BioRob},
  year      = {2026},
}