Celso M. de Melo

1 PromptGAR: Flexible Promptive Group Activity Recognition
Jin, Z., Feng, A., Chemburkar, A., de Melo, C., Proceedings of WACV, 2026
Show abstract

We present PromptGAR, a novel framework for Group Activity Recognition (GAR) that offering both input flexibility and high recognition accuracy. The existing approaches suffer from limited real-world applicability due to their reliance on full prompt annotations, fixed number of frames and instances, and the lack of actor consistency. To bridge the gap, we proposed PromptGAR, which is the first GAR model to provide input flexibility across prompts, frames, and instances without the need for retraining. We leverage diverse visual prompts—like bounding boxes, skeletal keypoints, and instance identities—by unifying them as point prompts. A recognition decoder then cross-updates class and prompt tokens for enhanced performance. To ensure actor consistency for extended activity durations, we also introduce a relative instance attention mechanism that directly encodes instance identities. Comprehensive evaluations demonstrate that PromptGAR achieves competitive performances both on full prompts and partial prompt inputs, establishing its effectiveness on input flexibility and generalization ability for real-world applications.

2 Referring Change Detection in Remote Sensing Imagery
Korkmaz, Y., Paranjape, J., de Melo, C., Patel, V., Proceedings of WACV, 2026
Show abstract

Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two-stage framework consisting of (I) RCDNet, a cross-modal fusion network designed for referring change detection, and (II) RCDGen, a diffusionbased synthetic data generation pipeline that produces realistic post-change images and change maps for a specified category using only pre-change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection.

3 F-ViTA: Foundation Model Guided Visible to Thermal Translation
Paranjape, J., de Melo, C., Patel, V., Proceedings of WACV, 2026
Show abstract

Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image.

4 SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning
Ma, W., Chou, Y.C., Liu, Q., Wang, X., de Melo, C., Xie, J., Yuille, A., Proceedings of NeurIPS, 2025
Show abstract

Despite recent advances on multi-modal models, 3D spatial reasoning remains a challenging task for state-of-the-art open-source and proprietary models. Recent studies explore data-driven approaches and achieve enhanced spatial reasoning performance by fine-tuning models on 3D-related visual question-answering data. However, these methods typically perform spatial reasoning in an implicit manner and often fail on questions that are trivial to humans, even with long chain-ofthought reasoning. In this work, we introduce SpatialReasoner, a novel large visionlanguage model (LVLM) that addresses 3D spatial reasoning with explicit 3D representations shared between multiple stages–3D perception, computation, and reasoning. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning and improves the generalization ability to novel question types. Furthermore, by analyzing the explicit 3D representations in multistep reasoning traces of SpatialReasoner, we study the factual errors and identify key shortcomings of current LVLMs. Results show that our SpatialReasoner achieves improved performance on a variety of spatial reasoning benchmarks, outperforming Gemini 2.0 by 9.2% on 3DSRBench, and generalizes better when evaluating on novel 3D spatial reasoning questions. Our study bridges the 3D parsing capabilities of prior visual foundation models with the powerful reasoning abilities of large language models, opening new directions for 3D spatial reasoning.

5 Bisecle: Binding and Separation in Continual Learning for Video Language Understanding
Tan, Y., Hu, X., Xue, H., de Melo, C., Salim, F., Proceedings of NeurIPS, 2025
Show abstract

Frontier vision-language models (VLMs) have made remarkable improvements in video understanding tasks. However, real-world videos typically exist as continuously evolving data streams (e.g., dynamic scenes captured by wearable glasses), necessitating models to continually adapt to shifting data distributions and novel scenarios. Considering the prohibitive computational costs of fine-tuning models on new tasks, usually, a small subset of parameters is updated while the bulk of the model remains frozen. This poses new challenges to existing continual learning frameworks in the context of large multimodal foundation models, i.e., catastrophic forgetting and update conflict. While the foundation models struggle with parameter-efficient continual learning, the hippocampus in the human brain has evolved highly efficient mechanisms for memory formation and consolidation. Inspired by the rapid Binding and pattern separation mechanisms in the hippocampus, in this work, we propose Bisecle for video-language continual learning, where a multi-directional supervision module is used to capture more cross-modal relationships and a contrastive prompt learning scheme is designed to isolate task-specific knowledge to facilitate efficient memory storage. Binding and separation processes further strengthen the ability of VLMs to retain complex experiences, enabling robust and efficient continual learning in video understanding tasks. We perform a thorough evaluation of the proposed Bisecle, demonstrating its ability to mitigate forgetting and enhance cross-task generalization on several VideoQA benchmarks.

6 AHA - Predicting What Matters Next: Online Highlight DetectionWithout Looking Ahead
Chang, A., de Melo, C., Lukin, S., Proceedings of NeurIPS, 2025
Show abstract

Real-time understanding of continuous video streams is essential for intelligent agents operating in high-stakes environments, including autonomous vehicles, surveillance drones, and disaster response robots. Yet, most existing video understanding and highlight detection methods assume access to the entire video during inference, making them unsuitable for online or streaming scenarios. In particular, current models optimize for offline summarization, failing to support step-by-step reasoning needed for real-time decision-making. We introduce AHA, an autoregressive highlight detection framework that predicts the relevance of each video frame against a task described in natural language. Without accessing future video frames, AHA utilizes a multimodal vision-language model and lightweight, decoupled heads trained on a large, curated dataset of human-centric video labels. To enable scalability, we introduce the Dynamic SinkCache mechanism that achieves constant memory usage across infinite-length streams without degrading performance on standard benchmarks. This encourages the hidden representation to capture high-level task objectives, enabling effective frame-level rankings for informativeness, relevance, and uncertainty with respect to the natural language task. AHA achieves state-of-the-art (SOTA) performance on highlight detection benchmarks, surpassing even prior offline, full-context approaches and video-language models by +5.9% on TVSum and +8.3% on Mr.Hisum in mAP (mean Average Precision). We explore AHA’s potential for real-world robotics applications given a task-oriented natural language input and a continuous, robot-centric video. Both experiments demonstrate AHA’s potential effectiveness as a real-time reasoning module for downstream planning and long-horizon understanding.

7 FLEET: Formal Language-Grounded Scheduling for Heterogeneous Robot
Rivera, C., Byrd, G., Booker, M., Kemp, B., Gaines, A., Holmes, E., Uplinger, J., de Melo, C., Handelman, D., Proceedings of CoRR, 2025
Show abstract

Coordinating heterogeneous robot teams from free-form natural-language instructions is hard: language-only planners struggle with long-horizon coordination and hallucination, while purely formal methods require closed-world models. We present FLEET, a hybrid decentralized framework that turns language into optimized multi-robot schedules. An LLM front-end produces (i) a task graph with durations and precedence and (ii) a capability-aware robot–task fitness matrix; a formal back-end solves a makespan-minimization problem while the underlying robots execute their free-form subtasks with agentic closed-loop control. Across multiple free-form language-guided autonomy coordination benchmarks, FLEETimproves success over state of the art generative planners on two-agent teams across heterogeneous tasks. Ablations show that mixed integer linear programming (MILP) primarily improves temporal structure, while LLM-derived fitness is decisive for capability-coupled tasks; together they deliver the highest overall performance. We demonstrate the translation to real world challenges with hardware trials using a pair of quadruped robots with disjoint capabilities.

8 A Compound 3D-Informed Design toward Spatially-Intelligent Large Multimodal Models
Ma, W., de Melo, C., Yuille, A., Chen, J., Proceedings of CVPR, 2025
Show abstract

Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial reasoning. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. In this paper, we systematically study the impact of 3D-informed data, architecture, and training setups, introducing 3DI-LMM, an LMM with advanced 3D spatial reasoning abilities. To address data limitations, we develop two types of 3D-informed training datasets: (1) 3D-informed probing data focused on object’s 3D location and orientation, and (2) 3D-informed conversation data for complex spatial relationships. Notably, we are the first to curate VQA data that incorporate 3D orientation relationships. Furthermore, we systematically integrate these two types of training data with the architectural and training designs of LMMs, providing a roadmap for optimal design aimed at achieving superior 3D reasoning capabilities. Our 3DI-LMM advances machines toward highly capable 3Dinformed reasoning, surpassing GPT-4o performance by 8.7%. Our systematic empirical design and the resulting findings offer valuable insights for future research in this direction.

9 PulseCheck457: A Diagnostic Benchmark for Comprehensive Spatial Reasoning of Large Multimodal Models
Wang, X., Ma, W., Zhang, T., de Melo, C., Chen, J., Yuille, A., Proceedings of CVPR, 2025
Show abstract

Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present PulseCheck457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.

10 Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
Reddy, A., et al., Proceedings of CVPR, 2025
Show abstract

In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon three main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.

11 ConceptAgent: LLM-Driven Precondition Grounding and Tree Search for Robust Task Planning and Execution
Rivera, C., et al., Proceedings of ICRA, 2025
Show abstract

Robotic planning and execution in open-world environments is a complex problem due to the vast state spaces and high variability of task embodiment. Recent advances in perception algorithms, combined with Large Language Models (LLMs) for planning, offer promising solutions to these challenges, as the common sense reasoning capabilities of LLMs provide a strong heuristic for efficiently searching the action space. However, prior work fails to address the possibility of hallucinations from LLMs, which results in failures to execute the planned actions largely due to logical fallacies at high- or low-levels. To contend with automation failure due to such hallucinations, we introduce ConceptAgent, a natural language-driven robotic platform designed for task execution in unstructured environments. With a focus on scalability and reliability of LLM-based planning in complex state and action spaces, we present innovations designed to limit these shortcomings, including 1) Predicate Grounding to prevent and recover from infeasible actions, and 2) an embodied version of LLM-guided Monte Carlo Tree Search with self reflection. In simulation experiments, ConceptAgent achieved a 19% task completion rate across three room layouts and 30 easy level embodied tasks outperforming other state-of-the-art LLM-driven reasoning baselines that scored 10.26% and 8.11% on the same benchmark. Additionally, ablation studies on moderate to hard embodied tasks revealed a 20% increase in task completion from the baseline agent to the fully enhanced ConceptAgent, highlighting the individual and combined contributions of Predicate Grounding and LLM-guided Tree Search to enable more robust automation in complex state and action spaces.

12 A Mamba-based Siamese Network for Remote Sensing Change Detection
Paranjape, J., de Melo, C., Patel, V., Proceedings of WACV, 2025
Show abstract

Change detection in remote sensing images is an essential tool for analyzing a region at different times. It finds varied applications in monitoring environmental changes, man-made changes as well as corresponding decision-making and prediction of future trends. Deep learning methods like Convolutional Neural Networks (CNNs) and Transformers have achieved remarkable success in detecting significant changes, given two images at different times. In this paper, we propose a Mamba-based Change Detector (M-CD) that segments out the regions of interest even better. Mamba-based architectures demonstrate linear-time training capabilities and an improved receptive field over transformers. Our experiments on four widely used change detection datasets demonstrate significant improvements over existing state-of-the-art (SOTA) methods.

13 Texture- and shape-based adversarial attacks for overhead image vehicle detection
Yeghiazaryan, M., Namburu, S., Kim, E., Panev, S., de Melo, C., De la Torre, F., Hodgins, J., Proceedings of ICIP, 2025
Show abstract

Detecting vehicles in aerial images is difficult due to complex backgrounds, small object sizes, shadows, and occlusions. Although recent deep learning advancements have improved object detection, these models remain susceptible to adversarial attacks (AAs), challenging their reliability. Traditional AA strategies often ignore practical implementation constraints. Our work proposes realistic and practical constraints on texture (lowering resolution, limiting modified areas, and color ranges) and analyzes the impact of shape modifications on attack performance. We conducted extensive experiments with three object detector architectures, demonstrating the performance-practicality trade-off: more practical modifications tend to be less effective, and vice versa

14 Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models
Ranasinghe, Y., VS, Vibashan, Uplinger, J., de Melo, C., Patel, V., Proceedings of AVSS, 2025
Show abstract

Automatic target recognition (ATR) plays a critical role in tasks such as navigation and surveillance, where safety and accuracy are paramount. In extreme use cases, such as military applications, these factors are often challenged due to the presence of unknown terrains, environmental conditions, and novel object categories. Current object detectors, including open-world detectors, lack the ability to confidently recognize novel objects or operate in unknown environments, as they have not been exposed to these new conditions. However, Large Vision-Language Models (LVLMs) exhibit emergent properties that enable them to recognize objects in varying conditions in a zero-shot manner. Despite this, LVLMs struggle to localize objects effectively within a scene. To address these limitations, we propose a novel pipeline that combines the detection capabilities of open-world detectors with the recognition confidence of LVLMs, creating a robust system for zero-shot ATR of novel classes and unknown domains. In this study, we compare the performance of various LVLMs for recognizing military vehicles, which are often underrepresented in training datasets. Additionally, we examine the impact of factors such as distance range, modality, and prompting methods on the recognition performance, providing insights into the development of more reliable ATR systems for novel conditions and classes.

15 Guarding Barlow Twins Against Overfitting with Mixed Samples
Bandara, W., de Melo, C., Patel, V., Proceedings of AVSS, 2025
Show abstract

Self-supervised Learning (SSL) aims to learn transferable feature representations for downstream applications without relying on labeled data. The Barlow Twins algorithm, renowned for its widespread adoption and straightforward implementation compared to its counterparts like contrastive learning methods, minimizes feature redundancy while maximizing invariance to common corruptions. Optimizing for the above objective forces the network to learn useful representations, while avoiding noisy or constant features, resulting in improved downstream task performance with limited adaptation. Despite Barlow Twins’ proven effectiveness in pre-training, the underlying SSL objective can inadvertently cause feature overfitting due to the lack of strong interaction between the samples unlike the contrastive learning approaches. From our experiments, we observe that optimizing for the Barlow Twins objective doesn’t necessarily guarantee sustained improvements in representation quality beyond a certain pre-training phase, and can potentially degrade downstream performance on some datasets. To address this challenge, we introduce Mixed Barlow Twins, which aims to improve sample interaction during Barlow Twins training via linearly interpolated samples. This results in an additional regularization term to the original Barlow Twins objective, assuming linear interpolation in the input space translates to linearly interpolated features in the feature space. Pre-training with this regularization effectively mitigates feature overfitting and further enhances the downstream performance on CIFAR-10, CIFAR-100, TinyImageNet, STL-10, and ImageNet datasets

16 ConceptGraphs: Open-vocabulary 3D scene graphs for perception and planning
Kuwajerwala, A., Gu, Q., Morin, S., Jatavallabhula, K., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., Gan, C., de Melo, C., Tenenbaum, J., Torralba, A., Shkurti, F., Paull, L., Proceedings of ICRA, 2024
Show abstract

For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, which do not scale well in larger environments, nor do they contain semantic spatial relationships between entities in the environment, which are useful for downstream planning. In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multiview association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models. We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts.

17 Vilco-bench: Video language continual learning benchmark
Tang, T., Deldari, S., Xue, H., de Melo, C., Salim, F., Proceedings of NeurIPS, 2024
Show abstract

Video language continual learning involves continuously adapting to information from video and text inputs, enhancing a model’s ability to handle new tasks while retaining prior knowledge. This field is a relatively under-explored area, and establishing appropriate datasets is crucial for facilitating communication and research in this field. In this study, we present the first dedicated benchmark, ViLCo-Bench, designed to evaluate continual learning models across a range of video-text tasks. The dataset comprises ten-minute-long videos and corresponding language queries collected from publicly available datasets. Additionally, we introduce a novel memory-efficient framework that incorporates self-supervised learning and mimics long-term and short-term memory effects. This framework addresses challenges including memory complexity from long video clips, natural language complexity from open queries, and text-video misalignment. We posit that ViLCo-Bench, with greater complexity compared to existing continual learning benchmarks, would serve as a critical tool for exploring the video-language domain, extending beyond conventional class-incremental tasks, and addressing complex and limited annotation issues. The curated data, evaluations, and our novel method are available at https://github. com/cruiseresearchgroup/ViLCo.

18 Exploring the impact of rendering method and motion quality on model performance when using multi-view synthetic data for action recognition
Panev, S., Kim, E., Namburu, S., Nikolova, D., de Melo, C., De la Torre, F., Hodgins, J., Proceedings of WACV, 2024
Show abstract

This paper explores the use of synthetic data in a human action recognition (HAR) task to avoid the challenges of obtaining and labeling real-world datasets. We introduce a new dataset suite comprising five datasets, eleven common human activities, three synchronized camera views (aerial and ground) in three outdoor environments, and three visual domains (real and two synthetic). For the synthetic data, two rendering methods (standard computer graphics and neural rendering) and two sources of human motions (motion capture and video-based motion reconstruction) were employed. We evaluated each dataset type by training popular activity recognition models and comparing the performance on the real test data. Our results show that synthetic data achieve slightly lower accuracy (4-8%) than real data. On the other hand, a model pre-trained on synthetic data and fine-tuned on limited real data surpasses the performance of either domain alone. Standard computer graphics (CG)-rendered data delivers better performance than the data generated from the neural-based rendering method. The results suggest that the quality of the human motions in the training data also affects the test results: motion capture delivers higher test accuracy. Additionally, a model trained on CG aerial view synthetic data exhibits greater robustness against camera viewpoint changes than one trained on real data.

19 Unsupervised video domain adaptation with masked pre-training and collaborative self-training
Reddy, A., Paul, W., Rivera, C., Shah, K., de Melo, C., Chellappa, R., Proceedings of CVPR, 2024
Show abstract

In this work, we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach, which we call UNITE, uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pre-training to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then perform self-training on masked target data, using the video student model and image teacher model together to generate improved pseudolabels for unlabeled target videos. Our self-training process successfully leverages the strengths of both models to achieve strong transfer performance across domains. We evaluate our approach on multiple video domain adaptation benchmarks and observe significant improvements upon previously reported results.

20 Entropic open-set active learning
Safaei, B., Vibashan, V.S., de Melo, C., Patel, V., Proceedings of AAAI, 2024
Show abstract

Active Learning (AL) aims to enhance the performance of deep models by selecting the most informative samples for annotation from a pool of unlabeled data. Despite impressive performance in closed-set settings, most AL methods fail in real-world scenarios where the unlabeled data contains unknown categories. Recently, a few studies have attempted to tackle the AL problem for the open-set setting. However, these methods focus more on selecting known samples and do not efficiently utilize unknown samples obtained during AL rounds. In this work, we propose an Entropic Open-set AL (EOAL) framework which leverages both known and unknown distributions effectively to select informative samples during AL rounds. Specifically, our approach employs two different entropy scores. One measures the uncertainty of a sample with respect to the known-class distributions. The other measures the uncertainty of the sample with respect to the unknown-class distributions. By utilizing these two entropy scores we effectively separate the known and unknown samples from the unlabeled data resulting in better sampling. Through extensive experiments, we show that the proposed method outperforms existing state-of-the-art methods on CIFAR-10, CIFAR-100, and TinyImageNet datasets.

21 Synthetic-to-Real Adaptation for Complex Action Recognition in Surveillance Applications
Lu, S., Jin, Z., Rajendran, V., Harari, M., Feng, A., and de Melo, C., Proceedings of Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications, 2024
Show abstract

In this paper, we propose to enhance action recognition accuracy by leveraging synthetic data and domain adaptation. Specifically, We achieve this through the creation of a synthetic dataset mimicking the Multi-View Extended Video with Activities (MEVA) dataset and the introduction of a multi-modal model for domain adaptation. This synthetic-to-real adaptation approach improves recognition accuracy by leveraging the synthetic data to enhance model generalization. Firstly, we focus on creating and utilizing synthetic datasets generated through a high-fidelity physically-based rendering system. The sensor simulation incorporates domain randomization and photo-realistic rendering to reduce the domain gap between the synthetic and real data, effectively addressing the persistent challenges of real data scarcity in action recognition. Complementing the synthetic dataset generation, we leverage the multi-modal models in the synthetic-toreal adaptation experiments that utilize RGB images and skeleton features. Our experiments show that even relatively straightforward techniques, such as synthetic data pre-training, provide improvements to the models. Our work highlights the effectiveness of the approach and its practical applications across various domains, including surveillance systems, threat identification, and disaster response.

22 Enhancing human action recognition with GAN-based data augmentation
Pulakurthia, P., de Melo, C., Rao, R., and Rabbani, M., Proceedings of Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications, 2024
Show abstract

Deep Neural Networks (DNNs) have emerged as powerful tools for human action recognition, yet their reliance on vast amounts of high-quality labeled data poses significant challenges. The traditional approach of collecting and labeling large volumes of real-world data is not only costly but also raises ethical concerns. A promising alternative is to generate synthetic data. However, the existing synthetic data generation pipelines require complex simulation environments. Our work presents a novel solution by employing Generative Adversarial Networks (GANs) to generate synthetic yet realistic training data from a small existing real-world dataset, thereby bypassing the need for elaborate simulation environments. Central to our approach is a training pipeline that extracts motion from each training video and augments it across varied subject appearances within the training set. This method increases the diversity in both motion and subject representation, thus significantly enhancing the model’s performance in accurately recognizing human gestures. The model’s performance is rigorously evaluated in diverse scenarios, including ground and aerial views, to demonstrate the method’s versatility and effectiveness. The findings of our study highlight the efficiency of GAN-based data augmentation, utilizing a minimal real dataset to create synthetic data without relying on complex simulators. Moreover, useful insights are provided by analyzing the critical factors influencing gesture recognition performance, such as the diversity in gesture motion and the diversity in subject appearance.

23 An evaluation of large pre-trained models for gesture recognition using synthetic videos
Reddy, A., Shah, K., Rivera, C., Paul, W., de Melo, C., and Chellappa, R., Proceedings of Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications, 2024
Show abstract

In this work, we explore the possibility of using synthetically generated data for video-based gesture recognition with large pre-trained models. We consider whether these models have sufficiently robust and expressive representation spaces to enable “training-free” classification. Specifically, we utilize various state-of-the-art video encoders to extract features for use in k-nearest neighbors classification, where the training data points are derived from synthetic videos only. We compare these results with another training-free approach— zero-shot classification using text descriptions of each gesture. In our experiments with the RoCoG-v2 dataset, we find that using synthetic training videos yields significantly lower classification accuracy on real test videos compared to using a relatively small number of real training videos. We also observe that video backbones that were fine-tuned on classification tasks serve as superior feature extractors, and that the choice of fine-tuning data has a substantial impact on k-nearest neighbors performance. Lastly, we find that zero-shot text-based classification performs poorly on the gesture recognition task, as gestures are not easily described through natural language.

24 Real-time human action recognition from aerial videos using autozoom and synthetic data
Xian, R., Vogel, B., de Melo, C., Harrison, A., Manocha, D., Proceedings of Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications, 2024
Show abstract

In this paper, we propose a novel approach for real-time human action recognition (HAR) on resource-constrained UAVs. Our approach tackles the limited availability of labeled UAV video data (compared to ground-based datasets) by incorporating synthetic data augmentation to improve the performance of a lightweight action recognition model. This combined strategy offers a robust and efficient solution for UAV-based HAR.We evaluate our method on the RoCoG v21 and UAV-Human2 datasets, showing a notable increase in top-1 accuracy across all scenarios on RoCoG: 9.1% improvement when training with synthetic data only, 6.9% with real data only, and the highest improvement of 11.8% with a combined approach. Additionally, using an X3D backbone further improves accuracy on the UAV-Human dataset by 5.5%. Our models deployed on a Qualcomm Robotics RB5 platform achieve real-time predictions at approximately 10 frames per second (fps) and demonstrate a superior trade-off between performance and inference rate on both low-power edge devices and high-end desktops.

25 ConceptFusion: Open-set multimodal 3D mapping
Jatavallabhula1, K., Kuwajerwala, A., Gu, Q., Omama, M., Chen, T., Maalouf, A., Li, S., Iyer, G., Saryazdi, S., Keetha, N., Tewari, A., Tenenbaum, J., de Melo, C., Krishna, M., Paull, L., Shkurti, F., Torralba, A., Proceedings of Robotics: Science and Systems (RSS), 2023
Show abstract

Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts. We address both these issues with ConceptFusion, a scene representation that is: (i) fundamentally open-set, enabling reasoning beyond a closed set of concepts (ii) inherently multi-modal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today’s foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping.

26 Stmt: A spatial-temporal mesh transformer for mocap-based action recognition
Zhu, X., Huang, P.-Y., Liang, J., de Melo, C., Hauptmann, A., Proceedings of CVPR, 2023
Show abstract

27 Open-Set automatic target recognition
Safaei, B., Vibashan, V.S.; de Melo, C.; Hu, S.; Patel, V., Proceedings of ICASSP, 2023
Show abstract

Automatic Target Recognition (ATR) is a category of computer vision algorithms which attempts to recognize targets on data obtained from different sensors. ATR algorithms are extensively used in real-world scenarios such as military and surveillance applications. Existing ATR algorithms are developed for traditional closed-set methods where training and testing have the same class distribution. Thus, these algorithms have not been robust to unknown classes not seen during the training phase, limiting their utility in real-world applications. To this end, we propose an Open-set Automatic Target Recognition framework where we enable open-set recognition capability for ATR algorithms. In addition, we introduce a plugin Category-aware Binary Classifier (CBC) module to effectively tackle unknown classes seen during inference. The proposed CBC module can be easily integrated with any existing ATR algorithms and can be trained in an end-to-end manner. Experimental results show that the proposed approach outperforms many open-set methods on the DSIAC and CIFAR-10 datasets. To the best of our knowledge, this is the first work to address the open-set classification problem for ATR algorithms. Source code is available at: https://github.com/bardisafa/Open-set-ATR.

28 Synthetic-to-real domain adaptation for action recognition: A dataset and baseline performances
Reddy, A., Shah, K., Paul, W., Mocharla, R., Hoffman, J., Katyal, K., Manocha, D., de Melo, C., & Chellappa, R., Proceedings of International Conference on Robotics and Automation (ICRA), 2023
Show abstract

Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data has shown promise as a way to avoid the substantial costs, and potential practical and ethical issues associated with collecting and labeling enormous amounts of data in the real-world. However, synthetic data may differ from real data in important ways. This phenomenon, known as domain shift, can limit the utility of synthetic data in robotics applications. To mitigate the effects of domain shift, substantial effort is being dedicated to the development of domain adaptation (DA) techniques. Yet, much remains to be understood on how best to develop these techniques. In this paper, we introduce a new dataset, called Robot Control Gestures (RoCoG-v2), composed of corresponding real and synthetic videos, to support the study of synthetic-to-real domain shift in video action recognition. Our work expands upon existing datasets by focusing the action classes on gestures for humanrobot teaming, as well as by enabling investigation of domain shift in both ground and aerial views. We present baseline results using state-of-the-art action recognition and domain adaptation algorithms and offer initial insight on tackling the synthetic-to-real and ground-to-air domain shifts. A link to the dataset and corresponding documentation can be found at https://github.com/reddyav1/RoCoG-v2.

29 AZTR: Aerial video action recognition with auto zoom and temporal reasoning
Wang, X., Xian, R., Guan, T., de Melo, C., Nogar, S., Bera, A., & Manocha, D., Proceedings of International Conference on Robotics and Automation (ICRA), 2023
Show abstract

We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also present an efficient temporal reasoning algorithm to capture the action information along the spatial and temporal domains within a controllable computational cost. Our approach has been implemented and evaluated both on the desktop with high-end GPUs and on the low power Robotics RB5 Platform for robots and drones. In practice, we achieve 6.1-7.4% improvement over SOTA in Top-1 accuracy on the RoCoG-v2 dataset, 8.3- 10.4% improvement on the UAV-Human dataset and 3.2% improvement on the Drone Action dataset.

30 Multi-view action recognition using contrastive learning
Shah, K., Shah, A., Lau, C., de Melo, C., & Chellappa, R., Proceedings of Winter Conference on Applications of Computer Vision (WACV), 2023
Show abstract

In this work, we present a method for RGB-based action recognition using multi-view videos. We present a supervised contrastive learning framework to learn a feature embedding robust to changes in viewpoint, by effectively leveraging multi-view data. We use an improved supervised contrastive loss and augment the positives with those coming from synchronized viewpoints. We also propose a new approach to use classifier probabilities to guide the selection of hard negatives in the contrastive loss, to learn a more discriminative representation. Negative samples from confusing classes based on posterior are weighted higher. We also show that our method leads to better domain generalization compared to the standard supervised training based on synthetic multi-view data. Extensive experiments on real (NTU-60, NTU-120, NUMA) and synthetic (RoCoG) data demonstrate the effectiveness of our approach.

31 Synthetic data for automatic target recognition from small drones
de Melo, C., Conover, D., Poster, D., Leung, S., Nguyen, R., Conroy, J., Proceedings of Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications, 2023
Show abstract

Automatic target recognition (ATR) technology is likely to play an increasingly prevalent role in maintaining situational awareness in the modern battlefield. Progress in deep learning has enabled considerable progress in the development of ATR algorithms; however, these algorithms require large amounts of high-quality annotated data to train and that is often the main bottleneck. Synthetic data offers a potential solution to this problem, especially given recent proliferation of tools and techniques to synthesize custom data. Here, we focus on ATR, in the visible domain, from the perspective of a small drone, which represents a domain of growing importance to the Army. We describe custom simulators built to support synthetic data for multiple targets in a variety of environments. We describe a field experiment where we compared a baseline (YOLOv5) model, trained on off-the-shelf large generic public datasets, with a model augmented with specialized synthetic data. We deployed the models on a VOXL platform in a small drone. Our results showed a considerable boost in performance when using synthetic data of over 40% in target detection accuracy (average precision with at least 50% overlap). We discuss the value of synthetic data for this domain, the opportunities it creates, but also the novel challenges it introduces.

32 Adversarial learning using synthetic IR imagery
Uplinger, J., Schesser, D., Meyer, C., Conroy, J., de Melo, C., Proceedings of Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications, 2023
Show abstract

33 The influence of emotional expressions of an industrial robot on human collaborative decision-making
Usui, K., Terada, K., & de Melo, C., Proceedings of Affective Computing and Intelligent Interaction (ACII), 2022
Show abstract

In recent years, robots have been equipped with the ability to express emotions and have begun building social relationships with people. However, the significance and effectiveness of incorporating emotion in industrial robots, which have a strong instrumental nature, is not fully understood. We investigated how emotional expressions of an industrial robot influence human collaborative decision-making. The participants (n=52), in a laboratory experiment, engaged in a dessert survival task with an arm robot in a 2 (emotion expression: present vs. absent) × 2 (competence: high vs. low) between-participants study. Emotion was expressed using color through a LED strip of lights - e.g., anger was conveyed by flashing red. The results showed that emotion expression and competence did not influence the final agreement and, in fact, emotion expressions made the interaction longer, emphasizing the difficulty in communicating emotion and the reason for those expressions. We discuss lessons learnt and provide insight on improving the value of emotion expression in industrial robots.

34 Not just streaks: Towards ground truth for single image deraining
Ba, Y., Zhang, H., Yang, E., Suzuki, A., Pfahnl, A., Chandrappa, C., de Melo, C., You, S., Soatto, S., Wong, Al, & Kadambi, A., Proceedings of European Conference on Computer Vision (ECCV), 2022
Show abstract

We propose a large-scale dataset of real-world rainy and clean image pairs and a method to remove degradations, induced by rain streaks and rain accumulation, from the image. As there exists no real-world dataset for deraining, current state-of-the-art methods rely on synthetic data and thus are limited by the sim2real domain gap; moreover, rigorous evaluation remains a challenge due to the absence of a real paired dataset. We fill this gap by collecting a real paired deraining dataset through meticulous control of non-rain variations. Our dataset enables paired training and quantitative evaluation for diverse real-world rain phenomena (e.g. rain streaks and rain accumulation). To learn a representation robust to rain phenomena, we propose a deep neural network that reconstructs the underlying scene by minimizing a rain-robust loss between rainy and clean images. Extensive experiments demonstrate that our model outperforms the state-of-the-art deraining methods on real rainy images under various conditions. Project website: https://visual.ee.ucla.edu/gt_rain.htm/.

35 The impact of partner expressions on felt emotion in the iterated prisoner’s dilemma: An event-level analysis
Angelika-Nikita, M., de Melo, C., Terada, K., Terada, K., & Gratch, J., Proceedings of the Ninth Annual Conference on Advances in Cognitive Systems (ACS), 2021
Show abstract

Social games like the prisoner’s dilemma are often used to develop models of the role of emotion in social decision-making. Here we examine an understudied aspect of emotion in such games: how an individual’s feelings are shaped by their partner’s expressions. Prior research has tended to focus on other aspects of emotion. Research on felt-emotion has focused on how an individual’s feelings shape how they treat their partner, or whether these feelings are authentically expressed. Research on expressed-emotion has focused on how an individual’s decisions are shaped by their partner’s expressions, without regard for whether these expressions actually evoke feelings. Here, we use computer-generated characters to examine how an individual’s moment-to-moment feelings are shaped by (1) how they are treated by their partner and (2) what their partner expresses during this treatment. Surprisingly, we find that partner expressions are far more important than actions in determining self-reported feelings. In other words, our partner can behave in a selfish and exploitive way, but if they show a collaborative pattern of expressions, we will feel greater pleasure collaborating with them. These results also emphasize the importance of context in determining how someone will feel in response to an expression (i.e., knowing a partner is happy is insufficient; we must know what they are happy-at). We discuss the implications of this work for cognitive-system design, emotion theory, and methodological practice in affective computing.

36 Vision-based gesture recognition in human-robot teams using synthetic data
de Melo, C., Rothrock, B., Gurram, P., Ulutan, O., & Manjunath, B. S., Proceedings of International Conference on Intelligent Robots and Systems (IROS), 2020
Show abstract

Building successful collaboration between humans and robots requires efficient, effective, and natural communication. Here we study a RGB-based deep learning approach for controlling robots through gestures (e.g., “follow me”). To address the challenge of collecting high-quality annotated data from human subjects, synthetic data is considered for this domain. We contribute a dataset of gestures that includes real videos with human subjects and synthetic videos from our custom simulator. A solution is presented for gesture recognition based on the state-of-the-art I3D model. Comprehensive testing was conducted to optimize the parameters for this model. Finally, to gather insight on the value of synthetic data, several experiments are described that systematically study the properties of synthetic data (e.g., gesture variations, character variety, generalization to new gestures). We discuss practical implications for the design of effective human-robot collaboration and the usefulness of synthetic data for deep learning.

37 Reducing task load with an embodied intelligent virtual assistant for improved performance in collaborative decision making
Kim, K., de Melo, C., Norouzi, N., Bruder, G., & Welch, G., Proceedings of IEEE on Virtual Reality and 3D User Interfaces (IEEE VR), 2020
Show abstract

Collaboration in a group has the potential to achieve more effective solutions for challenging problems, but collaboration per se is not an easy task, rather a stressful burden if the collaboration partners do not communicate well with each other. While Intelligent Virtual Assistants (IVAs), such as Amazon Alexa, are becoming part of our daily lives, there are increasing occurrences in which we collaborate with such IVAs for our daily tasks. Although IVAs can provide important support to users, the limited verbal interface in the current state of IVAs lacks the ability to provide effective non-verbal social cues, which is critical for improving collaborative performance and reducing task load. In this paper, we investigate the effects of IVA embodiment on collaborative decision making. In a within-subjects study, participants performed a desert survival task in three conditions: (1) performing the task alone, (2) working with a disembodied voice assistant, and (3) working with an embodied assistant. Our results show that both assistant conditions led to higher performance over when performing the task alone, but interestingly the reported task load with the embodied assistant was significantly lower than with the disembodied voice assistant. We discuss the findings with implications for effective and efficient collaborations with IVAs while also emphasizing the increased social presence and richness of the embodied assistant.

38 Shaping cooperation between humans and agents with emotion expressions and framing
de Melo, C., Khooshabeh, P., Amir, O., & Gratch, J., Proceedings of Autonomous Agents and Multiagent Systems (AAMAS 18), 2018
Show abstract

Emotion expressions can help solve social dilemmas where individual interest is pitted against the collective interest. Building on research that shows that emotions communicate intentions to others, we reinforce that people can infer whether emotionally expressive computer agents intend to cooperate or compete. We further show important distinctions between computer agents that are perceived to be driven by humans (i.e., avatars) vs. by algorithms (i.e., agents). Our results reveal that, when the emotion expression reflects an intention to cooperate, participants will cooperate more with avatars than with agents; however, when the emotion reflects an intention to compete, participants cooperate just as little with avatars as with agents. Finally, we present first evidence that the way the dilemma is described – or framed – can influence people’s decision-making. We discuss implications for the design of autonomous agents that foster cooperation with humans, beyond what game theory predicts in social dilemmas.

39 Increasing fairness by delegating decisions to autonomous agents
de Melo, C., Marsella, S., & Gratch, J., Proceedings of Autonomous Agents and Multiagent Systems (AAMAS 17), 2017
Show abstract

There has been growing interest in autonomous agents that act on our behalf, or represent us, across various domains such as negotiation, transportation, health, finance, defense, etc. As these agent representatives become immersed in society, it is critical we understand whether and, if so, how they disrupt the traditional patterns of interaction with others. In this paper we study how programming agents to represent us, shapes our decisions in social settings. Here we show that, when acting through agent representatives, people are considerably less likely to accept unfair offers from others, when compared to direct interaction with others. This result, thus, demonstrates that agent representatives have the potential to promote fairer outcomes. Moreover, we show that this effect can also occur when people are asked to “program” human representatives, thus revealing that the effect is caused by the act of programming itself. We argue this happens because programming requires the programmer to deliberate on all possible situations that might arise and, thus, promote consideration of social norms – such as fairness – when making their decisions. These results have important theoretical, practical, and ethical implications for designing and the nature of people's decision making when they act through agents that act on our behalf.

40 "Do as I say, not as I do:" Challenges in delegating decisions to automated agents
de Melo, C., Marsella, S., & Gratch, J., Proceedings of Autonomous Agents and Multiagent Systems (AAMAS 16), 2016
Show abstract

There has been growing interest, across various domains, in computer agents that can decide on behalf of humans. These agents have the potential to save considerable time and help humans reach better decisions. One implicit assumption, however, is that, as long as the algorithms that simulate decision-making are correct and capture how humans make decisions, humans will treat these agents similarly to other humans. Here we show that interaction with agents that act on our behalf or on behalf of others is richer and more interesting than initially expected. Our results show that, on the one hand, people are more selfish with agents acting on behalf of others, than when interacting directly with others. We propose that agents increase the social distance with others which, subsequently, leads to increased demand. On the other hand, when people task an agent to interact with others, people show more concern for fairness than when interacting directly with others. In this case, higher psychological distance leads people to consider their social image and the long-term consequences of their actions and, thus, behave more fairly. To support these findings, we present an experiment where people engaged in the ultimatum game, either directly or via an agent, with others or agents representing others. We show that these patterns of behavior also occur in a variant of the ultimatum game – the impunity game – where others have minimal power over the final outcome. Finally, we study how social value orientation – i.e., people’s propensity for cooperation – impact these effects. These results have important implications for our understanding of the psychological mechanisms underlying interaction with agents, as well as practical implications for the design of successful agents that act on our behalf or on behalf of others.

41 Beyond believability: Quantifying the differences between real and virtual humans.
de Melo, C., & Gratch, J., Proceedings of the 15th International Conference on Intelligent Virtual Agents (IVA 15), 2015
Show abstract

“Believable” agents are supposed to “suspend the audience's disbelief” and provide the “illusion of life”. However, beyond such high-level definitions, which are prone to subjective interpretation, there is not much more to help researchers systematically create or assess whether their agents are believable. In this paper we propose a more pragmatic and useful benchmark than believability for designing virtual agents. This benchmark requires people, in a specific social situation, to act with the virtual agent in the same manner as they would with a real human. We propose that perceptions of mind in virtual agents, especially pertaining to agency – the ability to act and plan – and experience – the ability to sense and feel emotion – are critical for achieving this new benchmark. We also review current computational systems that fail, pass, and even surpass this benchmark and show how a theoretical framework based on perceptions of mind can shed light into these systems. We also discuss a few important cases where it is better if virtual humans do not pass the benchmark. We discuss implications for the design of virtual agents that can be as natural and efficient to interact with as real humans.

42 People show envy, not guilt, when making decisions with machines.
de Melo, C., & Gratch, J., Proceedings of the 6th International Conference on Affective Computing and Intelligent Interaction (ACII 15), 2015
Show abstract

Research shows that people consistently reach more efficient solutions than those predicted by standard economic models, which assume people are selfish. Artificial intelligence, in turn, seeks to create machines that can achieve these levels of efficiency in human-machine interaction. However, as reinforced in this paper, people's decisions are systematically less efficient – i.e., less fair and favorable – with machines than with humans. To understand the cause of this bias, we resort to a well-known experimental economics model: Fehr and Schmidt's inequity aversion model. This model accounts for people's aversion to disadvantageous outcome inequality (envy) and aversion to advantageous outcome inequality (guilt). We present an experiment where participants engaged in the ultimatum and dictator games with human or machine counterparts. By fitting this data to Fehr and Schmidt's model, we show that people acted as if they were just as envious of humans as of machines; but, in contrast, people showed less guilt when making unfavorable decisions to machines. This result, thus, provides critical insight into this bias people show, in economic settings, in favor of humans. We discuss implications for the design of machines that engage in social decision making with humans.

43 The importance of cognition and affect for artificially intelligent decision makers.
de Melo, C., Gratch, J., Carnevale, P., Proceedings of the 28th Conference on Artificial Intelligence (AAAI 14), 2014
Show abstract

Agency – the capacity to plan and act – and experience – the capacity to sense and feel – are two critical aspects that determine whether people will perceive non-human entities, such as autonomous agents, to have a mind. There is evidence that the absence of either can reduce cooperation. We present an experiment that tests the necessity of both for cooperation with agents. In this experiment we manipulated people's perceptions about the cognitive and affective abilities of agents, when engaging in the ultimatum game. The results indicated that people offered more money to agents that were perceived to make decisions according to their intentions (high agency), rather than randomly (low agency). Additionally, the results showed that people offered more money to agents that expressed emotion (high experience), when compared to agents that did not (low experience). We discuss the implications of this agency-experience theoretical framework for the design of artificially intelligent decision makers.

44 Using virtual confederates to research intergroup bias and conflict.
de Melo, C., Carnevale, P., & Gratch, J., Best Paper Proceedings of the Annual Meeting of the Academy of Management (AOM 14), 2014
Show abstract

Virtual confederates–i.e., three-dimensional virtual characters that look and act like humans–have been gaining in popularity as a research method in the social and medical sciences. Interest in this research method stems from the potential for increased experimental control, ease of replication, facilitated access to broader samples and lower costs. We argue that virtual confederates are also a promising research tool for the study of intergroup behavior. To support this claim we replicate and extend with virtual confederates key findings in the literature. In Experiment 1 we demonstrate that people apply racial stereotypes to virtual confederates, and show a corresponding bias in terms of money offered in the dictator game. In Experiment 2 we show that people also show an in-group bias when group membership is artificially created and based on interdependence through shared payoffs in a nested social dilemma. Our results further demonstrate that social categorization and bias can occur not only when people believe confederates are controlled by humans (i.e., they are avatars), but also when confederates are believed to be controlled by computer algorithms (i.e., they are agents). The results, nevertheless, show a basic bias in favor of avatars (the in-group in the “human category”) to agents (the out-group). Finally, our results (Experiments 2 and 3) establish that people can combine, in additive fashion, the effects of these social categories; a mechanism that, accordingly, can be used to reduce intergroup bias. We discuss implications for research in social categorization, intergroup bias and conflict.

45 The effect of agency on the impact of emotion expressions on people's decision making.
de Melo, C., Gratch, J., Carnevale, P., Proceedings of the International Conference of Affective Computing and Intelligent Interaction (ACII 13), 2013
Show abstract

Recent research in neuroeconomics reveals that people show different behavior and lower activation of brain regions associated with mentalizing (i.e., the inference of other's mental states) when engaged in decision making tasks with a computer, when compared to a human. These findings are important for affective computing because they suggest people's decision making might be influenced differently according to whether they believe the emotional expressions shown by a computer are being generated by a computer algorithm or a human. To test this, we had people engage in a social dilemma (Experiment 1) or a negotiation (Experiment 2) with virtual humans that were either agents (i.e., controlled by computers) or avatars (i.e., controlled by humans). The results show a clear agency effect: in Experiment 1, people cooperated more with virtual humans that showed facial cooperative displays (e.g., joy after mutual cooperation) rather than competitive displays (e.g., joy when the participant was exploited) but, the effect was only significant with avatars; in Experiment 2, people conceded more to an angry than a neutral virtual human but, once again, the effect was only significant with avatars.

46 The effect of virtual agent's emotion displays and appraisals on people's decision making in negotiation.
de Melo, C., Carnevale, P., & Gratch, J., Proceedings of The 12th International Conference on Intelligent Virtual Agents (IVA 12), 2012
Show abstract

There is growing evidence that emotion displays can impact people's decision making in negotiation. However, despite increasing interest in AI and HCI on negotiation as a means to resolve differences between humans and agents, emotion has been largely ignored. We explore how emotion displays in virtual agents impact people's decision making in human-agent negotiation. This paper presents an experiment (N=204) that studies the effects of virtual agents' displays of joy, sadness, anger and guilt on people's decision to counteroffer, accept or drop out from the negotiation, as well as on people's expectations about the agents' decisions. The paper also presents evidence for a mechanism underlying such effects based on appraisal theories of emotion whereby people retrieve, from emotion displays, information about how the agent is appraising the ongoing interaction and, from this information, infer about the agent's intentions and reach decisions themselves. We discuss implications for the design of intelligent virtual agents that can negotiate effectively

47 Bayesian model of the social effects of emotion in decision-making in multiagent systems.
de Melo, C., Carnevale, P., Read, S., Antos, D., & Gratch, J., Proceedings of Autonomous Agents and Multiagent Systems (AAMAS 12), 2012
Show abstract

Research in the behavioral sciences suggests that emotion can serve important social functions and that, more than a simple manifestation of internal experience, emotion displays communicate one's beliefs, desires and intentions. In a recent study we have shown that, when engaged in the iterated prisoner's dilemma with agents that display emotion, people infer, from the emotion displays, how the agent is appraising the ongoing interaction (e.g., is the situation favorable to the agent? Does it blame me for the current state-of-affairs?). From these appraisals people, then, infer whether the agent is likely to cooperate in the future. In this paper we propose a Bayesian model that captures this social function of emotion. The model supports probabilistic predictions, from emotion displays, about how the counterpart is appraising the interaction which, in turn, lead to predictions about the counterpart's intentions. The model's parameters were learnt using data from the empirical study. Our evaluation indicated that considering emotion displays improved the model's ability to predict the counterpart's intentions, in particular, how likely it was to cooperate in a social dilemma. Using data from another empirical study where people made inferences about the counterpart's likelihood of cooperation in the absence of emotion displays, we also showed that the model could, from information about appraisals alone, make appropriate inferences about the counterpart's intentions. Overall, the paper suggests that appraisals are valuable for computational models of emotion interpretation. The relevance of these results for the design of multiagent systems where agents, human or not, can convey or recognize emotion is discussed.

48 A computer model of the interpersonal effect of emotion displayed in a social dilemma.
de Melo, C., Carnevale, P., Antos, D., & Gratch, J., Proceedings of Affective Computing and Intelligent Interaction (ACII 11), 2011
Show abstract

The paper presents a computational model for decision-making in a social dilemma that takes into account the other party's emotion displays. The model is based on data collected in a series of recent studies where participants play the iterated prisoner's dilemma with agents that, even though following the same action strategy, show different emotion displays according to how the game unfolds. We collapse data from all these studies and fit, using maximum likelihood estimation, probabilistic models that predict likelihood of cooperation in the next round given different features. Model 1 predicts based on round outcome alone. Model 2 predicts based on outcome and emotion displays. Model 3 also predicts based on outcome and emotion but, considers contrast effects found in the empirical studies regarding the order with which participants play cooperators and non-cooperators. To evaluate the models, we replicate the original studies but, substitute the humans for the models. The results reveal that Model 3 best replicates human behavior in the original studies and Model 1 does the worst. The results, first, emphasize recent research about the importance of nonverbal cues in social dilemmas and, second, reinforce that people attend to contrast effects in their decision-making. Theoretically, the model provides further insight into how people behave in social dilemmas. Pragmatically, the model could be used to drive an agent that is engaged in a social dilemma with a human (or another agent).

49 The effect of expression of anger and happiness in computer agents on negotiations with humans.
de Melo, C., Carnevale, P., & Gratch, J., Proceedings of Autonomous Agents and Multiagent Systems (AAMAS 11), 2011
Show abstract

There is now considerable evidence in social psychology, economics, and related disciplines that emotion plays an important role in negotiation. For example, humans make greater concessions in negotiation to an opposing human who expresses anger, and they make fewer concessions to an opponent who expresses happiness, compared to a no-emotion-expression control. However, in AI, despite the wide interest in negotiation as a means to resolve differences between agents and humans, emotion has been largely ignored. This paper explores whether expression of anger or happiness by computer agents, in a multi-issue negotiation task, can produce effects that resemble effects seen in human-human negotiation. The paper presents an experiment where participants play with agents that express emotions (anger vs. happiness vs. control) through different modalities (text vs. facial displays). An important distinction in our experiment is that participants are aware that they negotiate with computer agents. The data indicate that the emotion effects observed in past work with humans also occur in agent-human negotiation, and occur independently of modality of expression. The implications of these results are discussed for the fields of automated negotiation, intelligent virtual agents and artificial intelligence.

50 The influence of emotion expression on perceptions of trustworthiness in negotiation.
Antos, D., de Melo, C., Gratch, J., & Grosz, B., Proceedings of The 25th Conference on Artificial Intelligence (AAAI 11), 2011
Show abstract

When interacting with computer agents, people make inferences about various characteristics of these agents, such as their reliability and trustworthiness. These perceptions are significant, as they influence people's behavior towards the agents, and may foster or inhibit repeated interactions between them. In this paper we investigate whether computer agents can use the expression of emotion to influence human perceptions of trustworthiness. In particular, we study human-computer interactions within the context of a negotiation game, in which players make alternating offers to decide on how to divide a set of resources. A series of negotiation games between a human and several agents is then followed by a “trust game.” In this game people have to choose one among several agents to interact with, as well as how much of their resources they will trust to it. Our results indicate that, among those agents that displayed emotion, those whose expression was in accord with their actions (strategy) during the negotiation game were generally preferred as partners in the trust game over those whose emotion expressions and actions did not mesh. Moreover, we observed that when emotion does not carry useful new information, it fails to strongly influence human decision-making behavior in a negotiation setting.

51 The influence of emotions in embodied agents on human decision-making.
de Melo, C., Carnevale, P., & Gratch, J., Proceedings of Intelligent Virtual Agents (IVA 10), 2010
Show abstract

Acknowledging the social functions that emotions serve, there has been growing interest in the interpersonal effect of emotion in human decision making. Following the paradigm of experimental games from social psychology and experimental economics, we explore the interpersonal effect of emotions expressed by embodied agents on human decision making. The paper describes an experiment where participants play the iterated prisoner's dilemma against two different agents that play the same strategy (tit-for-tat), but communicate different goal orientations (cooperative vs. individualistic) through their patterns of facial displays. The results show that participants are sensitive to differences in the facial displays and cooperate significantly more with the cooperative agent. The data indicate that emotions in agents can influence human decision making and that the nature of the emotion, as opposed to mere presence, is crucial for these effects. We discuss the implications of the results for designing human-computer interfaces and understanding human-human interaction.

52 Evolving expression of emotions through color in virtual humans using genetic algorithms.
de Melo, C., & Gratch, J., Proceedings of the 1st International Conference on Computational Creativity (ICCC 10), 2010
Show abstract

For centuries artists have been exploring the formal elements of art (lines, space, mass, light, color, sound, etc.) to express emotions. This paper takes this insight to explore new forms of expression for virtual humans which go beyond the usual bodily, facial and vocal expression channels. In particular, the paper focuses on how to use color to influence the perception of emotions in virtual humans. First, a lighting model and filters are used to manipulate color. Next, an evolutionary model, based on genetic algorithms, is developed to learn novel associations between emotions and color. An experiment is then conducted where non-experts evolve mappings for joy and sadness, without being aware that genetic algorithms are used. In a second experiment, the mappings are analyzed with respect to its features and how general they are. Results indicate that the average fitness increases with each new generation, thus suggesting that people are succeeding in creating novel and useful mappings for the emotions. Moreover, the results show consistent differences between the evolved images of joy and the evolved images of sadness.

53 Expression of emotions using wrinkles, blushing, sweating and tears.
de Melo, C., & Gratch, J., Proceedings of the Intelligent Virtual Agents (IVA 09), 2009
Show abstract

Wrinkles, blushing, sweating and tears are physiological manifestations of emotions in humans. Therefore, the simulation of these phenomena is important for the goal of building believable virtual humans which interact naturally and effectively with humans. This paper describes a real-time model for the simulation of wrinkles, blushing, sweating and tears. A study is also conducted to assess the influence of the model on the perception of surprise, sadness, anger, shame, pride and fear. The study follows a repeated-measures design where subjects compare how well is each emotion expressed by virtual humans with or without these phenomena. The results reveal a significant positive effect on the perception of surprise, sadness, anger, shame and fear. The relevance of these results is discussed for the fields of virtual humans and expression of emotions.

54 Expression of moral emotions in cooperating agents.
de Melo, C., Zheng, L., & Gratch, J., Proceedings of Intelligent Virtual Agents (IVA 09), 2009
Show abstract

Moral emotions have been argued to play a central role in the emergence of cooperation in human-human interactions. This work describes an experiment which tests whether this insight carries to virtual human-human interactions. In particular, the paper describes a repeated-measures experiment where subjects play the iterated prisoner's dilemma with two versions of the virtual human: (a) neutral, which is the control condition; (b) moral, which is identical to the control condition except that the virtual human expresses gratitude, distress, remorse, reproach and anger through the face according to the action history of the game. Our results indicate that subjects cooperate more with the virtual human in the moral condition and that they perceive it to be more human-like. We discuss the relevance these results have for building agents which are successful in cooperating with humans.

55 The effect of color on expression of joy and sadness in virtual humans.
de Melo, C., & Gratch, J., Proceedings of the Affective Computing and Intelligent Interaction (ACII 09), 2009
Show abstract

For centuries artists have been exploring color to express emotions. Following this insight, the paper describes an approach to learn how to use color to influence the perception of emotions in virtual humans. First, a model of lighting and filters inspired on the visual arts is integrated with a virtual human platform to manipulate color. Next, an evolutionary model, based on genetic algorithms, is created to evolve mappings between emotions and lighting and filter parameters. A first study is, then, conducted where subjects evolve mappings for joy and sadness without being aware of the evolutionary model. In a second study, the features which characterize the mappings are analyzed. Results show that virtual human images of joy tend to be brighter, more saturated and have more colors than images of sadness. The paper discusses the relevance of the results for the fields of expression of emotions and virtual humans.

56 Creative expression of emotions in virtual humans.
de Melo, C., & Gratch, J., Proceedings of the International Conference on the Foundations of Digital Games (FDG 09), 2009
Show abstract

We summarize our work on creative expression of emotion based on techniques from the arts.

57 Evolving expression of emotions in virtual humans using lights and pixels.
de Melo, C., & Gratch, J., Proceedings of Intelligent Virtual Agents (IVA 08), 2008
Show abstract

We summarize our work on using genetic algorithms to evolve emotion expression through lighting and color.

58 Expression of emotions in virtual humans using lights, shadows, composition and filters.
de Melo, C., & Paiva, A., Proceedings of Affective Computing and Intelligent Interaction (ACII 07), 2007
Show abstract

Artists use words, lines, shapes, color, sound and their bodies to express emotions. Virtual humans use postures, gestures, face and voice to express emotions.Why are they limiting themselves to the body? The digital medium affords the expression of emotions using lights, camera, sound and the pixels in the screen itself. Thus, leveraging on accumulated knowledge from the arts, this work proposes a model for the expression of emotions in virtual humans which goes beyond embodiment and explores lights, shadows, composition and filters to convey emotions. First, the model integrates the OCC emotion model for emotion synthesis. Second, the model defines a pixel-based lighting model which supports extensive expressive control of lights and shadows. Third, the model explores the visual arts techniques of composition in layers and filtering to manipulate the virtual human pixels themselves. Finally, the model introduces a markup language to define mappings between emotional states and multimodal expression.

59 Mainstream games in the multi-agent classroom.
de Melo, C., Prada, R., Raimundo, G., Pardal, J., Pinto, H., & Paiva, A., Proceedings of IEEE/WIC/ACM Intelligent Agent Technology (IAT 06), 2006
Show abstract

Computer games make learning fun and support learning through doing. Edutainment software tries to capitalize on this however, it has failed in reaching the levels of motivation and engagement seen in mainstream games. In this context, we have integrated a mainstream first-person shooter game, Counter-Strike, into the curriculum of our Autonomous Agents and Multi-agent Systems course. In this paper we describe this integration and a platform to support the creation of Counter-Strike agents. In addition, a questionnaire was posed to our students to assess the success of our approach. Results show that students found the idea of applying a first-person-shooter game motivating and the integration with the curriculum useful for their education.

60 A story about gesticulation expression.
de Melo, C., & Paiva, A., Proceedings of the Intelligent Virtual Agents Conference (IVA 06), 2006
Show abstract

Gesticulation is essential for the storytelling experience thus, virtual storytellers should be endowed with gesticulation expression. This work proposes a gesticulation expression model based on psycholinguistics. The model supports: (a) real-time gesticulation animation described as sequences of constraints on static (Portuguese Sign Language hand shapes, orientations and positions) and dynamic (motion profiles) features; (b) multimodal synchronization between gesticulation and speech; (c) automatic reproduction of annotated gesticulation according to GestuRA, a gesture transcription algorithm. To evaluate the model two studies, involving 147 subjects, were conducted. In both cases, the idea consisted of comparing the narration of the Portuguese traditional story “The White Rabbit” by a human storyteller with a version by a virtual storyteller. Results indicate that synthetic gestures fared well when compared to real gestures however, subjects preferred the human storyteller.

61 Environment expression: Expressing emotions through cameras, lights and music.
de Melo, C., & Paiva, A., Proceedings of Affective Computing and Intelligent Agents (ACII 05), 2005
Show abstract

Environment expression is about going beyond the usual Human emotion expression channels in virtual worlds. This work proposes an integrated storytelling model – the environment expression model – capable of expressing emotions through three channels: cinematography, illumination and music. Stories are organized into prioritized points of interest which can be characters or dialogues. Characters synthesize cognitive emotions based on the OCC emotion theory. Dialogues have collective emotional states which reflect the participants' emotional state. During storytelling, at each instant, the highest priority point of interest is focused through the expression channels. The cinematography channel and the illumination channel reflect the point of interest's strongest emotion type and intensity. The music channel reflects the valence of the point of interest's mood. Finally, a study was conducted to evaluate the model. Results confirm the influence of environment expression on emotion perception and reveal moderate success of this work's approach.

62 Environment expression: Telling stories through cameras, lights and music.
de Melo, C., & Paiva, A., Proceedings of The International Conference on Virtual Storytelling (ICV 05), 2005
Show abstract

This work proposes an integrated model – the environment expression model – which supports storytelling through three channels: cinematography, illumination and music. Stories are modeled as a set of points of interest which can be characters, dialogues or sceneries. At each instant, audience's focus is drawn to the highest priority point of interest. Expression channels reflect the type and emotional state of this point of interest. A study, using a cartoon-like application, was also conducted to evaluate the model. Results were inconclusive regarding influence on story interpretation but, succeeded in showing preference for stories told with environment expression.