Publications

1 ConceptGraphs: Open-vocabulary 3D scene graphs for perception and planning
Kuwajerwala, A., Gu, Q., Morin, S., Jatavallabhula, K., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., Gan, C., de Melo, C., Tenenbaum, J., Torralba, A., Shkurti, F., Paull, L., Proceedings of ICRA, 2024
Show abstract

For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, which do not scale well in larger environments, nor do they contain semantic spatial relationships between entities in the environment, which are useful for downstream planning. In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multiview association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models. We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts.

2 Incorporating physics into data-driven computer vision
Kadambi, A., de Melo, C., Hsieh, C.-J., Srivastava, M., & Soatto, S., Nature Machine Intelligence, 2023
Show abstract

Many computer vision techniques infer properties of our physical world from images. While images are formed through the physics of light and mechanics, computer vision techniques are typically data-driven. This trend is mostly driven by performance: classical techniques from physicsbased vision often do not score as high in metrics, compared to modern deep learning. However, recent research, covered in this perspective, has shown that physical models can be included as a constraint into datadriven pipelines. In doing so, one can combine the performance benefits of a data-driven method with advantages offered from a physics-based method, such as intepretability, falsifiability, and generalizability. The aim of this Perspective is to provide an overview into specific approaches of how physical models can be integrated into artificial intelligence (AI) pipelines, referred to as physics-based machine learning. We discuss technical approaches that range from modifications to the dataset, network design, loss functions, optimization, and regularization schemes.

3 Social functions of machine emotional expressions
de Melo, C., Gratch, J., Marsella, S., & Pelachaud, C., Proceedings of IEEE, 2023
Show abstract

Virtual humans and social robots frequently generate behaviors that human observers naturally see as expressing emotion. In this review article, we highlight that these expressions can have important benefits for human-machine interaction. We first summarize the psychological findings on how emotional expressions achieve important social functions in human relationships and highlight that artificial emotional expressions can serve analogous functions in human-machine interaction.We then review computational methods for determining what expressions make sense to generate within the context of an interaction and how to realize those expressions across multiple modalities such as facial expressions, voice, language and touch. The use of synthetic expressions raises a number of ethical concerns and we conclude with a discussion of principles to achieve the benefits of machine emotion in ethical ways.

4 Synthetic-to-real domain adaptation for action recognition: A dataset and baseline performances
Reddy, A., Shah, K., Paul, W., Mocharla, R., Hoffman, J., Katyal, K., Manocha, D., de Melo, C., & Chellappa, R., Proceedings of International Conference on Robotics and Automation (ICRA), 2023
Show abstract

Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data has shown promise as a way to avoid the substantial costs, and potential practical and ethical issues associated with collecting and labeling enormous amounts of data in the real-world. However, synthetic data may differ from real data in important ways. This phenomenon, known as domain shift, can limit the utility of synthetic data in robotics applications. To mitigate the effects of domain shift, substantial effort is being dedicated to the development of domain adaptation (DA) techniques. Yet, much remains to be understood on how best to develop these techniques. In this paper, we introduce a new dataset, called Robot Control Gestures (RoCoG-v2), composed of corresponding real and synthetic videos, to support the study of synthetic-to-real domain shift in video action recognition. Our work expands upon existing datasets by focusing the action classes on gestures for humanrobot teaming, as well as by enabling investigation of domain shift in both ground and aerial views. We present baseline results using state-of-the-art action recognition and domain adaptation algorithms and offer initial insight on tackling the synthetic-to-real and ground-to-air domain shifts. A link to the dataset and corresponding documentation can be found at https://github.com/reddyav1/RoCoG-v2.

5 Next-generation deep learning based on simulators and synthetic data
de Melo, C., Torralba, A., Guibas, L., DiCarlo, J., Chellappa, R., & Hodgins, J., Trends in Cognitive Sciences, 2021
Show abstract

Deep learning (DL) is being successfully applied across multiple domains, yet these models learn in a most artificial way: they require large quantities of labeled data to grasp even simple concepts. Thus, the main bottleneck is often access to supervised data. Here, we highlight a trend in a potential solution to this challenge: synthetic data. Synthetic data are becoming accessible due to progress in rendering pipelines, generative adversarial models, and fusion models. Moreover, advancements in domain adaptation techniques help close the statistical gap between synthetic and real data. Paradoxically, this artificial solution is also likely to enable more natural learning, as seen in biological systems, including continual, multimodal, and embodied learning. Complementary to this, simulators and deep neural networks (DNNs) will also have a critical role in providing insight into the cognitive and neural functioning of biological systems. We also review the strengths of, and opportunities and novel challenges associated with, synthetic data.

6 Emotion expressions shape human social norms and reputations
de Melo, C., Terada, K., & Santos, F., iScience, 2021
Show abstract

The emergence of pro-social behaviors remains a key open challenge across disciplines. In this context, there is growing evidence that expressing emotions may foster human cooperation. However, it remains unclear how emotions shape individual choices and interact with other cooperation mechanisms. Here, we provide a comprehensive experimental analysis of the interplay of emotion expressions with two important mechanisms: direct and indirect reciprocity. We show that cooperation in an iterated prisoner's dilemma emerges from the combination of the opponent's initial reputation, past behaviors, and emotion expressions. Moreover, all factors influenced the social norm adopted when assessing the action of others — i.e., how their counterparts' reputations are updated – thus, reflecting longer-term consequences. We expose a new class of emotion-based social norms, where emotions are used to forgive those that defect but also punish those that cooperate. These findings emphasize the importance of emotion expressions in fostering, directly and indirectly, cooperation in society.

7 The interplay of emotion expressions and strategy in promoting cooperation in the iterated prisoner's dilemma
de Melo, C., & Kazunori, T., Scientific Reports, 2020
Show abstract

The iterated prisoner's dilemma has been used to study human cooperation for decades. The recent discovery of extortion and generous strategies renewed interest on the role of strategy in shaping behavior in this dilemma. But what if players could perceive each other's emotional expressions? Despite increasing evidence that emotion signals influence decision making, the effects of emotion in this dilemma have been mostly neglected. Here we show that emotion expressions moderate the effect of generous strategies, increasing or reducing cooperation according to the intention communicated by the signal; in contrast, expressions by extortionists had no effect on participants' behavior, revealing a limitation of highly competitive strategies. We provide evidence that these effects are mediated mostly by inferences about other's intentions made from strategy and emotion. These findings provide insight into the value, as well as the limits, of behavioral strategies and emotion signals for cooperation.

8 Vision-based gesture recognition in human-robot teams using synthetic data
de Melo, C., Rothrock, B., Gurram, P., Ulutan, O., & Manjunath, B. S., Proceedings of International Conference on Intelligent Robots and Systems (IROS), 2020
Show abstract

Building successful collaboration between humans and robots requires efficient, effective, and natural communication. Here we study a RGB-based deep learning approach for controlling robots through gestures (e.g., “follow me”). To address the challenge of collecting high-quality annotated data from human subjects, synthetic data is considered for this domain. We contribute a dataset of gestures that includes real videos with human subjects and synthetic videos from our custom simulator. A solution is presented for gesture recognition based on the state-of-the-art I3D model. Comprehensive testing was conducted to optimize the parameters for this model. Finally, to gather insight on the value of synthetic data, several experiments are described that systematically study the properties of synthetic data (e.g., gesture variations, character variety, generalization to new gestures). We discuss practical implications for the design of effective human-robot collaboration and the usefulness of synthetic data for deep learning.

9 Human cooperation when acting through autonomous machines
de Melo, C., Marsella, S., & Gratch, J., Proceedings of the National Academy of Sciences U.S.A., 116, 3482-3487, 2019
Show abstract

Recent times have seen an emergence of intelligent machines that act autonomously on our behalf, such as autonomous vehicles. Despite promises of increased efficiency, it is not clear whether this paradigm shift will change how we decide when our self-interest (e.g., comfort) is pitted against the collective interest (e.g., environment). Here we show that acting through machines changes the way people solve these social dilemmas and we present experimental evidence showing that participants program their autonomous vehicles to act more cooperatively than if they were driving themselves. We show this happens because programming causes selfish short-term rewards to become less salient, leading to considerations of broader societal goals. We also show that the programmed behavior is influenced by past experience. Finally, we report evidence that the effect generalizes beyond the domain of autonomous vehicles. We discuss implications for designing autonomous machines that contribute to a more cooperative society.

10 Reading people's minds from emotion expressions in interdependent decision making.
de Melo, C., Carnevale, P., Read, S., & Gratch, J., Journal of Personality and Social Psychology, 106(1), 73-88, 2014
Show abstract

How do people make inferences about other people's minds from their emotion displays? The ability to infer others beliefs, desires and intentions from their facial expressions should be especially important in interdependent decision making when people make decisions from beliefs about the others' intention to cooperate. Five experiments tested the general proposition that people follow principles of appraisal when making inferences from emotion displays, in context. Experiment 1 found that the same emotion display produced opposite effects depending on context: when the other was competitive, a smile on the other's face evoked a more negative response than when the other was cooperative. Experiment 2 found that the essential information from emotion displays was derived from appraisals (e.g., is the current state-of-affairs conducive to my goals? Who is to blame for it?}, facial displays of emotion had the same impact on people's decision making as textual expressions of the corresponding appraisals. Experiments 3, 4 and 5 used multiple mediation analyses and a causal-chain design: Results supported the proposition that beliefs about others' appraisals mediate the effects of emotion displays on expectations about others' intentions. We suggest a model based on appraisal theories of emotion that posits an inferential mechanism whereby people retrieve, from emotion expressions, information about others' appraisals, which then lead to inferences about others' mental states. This work has implications for the design of algorithms that drive agent behavior in human-agent strategic interaction, an emerging domain at the interface of computer science and social psychology.