Incorporating physics into data-driven computer vision
Kadambi, A., de Melo, C., Hsieh, C.-J., Srivastava, M., & Soatto, S., Nature Machine Intelligence, 2023
Many computer vision techniques infer properties of our physical world from images. While images are formed through the physics of light and mechanics, computer vision techniques are typically data-driven. This trend is mostly driven by performance: classical techniques from physicsbased vision often do not score as high in metrics, compared to modern deep learning. However, recent research, covered in this perspective, has shown that physical models can be included as a constraint into datadriven pipelines. In doing so, one can combine the performance benefits of a data-driven method with advantages offered from a physics-based method, such as intepretability, falsifiability, and generalizability. The aim of this Perspective is to provide an overview into specific approaches of how physical models can be integrated into artificial intelligence (AI) pipelines, referred to as physics-based machine learning. We discuss technical approaches that range from modifications to the dataset, network design, loss functions, optimization, and regularization schemes.
Social functions of machine emotional expressions
de Melo, C., Gratch, J., Marsella, S., & Pelachaud, C., Proceedings of IEEE, 2023
Virtual humans and social robots frequently generate behaviors that human observers naturally see as expressing emotion. In this review article, we highlight that these expressions can have important benefits for human-machine interaction. We first summarize the psychological findings on how emotional expressions achieve important social functions in human relationships and highlight that artificial emotional expressions can serve analogous functions in human-machine interaction.We then review computational methods for determining what expressions make sense to generate within the context of an interaction and how to realize those expressions across multiple modalities such as facial expressions, voice, language and touch. The use of synthetic expressions raises a number of ethical concerns and we conclude with a discussion of principles to achieve the benefits of machine emotion in ethical ways.
ConceptFusion: Open-set Multimodal 3D Mapping
Jatavallabhula1, K., Kuwajerwala, A., Gu, Q., Omama, M., Chen, T., Maalouf, A., Li, S., Iyer, G., Saryazdi, S., Keetha, N., Tewari, A., Tenenbaum, J., de Melo, C., Krishna, M., Paull, L., Shkurti, F., Torralba, A., Proceedings of Robotics: Science and Systems (RSS), 2023
Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts. We address both these issues with ConceptFusion, a scene representation that is: (i) fundamentally open-set, enabling reasoning beyond a closed set of concepts (ii) inherently multi-modal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today’s foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping.
Synthetic-to-real domain adaptation for action recognition: A dataset and baseline performances
Reddy, A., Shah, K., Paul, W., Mocharla, R., Hoffman, J., Katyal, K., Manocha, D., de Melo, C., & Chellappa, R., Proceedings of International Conference on Robotics and Automation (ICRA), 2023
Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data has shown promise as a way to avoid the substantial costs, and potential practical and ethical issues associated with collecting and labeling enormous amounts of data in the real-world. However, synthetic data may differ from real data in important ways. This phenomenon, known as domain shift, can limit the utility of synthetic data in robotics applications. To mitigate the effects of domain shift, substantial effort is being dedicated to the development of domain adaptation (DA) techniques. Yet, much remains to be understood on how best to develop these techniques. In this paper, we introduce a new dataset, called Robot Control Gestures (RoCoG-v2), composed of corresponding real and synthetic videos, to support the study of synthetic-to-real domain shift in video action recognition. Our work expands upon existing datasets by focusing the action classes on gestures for humanrobot teaming, as well as by enabling investigation of domain shift in both ground and aerial views. We present baseline results using state-of-the-art action recognition and domain adaptation algorithms and offer initial insight on tackling the synthetic-to-real and ground-to-air domain shifts. A link to the dataset and corresponding documentation can be found at https://github.com/reddyav1/RoCoG-v2.
Next-generation deep learning based on simulators and synthetic data
de Melo, C., Torralba, A., Guibas, L., DiCarlo, J., Chellappa, R., & Hodgins, J., Trends in Cognitive Sciences, 2021
Deep learning (DL) is being successfully applied across multiple domains, yet these models learn in a most artificial way: they require large quantities of labeled data to grasp even simple concepts. Thus, the main bottleneck is often access to supervised data. Here, we highlight a trend in a potential solution to this challenge: synthetic data. Synthetic data are becoming accessible due to progress in rendering pipelines, generative adversarial models, and fusion models. Moreover, advancements in domain adaptation techniques help close the statistical gap between synthetic and real data. Paradoxically, this artificial solution is also likely to enable more natural learning, as seen in biological systems, including continual, multimodal, and embodied learning. Complementary to this, simulators and deep neural networks (DNNs) will also have a critical role in providing insight into the cognitive and neural functioning of biological systems. We also review the strengths of, and opportunities and novel challenges associated with, synthetic data.
Emotion expressions shape human social norms and reputations
de Melo, C., Terada, K., & Santos, F., iScience, 2021
The emergence of pro-social behaviors remains a key open challenge across disciplines. In this context, there is growing evidence that expressing emotions may foster human cooperation. However, it remains unclear how emotions shape individual choices and interact with other cooperation mechanisms. Here, we provide a comprehensive experimental analysis of the interplay of emotion expressions with two important mechanisms: direct and indirect reciprocity. We show that cooperation in an iterated prisoner's dilemma emerges from the combination of the opponent's initial reputation, past behaviors, and emotion expressions. Moreover, all factors influenced the social norm adopted when assessing the action of others — i.e., how their counterparts' reputations are updated – thus, reflecting longer-term consequences. We expose a new class of emotion-based social norms, where emotions are used to forgive those that defect but also punish those that cooperate. These findings emphasize the importance of emotion expressions in fostering, directly and indirectly, cooperation in society.
The interplay of emotion expressions and strategy in promoting cooperation in the iterated prisoner's dilemma
de Melo, C., & Kazunori, T., Scientific Reports, 2020
The iterated prisoner's dilemma has been used to study human cooperation for decades. The recent discovery of extortion and generous strategies renewed interest on the role of strategy in shaping behavior in this dilemma. But what if players could perceive each other's emotional expressions? Despite increasing evidence that emotion signals influence decision making, the effects of emotion in this dilemma have been mostly neglected. Here we show that emotion expressions moderate the effect of generous strategies, increasing or reducing cooperation according to the intention communicated by the signal; in contrast, expressions by extortionists had no effect on participants' behavior, revealing a limitation of highly competitive strategies. We provide evidence that these effects are mediated mostly by inferences about other's intentions made from strategy and emotion. These findings provide insight into the value, as well as the limits, of behavioral strategies and emotion signals for cooperation.
Vision-based gesture recognition in human-robot teams using synthetic data
de Melo, C., Rothrock, B., Gurram, P., Ulutan, O., & Manjunath, B. S., Proceedings of International Conference on Intelligent Robots and Systems (IROS), 2020
Building successful collaboration between humans and robots requires efficient, effective, and natural communication. Here we study a RGB-based deep learning approach for controlling robots through gestures (e.g., “follow me”). To address the challenge of collecting high-quality annotated data from human subjects, synthetic data is considered for this domain. We contribute a dataset of gestures that includes real videos with human subjects and synthetic videos from our custom simulator. A solution is presented for gesture recognition based on the state-of-the-art I3D model. Comprehensive testing was conducted to optimize the parameters for this model. Finally, to gather insight on the value of synthetic data, several experiments are described that systematically study the properties of synthetic data (e.g., gesture variations, character variety, generalization to new gestures). We discuss practical implications for the design of effective human-robot collaboration and the usefulness of synthetic data for deep learning.
Human cooperation when acting through autonomous machines
de Melo, C., Marsella, S., & Gratch, J., Proceedings of the National Academy of Sciences U.S.A., 116, 3482-3487, 2019
Recent times have seen an emergence of intelligent machines that act autonomously on our behalf, such as autonomous vehicles. Despite promises of increased efficiency, it is not clear whether this paradigm shift will change how we decide when our self-interest (e.g., comfort) is pitted against the collective interest (e.g., environment). Here we show that acting through machines changes the way people solve these social dilemmas and we present experimental evidence showing that participants program their autonomous vehicles to act more cooperatively than if they were driving themselves. We show this happens because programming causes selfish short-term rewards to become less salient, leading to considerations of broader societal goals. We also show that the programmed behavior is influenced by past experience. Finally, we report evidence that the effect generalizes beyond the domain of autonomous vehicles. We discuss implications for designing autonomous machines that contribute to a more cooperative society.
Reading people's minds from emotion expressions in interdependent decision making.
de Melo, C., Carnevale, P., Read, S., & Gratch, J., Journal of Personality and Social Psychology, 106(1), 73-88, 2014
How do people make inferences about other people's minds from their emotion displays? The ability to infer others beliefs, desires and intentions from their facial expressions should be especially important in interdependent decision making when people make decisions from beliefs about the others' intention to cooperate. Five experiments tested the general proposition that people follow principles of appraisal when making inferences from emotion displays, in context. Experiment 1 found that the same emotion display produced opposite effects depending on context: when the other was competitive, a smile on the other's face evoked a more negative response than when the other was cooperative. Experiment 2 found that the essential information from emotion displays was derived from appraisals (e.g., is the current state-of-affairs conducive to my goals? Who is to blame for it?}, facial displays of emotion had the same impact on people's decision making as textual expressions of the corresponding appraisals. Experiments 3, 4 and 5 used multiple mediation analyses and a causal-chain design: Results supported the proposition that beliefs about others' appraisals mediate the effects of emotion displays on expectations about others' intentions. We suggest a model based on appraisal theories of emotion that posits an inferential mechanism whereby people retrieve, from emotion expressions, information about others' appraisals, which then lead to inferences about others' mental states. This work has implications for the design of algorithms that drive agent behavior in human-agent strategic interaction, an emerging domain at the interface of computer science and social psychology.