Ella: Embodied Social Agents
with Lifelong Memory

1 University of Massachusetts Amherst 2 Johns Hopkins University 3 Tsinghua University

Overview

We introduce Ella, an embodied social agent capable of lifelong learning within a community in a 3D open world, where agents accumulate experiences and acquire knowledge through everyday visual observations and social interactions. At the core of Ella's capabilities is a structured long-term multimodal memory system that stores, updates, and retrieves information effectively. It consists of a name-centric semantic memory for organizing acquired knowledge and a spatiotemporal episodic memory for capturing multimodal experiences. By integrating this lifelong memory system with foundation models, Ella retrieves relevant information for decision-making, plans daily activities, builds social relationships, and evolves autonomously while coexisting with other intelligent beings in the open world. % reflects on abstract experiences, and incorporates new knowledge through visual processing and social interactions in open-world settings. We conduct capability-oriented evaluations in a dynamic 3D open world where 15 agents engage in social activities for days and are assessed with a suit of unseen controlled evaluations. Experimental results show that Ella can influence, lead, and cooperate with other agents well to achieve goals, showcasing its ability to learn effectively through observation and social interaction. Our findings highlight the transformative potential of combining structured memory systems with foundation models for advancing embodied intelligence.


Ella: Embodied Lifelong Learning Agent

We build long-term memory in two forms: (a) name-centric semantic memory organizes the knowledge in a name-centric graph including a hierarchical scene graph serving as the spatial memory; (b) spatiotemporal episodic memory stores the experience as a series of events consisting of time, location, and multimodal contents. (c) Ella first generates a daily schedule according to the knowledge and experiences retrieved from the long-term memory, (d) then updates the memory based on visual observations of the environment, and (e) social interactions with other agents and (f) makes reactions accordingly including (f1) revising the schedule, (f2) interacting with the environment, (f3) and engaging in a conversation.


Controlled Finals

Influence Battle
Leadership Quest

Click here for a glimpse of Ella's life in Virtual Community