Vision and Graphics at Columbia University

About

VIGR Seminar is a lively and interactive forum held bi-weekly where students and faculty affiliated with the visual computing community gather for a one hour presentation on recent research. The topics span Computer Vision, Machine Learning, Robotics, Graphics, and related themes. The Seminar is held on Thursdays from 2:00 pm to 3:00 pm (unless otherwise noted) in CEPSR 620 (currently on zoom). See our YouTube channel for all talks.

Talks

Bio: Samuel Ainsworth is a Senior Research Scientist at Cruise AI Research where he studies imitation learning, robustness, and efficiency. He completed his undergraduate in Computer Science and Applied Mathematics at Brown University and received his PhD from the School of Computer Science and Engineering at the University of Washington. His research interests span reinforcement learning, deep learning, programming languages, and drug discovery. He has previously worked on recommender systems, Bayesian optimization, and variational inference at organizations such as The New York Times and Google.

Title:Git Re-Basin: Merging Models modulo Permutation Symmetries
Speaker: Samuel Ainsworth
Time: January 26, 2023; 1:00 pm ET

Abstract: The success of deep learning is due in large part to our ability to solve certain massive non-convex optimization problems with relative ease. Though non-convex optimization is NP-hard, simple algorithms -- often variants of stochastic gradient descent -- exhibit surprising effectiveness in fitting large neural networks in practice. We argue that neural network loss landscapes contain (nearly) a single basin after accounting for all possible permutation symmetries of hidden units a la Entezari et al. (2021). We introduce three algorithms to permute the units of one model to bring them into alignment with a reference model in order to merge the two models in weight space. This transformation produces a functionally equivalent set of weights that lie in an approximately convex basin near the reference model. Experimentally, we demonstrate the single basin phenomenon across a variety of model architectures and datasets, including the first (to our knowledge) demonstration of zero-barrier linear mode connectivity between independently trained ResNet models on CIFAR-10 and CIFAR-100. Additionally, we identify intriguing phenomena relating model width and training time to mode connectivity. Finally, we discuss shortcomings of the linear mode connectivity hypothesis, including a counterexample to the single basin theory.

Bio: Animesh Garg is a Stephen Fleming Early Career Professor at the School of Interactive Computing at Georgia Tech. He leads the People, AI, and Robotics (PAIR) research group. He is on the core faculty in the Robotics and Machine Learning programs. Animesh is also a Senior Researcher at Nvidia Research. Animesh earned a Ph.D. from UC Berkeley and was a postdoc at the Stanford AI Lab. He is on leave from the department of Computer Science at the University of Toronto and the CIFAR Chair position at the Vector Institute. His work aims to build Generalizable Autonomy which involves a confluence of representations and algorithms for reinforcement learning, control, and perception. He currently studies three aspects: learning structured inductive biases in sequential decision-making, using data-driven causal discovery, and transfer to real robots — all in the purview of embodied systems.

Title:Building Blocks of Generalizable Autonomy: Duality of Discovery & Bias
Speaker: Animesh Garg
Time: January 19, 2023; 1:00 pm ET

Abstract: Generalization in embodied intelligence, such as in robotics, requires interactive learning across families of tasks is essential for discovering efficient representation and inference mechanisms. Concurrent systems need a lot of hand-holding to even learn a single cognitive concept or a dexterous skill, say “open a door”, let alone generalizing to new windows and cupboards! This is far from our vision of everyday robots! would require a broader concept of generalization and continual update of representations. This study of the science of embodied AI opens three key questions: (a) Representational biases & Causal inference for interactive decision-making, (b) Perceptual representations learned by and for interaction, and (c) Systems and abstractions for scalable learning. This talk will focus on the notions of structure in Embodied AI for both perception and decision-making. uncovering the many facets of inductive biases in off-policy reinforcement learning in robotics. We will first talk about the need and kinds of structure for perception in robotics, thereafter, we will talk about the existence of structure in different aspects of decision-making with RL. Finally, I will propose a framework of generalization through the separation of the `what` and `how` of skills in embodied domains.

Bio: Georgia Gkioxari is an incoming faculty at Caltech. She received a PhD in Computer Science and Electrical Engineering from the University of California at Berkeley under the supervision of Jitendra Malik in 2016. She spent 6 years as a research scientist at Facebook AI Research (now Meta AI). In 2017, Georgia received the Marr Prize at ICCV for "Mask R-CNN". In 2021, she received the PAMI Young Researcher Award and the PAMI Mark Everingham Prize for the Detectron library suite. She hates writing her bio!

Title: Toward 3D Visual Recognition: New Benchmarks and Methods for 3D in the Wild
Speaker: Georgia Gkioxari
Time: December 8, 2022; 1:00 pm ET

Abstract: How to make machines recognize objects and scenes from visual data is a longstanding problem of computer vision and AI. Over the last decade, 2D visual recognition, which encompasses image classification, object detection and instance segmentation, has advanced visual perception. Progress is so stark that industry products around content understanding successfully integrate models we computer vision researchers build. Today, we are entering a new, more exciting era which aims to build interactive, anthropocentric technology, such as AR, VR, robotics and assistive tech. To enable these, visual perception needs to graduate from 2D to 3D and unlock the ability to recognize the 3D world instead of its 2D image projection. In this talk, I will present a line of work around the problem of 3D visual recognition, starting from insights, new benchmarks and tasks to models and learning schemes for 3D object recognition in the wild, with and without 3D supervision.

Bio: Prithviraj (Raj) Ammanabrolu is a postdoc researcher on the Mosaic team in the Allen Institute for AI. He received PhD in Computer Science from the School of Interactive Computing at the Georgia Institute of Technology, advised by Professor Mark Riedl as a research assistant at the Entertainment Intelligence Lab.

Title: Modeling Worlds in Text
Speaker: Prithviraj (Raj) Ammanabrolu
Time: December 6, 2022; 1:00 pm ET

Abstract: How do we develop interactive agents that can operate via language? Agents that can both learn from feedback and generate contextually relevant language grounded in the world around them. One way of doing this is by creating world modeling agents that have an intrinsic motivation, curiosity, to learn the underlying rules and commonsense axioms of the world they are operating in and use that better inform their actions. This talk focuses on the task of building world models of text-based environments. Text-based environments, or interactive language games, are reinforcement learning environments in which agents perceive and interact with the world using textual natural language. These environments contain long, multi-step puzzles or quests woven through a world that is filled with hundreds of characters, locations, and objects. Task domains range from agents making recipes to performing science experiments all the way to defeating monsters to collect treasure. I will first describe what a world model looks like (spoiler: it looks like a lot like a knowledge graph), how to dynamically build it via an information seeking intrinsic motivation, and how automated agents can use it to more effectively affect change via natural language.

Bio: Silvia is a fourth year Computer Science PhD student at the University of Toronto. She is advised by Alec Jacobson and working in Computer Graphics and Geometry Processing. She is a Vanier Doctoral Scholar, an Adobe Research Fellow and the winner of the 2021 University of Toronto Arts & Science Dean's Doctoral Excellence Scholarship. She has interned twice at Adobe Research and twice at the Fields Institute of Mathematics. She is also a founder and organizer of the Toronto Geometry Colloquium and a member of WiGRAPH. She is currently looking to survey potential future postdoc and faculty positions, starting Fall 2024.

Title: Uncertain Surface Reconstruction
Speaker: Silvia Sellán
Time: November 14, 2022; 1:00 pm ET

Abstract: We propose a method to introduce uncertainty to the surface reconstruction problem. Specifically, we introduce a statistical extension of the classic Poisson Surface Reconstruction algorithm for recovering shapes from 3D point clouds. Instead of outputting an implicit function, we represent the reconstructed shape as a modified Gaussian Process, which allows us to conduct statistical queries (e.g., the likelihood of a point in space being on the surface or inside a solid). We show that this perspective improves PSR's integration into the online scanning process, broadens its application realm, and opens the door to other lines of research such as applying task-specific priors.

Bio: Wei-Chiu Ma is a final-year Ph.D. candidate at MIT, working with Antonio Torralba and Raquel Urtasun. His research lies in the intersection of computer vision, robotics, and graphics, with a focus on in-the-wild 3D modeling and simulation and their applications to self-driving vehicles. Wei-Chiu is a recipient of the Siebel Scholarship. His work has been covered by media outlets such as WIRED, DeepLearning.AI, MIT News, etc. Previously, Wei-Chiu was a Sr. Research Scientist at Uber ATG R&D. He received his M.S. in Robotics from CMU where he was advised by Kris Kitani and B.S. in EE from National Taiwan University.

Title: Learning in-the-wild 3D Modeling and Simulation
Speaker: Wei-Chiu Ma
Time: October 20, 2022; 1:00 pm ET

Abstract: Humans have extraordinary capabilities of comprehending and reasoning about our 3D visual world. With just a few casual glances, we can grasp the 3D structure and appearances of the surroundings and imagine all sorts of “what-if” scenarios in our minds. The long-term goal of my research is to equip computers with similar abilities. In this talk, I will present a few of our efforts toward this goal.

I will start by discussing 3D reconstruction from sparse views, where the camera poses are unknown and the images have little or no overlap. While existing approaches struggle in this setting, humans seem adept at making sense of it. Our key insight is to imitate humans and distill prior knowledge about objects into the algorithms. By leveraging those priors, we can significantly expand the applicable domains of existing 3D systems and unleash the potential of multiple downstream tasks in extreme-view setups. Then, I will talk about how to recover scene representations from a set of images with imperfect camera poses. While most existing works assume perfect calibrations, camera poses are usually noisy in the real world. To this end, we present a coarse-to-fine optimization framework that allows one to jointly estimate the underlying 3D representations and camera poses. The framework can be a plug-n-play to existing methods when transferring to real-world setups. Finally, I will showcase how we exploit “what-if” to build sensor simulation systems for self-driving and enable efficient robot manipulation.

Bio: James Tompkin is the John E. Savage Assistant Professor of Computer Science at Brown University. His research at the intersection of computer vision, computer graphics, and human-computer interaction helps develop new visual computing tools and experiences. His doctoral work at University College London on large-scale video processing and exploration techniques led to creative exhibition work in the Museum of the Moving Image in New York City. Postdoctoral work at Max-Planck-Institute for Informatics and Harvard University helped create new methods to edit content within images and videos. Recent research has developed new techniques for low-level depth reconstruction, view synthesis for VR, and content editing and generation.

Title: More Cameras and Better Cameras: Time-of-Flight Radiance Fields and Scalable Hybrid Structures for Scene Reconstruction
Speaker: James Tompkin
Time: October 13, 2022; 1:00 pm ET

Abstract: Scene reconstruction enables applications across visual computing, including media creation and editing, and capturing the real-world for telecommunication. Building better cameras through multimodal sensing could help us as long as we can integrate their signals effectively, and I will discuss one way to add time-of-flight imaging to volume reconstruction to aid monocular dynamic scene reconstruction (imaging.cs.cmu.edu/torf/). More cameras can also help us reconstruct larger scenes as long as we can scale our methods effectively, and I will discuss one way to achieve this through hybrid neural fields for high-quality indoor reconstruction and interactive rendering (https://xchaowu.github.io/papers/scalable-nisr/). Finally, in Q&A, we will discuss the merits of cameras that 'do more with less', or whether 'more really is better'.

Bio: Antoine Miech is a Senior Research Scientist at DeepMind. He completed his computer vision Ph.D. at Inria and Ecole Normale Supérieure, working with Dr. Ivan Laptev and Dr. Josef Sivic. His main research interests are vision and language understanding and is mostly known for his work around HowTo100M. Prior to joining DeepMind, he was awarded the Google Ph.D. fellowship in 2018 for his initial work on video understanding.

Title: Flamingo: a Visual Language Model for Few-Shot Learnings
Speaker: Antoine Miech
Time: October 6, 2022; 1:00 pm ET

Abstract: Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. In this talk, I will introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.

Bio: Hila is a Ph.D. candidate at Tel Aviv University, advised by Prof. Lior Wolf. Her research focuses on developing reliable XAI algorithms and leveraging them to promote model accuracy and fairness.

Title: Transformer Explainability: obtaining reliable relevancy scores and using them to promote accuracy and robustness
Speaker: Hila Chefer
Time: August 4, 2022; 11:00 am ET

Abstract: Transformers have revolutionized deep learning research across many disciplines, starting from NLP and expanding to vision, speech, and more. In my talk, I will explore several milestones toward interpreting all families of Transformers, including unimodal, bi-modal, and encoder-decoder Transformers. I will present working examples and results that cover some of the most prominent models, including CLIP, ViT, and LXMERT. I will then present our recent explainability-driven fine-tuning technique that significantly improves the robustness of Vision Transformers (ViTs). The loss we employ ensures that the model bases its prediction on the relevant parts of the input rather than supportive cues (e.g., background).

Bio: Mido Assran is a Visiting Researcher at Meta AI (FAIR) and holds a Vanier Scholarship and a Vadasz Doctoral Fellowship at McGill University (advised by Michael Rabbat) and Mila, the Quebec AI Institute, led by Yoshua Bengio. Mido's research focuses on self-supervised representation learning, and, in particular, learned feature-extraction from visual data for low-shot prediction. Mido also works in the education space — with youth, teachers, and school boards — leveraging constructivist learning theories to introduce new teaching pedagogies and improve school curricula (think self-supervised learning but for kids!).

Title: Label-Efficient Representation Learning: Masked Siamese Networks and Predicting View Assignments with Support Samples
Speaker: Mido Assran
Time: July 28, 2022; 2:00 pm ET

Abstract: Self-Supervised Learning (SSL) has emerged as an effective strategy for unsupervised learning of image representations, eliminating the need to manually annotate vast quantities of data. Since a good representation should be adaptable to prediction task with very few labels, I will discuss two methods for representation learning with very few annotated examples; the so-called setting of “Label-Efficient Learning.” The first method, PAWS (ICCV-2021, Oral), proposes a novel method of learning by predicting view assignments with support samples and is the first method to match the performance of fully-supervised learning on ImageNet using 10x fewer labels. The second method, MSN (ECCV-2022), combines advances in Vision Transformers to learn by matching the representation of an image view containing randomly masked patches to the representation of the original unmasked image, establishing a new state-of-the-art for semi-supervised learning on ImageNet: 73% top-1 with only 5000 annotated examples and 76% top-1 accuracy with only 1% of ImageNet-1K labels.

Bio: Jianbo studied Computer Science and Mathematics as an undergraduate at Cornell University where he received his B.A. in 1994. He received his Ph.D. degree in Computer Science from University of California at Berkeley in 1998, for his thesis on Normalize Cuts image segmentation algorithm. He joined The Robotics Institute at Carnegie Mellon University in 1999 as a research faculty. Since 2003, he has been with the Department of Computer & Information Science at the University of Pennsylvania.

Jianbo's group is developing vision algorithms for both human and image recognition. Their ultimate goal is to develop computation algorithms to understand human behavior and interaction with objects in video, and to do so at multiple levels of abstractions: from the basic body limb tracking, to human identification, gesture recognition, and activity inference. Jianbo and his group are working to develop a visual thinking model that allows computers to understand their surroundings and achieve higher-level cognitive abilities such as machine memory and learning.

Title: Mental Model in Escape Room: a First Person POV
Speaker: Jianbo Shi
Time: July 14, 2022; 2:00 pm ET

Abstract: In an Escape-room setting, we study different human behaviors when completing time-constrained tasks involving sequential decision-making and actions. We aim to construct a human mental model linking attention, episodic memory, and hand-object interaction. We record from two egocentric cameras: a head-mounted GoPro and Gaze-tracking glasses. We also record from up to four third-person cameras. Additionally, we created a detailed 3D map of the room. We collected data from 26 total subjects, with some subjects pairing up for multiplayer scenarios. In this talk, I will discuss the progress we have made and the challenges we faced.

Bio: Jiajun Wu is an Assistant Professor of Computer Science at Stanford University, working on computer vision, machine learning, and computational cognitive science. Before joining Stanford, he was a Visiting Faculty Researcher at Google Research. He received his PhD in Electrical Engineering and Computer Science at Massachusetts Institute of Technology. Wu's research has been recognized through the ACM Doctoral Dissertation Award Honorable Mention, the AAAI/ACM SIGAI Doctoral Dissertation Award, the MIT George M. Sprowls PhD Thesis Award in Artificial Intelligence and Decision-Making, the 2020 Samsung AI Researcher of the Year, the IROS Best Paper Award on Cognitive Robotics, and faculty research awards and graduate fellowships from Samsung, Amazon, Facebook, Nvidia, and Adobe.

Title: Understanding the Visual World Through Code
Speaker: Jiajun Wu
Time: July 7, 2022; 2:00 pm ET

Abstract: Much of our visual world is highly regular: objects are often symmetric and have repetitive parts; indoor scenes such as corridors often consist of objects organized in a repetitive layout. How can we infer and represent such regular structures from raw visual data, and later exploit them for better scene recognition, synthesis, and editing? In this talk, I will present our recent work on developing neuro-symbolic methods for scene understanding. Here, symbolic programs and neural nets play complementary roles: symbolic programs are more data-efficient to train and generalize better to new scenarios, as they robustly capture high-level structure; deep nets effectively extract complex, low-level patterns from cluttered visual data. I will demonstrate the power of such hybrid models in three different domains: 2D image editing, 3D shape modeling, and human motion understanding.

Bio: Pascal Mettes is an assistant professor in computer vision at the University of Amsterdam. He previously received his PhD (2017) under prof. Cees Snoek and was a post doc (2018-2019) at the University of Amsterdam. He was furthermore a visiting researcher at Columbia University (2016) under prof. Shih-Fu Chang and University of Tübingen (2021) under prof. Zeynep Akata. His research focuses on discovering and embedding prior knowledge in deep networks for visual understanding.

Title: Hyperbolic and Hyperspherical Visual Understanding
Speaker: Pascal Mettes
Time: June 30, 2022; 2:00 pm ET

Abstract: Deep learning for visual recognition thrives on examples but commonly ignores broader available knowledge about hierarchical relations between classes. Me and my team focus on the question of how to integrate hierarchical and inductive knowledge about categorization into deep networks. In this talk, I will dive into three of our recent works that integrate such knowledge through hyperbolic and hyperspherical geometry. As a starting point, I will shortly outline what hyperbolic geometry entails, as well as its potential for representation learning. The first paper introduces a hyperbolic prototype network that is able to embed semantic action hierarchies for video search and recognition [CVPR'20]. The second paper introduces Hyperbolic Image Segmentation, where we provide a tractable formulation for hierarchical pixel-level optimization in hyperbolic space, opening new doors in segmentation [CVPR'22]. For the third paper, we go back to a classical inductive bias, namely maximum separation between classes, and show that contrarily to recent literature, this inductive bias is not an optimization problem but has a closed-form hyperspherical solution. The solution takes the form of one fixed matrix and only requires a single line of code to add to your network, yet directly boosts categorization, long-tailed recognition, and open-set recognition [Preprint'22]. The last part of the talk discusses open research questions and future potential for hyperbolic and hyperspherical learning in computer vision.

Bio: Dr. Tianmin Shu is a postdoctoral associate in the Department of Brain and Cognitive Sciences at the Massachusetts Institute of Technology, working with Josh Tenenbaum and Antonio Torralba. He studies machine social intelligence and computational social cognition, with the goal of engineering and reverse engineering social intelligence. His work received the 2017 Cognitive Science Society Computational Modeling Prize in Perception/Action and several best paper awards at NeurIPS workshops and an IROS workshop. His research has also been covered by multiple media outlets, such as New Scientist, Science News, and VentureBeat. He received his PhD degree from the University of California, Los Angeles, in 2019.

Title: Cognitively Inspired Machine Social Intelligence
Speaker: Tianmin Shu
Time: June 29, 2022; 2:00 pm ET

Abstract: No other species possesses a social intelligence quite like that of humans. Our ability to understand one another's minds and actions, and to interact with one another in rich and complex ways, is the basis for much of our success. To build AI systems that can successfully interact with humans in real-world settings, we need to equip them with human-level social intelligence. In this talk, I will demonstrate how we can engineer human-level machine social intelligence by building cognitively inspired machine learning and AI models that can achieve the human-like ability to understand and interact with humans. In particular, I will focus on two areas of social intelligence—social scene understanding and multi-agent cooperation. I will also talk about future directions emerging from these two areas.

Bio: Dave is a second-year PhD student at UC Berkeley advised by Alexei Efros and currently collaborating with Adobe Research. Before that, he graduated from Columbia University where he was introduced to computer vision and advised by Carl Vondrick.

Title: BlobGAN: Spatially Disentangled Scene Representations
Speaker: Dave Epstein
Time: June 2, 2022; 2:00 pm ET

Abstract: I'll present recent work on learning generative representations of scenes. We train a model to create realistic images through a representation bottleneck in the form of a spatial collection of blobs. This simple construction causes our model to discover different objects in the world and learn how to arrange them to synthesize proper scenes. The emergent representation then allows many different forms of semantic image manipulation on both real and generated images, despite training without any supervision at all.

Bio: Philipp is an Assistant Professor in the Department of Computer Science at the University of Texas at Austin. He received his Ph.D. in 2014 from the CS Department at Stanford University and then spent two wonderful years as a PostDoc at UC Berkeley. His research interests lie in Computer Vision, Machine learning, and Computer Graphics. He is particularly interested in deep learning, image understanding, and vision and action.

Title: Learning to perceive and navigate from many views
Speaker: Philipp Krähenbühl
Time: May 26, 2022; 2:00 pm ET

Abstract: Autonomous navigation is undergoing an exciting transformation: On the hardware end, increasingly expensive active depth sensors are replaced with arrays of cheaper RGB cameras. On the intelligence side, an ever increasing amount of behaviors are learned directly from data. In the first part of this talk, I will show how transformer-based architectures are a particularly good fit to learn a 3D perception system from many 2D RGB views. In particular, I will show how to rephrase geometric reasoning in multi-view perception systems with a single transformer architecture, and thus learn 3D perception directly from data. Our model is simple, easily parallelizable, and runs in real-time. The presented architecture performs at state-of-the-art on the nuScenes dataset, with 4x faster inference speeds than prior work. In the second part, I will focus on learning a navigation policy from many views. Specifically, I will show how imitation learning from the viewpoint of other traffic participants significantly boosts the performance of a driving policy. The resulting system outperforms prior approaches on the public CARLA Leaderboard by a wide margin, improving driving score by 25 and route completion rate by 24 points.

Bio: David Fouhey is an assistant professor at the University of Michigan. He received a Ph.D. in robotics from Carnegie Mellon University and was then a postdoctoral fellow at UC Berkeley. His work has been recognized by a NSF CAREER award, and NSF and NDSEG fellowships. He has spent time at the University of Oxford's Visual Geometry Group, INRIA Paris, and Microsoft Research.

Title: Understanding 3D Rooms and Interacting Hands
Speaker: David Fouhey
Time: May 17, 2022; 11:00 am ET

Abstract: The long-term goal of my research is to enable computers to understand the physical world from images, including both 3D properties and how humans or robots could interact with things. This talk will summarize two recent directions aimed at enabling this goal. I will begin with learning to reconstruct full 3D scenes, including invisible surfaces, from a single RGB image and present work that can be trained with the ordinary unstructured 3D scans that sensors usually collect. Our method uses implicit functions, which have shown great promise when learned on watertight meshes. When trained on non-watertight meshes, we show that the conventional setup incentivizes neural nets to systematically distort their prediction. We offer a simple solution with a distance-like function that leads to strong results for full scene reconstruction on Matterport3D and other datasets. I will then focus on understanding what humans are doing with their hands. Hands are a primary means for humans to manipulate the world, but fairly basic information about what they are doing is often off limits to computers (or, at least in challenging data). I will describe some of our efforts on understanding hand state, including work on learning to segment hands and hand-held objects in images via a system that learns from large-scale video data.

Bio: Serena Yeung is an Assistant Professor of Biomedical Data Science and, by courtesy, of Computer Science and of Electrical Engineering at Stanford University. She leads the Stanford Medical AI and Computer Vision Lab (MARVL), and her research interests are in the areas of computer vision, machine learning, and deep learning, focusing on applications to healthcare. She also serves as Associate Director of Data Science for the Stanford Center for Artificial Intelligence in Medicine & Imaging (AIMI), and is affiliated with the Stanford Clinical Excellence Research Center (CERC). Prior to joining the Stanford faculty in 2019, Serena was a Technology for Equitable and Accessible Medicine (TEAM) Postdoctoral Fellow at Harvard University, where she was hosted by Susan Murphy and John Halamka. Serena received her Ph.D. from Stanford University in 2018, where she was advised by Fei-Fei Li and Arnold Milstein. During her Ph.D., she also spent time at Facebook AI Research in 2016 and Google Cloud AI in 2017.

Title: Overcoming data and label bottlenecks in scene understanding for healthcare applications
Speaker: Serena Yeung
Time: May 12, 2022; 2:00 pm ET

Abstract: There is an abundance of diverse visual data in healthcare, ranging from radiology and dermatology images to video of surgical procedures and patient care activities. In this talk, I will discuss our group's ongoing work in developing computer vision methods to interpret image and video data from challenging healthcare domains, towards enabling new uses of AI in healthcare. I will focus on two types of video data: surgical videos, and videos of human behavior in healthcare contexts. First, I will present work on developing computer vision for fine-grained scene understanding in surgical videos, towards applications for surgeon training and real-time assistance. Then, I will present work on 3D human and scene understanding in videos, towards applications of ambient intelligence in hospitals and behavioral studies. I will discuss methods for learning with limited amounts of labels to enable adaptation of these complex computer vision tasks to challenging in-the-wild domains such as healthcare.

Bio: Arsha Nagrani is a Research Scientist at Google AI Research. She obtained her PhD from the University of Oxford, where her thesis received the ELLIS PhD Award. During her PhD she also spent time at Google AI in California and the Wadhwani AI non-profit organisation in Mumbai. Her work has been recognised by a Best Student Paper Award at Interspeech, an Outstanding Paper Award at ICASSP, a Google PhD Fellowship and a Townsend Scholarship, and has been covered by news outlets such as The New Scientist, MIT Tech review and Verdict. Her research is focused on cross-modal and multi-modal machine learning techniques for video understanding.

Title: Multimodal Learning for Videos
Speaker: Arsha Nagrani
Time: April 28, 2022; 2:00 pm ET

Abstract: How can a video recognition system benefit from other modalities such as audio and text? This talk will focus on what multimodal learning is, why we need it (from both a human and machine perspective) and how to use it. We will cover some recent papers accepted to CVPR and NeurIPS, and finally brainstorm on some of the biggest challenges facing multimodal learning.

Bio: Ishan Misra is a Research Scientist at Meta AI Research (FAIR) where he works on Computer Vision and Machine Learning. His research interest is in reducing the need for supervision in visual learning. In recent years, Ishan has worked on self-supervised techniques in vision such as PIRL, SwAV, SEER, DINO etc. that learn highly performant visual representations without using human labels. These methods work at scale (billions of images) and across modalities (images, video, 3D, audio). Ishan got his PhD in Robotics from Carnegie Mellon.

Title: Building General Purpose Vision Models
Speaker: Ishan Misra
Time: April 14, 2022; 2:00 pm ET

Abstract: Modern computer vision models are good at specialized tasks. Given the right architecture, right supervision, supervised learning can yield great specialist models. Specialist models also have severe limitations. In this talk, I will focus on two limitations: specialist models cannot work on tasks beyond what they saw training labels for, or on new types of visual data. I will present our recent efforts that design better architectures, training paradigms and loss functions to address these issues. (a) Our first work, called Omnivore, presents a single model that can operate on images, videos, and single-view 3D data. Omnivore leads to shared representations across visual modalities, without using paired input data. (b) I will introduce Mask2Former which is a general architecture for all types of pixel labeling tasks like semantic, instance or panoptic segmentation on images or video. (c) I will conclude the talk with Detic, a simple way to train detectors by leveraging classification data which leads to a 20,000+ class detector.

Bio: Angjoo Kanazawa is an Assistant Professor in the Department of Electrical Engineering and Computer Science at the University of California at Berkeley. Her research is at the intersection of Computer Vision, Computer Graphics, and Machine Learning, focusing on the visual perception of the dynamic 3D world behind everyday photographs and video. Previously, she was a research scientist at Google NYC with Noah Snavely, and prior to that she was a BAIR postdoc at UC Berkeley advised by Jitendra Malik, Alyosha Efros, and Trevor Darrell. She completed her PhD in Computer Science at the University of Maryland, College Park with her advisor David Jacobs. She also spent time at the Max Planck Institute for Intelligent Systems with Michael Black. She has been named a Rising Star in EECS and is a recipient of Anita Borg Memorial Scholarship, Best Paper Award in Eurographics 2016 and the Google Research Scholar Award 2021. She also serves on the advisory board of Wonder Dynamics, whose goal is to utilize AI technologies to make VFX effects more accessible for indie filmmakers.

Title: Towards Capturing Reality
Speaker: Angjoo Kanazawa
Time: April 7, 2022; 2:00 pm ET

Abstract: What is the set of all things that we can ever see? The seminal work of Adelson and Bergen in '91 answered this ambitious question with the Plenoptic Function, a conceptual device that captures the complete holographic representation of the visual world. The recent advances in differentiable rendering, namely Neural Radiance Field (NeRF) comes closest to actually modeling the plenoptic function from a set of calibrated image observations. NeRF demonstrated exciting potential in photorealistic 3D reconstruction, however its original form has many shortcomings that makes it impractical for casual photorealistic 3D capture. In this talk, I will discuss the series of work my group has been conducting to make NeRF more practical. Specifically, I will talk about few-shot prediction with pixelNeRF, which can predict NeRF from one or few images, real-time rendering with PlenOctrees, which improved the rendering time by 3000x, and lastly our latest work Plenoxels (for plenoptic voxel element), which can optimize to the same fidelity as NeRFs without any neural networks with a two orders of magnitude speed up.

Organizers

Carl Vondrick

Dídac Surís Coll-Vinent

Purva Tendulkar