PPR Seminar

Advances in Perception, Prediction, and Reasoning

Hosted by Tejas Gokhale at UMBC



Schedule


May 01, 2024
4:00 -- 5:15 PM
ENGR 231

Webex Link

Serena Booth
AAAS AI Policy Fellow, United States Senate



Bio

April 29, 2024
4:00 -- 5:15 PM
ENGR 231

Webex Link

Michael Saxon
Ph.D. Candidate, University of California, Santa Barbara


Rigorous measurement in text-to-image systems (and AI more broadly?)
As large pretrained models underlying generative AI systems have grown larger, inscrutable, and widely-deployed, interest in understanding their nature as emergent rather than engineered systems has grown. I believe to move this "ersatz natural science" of AI forward, we need to focus on building rigorous observational tools for these systems, which can characterize capabilities unambiguously. At their best, benchmarks and metrics could meet this need, but at present they are often treated as mere leaderboards to chase and only very indirectly measure capabilities of interest. This talk covers three works on this topic: first, a work laying out the high-level case for building a subfield of "model metrology" which focuses on building better benchmarks and metrics. Then, it covers two works on metrology in the generative image domain: first, a work which assesses multilingual conceptual knowledge in text-to-image (T2I) systems, and second, a meta-benchmark that demonstrates how many T2I prompt faithfulness benchmarks actually fail to capture the compositionality characteristics of T2I systems which they purport to measure. This line of inquiry is intended to help move benchmarking toward the ideal of rigorous tools of scientific observation.

Bio Michael Saxon is a PhD candidate and NSF Fellow in the NLP Group at the University of California, Santa Barbara. His research sits on the intersection of generative model benchmarking, multimodality, and AI ethics. He’s particularly interested in making meaningful evaluations of hard-to-measure new capabilities in these artifacts. Michael earned his BS in Electrical Engineering and MS in Computer Engineering at Arizona State University, advised by Visar Berish and Sethuraman Panchanathan in 2018 and 2020 respectively.

April 24, 2024
4:00 -- 5:15 PM
ENGR 231

Webex Link

Catherine Ordun
Vice President, AI, Booz Allen Hamilton


Visible-Thermal Image Registration and Translation for Remote Medical Applications
Thermal imagery captured in the Long Wave Infrared (LWIR) spectrum has long-played a vital role in thermal physiology. Signs of stress and inflammation which are unseen in the visible spectrum, can be detected in LWIR due to principles of blackbody radiation. As a result, thermal facial imagery provides a unique modality for physiological assessment of states such as chronic pain. In this presentation, I will provide a presentation of my research into image registration to align visible-thermal images that serve as a prerequisite for image-to-image translation using conditional GANs and Diffusion Models. I will share recent work leading research with the National Institutes of Health applying this research in a real-world setting on cancer patients suffering from chronic pain.

Bio Dr. Catherine Ordun is a Vice President at Booz Allen Hamilton, leading AI Rapid Prototyping and Tech Transfer solutions for mission-critical problems for the Federal Government. She drives AI rapid prototyping to support mission-critical proof-of-concepts across multiple AI domains, in addition to AI tech transfer to support algorithm reuse and consumption. She also leads multimodal AI research supporting the National Cancer Institute for chronic cancer pain detection. Dr. Ordun is a Ph.D. graduate of the UMBC Department of Information Systems under Drs. Sanjay Purushotham and Edward Raff, and obtained her bachelors degree from Georgia Tech, masters from Emory, and an MBA from GWU Business School.

April 17, 2024
4:00 -- 5:15 PM
ENGR 231

Webex Link

Yu Zeng
Ph.D., Johns Hopkins University


Learning to Synthesize Images with Multimodal and Hierarchical Inputs
In recent years, image synthesis and manipulation has experienced remarkable advancements driven by deep learning algorithms and web-scale data, yet there persists a notable disconnect between the intricate nature of human ideas and the simplistic input structures employed by the existing models. In this talk, I will present our research towards a more natural way for controllable image synthesis inspired by the coarse-to-fine workflow of human artists and the inherently multimodal aspect of human thought processes. We consider the inputs of semantic and visual modality at varying levels of hierarchy. For the semantic modality, we introduce a general framework for modeling semantic inputs of different levels, which includes image-level text prompts and pixel-level label maps as two extremes and brings a series of mid-level regional descriptions with different precision. For the visual modality, we explore the use of low-level and high-level visual inputs aligning with the natural hierarchy of visual processing. Additionally, as the misuse of generated images becomes a societal threat, I will introduce our findings on the trustworthiness of deep generative models in the second part of this talk and potential future research directions.

Bio Yu Zeng is a PhD candidate at Johns Hopkins University advised by Vishal M Patel. Her research interest lies in computer vision and deep learning. She has focused on two main areas: (1) deep generative models for image synthesis and editing, and (2) label-efficient deep learning. By combining these research areas, she aims to bridge human creativity and machine intelligence through user-friendly and socially responsible models while minimizing the need for intensive human supervision. Yu has collaborated with researchers at NVIDIA and Adobe through internships. Prior to her PhD, she worked as a researcher at Tencent Games. Yu’s research has been recognized by the KAUST Rising Stars in AI and her PhD study has been supported by JHU Kewei Yang and Grace Xin Fellowship.

March 05, 2024
2:15 -- 3:30 PM
ITE 325-B

Webex Link

Co-hosted by Lara Martin

Li "Harry" Zhang
Ph.D. Candidate, University of Pennsylvania


Structured Event Reasoning with Large Language Models
Reasoning about real-life events is a unifying challenge in AI and NLP that has profound utility in a variety of domains, while any fallacy in high-stake applications like law, medicine, and science could be catastrophic. Able to work with diverse text in these domains, large language models (LLMs) have proven capable of answering questions and solving problems. In this talk, I demonstrate that end-to-end LLMs still systematically fail on reasoning tasks of complex events. Moreover, their black-box nature gives rise to little interpretability and user control. To address these issues, I propose two general approaches to use LLMs in conjunction with a structured representation of events. The first is a language-based representation involving relations of sub-events that can be learned by LLMs via fine-tuning. The second is a symbolic representation involving states of entities that can be leveraged by either LLMs or deterministic solvers. On a suite of event reasoning tasks, I show that both approaches outperform end-to-end LLMs in terms of performance and trustworthiness.

Bio Li "Harry" Zhang is a 5th-year PhD student working on Natural Language Processing (NLP) and artificial intelligence at the University of Pennsylvania advised by Prof. Chris Callison-Burch. He earned his Bachelor's degree at the University of Michigan mentored by Prof. Rada Mihalcea and Prof. Dragomir Radev. He has published more than 20 papers in NLP conferences that have been cited more than 1,000 times. He has reviewed more than 50 papers in those venues and has served as Session Chair and Program Chair in many conferences and workshops. Being a musician, producer, content creator of over 50,000 subscribers, he is also passionate in the research of AI music.

Feb 08, 2024
3:30 -- 4:45 PM
ITE 325 B

Webex Link

★★ PPR Distinguished Speaker ★★
Yezhou Yang
Associate Professor, Arizona State University


Visual Concept Learning Beyond Appearances: Modernizing a Couple of Classic Ideas
The goal of Computer Vision, as coined by Marr, is to develop algorithms to answer "What are", "Where at", "When from" visual appearance. The speaker, among others, recognizes the importance of studying underlying entities and relations beyond visual appearance, following an Active Perception paradigm. This talk will present the speaker's efforts over the last decade, ranging from 1) reasoning beyond appearance for vision and language tasks (VQA, captioning, T2I, etc.), and addressing their evaluation misalignment, through 2) reasoning about implicit properties, to 3) their roles in a Robotic visual concept learning framework. The talk will also feature the Active Perception Group (APG)’s projects addressing emerging challenges of the nation in automated mobility and intelligent transportation domains, at the ASU School of Computing and Augmented Intelligence (SCAI).

Bio Yezhou (YZ) Yang is an Associate Professor and a Fulton Entrepreneurial Professor in the School of Computing and Augmented Intelligence (SCAI) at Arizona State University. He founded and directs the ASU Active Perception Group, and currently serves as the topic lead (situation awareness) at the Institute of Automated Mobility, Arizona Commerce Authority. He is also a thrust lead (AVAI) at Advanced Communications Technologies (ACT, a Science and Technology Center under the New Economy Initiative, Arizona). His work includes exploring visual primitives and representation learning in visual (and language) understanding, grounding them by natural language and high-level reasoning over the primitives for intelligent systems, secure/robust AI, and V&L model evaluation alignment. Yang is a recipient of the Qualcomm Innovation Fellowship 2011, the NSF CAREER award 2018, and the Amazon AWS Machine Learning Research Award 2019. He received his Ph.D. from the University of Maryland at College Park, and B.E. from Zhejiang University, China. He is a co- founder of ARGOS Vision Inc, an ASU spin-off company.

Dec 04, 2023
4:00 -- 5:15 PM
ENGR 231

Webex Link

Man Luo
Postdoctoral Research Fellow, Mayo Clinic


Advancing Multimodal Retrieval and Generation: From General to Biomedical Domains
This talk explores advancements in multimodal retrieval and generation across general and biomedical domains. The first work introduces a multimodal retriever and reader pipeline for vision-based question answering, using image-text queries to retrieve and interpret relevant textual knowledge. The second work simplifies this approach with an efficient end-to-end retrieval model, removing dependencies on intermediate models like object detectors. The final part presents a biomedical-focused multimodal generation model, capable of classifying and explaining labels in images with text prompts. Together, these works demonstrate significant progress in integrating visual and textual data processing in diverse applications.

Bio Dr Man Luo is a Postdoctoral Research Fellow at Mayo Clinic with Dr. Imon Banerjee and Dr. Bhavik Patel. Her research is at the intersection of information retrieval and reading comprehension within natural language processing (NLP) and multimodal domains, to retrieve and utilize external knowledge with efficiency and generalization. Currently she is interested in knowledge retrieval, multimodal understanding, and applications of LLMs and VLMs in biomedical/healthcare application. She earned her Ph.D. in 2023 from Arizona State University advised by Dr. Chitta Baral, and has collaborated at industrial research labs at Salesforce, Meta, and Google.

Nov 29, 2023
4:00 -- 5:15 PM
ENGR 231

Webex Link

Kowshik Thopalli
Postdoctoral Researcher, Lawrence Livermore National Laboratory


Making Machine Learning Models Safer: Data and Model Perspectives
As machine learning systems are increasingly deployed in real-world settings like healthcare, finance, and scientific applications, ensuring their safety and reliability is crucial. However, many state-of-the-art ML models still suffer from issues like poor out-of-distribution generalization, sensitivity to input corruptions, requiring large amounts of data, and inadequate calibration - limiting their robustness and trustworthiness for critical real-world applications. In this talk, I will first present a broad overview of different safety considerations for modern ML systems. I will then proceed to discuss our recent efforts in making ML models safer from two complementary perspectives - (i) manipulating data and (ii) enriching the model capabilities by developing novel training mechanisms. I will discuss our work on designing new data augmentation techniques for object detection followed by demonstrating how, in the absence of data from desired target domains of interest, one could leverage pre-trained generative models for efficient synthetic data generation. Next, I will present a new paradigm of training deep networks called model anchoring and show how one could achieve similar properties to an ensemble but through a single model. I will specifically discuss how model anchoring can significantly enrich the class of hypothesis functions being sampled and demonstrate its effectiveness through its improved performance on several safety benchmarks. I will conclude by highlighting exciting future research directions for producing robust ML models through leveraging multi-modal foundation models.

Bio Kowshik Thopalli is a Machine Learning Scientist and a post-doctoral researcher at Lawrence Livermore National Laboratory. His research focuses on developing reliable machine learning models that are robust under distribution shifts. He has published papers on a variety of techniques to address model robustness, including domain adaptation, domain generalization, and test-time adaptation using geometric and meta-learning approaches. His expertise also encompasses integrating diverse knowledge sources, such as domain expert guidance and generative models, to improve model data efficiency, accuracy, and resilience to distribution shifts. He received his Ph.D. in 2023 from Arizona State University.

Nov 27, 2023
4:00 -- 5:15 PM
ENGR 231

Webex Link

Eadom Dessalene
Ph.D. Candidate, University of Maryland College Park


Learning Actions from Humans in Video
The prevalent computer vision paradigm in the realm of action understanding is to directly transfer advances in object recognition toward action understanding. In this presentation I discuss the motivations for an alternative embodied approach centered around the modelling of actions rather than objects and survey recent work of ours along these lines, as well as promising possible future directions.

Bio Eadom Dessalene is a Ph.D. Candidate at University of Maryland, College Park, advised by Yiannis Aloimonos and Cornelia Fermuller in the Perception and Robotics Group. Eadom received his bachelors degree in Computer Science from George Mason University. He has made several important contributions to research on video understanding and ego-centric vision through publications in CVPR, ICLR, T-PAMI, and ICRA, as well as winning first place in the 2020 EPIC Kitchens Action Anticipation Challenge.