We are delighted to introduce our two inspiring keynote speakers coming from both academia and industry.

Xin Luna Dong

Xin Luna Dong

Xin Luna Dong is a Principal Scientist at Amazon, leading the efforts of constructing Amazon Product Knowledge Graph.

She was one of the major contributors to the Google Knowledge Vault project, and has led the Knowledge-based Trust project, which is called the “Google Truth Machine” by Washington’s Post. She has co-authored book “Big Data Integration”, was awarded ACM Distinguished Member, VLDB Early Career Research Contribution Award for “advancing the state of the art of knowledge fusion”, and Best Demo award in Sigmod 2005. She serves in VLDB endowment and PVLDB advisory committee, and is a PC co-chair for VLDB 2021, ICDE Industry 2019, VLDB Tutorial 2019, Sigmod 2018 and WAIM 2015.​​

Talk: Ceres: Harvesting Knowledge from the Semi-structured Web
Knowledge graphs have been used to support a wide range of applications and enhance search and QA for Google, Amazon Alexa, etc. However, we often miss long-tail knowledge, including unpopular entities, unpopular relations, and unpopular verticals. In this talk we describe our efforts in harvesting knowledge from semi-structured websites, which are often populated according to some templates using vast volume of data stored in underlying databases. We describe our AutoCeres ClosedIE system, which improves the accuracy of fully automatic knowledge extraction from 60%+ of state-of-the-art to 90%+ on semi-structured data. We also describe OpenCeres, the first ever OpenIE system on semi-structured data, that is able to identify new relations not readily included in existing ontologies. In addition, we describe our other efforts in ontology alignment, entity linkage, graph mining, and QA, that allow us to best leverage the knowledge we extract for search and QA.

Michel Dumontier

Michel D

Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research focuses on the development of computational methods for scalable and responsible discovery science. Dr. Dumontier obtained his BSc (Biochemistry) in 1998 from the University of Manitoba, and his PhD (Bioinformatics) in 2005 from the University of Toronto. Previously a faculty member at Carleton University in Ottawa and Stanford University in Palo Alto, Dr. Dumontier founded and directs the interfaculty Institute of Data Science at Maastricht University to develop sociotechnological systems for responsible data science by design. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon 2020, the European Open Science Cloud, the US National Institutes of Health and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.

Talk: Accelerating Discovery Science with an Internet of FAIR Data and Services

Biomedicine has always been a fertile and challenging domain for computational discovery science. Indeed, the existence of millions of scientific articles, thousands of databases, and hundreds of ontologies, offer exciting opportunities to reuse our collective knowledge, were we not stymied by incompatible formats, overlapping and incomplete vocabularies, unclear licensing, and heterogeneous access points. In this talk, I will discuss our work to create computational standards, platforms, and methods to wrangle knowledge into simple, but effective representations based on semantic web technologies that are maximally FAIR – Findable, Accessible, Interoperable, and Reuseable – and to further use these for biomedical knowledge discovery. But only with additional crucial developments will this emerging Internet of FAIR data and services enable automated scientific discovery on a global scale.

Carlo Curino

Carlo Curino is the lead of Gray Systems Lab (GSL). Before this Carlo was a Principal Scientist in Cloud and Information Services Lab (CISL), working on large-scale distributed systems, with a focus on scheduling for BigData clusters; this line of research was co-developed with several team members and open-sourced as part of Apache Hadoop/YARN. Intrinsically, this research work enables us to operate the largest YARN clusters in the world (deployed on 250k + servers within Microsoft). Prior to joining Microsoft was a Research Scientist at Yahoo!; primarily working entity deduplication and scale and mobile+cloud platforms. Carlo spent two years as a Post Doc Associate at CSAIL MIT working with Prof. Samuel Madden and Prof. Hari Balakrishnan. At MIT he also served as the primary lecturer for the course on databases CS630, taught in collaboration with Mike Stonebraker. Carlo received a Bachelor in Computer Science at Politecnico di Milano. He participated in a joint project between University of Illinois at Chicago (UIC) and Politecnico di Milano, obtaining a Master Degree in Computer Science at UIC and the Laurea Specialistica (cum laude) in Politecnico di Milano. During the PhD at Politecnico di Milano, Carlo spent two years as a visiting researcher at UCLA, working with Prof. Carlo Zaniolo (UCLA) and Prof Alin Deutsch (UCSD). Research interests: ML-for-Systems and Systems-for-ML, large scale distributed systems, performance tuning, and scheduling. Previous research work: mobile+cloud platforms, entity dedup at scale, relational databases and cloud computing, workload management and performance analysis, schema evolution, and temporal databases.

Talk: Cloudy With High Chance Of DBMS: A 10-year Prediction For Enterprise-Grade ML

Machine learning (ML) has proven itself in high-value web applications such as search ranking and is emerging as a powerful tool in a much broader range of enterprise scenarios including voice recognition and conversational understanding for customer support, auto-tuning for videoconferencing, intelligent feedback loops in large-scale sysops, manufacturing and autonomous vehicle management, complex financial predictions, just to name a few. Meanwhile, as the value of data is increasingly recognized and monetized, concerns about securing valuable data and risks to individual privacy have been growing. Furthermore broader adoption leads to less experience development teams, further increasing risks of misuse of these technologies. Rigorous data management has emerged as a key requirement in enterprise settings. How will these trends (ML growing popularity, and stricter data governance) intersect? What are the unmet requirements for applying ML in enterprise settings? What are the technical challenges for the DB community to solve? In this talk, we present a vision of how ML and database systems are likely to come together, and highlight fascinating early results towards in building Enterprise Grade Machine Learning (EGML).

Timnit Gebru

Timnit Gebru co-leads the Ethical Artificial Intelligence research team at Google, working to reduce the potential negative impacts of AI. Timnit earned her doctorate under the supervision of Fei-Fei Li at Stanford University in 2017 and did a postdoc at Microsoft Research NYC in the FATE team. She is also the cofounder of Black in AI, a place for sharing ideas, fostering collaborations and discussing initiatives to increase the presence of Black people in the field of Artificial Intelligence.

Talk: Lessons from Archives – Strategies for Collecting Sociocultural Data in Machine Learning

A growing body of work shows that many problems in fairness, accountability, transparency, and ethics in machine learning systems are rooted in decisions surrounding the data collection and annotation process. We argue that a new specialization should be formed within machine learning that is focused on methodologies for data collection and annotation: efforts that require institutional frameworks and procedures. Specifically for sociocultural data, parallels can be drawn from archives and libraries. Archives are the longest standing communal effort to gather human information and archive scholars have already developed the language and procedures to address and discuss many challenges pertaining to data collection such as consent, power, inclusivity, transparency, and ethics privacy. We discuss these five key approaches in document collection practices in archives that can inform data collection in sociocultural machine learning.