Bardh Prenkaj, Giovanni Stilo and Lorenzo Madeddu
Online education is one of the wealthiest industries in the world. The relevance of this sector has increased due to the COVID-19 emergency, forcing nations to convert their education systems towards online environments quickly. Despite the benefits of distance learning, students enrolled in online degree programs have a higher chance of dropping out than those attending a conventional classroom environment. Being able to detect student withdrawals early is fundamental to build the next generation learning environment. In machine learning, this is known as the student dropout prediction (SDP) problem. In this tutorial, intermediate-level academicians, industry practitioners, and institutional officers will learn existing works and current progress within this particular domain. We provide a mathematical formalisation to the SDP problem, and we discuss in a comprehensive review the most useful aspects to consider for this specific domain: definition of the prediction problem, input modelling, adopted prediction technique, evaluation framework, standard benchmark datasets, and privacy concerns.
Anastasia Giachanou and Paolo Rosso
Social media platforms have given the opportunity to users to publish content and express their opinions online in a very fast and easy way. The ease of posting content online and the anonymity of social media have increased the amount of harmful content that is published. This tutorial will focus on the detection of harmful information that is published online. In particular, the tutorial will focus on two types of harmful information, fake news and hate speech. The tutorial will start with an introduction of online harmful information including definitions and characteristics of the different types of harmful information. Then we will present and discuss different approaches that have been proposed for fake news and hate speech detection. We will also present details regarding the evaluation process, available datasets and shared evaluation tasks. The tutorial will conclude with a discussion on open issues and future directions in the field of online harmful information detection.
Qingsong Guo and Jiaheng Lu
Numerous data models were proposed for practical purposes, which pose a great challenge for big data management. Specifying a database query using a formal query language is a typically challenging task. In the context of the multi-model data, this problem becomes even harder because it requires the users to deal with data of different types. It usually lacks a unified schema to help the users issuing their queries, or have an incomplete schema as data come from disparate sources. Multi-Model DataBases (MMDBs) have been developed to facilitate the management of multi-model data. In this tutorial we offer a comprehensive presentation of a wide range of multi-model data query languages and to make a comparison of their key properties. The tutorial also offer the participants hands-on experience in issuing queries over MMDBs. In addition, we also address the essence of multi-model query processing and provide insights on the research challenges and directions for future work.
Shaoxu Song and Aoqian Zhang
Data quality issues have been widely recognized in IoT data, and prevent the downstream applications. Improving IoT data quality however is particularly challenging, given the distinct features over the IoT data such as pervasive noises, unaligned timestamps, consecutive errors, misplaced columns, correlated errors and so on. In this tutorial, we review the state-of-the-art techniques for IoT data quality management. In particular, we discuss how the dedicated approaches improve various data quality dimensions, including validity, completeness and consistency. Among others, we further highlight the recent advances by deep learning techniques for IoT data quality. Finally, we indicate the open problems in IoT data quality management, such as benchmark or interpretation of data quality issues.
Fattane Zarrinkalam, Guangyuan Piao, Stefano Faralli and Ebrahim Bagheri
The abundance of user generated content on social media provides the opportunity to build models that are able to accurately and effectively extract, mine and predict users’ interests with the hopes of enabling more effective user engagement, better quality delivery of appropriate services and higher user satisfaction. While traditional methods for building user profiles relied on AI-based preference elicitation techniques that could have been considered to be intrusive and undesirable by the users, more recent advances are focused on a non-intrusive yet accurate way of determining users’ interests and preferences. In this tutorial, we will cover five important aspects related to the effective mining of user interests: 1) The information sources that are used for extracting user interests; 2) Various types of user interest profiles that have been proposed in the literature; 3) Techniques that have been adopted or proposed for mining user interests, 4) The scalability and resource requirements of the state of the art methods; 5) The evaluation methodologies that are adopted in the literature for validating the appropriateness of the mined user interest profiles. We will also introduce existing challenges, open research question and exciting opportunities for further work.
This half-day tutorial addresses the fundamentals and advances in deep Bayesian learning for a variety of information systems ranging from speech recognition to document summarization, text classification, information extraction, image caption generation, sentence/image generation, dialogue management, sentiment classification, recommendation system, question answering and machine translation, to name a few. Traditionally, “deep learning” is taken to be a learning process from source inputs to target outputs where the inference or optimization is based on the real-valued deterministic model. The “semantic structure” in words, sentences, entities, images, videos, actions and documents may not be well expressed or correctly optimized in mathematical logic or computer programs. The “distribution function” in discrete or continuous latent variable model for natural sentences or images may not be properly decomposed or estimated. A systematic and elaborate transfer learning is required to meet source and target domains. This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including variational autoencoder (VAE), stochastic temporal convolutional network, stochastic recurrent neural network, sequence-to-sequence model, attention mechanism, memory-augmented neural network, skip neural network, temporal difference VAE, predictive state neural network, and generative or normalizing flow. Enhancing the prior/posterior representation is addressed. We present how these models are connected and why they work for information and knowledge management on symbolic and complex patterns in temporal and spatial data. The variational inference and sampling method are formulated to tackle the optimization for complicated models. The word, sentence and image embeddings are merged with structural or semantic constraint. A series of case studies are presented to tackle different issues in neural Bayesian information processing. At last, we will point out a number of directions and outlooks for future studies. This tutorial serves the objectives to introduce novices to major topics within deep Bayesian learning, motivate and explain a topic of emerging importance for data mining and information retrieval, and present a novel synthesis combining distinct lines of machine learning work.
Si Zhang and Hanghang Tong
In the era of big data, networks are often from multiple sources such as the social networks of diverse platforms (e.g., Facebook, Twitter), protein-protein interaction (PPI) networks of different tissues, transaction networks at multiple financial institutes and knowledge graphs derived from a variety of knowledge bases (e.g., DBpedia, Freebase, etc.). The very first step before exploring insights from these multi-sourced networks is to integrate and unify different networks. In general, network alignment is such a task that aims to uncover the correspondences among nodes across different graphs. The challenges of network alignment include: (1) the heterogeneity of the multi-sourced networks, e.g., different structural patterns, (2) the variety of the real-world networks, e.g., how to leverage the rich contextual information, and (3) the computational complexity. The goal of this tutorial is to (1) provide a comprehensive overview of the recent advances in network alignment, and (2) identify the open challenges and future trends. We believe this can be beneficial to numerous application problems, and attract both researchers and practitioners from both data mining area and other interdisciplinary areas. In particular, we start with introducing the backgrounds, problem definition and key challenges of network alignment. Next, our emphases will be on (1) the recent techniques on addressing network alignment problem and other related problems with a careful balance between the algorithms and applications, and (2) the open challenges and future trends.
Computer vision (CV) is a field of artificial intelligence that trains computers to interpret and understand the visual world for a variety of exciting downstream tasks such as self-driving cars, checkout-less shopping, smart cities, cancer detection, and more. The field of CV has been revolutionized by deep learning over the last decade. This tutorial looks under the hood of modern day CV systems, and builds out some of these tech pipelines in a Jupyter Notebook using Python, OpenCV, Keras and Tensorflow. While the primary focus is on digital images from cameras and videos, this tutorial will also introduce 3D point clouds, and classification and segmentation algorithms for processing them.
More concretely, we will briefly overview the basics of computer vision, and object detection, progressing from object detection’s earlier attempts based on dense multiscale sliding windows of Histogram of Oriented Gradients (HOG) features in conjunction with support vector machine classifiers, to modern day pipelines based upon deep fully convolutional neural networks (FCNN). These modern day pipelines are based on complex FCNN architectures (often 50-60 layers deep), multi-task loss functions, and are either two-stage (e.g., Faster R-CNN) or single-stage (e.g., YOLO/SSD) in nature. Recent revolutionary architectures such as the DEtection TRansformer (DETR) will also be presented. Core concepts will be demonstrated with examples, code, and exercises. This will culminate with a demonstration (and a challenge) on how to build, train, and evaluate computer vision applications with a primary focus on building an object detection application from scratch to detect logos in images/video.
Juan F. Sequeda and Claudio Gutierrez
Knowledge Graphs can be considered as fulfilling an early vision in Computer Science of creating intelligent systems that integrate knowledge and data at large scale. Stemming from scientific advancements in research areas of Semantic Web, Databases, Knowledge representation, NLP, Machine Learning, among others, Knowledge Graphs have rapidly gained popularity in academia and industry in the past years. The integration of such disparate disciplines and techniques give the richness to Knowledge Graphs, but also present the challenge to practitioners and theoreticians to know how current advances develop from early techniques in order, on one hand, take full advantage of them, and on the other, avoid reinventing the wheel. This tutorial will provide a historical context on the roots of Knowledge Graphs grounded in the advancements of Logic, Data and the combination thereof.
Deepak P, Joemon M. Jose, and Sanil V
Data in digital form is expanding at an exponential rate, far outpacing any chance of getting any significant fraction labelled manually. This has resulted in heightened research emphasis on unsupervised learning, learning in the absence of labels. In fact, unsupervised learning has been often dubbed as the next frontier of AI. Unsupervised learning is the only plausible model to analyze the bulk of passively collected data that spans across various domains; e.g., social media footprints, safety/surveilance cameras, IoT devices, sensors, smartphone apps, medical wearables, traffic sensing devices and public wi-fi access. While fairness in supervised learning, such as classification tasks, has inspired a large amount of research in the past few years, work on fair unsupervised learning has been relatively slow in picking up. This tutorial targets to provide an overview of: (i) fairness principles drawing abundantly from political philosophy placed within the backdrop of motivating scenarios from unsupervised learning, (ii) current research in fair algorithms for unsupervised learning, and (iii) new directions to extend the state-of-the-art in fair unsupervised learning. While we intend to broadly cover all tasks in unsupervised learning, our focus will be on clustering, retrieval and representation learning. In a unique departure from conventional data science tutorials, we will place significant emphasis on presenting and debating pertinent literature from ethics and philosophy. Overall, there will be a strong emphasis on ensuring strong interdisciplinarity, with the instructor team having expertise in both computer science and political philosophy.
We start the tutorial with an introduction followed by a set of motivating unsupervised analytics scenarios which illustrate the need for addressing fairness considerations. Next, we will outline several principles of fairness which will include streams explored in ML literature (e.g., individual and group fairness), as well as popular notions of fairness within political philosophy, chiefly, Rawlsian fairness and several notions within the Rawlsian family. This will be followed by analyzing classical unsupervised learning algorithms from the perspective of fairness, as well as a reasonably comprehensive review of fair unsupervised learning algorithms. We will then outline several interesting directions for future work, targeted at young researchers in the audience who may be interested in embarking on fair ML research.
Manish Gupta, Vasudeva Varma, Sonam Damani and Kedhar Nath Narahari
In recent years, the fields of NLP and information retrieval have made tremendous progress thanks to deep learning models like RNNs and LSTMs, and Transformer based models like BERT. But these models are humongous in size. Real world applications however demand small model size, low response times and low computational power wattage. We will discuss six different types of methods (pruning, quantization, knowledge distillation, parameter sharing, matrix decomposition, and other Transformer based methods) for compression of such models to enable their deployment in real industry NLP projects. Given the critical need of building applications with efficient and small models, and the large amount of recently published work in this area, we believe that this tutorial is very timely. We will organize related work done by the ‘deep learning for NLP’ community in the past few years and present it as a coherent story.