Title: Data Science Driven Methods for Sustainable and Failure Tolerant Edge Systems
Nowadays we experience a paradigm shift in our society, where every item around us is becoming a computer facilitating life-changing applications like self-driving cars, tele-medicine, precision agriculture or virtual reality. On one hand, for the execution of such resource demanding applications we need powerful IT facilities. On the other hand, the requirements often include latencies below 100 ms or even below 10 ms – what is called ``tactile internet”. To facilitate low latency computation has to be placed in the vicinity of the end users by utilizing the concept of Edge Computing. In this talk we explain the challenges of Edge systems in combination with tactile internet. We discuss the recent problems of geographically distributed machine learning applications and novel approaches to balance competing priorities like the energy efficiency and the staleness of the machine learning models.
Available failure resilience mechanisms designed for Cloud computing or generic distributed systems cannot be applied to Edge systems due to timeliness, hyper heterogeneity and resource scarcity. Therefore, we discuss a novel machine learning based mechanism that evaluates the failure resilience of a service deployed redundantly on the edge infrastructure. Our approach learns the spatiotemporal dependencies between edge server failures and combines them with the topological information to incorporate link failures by utilizing the concept of the Dynamic Bayesian Networks (DBNs). Eventually, we infer the probability that a certain set of servers fails or disconnects concurrently during service runtime.
Ivona Brandic is University Professor for High Performance Computing Systems at the Institute of Information Systems Engineering, Vienna University of Technology (TU Wien) where she leads the High Performance Computing Systems Research Group. In 2015 she was awarded the FWF START prize, the highest Austrian award for early career researchers. Since 2016 she has been a member of the Young Academy of the Austrian Academy of Sciences. She received her PhD degree in 2007 and her venia docendi for practical computer science in 2013, both from Vienna University of Technology. From 2009 to 2012 she led the Austrian national FoSII (Foundations of Self-governing ICT Infrastructures) project funded by the Vienna Science and Technology Fund (WWTF). She was a management committee member of the European Commission's COST Action on Energy Efficient Large Scale Distributed Systems and of the COST Action on Sustainable Ultrascale Computing (NESUS). From June to August 2008 she was visiting researcher at the University of Melbourne, Australia. I. Brandic was on the Editorial Board of IEEE Magazine on Cloud Computing, IEEE TPDS and IEEE TCC. In 2011 she received the Distinguished Young Scientist Award from the Vienna University of Technology for her project on the Holistic Energy Efficient Hybrid Clouds. Her interests comprise virtualized HPC systems, energy efficient ultra-scale distributed systems, massive-scale data analytics, Cloud \& workflow Quality of Service (QoS), and service-oriented distributed systems. She published more than 50 scientific journal, magazine and conference publications and she co-authored a text-book on federated and self-manageable Cloud infrastructures. I. Brandic has been invited as an expert evaluator of the European Commission, and many national research organizations. In 2019 she chaired the CHIST-ERA panel (ANR) on Smart Distribution of Computing in Dynamic Networks (SDCDN). She is a board member of the Center for Artificial Intelligence and Machine Learning (CAIML) and a faculty member of the Vienna Center for Engineering in Medicine at TU Wien.
Title: Building warehouse-scale computers
Imagine some product team inside Google wants 100,000 CPU cores + RAM + flash + accelerators + disk in a couple of months. We need to decide where to put them, when; whether to deploy new machines, or re-purpose/reconfigure old ones; ensure we have enough power, cooling, networking, physical racks, data centers and (over a longer time-frame) wind power; cope with variances in delivery times from supply logistics hiccups; do multi-year cost-optimal placement+decisions in the face of literally thousands of different machine configurations; keep track of parts; schedule repairs, upgrades, and installations; and generally make all this happen behind the scenes at minimum cost.
And then after breakfast, we get to dynamically allocate resources (on the small-minutes timescale) to the product groups that need them most urgently, accurately reflecting the cost (opex/capex) of all the machines and infrastructure we just deployed, and monitoring and controlling the datacenter power and cooling systems to achieve minimum overheads - even as we replace all of these on the fly.
This talk will highlight some of the exciting problems we’re working on inside Google to ensure we can supply the needs of an organization that is experiencing (literally) exponential growth in computing capacity.
John Wilkes has been at Google since 2008, where he is working on automation for building warehouse scale computers, with a current focus on delivering network capacity. Before this, he worked on cluster management for Google's compute infrastructure (Borg, Omega, Kubernetes). He is interested in far too many aspects of distributed systems, but a recurring theme has been technologies that allow systems to manage themselves. He received a PhD in computer science from the University of Cambridge, joined HP Labs in 1982, and was elected an HP Fellow and an ACM Fellow in 2002 for his work on storage system design. Along the way, he’s been program committee chair for SOSP, FAST, EuroSys and HotCloud, and has served on the steering committees for EuroSys, FAST, SoCC and HotCloud. He’s listed as an inventor on 50+ US patents, and has an adjunct faculty appointment at Carnegie-Mellon University. In his spare time he continues, stubbornly, trying to learn how to blow glass.
Title: Performance Optimization of HPC Applications in Large-Scale Cluster Systems
In modern HPC clusters, the performance of an application is a combination of several aspects. To successfully improve the application performance, all performance aspects should be analyzed and optimized. In particular, as modern CPUs contain more and more cores, the speed of floating-point computations has increased rapidly, making data access one of the main bottlenecks in most HPC applications. Furthermore, performance diagnosing for HPC applications can be extremely complex, and the performance bottlenecks of HPC applications may vary with the scale of parallelism. In this presentation, a multi-layered data access (MLDA) optimization methodology is introduced. Developers could follow this methodology to optimize the HPC applications. We provide several examples of applying the MLDA method on real-world HPC applications, including the weather, ocean, material science, CFD, and MHD areas.
Dr. Li received his degree in Engineering from the Civil Engineering Department of the Tianjin University, in 2019, with Professor Qinghe Zhang as his advisor. During his Ph.D. studies, he was working on the development of the numerical model NDFEM based on the discontinuous finite element method and applying the model on simulation of physical problems such as nearshore hydrodynamics and tsunami waves. After graduation, he entered the AI\&HPC software department of Inspur Information and is working on the development of the performance analysis tool - TEYE. He has been involved in the analysis and optimization of several HPC applications. His research area includes performance model research and HPC application optimization.
Li L., Zhang Q. Development of an efficient wetting and drying treatment for shallow water modeling using the quadrature‐free RKDG method[J]. International Journal for Numerical Methods in Fluids, 2020.
Li L., Zhang Q. A new vertex-based limiting approach for nodal discontinuous Galerkin methods on arbitrary unstructured meshes[J]. Computers & Fluids, 2017, 159:316-326.
Li L, Zhang Q. A Quadrature-Free Scheme for Nodal Discontinuous Galerkin Method on Arbitrary Quadrilateral Unstructured Meshes[J]. Journal of Tianjin University Science and Technology, 2018.