Google Open Source Blog: machine learning

Showing posts with label machine learning. Show all posts

Transforming Kubernetes and GKE into the leading platform for AI/ML

Wednesday, May 21, 2025

The world is rapidly embracing the power of AI/ML, from training cutting-edge foundation models to deploying intelligent applications at scale. As these workloads become more sophisticated and demanding, the infrastructure required to support them must evolve. Kubernetes has emerged as the standard for container orchestration, but AI/ML introduces unique challenges that push traditional infrastructure to its limits.

AI training jobs often require massive scale, needing to coordinate thousands of specialized hardware like GPUs and TPUs. Reliability is critical, as failures can be costly for long running, large-scale training jobs. Efficient resource sharing across teams and workloads is essential given the expense of accelerators. Furthermore, deploying and scaling AI models for inference demands low latency and faster startup times for large container images and models.

At Google, we are deeply invested in the AI/ML revolution. This is why we are doubling down on our commitment to advancing Kubernetes as the foundational open standard for these workloads. Our strategy centers on evolving the core Kubernetes platform to meet the needs of the "next trillion core hours," specifically focusing on batch and AI/ML. We then bring these advancements, alongside enterprise-grade management and optimizations, to users through Google Kubernetes Engine (GKE).

Here's how we are transforming Kubernetes and GKE:

Redefining Kubernetes' relationship with specialized hardware

Kubernetes was initially designed for more uniform CPU compute. The surge of AI/ML brought new requirements for seamless integration and efficient management of expensive, sparse, and diverse accelerators. To support these new demands, Google has been a key investor in upstream Kubernetes to offer robust support for a diverse portfolio of the latest accelerators, including multiple generations of TPUs and a wide range of NVIDIA GPUs.

A core Kubernetes enhancement driven by Google and the community to better support AI/ML workloads is Dynamic Resource Allocation (DRA). This framework, developed in the heart of Kubernetes, provides a more flexible and extensible way for workloads to request and consume specialized hardware resources beyond traditional CPU and memory, which is crucial for efficiently managing accelerators. Building on such foundational open-source capabilities, GKE can then offer features like Custom Compute Classes, which improve the obtainability of these resources through intelligent fallback priorities across different capacity types like reservations, on-demand, and Spot instances. Google's active contributions to advanced resource management and scheduling capabilities within the Kubernetes community ensure that the platform evolves to meet the sophisticated demands of AI/ML, making efficient use of these specialized hardware resources more broadly accessible.

Unlocking scale and reliability

AI/ML workloads demand unprecedented scale and have new failure modes compared to traditional applications. GKE is built to handle this, supporting up to 65,000 nodes in a single cluster. We've demonstrated the ability to run the largest publicly announced training jobs, coordinating 50,000 TPU chips with near-ideal scaling efficiency.

Critically, we are enhancing core Kubernetes capabilities to support the scale and reliability needed for AI/ML. For instance, to better manage distributed AI workloads like serving large models split across multiple hosts, Google has been instrumental in developing features like JobSet (emerging from earlier concepts like LeaderWorkerSet) within the Kubernetes community (SIG Apps). This provides robust orchestration for co-scheduled, interdependent groups of Pods. We are also actively working upstream to improve Kubernetes reliability and stability through initiatives like Production Readiness Reviews, promoting safer upgrade paths, and enhancing etcd stability for the benefit of all Kubernetes users.

Optimizing Kubernetes performance for efficient inference

Low-latency and cost-efficient inference is critical for AI applications. For serving, the GKE Inference Gateway routes requests based on model server metrics like KVCache utilization and pending queue length, reducing serving costs by up to 30% and tail latency by 60% compared to traditional load balancing. We've even achieved vLLM fungibility across TPUs and GPUs, allowing users to serve the same model on either accelerator without incremental effort.

To address slow startup times for large AI/ML container images (often 20GB+), GKE offers rapid scale-out features. Secondary boot disks allow preloading container images and data, resulting in up to 29x faster container mounting time. GCS FUSE enables streaming data directly from Cloud Storage, leading to faster model load times. Furthermore, GKE Inference Quickstart provides data-driven, optimized Kubernetes deployment configurations, saving extensive benchmarking effort and enabling up to 30% lower cost, 60% lower tail latency, and 40% higher throughput.

Simplifying the Kubernetes experience and enhancing observability for AI/ML

We understand that data scientists and ML researchers may not be Kubernetes experts. Google aims to simplify the setup and management of AI-optimized Kubernetes clusters. This includes contributions to Kubernetes usability efforts and SIG-Usability. Managed offerings like GKE provide multiple paths to set up AI-optimized environments, from default configurations to customizable blueprints. Offerings like GKE Autopilot further abstract away infrastructure management, aiming for the ease of use that benefits all users.
Ensuring visibility into AI/ML workloads is paramount. Google actively supports and contributes to the integration of standard open-source observability tools within the Kubernetes ecosystem, such as Prometheus, Grafana, and OpenTelemetry. Building on this open foundation, GKE then provides enhanced, out-of-the-box observability integrated with popular AI frameworks & tools, including specific insights into workload startup latency and end-to-end tracing.

Looking ahead: continued investment in Open Source Kubernetes for AI/ML

The transformation continues. Our roadmap includes exciting developments in upstream Kubernetes for easily deploying and managing large-scale clusters, support for new GPU & TPU generations integrated through open-source mechanisms, and continued community-driven innovations in fast startup, reliability, and ease of use for AI/ML workloads.

Google is committed to making Kubernetes the premier open-source platform for AI/ML, pushing the boundaries of scale, performance, and efficiency while maintaining stability and ease of use. By driving innovation in core Kubernetes and building powerful, deeply integrated capabilities in our managed offering, GKE, we are empowering organizations to accelerate their AI/ML initiatives and unlock the next generation of intelligent applications built on an open foundation.

Come explore the possibilities with Kubernetes and GKE for your AI/ML workloads!

By Francisco Cabrera & Federico Bongiovanni, GCP Google Kubernetes Engine

A Robust Open Ecosystem for All: Accelerating AI Infrastructure

Tuesday, December 3, 2024

JAX now runs on AWS Trainium: Open Source Fuels AI Innovation

Open source software is the foundation of machine learning. It accelerates innovation through an ethos of flexibility and collaboration. This philosophy drives the open development of JAX, our high-performance array computing library, as well as OpenXLA, the compiler and runtime infrastructure it relies on.

Today we're excited to highlight how this commitment to openness, together with JAX and OpenXLA's modular designs, enables seamless integration of AWS Trainium and Trainium2 chips accelerators into the JAX ecosystem. Users get more portability, more choice, and faster progress.

JAX and OpenXLA, abstraction and modularity

JAX is a Python library for high-performance, large-scale numerical computing and machine learning. Its unique compiler-oriented design makes numerical computation familiar and portable while also accelerator-friendly and scalable. It combines a NumPy-like API with composable transformations for automatic differentiation, vectorization, parallelization, and more. Under the hood, JAX leverages the XLA compiler to optimize and scale computations over a broad set of backends.

This abstraction layer is key to its portability: JAX presents a consistent interface while XLA optimizes performance, whether you're running on CPUs, GPUs, TPUs, or something new.

In fact, OpenXLA infrastructure is designed to be modular and extensible to new platforms. By developing a PJRT plugin and leveraging existing XLA compiler components, JAX code can target new platforms, even when scaling from a single device to thousands.

Enter AWS Trainium and Inferentia

We are excited to announce that AWS Trainium is the latest platform to embrace JAX and OpenXLA. With the JAX Neuron plugin, AWS Trainium and Inferentia can be used as native JAX devices.

This new backend demonstrates how abstraction and modularity make JAX and OpenXLA especially extensible and amenable to collaboration, even on new hardware. We're thrilled to have diverse hardware partners like AMD, Arm, Intel, Nvidia, and AWS taking advantage of JAX's portability and performance. If you're interested in bringing new platforms into the JAX and OpenXLA ecosystem, please reach out!

A multi-platform ecosystem fosters open collaboration in advancing AI infrastructure. Our goal is to drive continuous development of open standards and to accelerate progress. And if you're a machine learning developer or numerical computing user, we're excited for you to try JAX on any platform you choose.

By Matthew Johnson - Principal Scientist, with additional contributors: Aditi Joshi, Fenghui Zhang, Roy Frostig, and Carlos Araya

OpenXLA Dev Lab 2024: Building Groundbreaking ML Systems Together

Thursday, May 9, 2024

AMD, Arm, AWS, Google, NVIDIA, Intel, Tesla, SambaNova, and more come together to crack the code for colossal AI workloads

As AI models grow increasingly complex and compute-intensive, the need for efficient, scalable, and hardware-agnostic infrastructure has never been greater. OpenXLA is a deep learning compiler framework that makes it easy to speed up and massively scale AI models on a wide range of hardware types—from GPUs and CPUs to specialized chips like Google TPUs and AWS Trainium. It is compatible with popular modeling frameworks—JAX, PyTorch, and TensorFlow—and delivers leading performance. OpenXLA is the acceleration infrastructure of choice for global-scale AI-powered products like Amazon.com Search, Google Gemini, Waymo self-driving vehicles, and x.AI's Grok.

The OpenXLA Dev Lab

On April 25th, the OpenXLA Dev Lab played host to over 100 expert ML practitioners from 10 countries, representing industry leaders like AMD, Arm, AWS, ByteDance, Cerebras, Cruise, Google, NVIDIA, Intel, Tesla, SambaNova, and more. The full-day event, tailored to AI hardware vendors and infrastructure engineers, broke the mold of previous OpenXLA Summits by focusing purely on “Lab Sessions”, akin to office hours for developers, and hands-on Tutorials. The energy of the event was palpable as developers worked side-by-side, learning and collaborating on both practical challenges and exciting possibilities for AI infrastructure.

Figure 1: Developers from around the world congregated at the OpenXLA Dev Lab.

The Dev Lab was all about three key things:

Educate and Empower: Teach developers how to implement OpenXLA's essential workflows and advanced features through hands-on tutorials.
Offer Expert Guidance: Provide personalized office hours led by OpenXLA experts to help developers refine their ideas and contributions.
Foster Community: Encourage collaboration, knowledge-sharing, and lasting connections among the brilliant minds in the OpenXLA community.

Tutorials

The Tutorials included:

Integrating an AI Compiler & Runtime into PJRT

Learn how PJRT connects ML frameworks to AI accelerators, standardizing their interaction for easy model deployment on diverse hardware.
Explore the PJRT C API for framework-hardware communication.
Implement a PJRT Plugin, a Python package that implements the C API.
Discover plugin examples for Apple Metal, CUDA, Intel GPU, and TPU.

Led by Jieying Luo and Skye Wanderman-Milne

Extracting StableHLO Graphs + Intro to StableHLO Quantizer

Learn to export StableHLO from JAX, PyTorch, and TensorFlow using static/dynamic shapes and SavedModel format.
Hack along with the tutorial using the JAX, PyTorch, and TensorFlow Colab notebooks provided on OpenXLA.org.
Simplify quantization with StableHLO Quantizer; a framework and device-agnostic tool.
Explore streamlined parameter selection and model rewriting for lower precision.

Led by Kevin Gleason, Jen Ha, and Xing Liu

Optimizing PyTorch/XLA Auto-sharding for Your Hardware

Discover this experimental feature that automates distributing large-scale PyTorch models across XLA devices.
Learn how it partitions and distributes for out-of-the-box performance without manual intervention
Explore future directions such as customizable cost models for different hardware

Led by Yeounoh Chung and Pratik Fegade

Optimizing Compute and Communication Scheduling with XLA

Scale ML models on multi-GPUs with SPMD partitioning, collective communication, HLO optimizations.
Explore tensor parallelism, latency hiding scheduler, pipeline parallelism.
Learn collective optimizations, pipeline parallelism for efficient large-scale training.

Led by Frederik Gossen, TJ Xu, and Abhinav Goel

Lab Sessions

Lab Sessions featured use case-specific office hours for AMD, Arm, AWS, ByteDance, Intel, NVIDIA, SambaNova, Tesla, and more. OpenXLA engineers were on hand to provide development teams with dedicated support and walkthrough specific pain points and designs. In addition, Informational Roundtables that covered broader topics like GPU ML Performance Optimization, JAX, and PyTorch-XLA GPU were available for those without specific use cases. This approach led to productive exchanges and fine-grained exploration of critical contribution areas for ML hardware vendors.

four photos of participants and vendors at OpenXLA Dev Lab

Don’t just take our word for it – here’s some of the feedback we received from developers:

"OpenXLA is awesome, and it's great to see the community interest around it. We're excited about the potential of PJRT and StableHLO to improve the portability of ML workloads onto novel hardware such as ours. We appreciate the support that we have been getting."

— Mark Gottscho, Senior Manager and Technical Lead at SambaNova

"Today I learned a lot about Shardy and about some of the bugs I found in the GSPMD partitioner, and I got to learn a lot of cool stuff."

— Patrick Toulme, Machine Learning Engineer at AWS

“I learned a lot, a lot about how XLA is making tremendous progress in building their community.”

— Tejash Shah, Product Manager at NVIDIA

“Loved the format this year - please continue … lots of learning, lots of interactive sessions. It was great!”

— Om Thakkar, AI Software Engineer at Intel

Technical Innovations and The Bold Road Ahead

The event kicked off with a keynote by Robert Hundt, Distinguished Engineer at Google, who outlined OpenXLA's ambitious plans for 2024, particularly three major areas of focus:

Large-scale training
GPU and PyTorch compute performance
Modularity and extensibility

Empowering Large-Scale Training

OpenXLA is introducing powerful features to enable model training at record-breaking scales. One of the most notable additions is Shardy, a tool coming soon to OpenXLA that automates and optimizes how large AI workloads are divided across multiple processing units, ensuring efficient use of resources and faster time to solution. Building on the success of its predecessor, SPMD, Shardy empowers developers with even more fine-grained control over partitioning decisions, all while maintaining the productivity benefits that SPMD is known for.

Diagram of sharding representation with a simple rank 2 tensor and 4 devices.

Figure 2: Sharding representation example with a simple rank 2 tensor and 4 devices.

In addition to Shardy, developers can expect a suite of features designed to optimize computation and communication overlap, including:

Automatic profile-guided latency estimation
Collective pipelining
Heuristics-based collective combiners

These innovations will enable developers to push the boundaries of large-scale training and achieve unprecedented performance and efficiency.

OpenXLA Delivers on TorchBench Performance

OpenXLA has also made significant strides in enhancing performance, particularly on GPUs with key PyTorch-based generative AI models. PyTorch-XLA GPU is now neck and neck with TorchInductor for TorchBench Full Graph Models and has a TorchBench pass rate within 5% of TorchInductor.

A bar graph showing a performance comparison of TorchInductor vs. PyTorch-XLA GPU on Google Cloud NVIDIA H100 GPUs

Figure 3: Performance comparison of TorchInductor vs. PyTorch-XLA GPU on Google Cloud NVIDIA H100 GPUs. “Full graph models” represent all TorchBench models that can be fully represented by StableHLO

Behind these impressive gains lies XLA GPU's global cost model, a game-changer for developers. In essence, this cost model acts as a sophisticated decision-making system, intelligently determining how to best optimize computations for specific hardware. The cost model delivers state-of-the-art performance through a priority-based queue for fusion decisions and is highly extensible, allowing third-party developers to seamlessly integrate their backend infrastructure for both general-purpose and specialized accelerators. The cost model's adaptability ensures that computation optimizations are tailored to specific accelerator architectures, while less suitable computations can be offloaded to the host or other accelerators.

OpenXLA is also breaking new ground with novel kernel programming languages, Pallas and Mosaic, which empower developers to write highly optimized code for specialized hardware. Mosaic demonstrates remarkable efficiency in programming key AI accelerators, surpassing widely used libraries in GPU code generation efficiency for models with 64, 128, and 256 Q head sizes, as evidenced by its enhanced utilization of TensorCores.

A bar graph showing a performance comparison of Flash Attention vs. Mosaic GPU on NVIDIA H100 GPUs

Figure 4: Performance comparison of Flash Attention vs. Mosaic GPU on NVIDIA H100 GPUs.

Modular and Extensible AI Development

In addition to performance enhancements, OpenXLA is committed to making the entire stack more modular and extensible. Several initiatives planned for 2024 include:

Strengthening module interface contracts
Enhancing code sharing between platforms
Enabling a shared high-level compiler flow through runtime configuration and component registries

A flow diagram showing modules and subcomponents of the OpenXLA stack.

Figure 5: Modules and subcomponents of the OpenXLA stack.

These improvements will make it easier for developers to build upon and extend OpenXLA.

Alibaba's success with PyTorch XLA FSDP within their TorchAcc framework is a prime example of the benefits of OpenXLA's modularity and extensibility. By leveraging these features, Alibaba achieved state-of-the-art performance for the LLaMa 2 13B model, surpassing the previous benchmark set by Megatron. This demonstrates the power of the developer community in extending OpenXLA to push the boundaries of AI development.

A bar graph showing a performance comparison of TorchAcc and Megatron for LLaMa 2 13B at different number of GPUs.

Figure 6: Performance comparison of TorchAcc and Megatron for LLaMa 2 13B at different numbers of GPUs.

Join the OpenXLA Community

If you missed the Dev Lab, don't worry! You can still access StableHLO walkthroughs on openxla.org, as well as the GitHub Gist for the PJRT session. Additionally, the recorded keynote and tutorials are available on our YouTube channel. Explore these resources and join our global community – whether you're an AI systems expert, model developer, student, or just starting out, there's a place for you in our innovative ecosystem.

Acknowledgements

Adam Paszke, Allen Hutchison, Amin Vahdat, Andrew Leaver, Andy Davis, Artem Belevich, Abhinav Goel, Bart Chrzaszcz, Benjamin Kramer, Berkin Ilbeyi, Bill Jia, Cyril Bortolato, David Dunleavy, Eugene Zhulenev, Florian Reichl, Frederik Gossen, George Karpenkov, Gunhyun Park, Han Qi, Jack Cao, Jacques Pienaar, Jaesung Chung, Jen Ha, Jianting Cao, Jieying Luo, Jiewen Tan, Jini Khetan, Kevin Gleason, Kyle Lucke, Kuy Mainwaring, Lauren Clemens, Manfei Bai, Marisa Miranda, Michael Levesque-Dion, Milad Mohammadi, Nisha Miriam Johnson, Penporn Koanantakool, Puneith Kaul, Robert Hundt, Sandeep Dasgupta, Sayce Falk, Shauheen Zahirazami, Skye Wanderman-Milne, Yeounoh Chung, Pratik Fegade, Peter Hawkins, Vaibhav Singh, Tamás Danyluk, Thomas Joerg, TJ Xu, and Tom Natan

By James Rubin, Aditi Joshi, and Elliot English – on behalf of the OpenXLA Project

Kubeflow joins the CNCF family

Tuesday, July 25, 2023

We are thrilled to announce a major milestone in the journey of the Kubeflow project. After a comprehensive review process and several months of meticulous preparation, Kubeflow has been accepted by the Cloud Native Computing Foundation (CNCF) as an incubating project. This momentous step marks a new chapter in our collaborative and open approach to accelerating machine learning (ML) in the cloud native ecosystem.

The acceptance of Kubeflow into the incubation stage by the CNCF reflects not just the project's maturity, but also its widespread adoption and expanding user base. It underscores the tremendous value of the diverse suite of components that Kubeflow provides, including Notebooks, Pipelines, Training Operators, Katib, Central Dashboard, Manifests, and many more. These tools have been instrumental in creating a cohesive, end-to-end ML platform that streamlines the development and deployment of ML workflows.

Furthermore, the alignment of Kubeflow with the CNCF acknowledges the project's foundational reliance on several existing CNCF projects such as Argo, Cert-Manager, and Istio. The joining of Kubeflow with the CNCF will serve to strengthen these existing relationships and foster greater collaboration among cloud native projects, leading to even more robust and innovative solutions for users.

Looking ahead, Google and the Kubeflow community are eager to collaborate with the CNCF on the transition process. Rest assured, our commitment to Kubeflow's ongoing development remains unwavering during this transition. We will continue to support new feature development, plan and execute upcoming releases, and strive to deliver further improvements to the Kubeflow project.

We extend our heartfelt thanks to the CNCF Technical Oversight Committee and the wider CNCF community for their support and recognition of the Kubeflow project. We look forward to this exciting new phase in our shared journey towards advancing machine learning in the cloud native landscape.

As Kubeflow continues to evolve, we invite developers, data scientists, ML engineers, and all other interested individuals to join us in shaping the future of cloud native machine learning. Let's innovate together, with Kubeflow and the CNCF, to make machine learning workflows more accessible, manageable, and scalable than ever before!

By James Liu – GCP Cloud AI

Kubeflow applies to become a CNCF incubating project

Monday, October 24, 2022

Google has pioneered AI and ML and has a history of innovative technology donations to the open source community (e.g. TensorFlow and Jax). Google is also the initial developer and largest contributor to Kubernetes, and brings with it a wealth of experience to the project and its community. Building an ML Platform on our state-of-the-art Google Kubernetes Engine (GKE), we have learned best practices from our users, and in 2017, we used that experience to create and open source the Kubeflow project.

In May 2020, with the v1.0 release, Kubeflow reached maturity across a core set of its stable applications. During that year, we also graduated Kubeflow Serving as an independent project, KServe, which is now incubating in Linux Foundation AI & Data.

Today, Kubeflow has developed into an end-to-end, extendable ML platform, with multiple distinct components to address specific stages of the ML lifecycle: model development (Kubeflow Notebooks), model training (Kubeflow Pipelines and Kubeflow Training Operator), model serving (KServe), and automated machine learning (Katib).

The Kubeflow project now has close to 200 contributors from over 30 organizations, and the Kubeflow community has hosted several summits and contributor meetups across the world. The broader Kubeflow ecosystem includes a number distributions across multiple cloud service providers and on-prem environments. Kubeflow’s powerful development experience helps data scientists build, train, and deploy their ML models, enabling enterprise ML operation teams to deploy and scale advanced workflows in a variety of infrastructures.

Google’s application for Kubeflow to become a CNCF incubating project is the next big milestone for the Kubeflow community, and we’re thrilled to see how developers will continue to build and innovate in ML using this project.

What's next? The pull request we’ve opened today to join the CNCF as an incubating project is only the first step. Google and the Kubeflow community will work with the CNCF and their Technical Oversight Committee (TOC), to meet the incubation stage requirements. While the due diligence and eventual TOC decision can take a few months, the Kubeflow project will continue developing and releasing throughout this process.

If Kubeflow is accepted into CNCF, the project’s assets will be transferred to the CNCF, including the source code, trademark, and website, and other collaboration and social media accounts. At Google, we believe that using open source comes with a responsibility to contribute, maintain, and improve those projects. In that spirit, we will continue supporting the Kubeflow project and work with the community towards the next level of innovation.

Thanks to everyone who has contributed to Kubeflow over the years! We are excited for what lays ahead for the Kubeflow community.

By Thea Lamkin, Senior Program Manager and Mark Chmarny, Senior Technical Program Manager – Google Open Source

Lyra V2 - a better, faster, and more versatile speech codec

Friday, September 30, 2022

Since we open sourced the first version of Lyra on GitHub last year, we are delighted to see a vibrant community growing around it, with thousands of stars, hundreds of forks, and many comments and pull requests. There are people who fixed and formatted our code, built continuous integration for the project, and even added support for Web Assembly.

We are incredibly grateful for all these contributions, and we also heard the community's feedback, asking us to improve Lyra. Some examples of what developers wanted were to run Lyra on more platforms, develop applications in more languages; and for a model that computes faster with more bitrate options and lower latency, and better audio quality with fewer artifacts.

That's why we are now releasing Lyra V2, with a new architecture that enjoys a wider platform support, provides scalable bitrate capabilities, has better performance, and generates higher quality audio. With this release, we hope to continue to evolve with the community, and with its collective creativity, see new applications being developed and new directions emerging.

New Architecture

Lyra V2 is based on an end-to-end neural audio codec called SoundStream. The architecture has a residual vector quantizer (RVQ) sitting before and after the transmission channel, which quantizes the encoded information into a bitstream and reconstructs it on the decoder side.

The integration of RVQ into the architecture allows changing the bitrate of Lyra V2 at any time by selecting the number of quantizers to use. When more quantizers are used, higher quality audio is generated (at a cost of a higher bitrate). In Lyra V2, we support three different bitrates: 3.2 kps, 6 kbps, and 9.2 kbps. This enables developers to choose a bitrate most suitable for their network condition and quality requirements.

Lyra V2's model is exported in TensorFlow Lite, TensorFlow's lightweight cross-platform solution for mobile and embedded devices, which supports various platforms and hardware accelerations. The code is tested on Android phones and Linux, with experimental Mac and Windows support. Operation on iOS and other embedded platforms is not currently supported, although we expect it is possible with additional effort. Moreover, this paradigm opens Lyra to any future platform supported by TensorFlow Lite.

Better Performance

With the new architecture, the delay is reduced from 100 ms with the previous version to 20 ms. In this regard, Lyra V2 is comparable to the most widely used audio codec Opus for WebRTC, which has a typical delay of 26.5 ms, 46.5 ms, and 66.5 ms.

Lyra V2 also encodes and decodes five times faster than the previous version. On a Pixel 6 Pro phone, Lyra V2 takes 0.57 ms to encode and decode a 20 ms audio frame, which is 35 times faster than real time. The reduced complexity means that more phones can run Lyra V2 in real time than V1, and that the overall battery consumption is lowered.

Higher Quality

Driven by the advance of machine learning research over the years, the quality of the generated audio is also improved. Our listening tests show that the audio quality (measured in MUSHRA score, an indication of subjective quality) of Lyra V2 at 3.2 kbps, 6 kbps, and
9.2 kbps measures up to Opus at 10 kbps, 13 kbps, and 14 kbps respectively.

	Sample 1
Original
Opus@6kbps
LyraV1
Opus@10kbps
[email protected]
Opus@13k
LyraV2@6kbps
Opus@14kbps
[email protected]

	Sample 2
Original
Opus@6kbps
LyraV1
Opus@10kbps
[email protected]
Opus@13k
LyraV2@6kbps
Opus@14kbps
[email protected]

This makes Lyra V2 a competitive alternative to other state-of-the-art telephony codecs. While Lyra V1 already compares favorably to the Adaptive Multi-Rate (AMR-NB) codec, Lyra V2 further outperforms Enhanced Voice Services (EVS) and Adaptive Multi-Rate Wideband (AMR-WB), and is on par with Opus, all the while using only 50% - 60% of their bandwidth.

	Sample 1
Original
AMR-NB
LyraV1
EVS
AMR-WB
Opus@13k
LyraV2@6kbps

	Sample 2
Original
AMR-NB
LyraV1
EVS
AMR-WB
Opus@13k
LyraV2@6kbps

This means more devices can be connected in bandwidth-constrained environments, or that additional information can be sent over the network to reduce voice choppiness through forward error correction and packet loss concealment.

Open Source Release

Lyra V2 continues to provide what is already in Lyra V1 (the build tools, the testing frameworks, the C++ encoding and decoding API, the signal processing toolchain, and the example Android app). Developers who have experience with the Lyra V1 API will find that the V2 API looks familiar, but with a few changes. For example, now it's possible to change bitrates during encoding (more information is available in the release notes). In addition, the model definitions and weights are included as .tflite files. As with V1, this release is a beta version and the API and bitstream are expected to change. The code for running Lyra is open sourced under the Apache license. We can’t wait to see what innovative applications people will create with the new and improved Lyra!

By Ｈengchin Yeh - Chrome

Acknowledgements

The following people helped make the open source release possible: from Chrome: Alejandro Luebs, Michael Chinen, Andrew Storus, Tom Denton, Felicia Lim, Bastiaan Kleijn, Jan Skoglund, Yaowu Xu, Jamieson Brettle, Omer Osman, Matt Frost, Jim Bankoski; and from Google Research: Neil Zeghidour, Marco Tagliasacchi

Accelerate your models to production with Google Cloud and PyTorch

Monday, September 12, 2022

We believe in the power of choice for Machine Learning development, and continue to invest resources to make it easy for ML practitioners to train, deploy, and orchestrate models from a single unified data and AI cloud platform. We’re excited to announce our role as a founding member of the newly formed PyTorch Foundation, which will better position Google Cloud to make meaningful contributions to the PyTorch community. As a member of the board, we will deepen our open source investment to deliver on the Foundation’s mission to drive adoption of AI tooling by building an ecosystem of open source projects with PyTorch. We strongly believe in choice and will continue to invest in frameworks such as JAX and Tensorflow and support integrations with other OSS Projects including Spark, Airflow, XGBoost, and others.

In this blog, we provide an overview of existing resources to help you get started with PyTorch on Google Cloud. We also talk about how ML practitioners can leverage our end-to-end ML platform to train, tune, and deploy PyTorch models.

PyTorch on Google Cloud

Open source in the cloud is important because it gives you flexibility and control over where you train and deploy your ML workloads. PyTorch is extensively used in the research space and in recent years it has gained immense traction in the industry due to its ease of use and deployment. In fact, according to a survey of Kaggle users, PyTorch is the fastest growing ML framework today.

ML practitioners using PyTorch tell us that it can be challenging to advance their ML project past experimentation. This is why Google Cloud has built integrations with PyTorch that make it easier to train, deploy, and orchestrate models in production. Some examples are:

PyTorch integrates directly with Vertex AI, a fully managed ML platform that provides the tools you need to take a model from PyTorch to production, like the Pytorch DL containers or the Vertex AI workbench PyTorch one-click JupyterLab environment.
PyTorch/XLA, an open source library, uses the XLA deep learning compiler to enable PyTorch to run on Cloud TPUs. Cloud TPUs are custom accelerators designed by Google, optimized for perf/TCO with large scale ML workload PyTorch/XLA also enables XLA driven optimizations on GPUs.
TorchX provides an adapter to run and orchestrate TorchX components as part of Kubeflow Pipelines that you can easily scale on Vertex AI Pipelines.
With our OSS contributions to Apache Beam, we have made PyTorch models easy to deploy in batch or stream, data processing pipelines. Running on Google Dataflow, these pipelines will scale to very large workloads in a fully managed and simple to maintain environment.

To learn more and start using PyTorch on Google Cloud, check out the resources below:

PyTorch on Vertex AI Resources

How To train and tune PyTorch models on Vertex AI: Learn how to use Vertex AI Training to build and train a sentiment text classification model using PyTorch and Vertex AI Hyperparameter Tuning to tune hyperparameters of PyTorch models.
How to deploy PyTorch models on Vertex AI: Walk through the deployment of a Pytorch model using TorchServe as a custom container, by deploying the model artifacts to a Vertex Prediction service.
Orchestrating PyTorch ML Workflows on Vertex AI Pipelines: See how to build and orchestrate ML pipelines for training and deploying PyTorch models on Google Cloud Vertex AI using Vertex AI Pipelines.
Scalable ML Workflows using PyTorch on Kubeflow Pipelines and Vertex Pipelines: Take a look at examples of PyTorch-based ML workflows on two pipelines frameworks: OSS Kubeflow Pipelines, part of the Kubeflow project, and Vertex AI Pipelines. We share new PyTorch built-in components added to the Kubeflow Pipelines.

PyTorch/XLA and Cloud TPU/GPU

Scaling deep learning workloads with PyTorch / XLA and Cloud TPU VM: Describes the challenges associated with scaling deep learning jobs to distributed training settings, using the Cloud TPU VM and shows how to stream training data from Google Cloud Storage (GCS) to PyTorch / XLA models running on Cloud TPU Pod slices.
PyTorch/XLA: Performance debugging on Cloud TPU VM: Part I: In the first part of the performance debugging series on Cloud TPU, we lay out the conceptual framework for PyTorch/XLA in the context of training performance. We introduced a case study to make sense of preliminary profiler logs and identify the corrective actions.
PyTorch/XLA: Performance debugging on Cloud TPU VM: Part II: In the second part, we deep dive into further analysis of the performance debugging to discover more performance improvement opportunities.
PyTorch/XLA: Performance debugging on Cloud TPU VM: Part III: In the final part of the performance debugging series, we introduce user defined code annotation and visualize these annotations in the form of a trace.
Train ML models with Pytorch Lightning on TPUs: Learn how easy it is to start training models with PyTorch Lightning on TPUs with its built-in TPU support.

PyTorch on Apache Beam and Google Cloud Dataflow

Integrating ML models into production pipelines with Dataflow: Learn how to use Apache Beam's RunInference transform, with either single or multi model pipelines at scale.

Other resources

Increase your productivity using PyTorch Lightning: Learn how to use PyTorch Lightning on Vertex AI Workbench (was previously Notebooks).

By Erwin Huizing and Grace Reed – Cloud AI and ML

Lyra - enabling voice calls for the next billion users

Tuesday, April 6, 2021

The past year has shown just how vital online communication is to our lives. Never before has it been more important to clearly understand one another online, regardless of where you are and whatever network conditions are available. That’s why in February we introduced Lyra: a revolutionary new audio codec using machine learning to produce high-quality voice calls.

As part of our efforts to make the best codecs universally available, we are open sourcing Lyra, allowing other developers to power their communications apps and take Lyra in powerful new directions. This release provides the tools needed for developers to encode and decode audio with Lyra, optimized for the 64-bit ARM android platform, with development on Linux. We hope to expand this codebase and develop improvements and support for additional platforms in tandem with the community.

The Lyra Architecture

Lyra’s architecture is separated into two pieces, the encoder and decoder. When someone talks into their phone the encoder captures distinctive attributes from their speech. These speech attributes, also called features, are extracted in chunks of 40ms, then compressed and sent over the network. It is the decoder’s job to convert the features back into an audio waveform that can be played out over the listener’s phone speaker. The features are decoded back into a waveform via a generative model. Generative models are a particular type of machine learning model well suited to recreate a full audio waveform from a limited number of features. The Lyra architecture is very similar to traditional audio codecs, which have formed the backbone of internet communication for decades. Whereas these traditional codecs are based on digital signal processing (DSP) techniques, the key advantage for Lyra comes from the ability of the generative model to reconstruct a high-quality voice signal.

The Impact

While mobile connectivity has steadily increased over the past decade, the explosive growth of on-device compute power has outstripped access to reliable high speed wireless infrastructure. For regions where this contrast exists—in particular developing countries where the next billion internet users are coming online—the promise that technology will enable people to be more connected has remained elusive. Even in areas with highly reliable connections, the emergence of work-from-anywhere and telecommuting have further strained mobile data limits. While Lyra compresses raw audio down to 3kbps for quality that compares favourably to other codecs, such as Opus, it is not aiming to be a complete alternative, but can save meaningful bandwidth in these kinds of scenarios.

These trends provided motivation for Lyra and are the reason our open source library focuses on its potential for real time voice communication. There are also other applications we recognize Lyra may be uniquely well suited for, from archiving large amounts of speech, and saving battery by leveraging the computationally cheap Lyra encoder, to alleviating network congestion in emergency situations where many people are trying to make calls at once. We are excited to see the creativity the open source community is known for applied to Lyra in order to come up with even more unique and impactful applications.

The Open Source Release

The Lyra code is written in C++ for speed, efficiency, and interoperability, using the Bazel build framework with Abseil and the GoogleTest framework for thorough unit testing. The core API provides an interface for encoding and decoding at the file and packet levels. The complete signal processing toolchain is also provided, which includes various filters and transforms. Our example app integrates with the Android NDK to show how to integrate the native Lyra code into a Java-based android app. We also provide the weights and vector quantizers that are necessary to run Lyra.

We are releasing Lyra as a beta version today because we wanted to enable developers and get feedback as soon as possible. As a result, we expect the API and bitstream to change as it is developed. All of the code for running Lyra is open sourced under the Apache license, except for a math kernel, for which a shared library is provided until we can implement a fully open solution over more platforms. We look forward to seeing what people do with Lyra now that it is open sourced. Check out the code and demo on GitHub, let us know what you think, and how you plan to use it!

By Andrew Storus and Michael Chinen – Chrome

Acknowledgements

The following people helped make the open source release possible:
Hengchin Yeh, Alejandro Luebs, Jamieson Brettle, Tom Denton, Felicia Lim, Bastiaan Kleijn, Jan Skoglund, Yaowu Xu, Matt Frost, Jim Bankoski (Chrome), Chenjie Gu, Zach Gleicher, Tom Walters, Norman Casagrande, Luis Cobo, Erich Elsen (DeepMind).

From MLPerf to MLCommons: moving machine learning forward

Thursday, December 3, 2020

Today, the community of machine learning researchers and engineers behind the MLPerf benchmark is launching an open engineering consortium called MLCommons. For us, this is the next step in a journey that started almost three years ago.

Early in 2018, we gathered a group of industry researchers and academics who had published work on benchmarking machine learning (ML), in a conference room to propose the creation of an industry standard benchmark to measure ML performance. Everyone had doubts: creating an industry standard is challenging under the best conditions and ML was (and is) a poorly understood stochastic process running on extremely diverse software and hardware. Yet, we all agreed to try.

Together, along with a growing community of researchers and academics, we created a new benchmark called MLPerf. The effort took off. MLPerf is now an industry standard with over 2,000 submitted results and multiple benchmarks suites that span systems from smartphones to supercomputers. Over that time, the fastest result submitted to MLPerf for training the classic ML network ResNet improved by over 13x.

We created MLPerf because we believed in three principles:

Machine learning has tremendous potential: Already, machine learning helps billions of people find and understand information through tools like Google’s search engine and translation service. Active research in machine learning could one day save millions of lives through improvements in healthcare and automotive safety.

Transforming machine learning from promising research into wide-spread industrial practice requires investment in common infrastructure -- especially metrics: Much like computing in the ‘80s, real innovation is mixed with hype and adopting new ideas is slow and cumbersome. We need good metrics to identify the best ideas, and good infrastructure to make adoption of new techniques fast and easy.

Developing common infrastructure is best done by an open, fast-moving collaboration: We need the vision of academics and the resources of industry. We need the agility of startups and the scale of leading tech companies. Working together, a diverse community can develop new ideas, launch experiments, and rapidly iterate to arrive at shared solutions.

Our belief in the principles behind MLPerf has only gotten stronger, and we are excited to be part of the next step for the MLPerf community with the launch of MLCommons.

MLCommons aims to accelerate machine learning to benefit everyone. MLCommons will build a a common set of tools for ML practitioners including:

Benchmarks to measure progress: MLCommons will leverage MLPerf to measure speed, but also expand benchmarking other aspects of ML such as accuracy and algorithmic efficiency. ML models continue to increase in size and consequently cost. Sustaining growth in capability will require learning how to do more (accuracy) with less (efficiency).

Public datasets to fuel research: MLCommons new People’s Speech project seeks to develop a public dataset that, in addition to being larger than any other public speech dataset by more than an order of magnitude, better reflects diverse languages and accents. Public datasets drive machine learning like nothing else; consider ImageNet’s impact on the field of computer vision.

Best practices to accelerate development: MLCommons will make it easier to develop and deploy machine learning solutions by fostering consistent best practices. For instance, MLCommons’ MLCube project provides a common container interface for machine learning models to make them easier to share, experiment with (including benchmark), develop, and ultimately deploy.

Google believes in the potential of machine learning, the importance of common infrastructure, and the power of open, collaborative development. Our leadership in co-founding, and deep support in sustaining, MLPerf and MLCommons has echoed our involvement in other efforts like TensorFlow and NNAPI. Together with the MLCommons community, we can improve machine learning to benefit everyone.

Want to get involved? Learn more at mlcommons.org.

By Peter Mattson – ML Metrics, Naveen Kumar – ML Performance, and Cliff Young – Google Brain

Releasing the Healthcare Text Annotation Guidelines

Friday, October 30, 2020

The Healthcare Text Annotation Guidelines are blueprints for capturing a structured representation of the medical knowledge stored in digital text. In order to automatically map the textual insights to structured knowledge, the annotations generated using these guidelines are fed into a machine learning algorithm that learns to systematically extract the medical knowledge in the text. We’re pleased to release to the public the Healthcare Text Annotation Guidelines as a standard.

Google Cloud recently launched AutoML Entity Extraction for Healthcare, a low-code tool used to build information extraction models for healthcare applications. There remains a significant execution roadblock on AutoML DIY initiatives caused by the complexity of translating the human cognitive process into machine-readable instructions. Today, this translation occurs thanks to human annotators who annotate text for relevant insights. Yet, training human annotators is a complex endeavor which requires knowledge across fields like linguistics and neuroscience, as well as a good understanding of the business domain. With AutoML, Google wanted to democratize who can build AI. The Healthcare Text Annotation Guidelines are a starting point for annotation projects deployed for healthcare applications.

The guidelines provide a reference for training annotators in addition to explicit blueprints for several healthcare annotation tasks. The annotation guidelines cover the following:

The task of medical entity extraction with examples from medical entity types like medications, procedures, and body vitals.
Additional tasks with defined examples, such as entity relation annotation and entity attribute annotation. For instance, the guidelines specify how to relate a medical procedure entity to the source medical condition entity, or how to capture the attributes of a medication entity like dosage, frequency, and route of administration.
Guidance for annotating an entity’s contextual information like temporal assessment (e.g., current, family history, clinical history), certainty assessment (e.g., unlikely, somewhat likely, likely), and subject (e.g., patient, family member, other).

Google consulted with industry experts and academic institutions in the process of assembling the Healthcare Text Annotation Guidelines. We took inspiration from other open source and research projects like i2b2 and added context to the guidelines to support information extraction needs for industry-applications like Healthcare Effectiveness Data and Information Set (HEDIS) quality reporting. The data types contained in the Healthcare Text Annotation Guidelines are a common denominator across information extraction applications. Each industry application can have additional information extraction needs that are not captured in the current version of the guidelines. We chose to open source this asset so the community can tailor this project to their needs.

We’re thrilled to open source this project. We hope the community will contribute to the refinement and expansion of the Healthcare Text Annotation Guidelines, so they mirror the ever-evolving nature of healthcare.

By Andreea Bodnari, Product Manager and Mikhail Begun, Program Manager—Google Cloud AI

Peer Bonus Experiences: Building tiny models for the ML community with TensorFlow

Friday, October 23, 2020

Almost all the current state-of-the-art machine learning (ML) models take quite a lot of disk space. This makes them particularly inefficient in production situations. A bulky machine learning model can be exposed as a REST API and hosted on cloud services, but that same bulk may lead to hefty infrastructure costs. And some applications may need to operate in low-bandwidth environments, making cloud-hosted models less practical.

In a perfect world, your models would live alongside your application, saving data transfer costs and complying with any regulatory requirements restricting what data can be sent to the cloud. But storing multi-gigabyte models on today’s devices just isn’t practical. The field of on-device ML is dedicated to the development of tools and techniques to produce tiny—yet high performing!—ML models. Progress has been slow, but steady!

There has never been a better time to learn about on-device ML and successfully apply it in your own projects. With frameworks like TensorFlow Lite, you have an exceptional toolset to optimize your bulky models while retaining as much performance as possible. TensorFlow Lite also makes it very easy for mobile application developers to integrate ML models with tools like metadata and ML Model Binding, Android codegen, and others.

What is TensorFlow Lite?

“TensorFlow Lite is a production ready, cross-platform framework for deploying ML on mobile devices and embedded systems.” - TensorFlow Youtube

TensorFlow Lite provides first-class support for Native Android and iOS-based integrations (with many additional features, such as delegates). TensorFlow Lite also supports other tiny computing platforms, such as microcontrollers. TensorFlow Lite’s optimization APIs produce world-class, fast, and well-performing machine learning models.

Venturing into TensorFlow Lite

Last year, I started playing around with TensorFlow Lite while developing projects for Raspberry Pi for Computer Vision, using the official documentation and this course to fuel my initial learning. Following this interest, I decided to join a voluntary working group focused on creating sample applications, writing out tutorials, and creating tiny models. This working group consists of individuals from different backgrounds passionate about teaching on-device machine learning to others. The group is coordinated by Khanh LeViet (TensorFlow Lite team) and Hoi Lam (Android ML team). This is by far one of the most active working groups I have ever seen. And, back in our starting days, Khanh proposed a few different state-of-art machine learning models that were great fits for on-device machine learning:

DeepLabV3 segmentation models that classify each image pixels to a particular category like so -

Selfie2Anime model that can turn a selfie into an anime like so -

These ideas were enough for us to start spinning up Jupyter notebooks and VSCode. After months of work, we now have strong collaborations between machine learning GDEs and a bunch of different TensorFlow Lite models, sample applications, and tutorials for the community to learn from. Our collaborations have been fueled by the power of open source and all the tiny models that we have built together are available on TensorFlow Hub. There are numerous open source applications that we have built that demonstrate how to use these models.

The Cartoonizer model cartoonizes uploaded images

Margaret and I co-authored an end-to-end tutorial that was published from the official TensorFlow blog and published the TensorFlow Lite models on TensorFlow Hub. So far, the response we have received for this work has been truly mesmerizing. I’ve also shared my experiences with TensorFlow Lite in these blog posts and conference talks:

A Tale of Model Quantization in TF Lite
Plunging into Model Pruning in Deep Learning
A few good stuff in TF Lite
Doing more with TF Lite
Model Optimization 101

The power of collaboration

The working group is a tremendous opportunity for machine learning GDEs, Googlers, and passionate community individuals to collaborate and learn. We get to learn together, create together, and celebrate the joy of teaching others. I am immensely thankful, grateful, and humbled to be a part of this group. Lastly, I would like to wholeheartedly thank Khanh for being a pillar of support to us and for nominating me for the Google Open Source Peer Bonus Award.

By Sayak Paul, PyImageSearch—Guest Author

Introducing TensorFlow Recorder

Friday, August 7, 2020

When training computer vision machine learning models, data loading can often be a performance bottleneck, causing your GPU or TPU resources to be underutilized while waiting for data to be loaded into the model. Storing your dataset in the efficient TensorFlow Record (TFRecord) format is a great way to solve these problems, but creating TFRecords can unfortunately often require a great deal of complex code.

Last week we open sourced the TensorFlow Recorder project (also known as TFRecorder), which makes it possible for data scientists, data engineers, or AI/ML engineers to create image based TFRecords with just a few lines of code. Using TFRecords is incredibly important for creating efficient TensorFlow ML pipelines, but until now they haven’t been so easy to create. Before TFRecorder, in order to create TFRecords at scale you would have had to write a data pipeline that parsed your structured data, loaded images from storage, and serialized the results into the TFRecord format. TFRecorder allows you to write TFRecords directly from a Pandas dataframe or CSV without writing any complicated code.

You can see an example of TFRecoder below, but first let’s talk about some of the specific advantages of TFRecords.

How TFRecords Can Help

Using the TFRecord file format allows you to store your data in sets of files, each containing a sequence of protocol buffers serialized as a binary record that can be read very efficiently, which will help reduce the data loading bottleneck mentioned above.

Data loading performance can be further improved by implementing prefetching and parallel interleave along with using the TFRecord format. Prefetching reduces the time of each model training step(s) by fetching the data for the next training step while your model is executing training on the current step. Parallel interleave allows you to read from multiple TFRecords shards (pieces of a TFRecord file) and apply preprocessing of those interleaved data streams. This reduces the latency required to read a training batch and is especially helpful when reading data from the network.

Using TensorFlow Recorder

Creating a TFRecord using TFRecorder requires only a few lines of code. Here’s how it works.

import pandas as pd
import tfrecorder
df = pd.read_csv(...)
df.tensorflow.to_tfrecord(output_dir="gs://my/bucket")

TFRecorder currently expects data to be in the same format as Google AutoML Vision.

This format looks like a pandas dataframe or CSV formatted as:

split	image_uri	label
TRAIN	gs://my/bucket/image1.jpg	cat

Where:

split can take on the values TRAIN, VALIDATION, and TEST
image_uri specifies a local or google cloud storage location for the image file.
label can be either a text-based label that will be integerized or an integer

In the future, we hope to extend TensorFlow Recorder to work with data in any format.

While this example would work well to convert a few thousand images into TFRecords, it probably wouldn’t scale well if you have millions of images. To scale up to huge datasets, TensorFlow Recorder provides connectivity with Google Cloud Dataflow, which is a serverless Apache Beam pipeline runner. Scaling up to DataFlow requires only a little bit more configuration.

df.tensorflow.to_tfrecord(
output_dir="gs://my/bucket",
runner="DataFlowRunner",
project="my-project",
region="us-central1)

What’s next?

We’d love for you to try out TensorFlow Recorder. You can get it from GitHub or simply pip install tfrecorder. Tensorflow Recorder is very new and we’d greatly appreciate your feedback, suggestions, and pull requests.

By Mike Bernico and Carlos Ezequiel, Google Cloud AI Engineers