ylliX - Online Advertising Network
Adopting Bazel for Web at Scale

Adopting Bazel for Web at Scale


A person making pesto sauce with a mortar and pestle

How and Why We Migrated Airbnb’s Large-Scale Web Monorepo to Bazel

10 min read

Nov 12, 2024

By: Brie Bunge and Sharmila Jesupaul

At Airbnb, we’ve recently adopted Bazel — Google’s open source build tool–as our universal build system across backend, web, and iOS platforms. This post will cover our experience adopting Bazel for Airbnb’s large-scale (over 11 million lines of code) web monorepo. We’ll share how we prepared the code base, the principles that guided the migration, and the process of migrating selected CI jobs. Our goal is to share information that would have been valuable to us when we embarked on this journey and to contribute to the growing discussion around Bazel for web development.

Historically, we wrote bespoke build scripts and caching logic for various continuous integration (CI) jobs that proved challenging to maintain and consistently reached scaling limits as the repo grew. For example, our linter, ESLint, and TypeScript’s type checking did not support multi-threaded concurrency out-of-the-box. We extended our unit testing tool, Jest, to be the runner for these tools because it had an API to leverage multiple workers.

It was not sustainable to continually create workarounds to overcome the inefficiencies of our tooling which did not support concurrency and we were incurring a long-run maintenance cost. To tackle these challenges and to best support our growing codebase, we found that Bazel’s sophistication, parallelism, caching, and performance fulfilled our needs.

Additionally, Bazel is language agnostic. This facilitated consolidation onto a single, universal build system across Airbnb and allowed us to share common infrastructure and expertise. Now, an engineer who works on our backend monorepo can switch to the web monorepo and know how to build and test things.

When we began the migration in 2021, there was no publicized industry precedent for integrating Bazel with web at scale outside of Google. Open source tooling didn’t work out-of-the-box, and leveraging remote build execution (RBE) introduced additional challenges. Our web codebase is large and contains many loose files, which led to performance issues when transmitting them to the remote environment. Additionally, we established migration principles that included improving or maintaining overall performance and reducing the impact on developers contributing to the monorepo during the transition. We effectively achieved both of these goals. Read on for more details.

We did some work up front to make the repository Bazel-ready–namely, cycle breaking and automated BUILD.bazel file generation.

Cycle Breaking

Our monorepo is laid out with projects under a top-level frontend/ directory. To start, we wanted to add BUILD.bazel files to each of the ~1000 top-level frontend directories. However, doing so created cycles in the dependency graph. This is not allowed in Bazel because there needs to be a DAG of build targets. Breaking these often felt like battling a hydra, as removing one cycle spawns more in its place. To accelerate the process, we modeled the problem as finding the minimum feedback arc set (MFAS)¹ to identify the minimal set of edges to remove leaving a DAG. This set presented the least disruption, level of effort, and surfaced pathological edges.

Automated BUILD.bazel Generation

We automatically generate BUILD.bazel files for the following reasons:

  1. Most contents are knowable from statically analyzable import / require statements.
  2. Automation allowed us to quickly iterate on BUILD.bazel changes as we refined our rule definitions.
  3. It would take time for the migration to complete and we didn’t want to ask users to keep these files up-to-date when they weren’t yet gaining value from them.
  4. Manually keeping these files up-to-date would constitute an additional Bazel tax, regressing the developer experience.

We have a CLI tool called sync-configs that generates dependency-based configurations in the monorepo (e.g., tsconfig.json, project configuration, now BUILD.bazel). It uses jest-haste-map and watchman with a custom version of the dependencyExtractor to determine the file-level dependency graph and part of Gazelle to emit BUILD.bazel files. This CLI tool is similar to Gazelle but also generates additional web specific configuration files such as tsconfig.json files used in TypeScript compilation.

With preparation work complete, we proceeded to migrate CI jobs to Bazel. This was a massive undertaking, so we divided the work into incremental milestones. We audited our CI jobs and chose to migrate the ones that would benefit the most: type checking, linting, and unit testing². To reduce the burden on our developers, we assigned the central Web Platform team the responsibility for porting CI jobs to Bazel. We proceeded one job at a time to deliver incremental value to developers sooner, gain confidence in our approach, focus our efforts, and build momentum. With each job, we ensured that the developer experience was high-quality, that performance improved, CI failures were reproducible locally, and that the tooling Bazel replaced was fully deprecated and removed.

We started with the TypeScript (TS) CI job. We first tried the open source ts_project rule³. However, it didn’t work well with RBE due to the sheer number of inputs, so we wrote a custom rule to reduce the number and size of the inputs.

The biggest source of inputs came from node_modules. Prior to this, the files for each npm package were being uploaded individually. Since Bazel works well with Java, we packaged up a full tar and a TS-specific tar (only containing the *.ts and package.json) for each npm package along the lines of Java JAR files (essentially zips).

Another source of inputs came through transitive dependencies. Transitive node_modules and d.ts files in the sandbox were being included because technically they can be needed for subsequent project compilations. For example, suppose project foo depends on bar, and types from bar are exposed in foo’s emit. As a result, project baz which depends on foo would also need bar’s outputs in the sandbox. For long chains of dependencies, this can bloat the inputs significantly with files that aren’t actually needed. TypeScript has a — listFiles flag that tells us which files are part of the compilation. We can package up this limited set of files along with the emitted d.ts files into an output tsc.tar.gz file⁴. With this, targets need only include direct dependencies, rather than all transitive dependencies⁵.

Diagram showing how we use tars and the — listFiles flag to prune inputs/outputs of :types targets

This custom rule unblocked switching to Bazel for TypeScript, as the job was now well under our CI runtime budget.

Bar chart showing the speed up from switching to using our custom genrule

We migrated the ESLint job next. Bazel works best with actions that are independent and have a narrow set of inputs. Some of our lint rules (e.g., special internal rules, import/export, import/extensions) inspected files outside of the linted file. We restricted our lint rules to those that could operate in isolation as a way of reducing input size and having only to lint directly affected files. This meant moving or deleting lint rules (e.g., those that were made redundant with TypeScript). As a result, we reduced CI times by over 70%.

Time series graph showing the runtime speed-up in early May from only running ESLint on directly affected targets

Our next challenge was enabling Jest. This presented unique challenges, as we needed to bring along a much larger set of first and third-party dependencies, and there were more Bazel-specific failures to fix.

Worker and Docker Cache

We tarred up dependencies to reduce input size, but extraction was still slow. To address this, we introduced caching. One layer of cache is on the remote worker and another is on the worker’s Docker container, baked into the image at build time. The Docker layer exists to avoid losing our cache when remote workers are auto-scaled. We run a cron job once a week to update the Docker image with the newest set of cached dependencies, striking a balance of keeping them fresh while avoiding image thrashing. For more details, check out this Bazel Community Day talk.

Diagram showing symlinked npm dependencies to a Docker cache and worker cache

This added caching provided us with a ~25% speed up of our Jest unit testing CI job overall and reduced the time to extract our dependencies from 1–3 minutes to 3–7 seconds per target. This implementation required us to enable the NodeJS preserve-symlinks option and patch some of our tools that followed symlinks to their real paths. We extended this caching strategy to our Babel transformation cache, another source of poor performance.

Implicit Dependencies

Next, we needed to fix Bazel-specific test failures. Most of these were due to missing files. For any inputs not statically analyzable (e.g., referenced as a string without an import, babel plugin string referenced in .babelrc), we added support for a Bazel keep comment (e.g., // bazelKeep: path/to/file) which acts as though the file were imported. The advantages of this approach are:

1. It is colocated with the code that uses the dependency,

2. BUILD.bazel files don’t need to be manually edited to add/move # keep comments,

3. There is no effect on runtime.

A small number of tests were unsuitable for Bazel because they required a large view of the repository or a dynamic and implicit set of dependencies. We moved these tests out of our unit testing job to separate CI checks.

Preventing Backsliding

With over 20,000 test files and hundreds of people actively working in the same repository, we needed to pursue test fixes such that they would not be undone as product development progressed.

Our CI has three types of build queues:

1. “Required”, which blocks changes,

2. “Optional”, which is non-blocking,

3. “Hidden”, which is non-blocking and not shown on PRs.

As we fixed tests, we moved them from “hidden” to “required” via a rule attribute. To ensure a single source of truth, tests run in “required” under Bazel were not run under the Jest setup being replaced.

# frontend/app/script/__tests__/BUILD.bazel
jest_test(
name = "jest_test",
is_required = True, # makes this target a required check on pull requests
deps = [
":source_library",
],
)

Example jest_test rule. This signifies that this target will run on the “required” build queue.

We wrote a script comparing before and after Bazel to determine migration-readiness, using the metrics of test runtime, code coverage stats, and failure rate. Fortunately, the bulk of tests could be enabled without additional changes, so we enabled these in batches. We divided and conquered the remaining burndown list of failures with the central team, Web Platform, fixing and updating tests in Bazel to avoid putting this burden on our developers. After a grace period, we fully disabled and deleted the non-Bazel Jest infrastructure and removed the is_required param.

In tandem with our CI migration, we ensured that developers can run Bazel locally to reproduce and iterate on CI failures. Our migration principles included delivering only what was on par with or superior to the existing developer experience and performance. JavaScript tools have developer-friendly CLI experiences (e.g., watch mode, targeting select files, rich interactivity) and IDE integrations that we wanted to retain. By default, frontend developers can continue using the tools they know and love, and in cases where it is beneficial they can opt into Bazel. Discrepancies between Bazel and non-Bazel are rare and when they do occur, developers have a means of resolving the issue. For example, developers can run a single script, failed-on-pr which will re-run any targets failing CI locally to easily reproduce issues.

Annotations on a failing build with scripts to recreate the failures, e.g. yak script jest:failed-on-pr

We also do some normalization of platform specific binaries so that we can reuse the cache between Linux and MacOS builds. This speeds up local development and CI jobs by sharing cache between a local developer’s macbook and linux machines in CI. For native npm packages (node-gyp dependencies) we exclude platform-specific files and build the package on the execution machine. The execution machine will be the machine executing the test or build process. We also use “universal binaries” (e.g., for node and zstd), where all platform binaries are included as inputs (so that inputs are consistent no matter which platform the action is run from) and the proper binary is chosen at runtime.

Adopting Bazel for our core CI jobs yielded significant performance improvements for TypeScript type checking (34% faster), ESLint linting (35% faster), and Jest unit tests (42% faster incremental runs, 29% overall). Moreover, our CI can now better scale as the repo grows.

Next, to further improve Bazel performance, we will be focusing on persisting a warm Bazel host across CI runs, taming our build graph, powering CI jobs that do not use Bazel with the Bazel build graph, and potentially exploring SquashFS to further compress and optimize our Bazel sandboxes.

We hope that sharing our journey has provided insights for organizations considering a Bazel migration for web.

Thank you Madison Capps, Meghan Dow, Matt Insler, Janusz Kudelka, Joe Lencioni, Rae Liu, James Robinson, Joel Snyder, Elliott Sprehn, Fanying Ye, and various other internal and external partners who helped bring Bazel to Airbnb.

We are also grateful to the broader Bazel community for being welcoming and sharing ideas.

[1]: This problem is NP-complete, though approximation algorithms have been devised that still guarantee no cycles; we chose the implementation outlined in “Breaking Cycles in Noisy Hierarchies”.

[2]: After initial evaluation, we considered migrating web asset bundling as out of scope (though we may revisit this in the future) due to high level of effort, unknowns in the bundler landscape, and neutral return on investment given our recent adoption of Metro, as Metro’s architecture already factors in scalability features (e.g. parallelism, local and remote caching, and incremental builds).

[3]: There are newer TS rules that may work well for you here.

[4]: We later switched to using zstd instead of gzip because it produces archives that are better compressed and more deterministic, keeping tarballs consistent across different platforms.

[5]: While unnecessary files may still be included, it’s a much narrower set (and could be pruned as a further optimization).

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *