1080*80 ad

Finding a Bug in Go’s arm64 Compiler

How a Subtle Compiler Bug Corrupted Go Programs on ARM64

In the world of software development, we often hunt for bugs in our own code. We look for race conditions, logic errors, and memory leaks. But what happens when the bug isn’t in your code, but in the compiler itself? A recent deep dive uncovered just such an issue—a subtle, hard-to-trace bug within the Go compiler for the arm64 architecture, leading to silent data corruption.

This is a story of meticulous debugging that serves as a powerful lesson for all developers working on modern hardware.

The Phantom Menace: A Bug That Wasn’t There

The first sign of trouble appeared as baffling, non-deterministic behavior in a Go application running on an arm64 platform. Calculations that should have been straightforward were occasionally producing incorrect results, specifically NaN (Not a Number) values where none should exist.

Initial debugging efforts pointed toward the usual suspects: race conditions or uninitialized memory. However, standard debugging tools like delve and gdb offered no clues. The problem was elusive, disappearing and reappearing seemingly at random, making it nearly impossible to pin down.

The critical breakthrough came from a classic debugging phenomenon known as a “Heisenbug”—a bug that changes or vanishes when you try to observe it. When fmt.Println statements were added to trace the values, the bug disappeared entirely. This is a tell-tale sign that the issue might be related to compiler optimizations, as the act of printing the value changes how the compiler handles registers and executes the code.

Isolating the Issue: The Power of a Minimal Example

With the suspicion shifted from the application code to the compiler, the next step was crucial: creating a minimal reproducible example (MRE). This involved stripping down the complex application to the smallest possible piece of code that could still trigger the error.

After a painstaking process of elimination, the bug was isolated to a small function involving floating-point arithmetic. This tiny, self-contained example consistently failed on arm64 but worked perfectly on other architectures like amd64, confirming the issue was platform-specific.

Into the Weeds: Analyzing the Assembly

With a reliable way to reproduce the bug, the investigation moved to the lowest level: the machine code generated by the compiler. By examining the assembly output (go tool compile -S), the root cause was finally unearthed.

The problem was centered on the handling of floating-point registers during a function call. The arm64 architecture has specific rules about which registers a function must preserve and which it can freely modify (caller-saved vs. callee-saved).

The investigation revealed that the Go compiler was making an incorrect assumption about the FMOV instruction. It assumed that a specific floating-point register (F28) would be preserved across a particular function call. However, the arm64 procedure call standard allows that register to be modified. Consequently, the compiler would store a value, call a function that inadvertently changed the register’s contents, and then incorrectly use the corrupted value in subsequent calculations, leading to the NaN errors.

This was not a logic error in the Go program, but a fundamental misunderstanding in the compiler’s code generation for a specific instruction on a specific architecture.

Resolution and Key Takeaways for Developers

Once identified, a patch was developed and submitted to the Go project, correcting the compiler’s register allocation logic. The fix ensures the compiler is aware that this specific instruction can modify the register, preventing it from making flawed optimization assumptions in the future.

This deep dive offers several critical lessons for software engineers:

  • Compiler Bugs Are Real: While rare, compilers are complex pieces of software and are not infallible. If you encounter a bug that defies all logical explanation, consider the possibility of a toolchain issue.
  • Suspect Optimizations in “Heisenbugs”: If adding a print statement or running your code in a debugger makes a bug disappear, it is a strong indicator of a compiler optimization bug, a timing issue, or a memory layout problem.
  • Master the Minimal Reproducible Example: The ability to isolate a bug into a small, self-contained test case is one of the most valuable skills in debugging. It’s essential for reporting issues and getting them fixed quickly.
  • Don’t Be Afraid of Assembly: While you may not write it daily, having a basic understanding of how to read assembly language for your target architecture can be an indispensable tool for diagnosing the most obscure, low-level bugs.

Source: https://blog.cloudflare.com/how-we-found-a-bug-in-gos-arm64-compiler/

900*80 ad

      1080*80 ad